provides several approaches to analysis of the source code written in different programming languages:
This series of articles focuses on the structure and operation principles of the signature analysis module (PM, pattern matching). The key benefits of such an analyzer include high performance, simplicity of pattern description, and scalability across various languages. The disadvantage of this approach is that the module is not able to analyze complex vulnerabilities, which require developing high-level models of code execution.
This article focuses on the first stage that includes parsing, comparing functionalities and features of various parsers, as well as applying theoretical principles to practice using Java, PHP, PLSQL, TSQL and even C# grammars. Other stages will be discussed in future publications.
Regular expressions describe only elementary statements for matching, which, however, cover the majority of tasks in everyday practice. Another advantage of regular expressions is that most programming languages support them. Complexity in both writing and parsing makes Turing complete languages unsuitable for computation in practice (for example, an esoteric programming language called
Thue comes to mind).
Thus far, in fact all the syntax of modern programming languages can be defined by context-free grammars. If we compare context-free grammars and regular expressions in lay terms, the latter do not have a memory (are not able to perform calculations). And if we compare context-sensitive and context-free grammars, the latter do not remember the already applicable rules (are able to calculate only two things).
Moreover, the language may be context-free in one case and context-sensitive in the other. Given the semantics (i.e., consistency with the definitions used in the language and consistency of types in particular) the language may be considered as context-sensitive. For example,
T a = new
T(). The type in the constructor on the right must be the same as the one on the left. It is usually advisable to check the semantics after parsing. Still there are such syntactic constructions that cannot be parsed using context-free grammars, for example,
Heredoc in PHP: $x = <<<
EOL Hello world
EOL; the
EOL token (or line break) is a special character signifying the end of a line of text and the start of a new line, therefore it requires memorizing the value of the token being visite. This article focuses on the analysis of such context-free and context-sensitive languages.
ANTLR
![]()
This parser generator is an
LL(*), it has existed for over 20 years and the 4th version was released in 2013. Now it’s under development on GitHub. This module allows users to create parsers in Java, C#, Python2, Python3, and JavaScript. C++, Swift, and Go are coming soon (hopefully). Well, I need to say, that it is simple enough to develop and debug grammars using this tool. Despite the fact that LL grammars do not allow for left-recursive grammar rules, ANTLR since version 4 will let you write the above mentioned rules (except for the rules with hidden or indirect left recursion). These rules are translated into ordinary rules during the parser generation. This reduces the time when writing expressions, for example, arithmetic ones:
expr
: expr '*' expr
| expr '+' expr
| expr '^' expr
| id
;
In addition, parsing performance is significantly improved by using the Adaptive LL algorithm (*). This algorithm combines the advantages of relatively slow GLL and
GLR algorithms, which, however, are able to resolve cases of ambiguity (used in the analysis of natural languages) compared to fast LL recursive descent algorithms, which in turn are not able to resolve all problems with ambiguity. The idea of the used algorithm is based on pseudo-parallel running LL parsers on each rule, caching and choosing the right prediction (as opposed to the GLR where a number of alternatives is allowed). Thus, the algorithm is dynamic. Although the theoretical worst-case behavior
of this algorithm parsing is O(n
4), the parsing rate for existing programming languages appears to be linear in practice. The 4th version also has an improved error recovery algorithm. Read more about the ANTLR 4 algorithms and differences with the other parsing algorithms in the following article:
Adaptive LL(*) Parsing: The Power of Dynamic Analysis.
Roslyn
Roslyn is not just a parser; it is a fully-featured tool for parsing, analyzing and compiling C# code. It is also developed on GitHub, but it is more advanced than ANTLR. This article deals only with its parsing features, regardless of the semantics. Roslyn parses the code to fidelity, immutable, and thread safe tree. Fidelity is that such tree can be converted back into the same character-for-character code, including spaces, comments and preprocessor directives, even if there are some syntax errors. Immutability makes it easy to handle multiple-tree processing, as a "smart" copy of a tree (which is used only for storing the changes) is created in each separate stream. The above mentioned tree may consist of:
- Syntax Node — a non-terminal node of the tree containing a few other nodes and displaying a certain structure. It may also contain an optional node (e.g., ElseClauseSyntax for if).
- Syntax Token — a terminal node that represents a keyword, an identifier, a literal, or a punctuation mark.
- Syntax Trivia — a terminal node that represents a space, a comment or a preprocessor directive (it can be easily removed without losing code information). Trivia does not have a parent. These nodes are critical when converting a tree back to code (e.g., during refactoring).
Parsing problems
The development of grammars and parsers introduces some challenges that should be considered.
Using keywords as identifiers
It often happens that some keywords may appear as identifiers during the parsing. For example, in the C # the
async
keyword placed before the method signature indicates that this method is asynchronous. But if this word will be used as the identifier of the
var async = 42;
variable, the code will be also valid. In ANTLR, this problem can be solved in two ways:
- Using a semantic predicate for the syntactic rule:
async: {_input.LT(1).GetText() == "async"}? ID ;
while the async token itself will not exist. This approach is bad because the grammar becomes dependent on runtime and looks ugly. - Inserting the token into the id rule:
ASYNC: 'async';
...
id
: ID
...
| ASYNC;
Ambiguity
Natural language contains ambiguous expressions (like, "Flying planes can be dangerous"). Such constructions may also occur in a formal language. For example:
stat: expr ';' // expression statement
| ID '(' ')' ';' // function call statement;
;
expr: ID '(' ')'
| INT
;
However, contrary to natural languages, they are likely to be the result of improper grammars. ANTLR is not able to detect such ambiguity in the process of generating a parser, but if we set the
LL_EXACT_AMBIG_DETECTION
mode (as ALL is a dynamic algorithm), ambiguity can be defined during the process of parsing. Ambiguity may arise in both lexer and parser. The lexer generates a token for each of two identical tokens (see the example with identifiers). Yet, in languages where the ambiguity is acceptable (for example, C ++), you can use semantic predicates (code insertions) to resolve it, for example:
expr: { isfunc(ID) }? ID '(' expr ')' // func call with 1 arg
| { istype(ID) }? ID '(' expr ')' // ctor-style type cast of expr
| INT
| void
;
Sometimes the ambiguity can be fixed after a little reinvention of grammar. For example, there is a right shift bit operator
RIGHT_SHIFT: '>>'
in C#: two angle brackets can also be used to describe a generics class:
List>
. If we define the
>>
as a token, the construction of two lists would never be parsed because the parser will assume that there is a
>>
operator instead of two closing brackets. To resolve this you only need to put the
RIGHT_SHIFT
token aside. At the same time, we can leave the
LEFT_SHIFT: '<<'
token as-is, because such a sequence of characters would not take place during the parsing of a valid code.
Yet, we have not performed a detailed analysis of whether there is any ambiguity in grammars developed using our module.
Another parsing problem is handling comments. The disadvantage here is that the comments being included into the grammar make it overcomplicated; in fact, each node will contain comments. However, we cannot simply eliminate the comments, because they may contain some relevant information. ANTLR uses the so-called channels to handle the comments, these channels isolate a lot of comments from other tokens:
Comment: ~[\r\n?]+ -> channel(PhpComments);
In Roslyn the comments are included into the tree nodes, but they belong to a special type called Syntax Trivia. You can get a list of trivial tokens associated with a certain ordinary token in both ANTLR and Roslyn. ANTLR has a method for the token with index
i
in the stream, which returns all tokens on a specified channel to the right or to the left:
getHiddenTokensToRight(int tokenIndex, int channel)
,
getHiddenTokensToRight(int tokenIndex, int channel)
. Roslyn adds such tokens to the terminal Syntax Token.
In order to retrieve all the comments in ANTLR you can get all tokens on a specified channel:
lexer.GetAllTokens().Where(t => t.Channel == PhpComments)
, in Roslyn you can get all DescendantTrivia for the root node with the following
SyntaxKind:
SingleLineCommentTrivia,
MultiLineCommentTrivia,
SingleLineDocumentationCommentTrivia,
MultiLineDocumentationCommentTrivia,
DocumentationCommentExteriorTrivia,
XmlComment.
Handling white spaces and comments is one of the reasons for which the code, for example, LLVM, cannot be used for the analysis: they will be just omitted. Apart from handling comments, even handling of whitespace is a very important part. For example, detecting errors in a single
if statement (this example was taken from an article entitled
PVS-Studio delved into the FreeBSD kernel):
case MOD_UNLOAD:
if (via_feature_rng & VIA_HAS_RNG)
random_nehemiah_deinit();
random_source_deregister(&random_nehemiah);
Handling parse errors
![]()
An important capability of each parser is error handling. The reasons are as follows:
- The parsing process should not be interrupted only because of a single mistake, it must recover properly and continue to parse the code (for instance, after missing a semicolon).
- Search for relevant error and its location, instead of searching multiple irrelevant errors.
ANTLR errors
The following parsing errors are present in ANTLR:
- Token recognition error (Lexer no viable alt). Is the only lexical error, indicating the absence of the rule used to create the token from an existing lexeme:
class # { int i; } — # is the above mentioned lexeme. - Missing token. In this case, ANTLR inserts the missing token to a stream of tokens, marks it as missing, and continues parsing as if this token exists.
class T { int f(x) { a = 3 4 5; } } — } is the above mentioned token. - Extraneous token. ANTLR marks a token as incorrect and continues parsing as if this token doesn’t exist: The example of such a token will be the first ;
class T ; { int i; } - Mismatched input. In this case "panic mode" will be initiated, a set of input tokens will be ignored, and the parser will wait for a token from the synchronizing set. The 4th and 5th tokens of the following example are ignored and ; is the synchronizing token
class T { int f(x) { a = 3 4 5; } } - No viable alternative input. This error describes all other possible parsing errors.
class T { int ; }
Furthermore, errors can be handled manually by adding an error alternative to the rule:
function_call
: ID '(' expr ')'
| ID '(' expr ')' ')' {notifyErrorListeners("Too many parentheses");}
| ID '(' expr {notifyErrorListeners("Missing closing ')'");}
;
Moreover, ANTLR 4 allows you to use your own error handling mechanism. This option may be used to increase the performance of the parser: first, code is parsed using a fast SLL algorithm, which, however, may parse the ambiguous code in an improper way. If this algorithm reveals at least a single error (this may be an error in the code or ambiguity), the code is parsed using the complete, but less rapid ALL-algorithm. Of course, an actual error (e.g., the missed semicolon) will always be parsed using LL, but the number of such files is less compared to ones without any errors.
Maximizing performance when parsing in ANTLR:
// try with simpler/faster SLL(*)
parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
// we don't want error messages or recovery during first try
parser.removeErrorListeners();
parser.setErrorHandler(newBailErrorStrategy());
try {
parser.startRule();
// if we get here, there was no syntax error and SLL(*) was enough;
// there is no need to try full LL(*)
}
catch (ParseCancellationException ex) { // thrown by BailErrorStrategy
tokens.reset(); // rewind input stream
parser.reset();
// back to standard listeners/handlers
parser.addErrorListener(ConsoleErrorListener.INSTANCE);
parser.setErrorHandler(newDefaultErrorStrategy());
// full now with full LL(*)
parser.getInterpreter().setPredictionMode(PredictionMode.LL);
parser.startRule();
}
Roslyn errors
The following parsing errors are present in Roslyn:
- Missing syntax; Roslyn completes the corresponding node with the
IsMissing = true
property value (a common example — Statement without a semicolon). - Incomplete member; a separate node
IncompleteMember
is created. - Incorrect value of the numeric, string, or character literal (e.g., a too long value, an empty char): A separate node with Kind equal to
NumericLiteralToken
, StringLiteralToken
, or CharacterLiteralToken
is created. - Excessive syntax (e.g., an accidentally typed character): A separate node with
Kind = SkippedTokensTrivia
is created.
The following code fragment demonstrates these errors (You can explore the features of Roslyn using the Visual Studio plugin
Syntax Visualizer):
namespace App
{
class Program
{
; // Skipped Trivia
static void Main(string[] args)
{
a // Missing ';'
ulong ul = 1lu; // Incorrect Numeric
string s = """; // Incorrect String
char c = ''; // Incorrect Char
}
}
class bControl flow
{
c // Incomplete Member
}
}
These carefully selected types of syntax errors in Roslyn allows converting the tree with any number of errors character by character to code.
From theory to practice
Php,
tsql,
plsql grammars illustrating the above theory were developed and maked open-source. PHP and SQL parsers use these grammars. In order to parse Java code we used the already existing
java and
java8 grammars. We have also refined the
C#grammar (for versions 5 and 6) used to compare parsers based on Roslyn and ANTLR. Below you will find the most interesting aspects of developing and using these grammars. Although SQL-based languages are regarded as declarative rather than imperative, T-SQL and PL/SQL extensions provide support for imperative programming (
Control flow). Our source code analyzer is mainly being developed for these aspects.
Java- and Java8 grammars
In most cases, Java 7-based parser is faster than Java 8, unless there is deep recursion, e.g., parsing of
ManyStringConcatenation.java file takes far longer and requires more memory resources. I would like to note that this is not an artificial example; we really came across such "spaghetti code". As it turned out, the problem is caused by left-recursive grammar rules in expression. Java 8 grammar contains only primitive recursive rules. Primitive recursive rules differ from ordinary recursive rules in the way they refer to themselves in the left or right side of the alternative, but not in both simultaneously. An example of the ordinary recursive expression:
expression
: expression ('*'|'/') expression
| expression ('+'|'-') expression
| ID
;
The following rules are obtained after converting the rules above to primitive left-recursive:
expression
: multExpression
| expression ('+'|'-') multExpression
;
multExpression
: ID
| multExpression ('*'|'/') ID
;
Or even to non-recursive ones (however, it is not so easy to handle expressions after parsing, because they are no longer binary):
expression
: multExpression (('+'|'-') multExpression)*
;
If the operation has right associativity (e.g., exponentiation), primitive right-recursive are used:
expression
: expression '^' expression
| ID
;
powExpression
: ID '^' powExpression
| ID
;
On the one hand, conversion of left-recursive rules allows us to address the problem of excessive memory consumption and poor performance for rare files with a large number of similar expressions, on the other - brings performance issues when processing other files. It is therefore advisable to use primitive recursion for expressions, which may be deep (e.g., the concatenation of strings), and ordinary recursion for all other cases (e.g., the comparison of numbers).
PHP grammar
Phalanger allows parsing PHP code on .NET plarform. However, we are not satisfied with the fact that this project is actually not developed and provides no Visitor interface for traversing the AST nodes (the only interface presented is the Walker). It was therefore agreed to develop PHP grammar for ANTLR using our own resources.
Case insensitive keywords
As far as is known, all tokens in PHP, except for the names of variables (which begin with '$'), are case-insensitive. In ANTLR case insensitivity can be implemented in two ways:
- Declaring fragment lexical rules to define all the Latin characters and using them as follows:
Abstract: A B S T R A C T;
Array: A R R A Y;
As: A S;
BinaryCast: B I N A R Y;
BoolType: B O O L E A N | B O O L;
BooleanConstant: T R U E | F A L S E;
...
fragment A: [aA];
fragment B: [bB];
...
fragment Z: [zZ];
An ANTLR fragment is a part of the token, which can be used in other token, but it is not a token itself. It is a "syntactic sugar" for describing tokens. Without the use of fragments, the first token can be written as Abstract: [Aa] [Bb] [Ss] [Tt] [Rr] [Aa] [Cc] [Tt]
. The advantage of this approach is that the generated lexer is independent of the runtime since the characters in upper and lower cases are declared directly in the grammar. The downside is that the lexer performance achieved in this approach is lower than in the second approach. - Converting the entire input stream of characters to the lower (or upper) case and starting the lexer, in which all the tokens described using this case. However, you need to perform this conversion for each particular runtime (Java, C#, JavaScript, Python), as described in Implement a custom File or String Stream and Override LA. Under this approach, it is difficult to make some tokens case-sensitive and other case-insensitive.
The first approach is used in the developed PHP grammar since lexical analysis usually takes less time than syntactical. Despite the fact that the grammar is still dependent from the runtime, this approach makes grammar porting to other runtimes easier. Furthermore, we created the Pull Request
RFC Case Insensitivity Proof of Concept to facilitate the description of case-insensitive tokens.
Lexical modes for PHP, HTML, CSS, and JavaScript
It is commonly known that PHP code inclusions may be placed anywhere in the HTML code. The same HTML code may include CSS and JavaScript code (these blocks of embedded code are known as "islands"). For example, the following code (using
Alternative Syntax) is valid:
or
Fortunately, ANTLR provides us with a mechanism called "modes", which allow to switch between different sets of tokens under certain conditions. For example, the
SCRIPT and
STYLE modes were designed to generate a flow of tokens for JavaScript and CSS (in fact, they are simply ignored in this grammar). HTML tokens are generated in the
DEFAULT_MODE. It is worth noting that it is possible to implement the support for Alternative Syntax in ANTLR without adding the target code to the lexer. i.e.:
nonEmptyStatement includes the
inlineHtml rule, which, in turn, includes the tokens received in the DEFAULT_MODE:
nonEmptyStatement
: identifier ':'
| blockStatement
| ifStatement
| ...
| inlineHtml
;
...
inlineHtml
: HtmlComment* ((HtmlDtd | htmlElement) HtmlComment*)+
;
Complex context sensitive statements
We should mention that although ANTLR supports only context-free grammars, there are also the so-called "actions" containing the arbitrary code, which extend the number of languages to at least context-dependent ones. Such code inclusions allow implementing parsing of
Heredoc and some other structures:
Despite the common root «SQL», T-SQL (MSSQL) and PL/SQL grammars differ greatly from each other.
It would be nice to stay off the development of our own parser for this complex language. Nevertheless, the existing parsers did not meet the criteria of total coverage and relevance (e.g.,
grammar for the deprecated GOLD parser) and have closed source code (
General SQL Parser). Finally, it was decided to recover TSQL grammar from the MSDN documentation. The result was worth it: the grammar covers many common syntactic constructions, looks neat, stays independent of the runtime, and it has been tested on SQL examples from MSDN. The complexity of the development was that some tokens in the grammar are optional. For example, a semicolon. In this case, error recovery during parsing is not so smooth.
PL/SQL grammar
Refinement of PL/SQL grammar took even less time, because the very grammar had already
existed under ANTLR3. The main difficulty was the fact that it had been developed using the java-runtime. Most Java code insertions had been removed, since you can build AST without using them (as mentioned earlier, the semantics can be checked at another stage). The following insertions
decimal_key
: {input.LT(1).getText().equalsIgnoreCase("decimal")}? REGULAR_ID
were replaced by the fragment tokens:
decimal_key: D E C I M A L
, as described above.
C# grammar
Strange as it seems, but the refinement of the grammar supporting 5 and 6 language versions, was quite a difficult task. The major concerns were the string interpolation and proper processing of preprocessor directives. Because these things are context-dependent, the lexer and the parser for processing directives turned out to be dependent of the runtime.
Preprocessor directives
C# allows you to compile the following code properly (code after the first directive cannot be compiled, still it is not included into the compilation since false is never satisfied).
#if DEBUG && false
Sample not compilied wrong code
var 42 = a + ;
#else
// Compilied code
var x = a + b;
#endif
In order to be processed correctly, the code is split to an array of tokens located in the default
COMMENTS_CHANNEL and
DIRECTIVE channels. The
codeTokens
list is also created; it contains the proper tokens for parsing. Then, the preprocessor parser calculates the value for the directive of preprocessor tokens. Special attention shall be given to the fact that ANTLR also allows you to write the code to calculate the value of complex logical expressions directly in the grammar. For more details on the implementation, check the following link
CSharpPreprocessorParser.g4. A value of
true
or
false
is calculated only for
#if
,
#elif
, and
#else
, directives, all of the remaining directives always return
true
, because they do not affect whether or not the following code is to be compiled. This parser also allows you to find the default Conditional Symbols (defined as
DEBUG
by default).
After the directive value was calculated and it gets a true value, the subsequent tokens are added to the
codeTokens
list, otherwise they are skipped. Such an approach allows to ignore the wrong tokens (like var
42 = a + ;
in this example) at the stage of parsing. The parsing process is described as follows:
CSharpAntlrParser.cs.
String interpolation
This feature was very challenging to develop since the closing curly bracket may mean a part of an interpolation expression or exit of the expression mode. A colon can also be part of the expression, and could mean the end of the expression and description of the output format (for example, #0.##). Additionally, such strings may be regular, verbatim or nested. For more details about syntax
see the MSDN page.
The above-described items are shown in the following code, which is valid syntactically:
s = $"{p.Name} is \"{p.Age} year{(p.Age == 1 ? "" : "s")} old";
s = $"{(p.Age == 2 ? $"{new Person { } }" : "")}";
s = $@"\{p.Name}
""\";
s = $"Color [ R={func(b: 3):#0.##}, G={G:#0.##}, B={B:#0.##}, A={A:#0.##} ]";
The interpolation of strings has been implemented using the stack for calculating the current level of the interpolation string and brackets. All of this is implemented in
CSharpLexer.g4.
Testing
Correctness of ANTLR parsers
Obviously, there is no need to test the parsing correctness of the Roslyn parser. On the other hand, we paid a lot of attention to testing of the ANTLR parser.
Testing was conducted in a single-threaded mode, in release configuration without the debugger attached.
ANTLR 4 4.5.0-alpha003 and
Roslyn (Microsoft.CodeAnalysis) 1.1.1 were tested.
WebGoat.PHP
The number of processed files — 885. The total number of strings — 137 248, characters — 4 461 768.
Approximate time - 00:00:31 sec (55% by lexer, 45% by parser).
PL/SQL Samples
The number of processed files — 175. The total number of strings — 1 909, characters — 55 741.
Approximate time < 1 sec. (5% by lexer, 95% by parser).
CoreFX-1.0.0-rc2
The number of processed files — 7329. The total number of strings — 2 286 274, characters — 91 132 116.
Approximate time: * Roslyn: 00:00:04 sec * ANTLR: 00:00:24 sec. (12% by lexer, 88% by parser)
Roslyn-1.1.1
The number of processed files — 6527. The total number of strings — 1 967 672, characters — 74 319 082.
Approximate time: * Roslyn: 00:00:03 sec * ANTLR: 00:00:16 sec. (12% by lexer, 88% by parser)
According to the testing results achieved with
CoreFX and
Roslyn, we may conclude that the developed C# parser on ANTLR is less five times slower than the Roslyn parser, which suggests the a great quality of the last-named. It’s understood that the parser created in a week as a kitchen-table effort will hardly ever be able to compete with such giants of the market like Roslyn, but it can be used to parse C# code on Java, Python, or JavaScript (and other future languages), because the parsing speed is still fast.
Based on the remaining tests it can be concluded that lexing is a substantially faster stage than parsing. The exception is the PHP lexer that spent more time on lexing compared to parsing. This appears to be due to complex logic of the lexer and complex rules, but it is not influenced by case insensitive keywords, since T-SQL and PL/SQL lexers (which also contain case insensitive keywords) are much faster than parsers (up to 20 times). For example, if you use the
SHARP: NEW_LINE Whitespace* '#';
instead of
SHARP: '#';
, the lexer will be 10 times slower instead of being 7 times faster! This is explained by the fact that any file has a lot of whitespaces, and the lexer will try to find the
#
symbol on each string, so its performance will be significantly slower (we were faced with such a problem, thus checking for a directive in the new string should be carried out at the stage of semantic analysis).
Error handling in ANTLR and Roslyn parsers
We wrote a simple C# file containing all parsing errors in ANTLR:
namespace App
{
©
class Program
{
static void Main(string[] args)
{
a = 3 4 5;
}
}
class B
{
c
}
ANTLR errors
- token recognition error at: '©' at 3:5
- mismatched input '4' expecting {'as', 'is', '[', '(', '.', ';', '+', '-', '*', '/', '%', '&', '|', '^', '<', '>', '?', '??', '++', '--', '&&', '||', '->', '==', '!=', '<=', '>=', '<<'} at 8:19
- extraneous input '5' expecting {'as', 'is', '[', '(', '.', ';', '+', '-', '*', '/', '%', '&', '|', '^', '<', '>', '?', '??', '++', '--', '&&', '||', '->', '==', '!=', '<=', '>=', '<<'} at 8:21
- no viable alternative at input 'c}' at 15:5
- missing '}' at 'EOF' at 15:6
As a next step, we have tested the above-mentioned file using Roslyn compiler and discovered the following errors:
- test(3,5): error CS1056: Unexpected character '©'
- test(8,19): error CS1002:; expected
- test(8,21): error CS1002:; expected
- test(15,5): error CS1519: Invalid token '}' in class, struct, or interface member declaration
- test(15,6): error CS1513: } expected
The number of errors detected using Roslyn was similar to that detected via ANTLR. The first and the last errors differ only in the name. The parsers have also been tested on files that are more complex. It is clear that Roslyn detects fewer errors and these errors are more relevant. However, in simple cases such as missing or extra tokens (semicolon, brackets), ANTLR detects the relevant position and the description of an error. ANTLR gives consistently worse results with errors when the part of a lexer code is written manually (compilation directives, interpolation strings). For example, if we write an
#if
directive without any condition, the rest part of the code may not be parsed correctly. However, in these cases, the code for recovering the process of parsing should be written manually as well (as this is a context sensitive structure).
Memory consumption of ANTLR runtime
As mentioned above, ANTLR 4 uses the internal cache obtained in the process of parsing in order to increase the performance of parsing the follow-up files. If you process too many files (we performed a test on about 70000 PHP files) or re-parse the files in the same process, memory consumption may increase significantly up to several gigabytes. You can clear the cashe by using the
lexer.Interpreter.ClearDFA()
interpreter method for the lexer and
parser.Interpreter.ClearDFA()
- for the parser after processing a certain number of files or after memory consumption has exceeded a certain threshold value.
After solving the problem of clearing cache, we have discovered an issue with multi-threaded parsers. By practical experience, we have found that the use of
GetAllTokens()
and
ClearDFA()
methods from different threads in the lexer (similar for the parser) in rare cases may lead to the "Object reference not set to an instance of an object" exception. Despite the fact that this behavior is due to an error in the ANTLR C# runtime, it can be fixed by locking with several readers (code parsers) and one writer (a cache cleaner). In the C# runtime a
ReadWriterLockSlim
synchronization primitive can be used to achieve such a goal.
For obvious reasons, Roslyn parser does not consume gigabytes of memory. The peak memory consumption did not exceed 200 MB, when parsing five large C# projects,
aspnet-mvc-6.0.0-rc1,
roslyn-1.1.1,
corefx,
Newtonsoft.Json-8.0.2, and
ImageProcessor-2.3.0.
Conclusion
This article has covered source code parsing with ANTLR and Roslyn. Future articles will address the following:
- Conversion of the parse trees to a unified AST using Visitor or Walker (Listener).
- A guide to writing an easy-to-read, efficient, and user-friendly grammar in ANTLR 4.
- Serialization and tree structures traversing in .NET;
- Pattern matching in a unified AST.
- Development and use of DSL for describing patterns.
References
- F. Yamaguchi. Modeling and Discovering Vulnerabilities with Code Property Graphs. Proceedings of the 2014 IEEE Symposium on Security and Privacy, SP, 2014.
- Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Compilers: Principles, Techniques, and Tools. Pearson Education Inc, Sep. 2006.
- Terence Parr. The Definitive ANTLR Reference. Pragmatic Bookshelf, 2013.
- Terence Parr, Sam Harwell, Kathleen Fisher. Adaptive LL(*) Parsing: The Power of Dynamic Analysis. ACM New York, 2014.
- Roslyn. https://github.com/dotnet/roslyn
- ANTLR grammars. https://github.com/antlr/grammars-v4
- ANTLR. https://github.com/antlr/antlr4
Author: Ivan Kochurkin, Positive Technologies