Learning ANTLR 4: building grammar

A language will usually follow 4 abstract patterns

  1. sequence : ( a sequence of elem, such as values in array initializer
  2. Choice: choice between mutliple alternative phrases
  3. Token dependence presence of one token requires presence of its counterpart
  4. nested phrase: self-similar language

We can use EBNF (extended backus-naur-format) to express. It will also allow use to use subrules, inline rules wrapped in parenthesis.

Parsers

Sequence

EG

#define true false 

To specify a sequence for grammar, just list the elements in order.

'#define' string string'

For arbitrarily long sequence we need to use + or *. They are subrule operator. + will encode one or more. (INT)+ will describe one of more operator. * will encode 0 or more.

Patterns within sequence

We use csv file to demonstrate two patterns

file : (row '\n')* ; // sequence with a '\n' terminator
row : field (',' field)* ; // sequence with a ',' separator
field: INT ;

terminator

the ‘\n’ token terminates each element of the sequence.

Other terminator includes

stats : (stat ';')*

Separator

Rule row use the list and match a field followed(seperated) by zero or more ‘,’ field. One more separator:

exprList : expr(',' expr)*

###zero-or-one sequence
?

(‘extends’ identifier)?

It will optionally match extend for super class.


##Choice(alternatives)
Use | as the or operator.
EG:

type :’float’ | ‘int’ | ‘void’;

<br />##Token dependency
If you see a left bracket, you better have a right bracket.
We use a sequence that specifier both symbols.

array : ‘[‘ INT+ ‘]’ ;

<br />Or 

functionDeclear : type ID ‘(‘ formalParameters? ‘)’ block // void f(int x) {…} formalParameters: formalParameter ( ‘,’ formalParameter)*; formalParameter: type ID;

<br />##Nested Phrase
self-similar language structure. expression

expr : ‘(‘ expr ‘)’ | ID;

<br />##How parsers handle precedence. 
The higher the line number, the higher the precedence.
```java
expr : expr '^'<assoc=right> expr // ^ operator is right associative
    | expr '*' expr // match subexpressions joined with '*' operator
    | expr '+' expr // match subexpressions joined with '+' operator
    | INT // matches simple integer atom
;

Lexers

There are also some very common lexical constructs:

Identifiers

ID : [a-zA-Z]+ ; // match 1-or-more upper or lowercase letters

What if a string keyword matches with ID

EG: enum can be matched with ID. But it is also a keyword that’s used in ;parser. ANTLR will seperate string literal, lexer rules and parser rules. Parser rules will be checked first, then string literal. Lexer will be checked last.

Numbers

For integer, it is easy:

INT : [0-9]+;

But float needs to consider the dot.

FLOAT: DIGIT + '.' + DIGIT* | '.' DIGIT+;

We use fragment to create reuseable rules only used by lexical rules and not a token in and itself.

fragment
DIGIT: [0-9];

String literal

dot is a wildcard that matches a single character. So .* will match any sequence of zero or more characters. But that would just go to end of file. So the ‘?’ will make the rule nongreedy. It will consume character until what follows the subrule in the lexer rule. In this case, it will stop after seeing “, the subrule right after ‘.*?’.

STRING : '"' .*? '"' ; // match anything in "..."

It consists of three parts: the string literal ” to start a string and
It is not good enough since we need escape sequence to allow special characters such as double quote inside the string.

We can define escape sequence with backslash. If we want to get double quote inside a double-quoted string, we use \”.

STRING : '"'(ESC|.)*? '"';
fragment
ESC: '\\"' | '\\\\';

In ESC, we have \ because ANTLR needs to escape as well.

Comments and whitespace

We need to discard them. Identify them the same way other Lexer rules did but tell Lexer to throw it out via -> skip.

LINE_COMMENT: '//' .*?  '\n' -> skip; //Match "//" stuff '\n'
COMMENT: '/*/ .*? '*/ -> skip;  //Match /* stuff */

Whitespace

WS :(' '| '\t' | '\r' | '\n') + -> skip;

A shorthand is:

WS: [ \t\r\n]+ -> skip;

Dealing with ambiguities

It resolve ambiguities by favoring the rule specified first. That means ID rule should be defined after keyword rules.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax