Learning ANTLR 1: intro

Concepts

For a computer to run a program, it first has to read and react properly to phrases in the input. EG: when we see foo = 2, the computer will assign. A refresh: A language is a set of valid sentences, a sentence is made up of phrases and a phrase is made up of subphrases and vocabs.

If an application “executes” sentences, it is an interpreter

Parsers/syntax analyzers are programmer that recognize language. syntax are rules that govern language membership. Grammar is just sets of rules that express the structure of a phrase.

ANTLR will translate grammar to parser.

2 phrases of a parser

Lexical analysis

Grouping characters into words (tokens). Tokens consist of at least to pieces of info:

  1. token type(to identify the lexical structure)
  2. Text EG int i; The token type is int and text is i

Parser

use token to recognize the sentence structure. ANTLR will build a parse tree that records how the parser recognize the structure. online

The interior nodes are phrase names that group and identify children. The higher we go in the tree, the more abstract we got. The root node is the most abstract. In this case, stat(statement).

The leaves are always token.

The parse-tree subtree roots correspond to grammar rule names.

How ANTLR implement parsers.

ANTLR generates recursive-descent parsers, starting from root and go down till leaves.

A code for assign would look like this:

// assign : ID '=' expr ';' ;
void assign() { // method generated from rule assign
match(ID); // compare ID to current input symbol then consume
match('=');
expr(); // match an expression by calling expr()
match(';');
}

Each match corresponds to matching a token, and expr() will correspond to a subroot with in the tree.

The parsers sometimes might have to look ahead to determine alternatives.

Ambiguity

“To my Ph.D. supervisor, for whom no thanks is too much.” Some grammars are ambiguous

stat: expr ';' // expression statement
| ID '(' ')' ';' // function call statement
;
expr: ID '(' ')'
| INT
;

We can interprete f() as

  1. expression
  2. function call

    stat
    /    \
    

    expr ;
    /|\
    f ( )

    stat
    / | \
    f ( ) ;

ANTLR will resloves by choosing the first alternative in the parse tree.

It could also occur in Lexer: ANTLR will match the input to the rule specified first in the grammar.

BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter

When Lexer sees begin, it will match it to a keyword, instead of an id.

Tree walking

We need to traverse the parse tree

Listeners

The listener will automatically walk through a node and all its childern through listener pattern

#

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax