Introduction to Parsing Theory

A common problem in computing is dealing with text input. For instance, a compiler must read the source code in order to analyse it. Typically, as a first step, the compiler would translate the source code into a tree format, which is easy to handle for further analysis steps. This process is referred to as parsing.

For instance, given the input string

total + number * price

we want to generate a tree that looks roughly like this:

       |
  +---add----+
  |          |
  |      +--mul--+
  |      |       |
total  number  price

We can also write this tree in LISP-syntax as:

(add
  total
  (mul
    number
    price))

The original expression is written in what's called the concrete syntax, the result is in abstract syntax and the tree is called an abstract syntax tree (or AST).

(As a side note, not all parsers produce abstract syntax trees. For instance, a calculator might instead evaluate partial results on the fly.)

Lexical analysis

Although it is possible to run a parser directly on the ASCII/Unicode input, it is far more common to use a simple preprocessor called a lexical analyser (or lexer for short). Lexical analysis converts the input (a sequence of characters) into a sequence of tokens. For instance, our example above might be converted to

TSYMBOL(total) TPLUS TSYMBOL(number) TMUL TSYMBOL(price) TEOF

Comments are usually elided at this stage and a token TEOF (for "end of file") is inserted at the end of the input stream (whether spaces and newlines are elided depends on whether they are significant in the language being parsed).

While it is possible to derive lexical analysers using regular expressions and finite state machines (using tools such as lex(1)), in practice it is fairly easy to write a lexical analyser by hand. A pseudocode example:

while(is space or comment) skip;
if(no more input) return TEOF;
if(is letter){
    read as many letters or digits as possible into buffer;
    return TSYMBOL(buffer);
}
if(is digits){
    read as many digits as possible into buffer;
    return TNUMBER(stringToInt(buffer));
}
if(is '+') return TPLUS;
etc.
if(unrecognized) error;

Exercise: (longer) implement this lexical analyser in your favourite programming language. You can either use complex data structures for tokens to include the extra info like "buffer" or use a simple enumeration and make buffer into a global variable (which isn't pretty but workable).
You may find the following helper functions useful:

peeked = EMPTY;
getch(){ // return the next input character
    if(peeked != EMPTY){
        c = peeked;
        peeked = EMPTY;
        return c;
    }
    return read one character;
}
peekch(){ // look ahead one input character
    if(peeked == EMPTY)
        peeked = getch();
    return peeked;
}

Grammars

The job of the parser is now to convert the sequence of tokens into an AST. The form of the input is usually described using a grammar. A grammar is a series of rules (or productions) of the form

a b c ... → d e f ...

On both sides there is a number of symbols which can be either terminals (input tokens) or nonterminals. Nonterminals correspond to the intermediate nodes in the tree (such as add or mul above). To be a valid grammar, there must always be one nonterminal on the left-hand side.

For example, a simplified grammar for expressions might look like

expr → TSYMBOL
expr → expr TPLUS expr
expr → expr TMUL expr

The simplest way to understand a grammar is to repeatedly substitute the rules, starting with a start symbol, such as expr, and make your way towards the input sequence. This is called a derivation; for our example it reads

expr → expr TPLUS expr → expr TPLUS expr TMUL expr → TSYMBOL TPLUS TSYMBOL TMUL TSYMBOL

Each rule in a grammar can be taken to produce a tree fragment:

expr → (expr TSYMBOL)
expr → (expr expr TPLUS expr)
expr → (expr expr TMUL expr)

By applying the corresponding rules in the same order, we produce the parse tree for the input expression:

expr
→ (expr expr TPLUS expr)
→ (expr expr TPLUS (expr expr TMUL expr))
→ (expr (expr TSYMBOL) TPLUS (expr (expr TSYMBOL) TMUL (expr TSYMBOL)))

In this particular example you may have noticed that there is an alternative sequence of rules which generates the same input, but leads to a different parse tree:

expr → expr TMUL expr → expr TPLUS expr TMUL expr → TSYMBOL TPLUS TSYMBOL TMUL TSYMBOL
expr
→ (expr expr TMUL expr)
→ (expr (expr expr TPLUS expr) expr)
→ (expr (expr (expr TSYMBOL) TPLUS (expr TSYMBOL)) TMUL (expr TSYMBOL))

This is because this is an ambiguous grammar, i.e. a grammar which permits multiple parse trees for the same input. Specifically, the grammar does not define whether + or * has higher precedence, which means either interpretation is valid. Although it is readily apparent here, in general it can be very hard to determine whether a grammar is ambiguous or not. Given a completely arbitrary grammar, it is, in fact, an undecidable problem like the halting problem.

While ambiguous grammars are undesirable from a theoretical point of view, in practice it is common to use a technically ambiguous grammar with extra rules that define a unique interpretation. In the case of expressions we define + and * to be left-associative (i.e. a+b+c is to be read as (a+b)+c) and * to be of higher precedence than + (i.e. a*b+c is read as (a*b)+c).

It is also possible to modify the grammar to make it unambiguous:

add → mul
add → add TPLUS mul
mul → TSYMBOL
mul → mul TMUL TSYMBOL

Exercise: Write down the derivation and the parse tree for this grammar and our example. Note that there are now extra steps in the derivation and in the parse tree.
Exercise: Write an unambiguous grammar that also allows for parentheses, subtraction, division and numerical constants (TNUMBER tokens), with the usual precedence rules.
Exercise: Write an unambiguous grammar where + is right-associative and * is not associative (i.e. a*b*c is not legal) and + has precedence over *.

Unrestricted vs context-free grammars

You may have noticed that all of our grammars have been of the form

a → b c d ...

where a is a nonterminal. This type of grammar is called a context-free grammar (because the a is not surrounded by other symbols which would form context for the rule), or CFG for short.

A CFG is strictly weaker than a general (unrestricted) grammar, i.e. there are some constructs that cannot be parsed with a context-free grammar. Most famously, because of typedef, C and its derivatives cannot be directly parsed by a CFG. For example, the statement

a ** b;

is read as declaration of a variable b of type a** if a is a type, and as an expression a * (*b) otherwise.

Abstractly, the grammar would have to say

(sequence with typedef for x) type → (sequence with typedef for x) x
(no sequence with typedef for x) variable → (no sequence with typedef for x) x

so that we can write rules like

stat → type stars variable
expr → variable
etc.

Although the rules for variable and type can't be implemented in the parser, we can "cheat" by having the lexer return different tokens depending on whether the parser has seen a typedef (this is the infamous lexer hack). With this hack, C is now parseable by a CFG.

Because it simplifies parsing considerably, from here on, we will assume that we are dealing with a context-free grammar.

At this point I will also mention a short-hand notation:

a → b ... | c ...

is short for

a → b ...
a → c ...

Recursive-descent parsing

A recursive-descent parser is the simplest type of parser. It is also incredibly common; they are used in many production compilers because they can be made very efficient and flexible.

The idea behind recursive-descent parsing is that we can directly implement a grammar rule of the form

a → b c d ...

in code by writing

a() {
    b();
    c();
    d();
}

Exercise: How would the code look if we wanted to generate a parse tree?

If a appears in the definitions of b, c, d, the result will be a mutually recursive set of functions, hence the first part of the name. The "descent" part refers to the fact that we have a top-down parser, i.e. we start with the top of the parse tree (a) and work our way down to the tokens (unlike the bottom-up LR technique we will discuss later).

If we have multiple rules for a, things become a bit trickier. If the two alternatives start with different tokens, we can use look ahead, e.g.

stat → TIF ifstat
     | TWHILE whilestat

stat() {
    if(next token is TIF)
        ifstat();
    else if(next token is TWHILE)
        whilestat();
    else
        error();
}

To introduce a bit of technical terminology, if the grammar can be parsed in this way with n tokens of lookahead, we say it is a LL(n) grammar. Luckily, most programming languages can be described by LL(1) grammars, which means writing a recursive-descent parser for them is straightforward.

To adopt recursive-descent for more general grammars, we need to use backtracking:

stat() {
    save state;
    if(!ifstat()){
        restore state;
        if(!whilestat())
            error();
    }
}

Backtracking can handle any grammar unless it is left-recursive, meaning it has rules of the form a → a b .... This does not limit the power of the technique though, because left-recursive rules can always be rewritten into right-recursive rules of the form a → b ... a.

For instance, the rule

T → F
  | T O F

can be rewritten to

T → F X
X →
X → O F X

Note that in code we can simply write:

T() {
    F();
    while(next is O){
        O();
        F();
    }
}

Exercise: You may have noticed this corresponds to a left-associative operator O. How would the code look like for a right-associative operator O? (Start from the grammar)

Exercise: In pseudocode, write a recursive-descent parser for our example expression grammar. You will need to take the unambiguous form and take care with the left-recursion.

Exercise: Write a parser for general arithmetic expressions (the grammar for which was a previous exercise).

Exercise: (longer) Actually implement the parser from the last question, using a lexical analyser similar to the one from a previous exercise.
You may find it useful to define functions peektok() and gettok() which function like peekch() and getch(), but for tokens.
Two other useful functions:

expect(t) { if(gettok() != t) error(); }
got(t) { if(peektok() == t) {gettok(); return true;} return false; }

Exercise: Turn the parser from the last exercise into an actual calculator. The functions should now return floating-point numbers and perform the appropriate calculations while parsing.

Exercise: (much longer) Write a grammar and parser for a simple imperative programming language.

Precedence parsing

If we have a language like C with many operators and many precedence levels, it becomes tedious to write out the many functions required. Since each precedence also adds a function call level, the parser will also be slowed by the many function calls. Precedence parsing is a specialised technique to parse arithmetic expressions.

If we only had once precedence level and only left-associative operators we could write:

expr() {
    result = primary();
    while(next is an operator){
        op = operator();
        result = op(result, primary());
    }
    return result;
}

(Here primary is a simple recursive-descent procedure for handling variables, numbers, parentheses, function calls, unary operators, etc.)

By contrast, for right-associative operators the recursive-descent parser would look like

expr() {
    result = primary();
    if(next is an operator){
        op = operator();
        return op(result, expr());
    }
}

We can eliminate the recursion here by using a stack. Since op and expr have potentially different types, we will simply use two stacks (I separate out reduce for reasons that will be obvious soon).

operators = empty stack;
primaries = empty stack;
reduce() {
    op = operators.pop();
    b = primaries.pop();
    a = primaries.pop();
    primaries.push(op(a,b));
}
expr() {
    primaries.push(primary());
    while(next is an operator){
        op = operator();
        operators.push(op);
        primaries.push(primary());
    }
    while(operators is not empty)
        reduce();
    return primaries.pop();
}

We can put left-associative operators into the same "framework" by simply adding a reduce call into the first loop:

expr() {
    primaries.push(primary());
    while(next is an operator){
        op = operator();
        if(operators is not empty)
            reduce();
        operators.push(op);
        primaries.push(primary());
    }
    while(operators is not empty)
        reduce();
    return primaries.pop();
}

We are, in fact, almost there now. The only thing that remains is replacing the if condition in the first while loop with a more sophisticated condition. Let's assume for now that all operators are left-associative. The condition then becomes

if(next is lower precedence than operators.top)
    reduce();

Exercise: work by hand through the parsing of our example expression.
Exercise: find the condition for the general case where the operators can also be left or right-associative. Note: operators at the same precedence level have the same associativity.

LR parsing

Although precedence parsing might seem like an extremely specialised algorithm, Donald Knuth realised in 1965 that the same technique could be adapted to parse any grammar.

Imagine a precedence parser with only one stack. When parsing the expression a + b * c + d the stack will evolve as follows:

Stack                      Action     Input
expr                       shift      a + b * c + d
expr +                     shift        + b * c + d
expr + expr                shift          b * c + d
expr + expr *              shift            * c + d
expr + expr * expr         shift              c + d
expr + expr                reduce             c + d
expr                       reduce             c + d
expr +                     shift                + d
expr + expr                shift                  d
expr                       reduce

Here, shift is the conventional name for reading an input token and pushing it on the stack (I'm converting primaries to expr on-the-fly). reduce takes three tokens of the form expr op expr off the top and replaces them with one taken expr. Since expr op expr is just the right-hand side of a grammar rule, this suggests the following algorithm for a general CFG:

  1. If the top of the stack looks like the right-hand side of a grammar rule, reduce.
  2. Otherwise shift.

However, this can't be quite right, because above we sometimes shifted, when rule 1 would have prompted us to reduce. In the case of precedence parsing, it is easy to explain the decision using operator precedences, but since we want to use general CFGs, let's use the unambiguous grammar and figure out what we want to happen:

add → mul
add → add TPLUS mul
mul → TSYMBOL
mul → mul TMUL TSYMBOL

    Stack                      Action      Input
    TSYMBOL                    shift       a + b * c + d
    mul                        reduce        + b * c + d
    add                        reduce        + b * c + d
    add TPLUS                  shift           b * c + d
    add TPLUS TSYMBOL          shift             * c + d
    add TPLUS mul              reduce            * c + d
(*) add TPLUS mul TMUL         shift               c + d
    add TPLUS mul TMUL TSYMBOL shift               c + d
    add TPLUS mul              reduce              c + d
    add                        reduce              c + d
    add TPLUS                  shift                 + d
    add TPLUS TSYMBOL          shift                   d
    add TPLUS mul              reduce
    add                        reduce

Note that at (*) we shifted instead of reduced. If we had reduced, we would have got

    add                        reduce            * c + d
    add TMUL                   shift               c + d
    add TMUL TSYMBOL           shift                 + d
    add TMUL mul               reduce                + d
    add TMUL add TPLUS         shift                   d
    add TMUL add TPLUS TSYMBOL shift                    
    add TMUL add TPLUS mul     reduce
    add TMUL add               reduce

And now we're stuck, because there is no input and no applicable rules.

Let's take a step back. What other problems does this algorithm have?

To fix this we need to make the algorithm aware of where it currently is in the input with respect to the grammar, i.e. it needs to keep track of state. In each state we have a table that tells the parser for each input token whether to shift or whether to reduce. Additionally, each shift specifies a new state and each reduce specifies a rule by which to reduce.

To implement reduce we also need to keep track of past states, since reducing by, e.g., add → add TPLUS mul requires going back three states to remember the context in which those three symbols appeared. This is easily done by keeping a state stack. The top of the stack points to the current state and shift pushes onto the stack. Reduce pops as many times as there are symbols on the right-hand side of the rule, to recover the previous state. From this state we then immediately perform a "shift" by the nonterminal that was on the left-hand side of the rule. This special shift is called a goto (but it behaves just like a shift).

This is the essence of the LR algorithm.

What we haven't talked about is what those states are, because it turns out to be rather difficult.

LR(0) parsing

UNDER CONSTRUCTION.

We can fix both problems if we keep track of where we are in a grammar rule. Let's define an item as a grammar rule with a little marker for the current position. For our grammar we have 12 items:

 1. add → . mul
 2. add → mul .
 3. add → . add TPLUS mul
 4. add → add . TPLUS mul
 5. add → add TPLUS . mul
 6. add → add TPLUS mul .
 7. mul → . TSYMBOL
 8. mul → TSYMBOL .
 9. mul → . mul TMUL TSYMBOL
10. mul → mul . TMUL TSYMBOL
11. mul → mul TMUL . TSYMBOL
12. mul → mul TMUL TSYMBOL .

Now we know that if the dot is before a terminal, we must shift that terminal, and if the dot is at the end of a rule, we must reduce. It's tempting to conclude now that we simply need to keep track of the current item but that can't be quite right. For instance, which one is the start item? 1, 3, 7 or 9?

Twist: It's a trick question! All of them. States are sets of items! Specifically, define a function CLOSURE as follows:

CLOSURE(set){
    do{
        foreach(item in set of the form α → β . A γ with nonterminal A)
            foreach(rule of the form A → . δ)
                add A → δ . to set;
    }while(set was changed);
    return set;
}

(Greek letters stand for arbitrary strings of symbols)

Running CLOSURE on the start rule with the dot at the far left (here 1 and 3), gives the start state:

State 0
    add → . mul
    add → . add TPLUS mul
    mul → . TSYMBOL
    mul → . mul TMUL TSYMBOL

We see that the only legitimate action is shifting TSYMBOL. Once we have shifted TSYMBOL, we take all items with the dot to the left of TSYMBOL and move it to the right of it. After calculating CLOSURE again, we get the next state. In code:

GOTO(set, T){
    newset = empty;
    foreach(item in set of the form α → β . T γ)
        add α → β T . γ to newset;
    return CLOSURE(newset);
}

In this case we get the state:

State 1
    mul → TSYMBOL .

Here the only legitimate action is reduce. We must now go back to state 0, which raises the question of how we are going to remember. In general if we reduce using a rule

a → d e f ...

we need to go back as many states as they are symbols on the right-hand side. We can do this by using a stack to keep track of state. The current state is the state on top of the stack and shift carries a number n that defines the state to be pushed on the stack. reduce similarly carries a rule number which we use to look up how many states to pop and discard.

In our case, after reduce we are back in state 0 but we still need to process mul we have just created from TSYMBOL using the rule mul → TSYMBOL. We should push mul on the token stack and move the dot past mul to determine the new state (to be pushed on the state stack), i.e. simply perform a shift by mul! Although it really is just like a shift, typically this action is called goto.

The new state in our case is

State 2
    add → mul .
    mul → mul . TMUL TSYMBOL

Now we have two legitimate actions: Shift TMUL or reduce add → mul. This unfortunate situation is referred to as a shift/reduce conflict. There isn't actually a solution in this case — the algorithm we built just now can't handle this grammar.