Introduction to Parsing Theory
A common problem in computing is dealing with text input. For instance, a compiler must read the source code in order to analyse it. Typically, as a first step, the compiler would translate the source code into a tree format, which is easy to handle for further analysis steps. This process is referred to as parsing.
For instance, given the input string
total + number * price
we want to generate a tree that looks roughly like this:

+add+
 
 +mul+
  
total number price
We can also write this tree in LISPsyntax as:
(add
total
(mul
number
price))
The original expression is written in what's called the concrete syntax, the result is in abstract syntax and the tree is called an abstract syntax tree (or AST).
(As a side note, not all parsers produce abstract syntax trees. For instance, a calculator might instead evaluate partial results on the fly.)
Lexical analysis
Although it is possible to run a parser directly on the ASCII/Unicode input, it is far more common to use a simple preprocessor called a lexical analyser (or lexer for short). Lexical analysis converts the input (a sequence of characters) into a sequence of tokens. For instance, our example above might be converted to
TSYMBOL(total) TPLUS TSYMBOL(number) TMUL TSYMBOL(price) TEOF
Comments are usually elided at this stage and a token TEOF (for "end of file") is inserted at the end of the input stream (whether spaces and newlines are elided depends on whether they are significant in the language being parsed).
While it is possible to derive lexical analysers using regular expressions and finite state machines (using tools such as lex(1)), in practice it is fairly easy to write a lexical analyser by hand. A pseudocode example:
while(is space or comment) skip;
if(no more input) return TEOF;
if(is letter){
read as many letters or digits as possible into buffer;
return TSYMBOL(buffer);
}
if(is digits){
read as many digits as possible into buffer;
return TNUMBER(stringToInt(buffer));
}
if(is '+') return TPLUS;
etc.
if(unrecognized) error;
Exercise: (longer) implement this lexical analyser in your favourite programming language. You can either use complex data structures for tokens to include the extra info like "buffer" or use a simple enumeration and make buffer into a global variable (which isn't pretty but workable).
You may find the following helper functions useful:
peeked = EMPTY;
getch(){ // return the next input character
if(peeked != EMPTY){
c = peeked;
peeked = EMPTY;
return c;
}
return read one character;
}
peekch(){ // look ahead one input character
if(peeked == EMPTY)
peeked = getch();
return peeked;
}
Grammars
The job of the parser is now to convert the sequence of tokens into an AST. The form of the input is usually described using a grammar. A grammar is a series of rules (or productions) of the form
a b c ... → d e f ...
On both sides there is a number of symbols which can be either terminals (input tokens) or nonterminals.
Nonterminals correspond to the intermediate nodes in the tree (such as add
or mul
above).
To be a valid grammar, there must always be one nonterminal on the lefthand side.
For example, a simplified grammar for expressions might look like
expr → TSYMBOL
expr → expr TPLUS expr
expr → expr TMUL expr
The simplest way to understand a grammar is to repeatedly substitute the rules, starting with a start symbol, such as expr
, and make your way towards the input sequence.
This is called a derivation; for our example it reads
expr → expr TPLUS expr → expr TPLUS expr TMUL expr → TSYMBOL TPLUS TSYMBOL TMUL TSYMBOL
Each rule in a grammar can be taken to produce a tree fragment:
expr → (expr TSYMBOL)
expr → (expr expr TPLUS expr)
expr → (expr expr TMUL expr)
By applying the corresponding rules in the same order, we produce the parse tree for the input expression:
expr
→ (expr expr TPLUS expr)
→ (expr expr TPLUS (expr expr TMUL expr))
→ (expr (expr TSYMBOL) TPLUS (expr (expr TSYMBOL) TMUL (expr TSYMBOL)))
In this particular example you may have noticed that there is an alternative sequence of rules which generates the same input, but leads to a different parse tree:
expr → expr TMUL expr → expr TPLUS expr TMUL expr → TSYMBOL TPLUS TSYMBOL TMUL TSYMBOL
expr
→ (expr expr TMUL expr)
→ (expr (expr expr TPLUS expr) expr)
→ (expr (expr (expr TSYMBOL) TPLUS (expr TSYMBOL)) TMUL (expr TSYMBOL))
This is because this is an ambiguous grammar, i.e. a grammar which permits multiple parse trees for the same input.
Specifically, the grammar does not define whether +
or *
has higher precedence, which means either interpretation is valid.
Although it is readily apparent here, in general it can be very hard to determine whether a grammar is ambiguous or not.
Given a completely arbitrary grammar, it is, in fact, an undecidable problem like the halting problem.
While ambiguous grammars are undesirable from a theoretical point of view, in practice it is common to use a technically ambiguous grammar with extra rules that define a unique interpretation.
In the case of expressions we define +
and *
to be leftassociative (i.e. a+b+c
is to be read as (a+b)+c
) and *
to be of higher precedence than +
(i.e. a*b+c
is read as (a*b)+c
).
It is also possible to modify the grammar to make it unambiguous:
add → mul
add → add TPLUS mul
mul → TSYMBOL
mul → mul TMUL TSYMBOL
Exercise: Write down the derivation and the parse tree for this grammar and our example. Note that there are now extra steps in the derivation and in the parse tree.
Exercise: Write an unambiguous grammar that also allows for parentheses, subtraction, division and numerical constants (TNUMBER
tokens), with the usual precedence rules.
Exercise: Write an unambiguous grammar where +
is rightassociative and *
is not associative (i.e. a*b*c
is not legal) and +
has precedence over *
.
Unrestricted vs contextfree grammars
You may have noticed that all of our grammars have been of the form
a → b c d ...
where a
is a nonterminal.
This type of grammar is called a contextfree grammar (because the a
is not surrounded by other symbols which would form context for the rule), or CFG for short.
A CFG is strictly weaker than a general (unrestricted) grammar, i.e. there are some constructs that cannot be parsed with a contextfree grammar.
Most famously, because of typedef
, C and its derivatives cannot be directly parsed by a CFG.
For example, the statement
a ** b;
is read as declaration of a variable b
of type a**
if a
is a type, and as an expression a * (*b)
otherwise.
Abstractly, the grammar would have to say
(sequence with typedef for x) type → (sequence with typedef for x) x
(no sequence with typedef for x) variable → (no sequence with typedef for x) x
so that we can write rules like
stat → type stars variable
expr → variable
etc.
Although the rules for variable
and type
can't be implemented in the parser, we can "cheat" by having the lexer return different tokens depending on whether the parser has seen a typedef
(this is the infamous lexer hack).
With this hack, C is now parseable by a CFG.
Because it simplifies parsing considerably, from here on, we will assume that we are dealing with a contextfree grammar.
At this point I will also mention a shorthand notation:
a → b ...  c ...
is short for
a → b ...
a → c ...
Recursivedescent parsing
A recursivedescent parser is the simplest type of parser. It is also incredibly common; they are used in many production compilers because they can be made very efficient and flexible.
The idea behind recursivedescent parsing is that we can directly implement a grammar rule of the form
a → b c d ...
in code by writing
a() {
b();
c();
d();
}
Exercise: How would the code look if we wanted to generate a parse tree?
If a
appears in the definitions of b
, c
, d
, the result will be a mutually recursive set of functions, hence the first part of the name.
The "descent" part refers to the fact that we have a topdown parser, i.e. we start with the top of the parse tree (a
) and work our way down to the tokens (unlike the bottomup LR technique we will discuss later).
If we have multiple rules for a
, things become a bit trickier.
If the two alternatives start with different tokens, we can use look ahead, e.g.
stat → TIF ifstat
 TWHILE whilestat
stat() {
if(next token is TIF)
ifstat();
else if(next token is TWHILE)
whilestat();
else
error();
}
To introduce a bit of technical terminology, if the grammar can be parsed in this way with n tokens of lookahead, we say it is a LL(n) grammar. Luckily, most programming languages can be described by LL(1) grammars, which means writing a recursivedescent parser for them is straightforward.
To adopt recursivedescent for more general grammars, we need to use backtracking:
stat() {
save state;
if(!ifstat()){
restore state;
if(!whilestat())
error();
}
}
Backtracking can handle any grammar unless it is leftrecursive, meaning it has rules of the form a → a b ...
.
This does not limit the power of the technique though, because leftrecursive rules can always be rewritten into rightrecursive rules of the form a → b ... a
.
For instance, the rule
T → F
 T O F
can be rewritten to
T → F X
X →
X → O F X
Note that in code we can simply write:
T() {
F();
while(next is O){
O();
F();
}
}
Exercise: You may have noticed this corresponds to a leftassociative operator O
. How would the code look like for a rightassociative operator O
? (Start from the grammar)
Exercise: In pseudocode, write a recursivedescent parser for our example expression grammar. You will need to take the unambiguous form and take care with the leftrecursion.
Exercise: Write a parser for general arithmetic expressions (the grammar for which was a previous exercise).
Exercise: (longer) Actually implement the parser from the last question, using a lexical analyser similar to the one from a previous exercise.
You may find it useful to define functions peektok()
and gettok()
which function like peekch()
and getch()
, but for tokens.
Two other useful functions:
expect(t) { if(gettok() != t) error(); }
got(t) { if(peektok() == t) {gettok(); return true;} return false; }
Exercise: Turn the parser from the last exercise into an actual calculator. The functions should now return floatingpoint numbers and perform the appropriate calculations while parsing.
Exercise: (much longer) Write a grammar and parser for a simple imperative programming language.
Precedence parsing
If we have a language like C with many operators and many precedence levels, it becomes tedious to write out the many functions required. Since each precedence also adds a function call level, the parser will also be slowed by the many function calls. Precedence parsing is a specialised technique to parse arithmetic expressions.
If we only had once precedence level and only leftassociative operators we could write:
expr() {
result = primary();
while(next is an operator){
op = operator();
result = op(result, primary());
}
return result;
}
(Here primary
is a simple recursivedescent procedure for handling variables, numbers, parentheses, function calls, unary operators, etc.)
By contrast, for rightassociative operators the recursivedescent parser would look like
expr() {
result = primary();
if(next is an operator){
op = operator();
return op(result, expr());
}
}
We can eliminate the recursion here by using a stack. Since op
and expr
have potentially different types, we will simply use two stacks
(I separate out reduce
for reasons that will be obvious soon).
operators = empty stack;
primaries = empty stack;
reduce() {
op = operators.pop();
b = primaries.pop();
a = primaries.pop();
primaries.push(op(a,b));
}
expr() {
primaries.push(primary());
while(next is an operator){
op = operator();
operators.push(op);
primaries.push(primary());
}
while(operators is not empty)
reduce();
return primaries.pop();
}
We can put leftassociative operators into the same "framework" by simply adding a reduce
call into the first loop:
expr() {
primaries.push(primary());
while(next is an operator){
op = operator();
if(operators is not empty)
reduce();
operators.push(op);
primaries.push(primary());
}
while(operators is not empty)
reduce();
return primaries.pop();
}
We are, in fact, almost there now. The only thing that remains is replacing the if
condition in the first while
loop with a more sophisticated condition.
Let's assume for now that all operators are leftassociative.
The condition then becomes
if(next is lower precedence than operators.top)
reduce();
Exercise: work by hand through the parsing of our example expression.
Exercise: find the condition for the general case where the operators can also be left or rightassociative. Note: operators at the same precedence level have the same associativity.
LR parsing
Although precedence parsing might seem like an extremely specialised algorithm, Donald Knuth realised in 1965 that the same technique could be adapted to parse any grammar.
Imagine a precedence parser with only one stack. When parsing the expression a + b * c + d the stack will evolve as follows:
Stack Action Input
expr shift a + b * c + d
expr + shift + b * c + d
expr + expr shift b * c + d
expr + expr * shift * c + d
expr + expr * expr shift c + d
expr + expr reduce c + d
expr reduce c + d
expr + shift + d
expr + expr shift d
expr reduce
Here, shift
is the conventional name for reading an input token and pushing it on the stack (I'm converting primaries to expr
onthefly).
reduce
takes three tokens of the form expr op expr
off the top and replaces them with one taken expr
.
Since expr op expr
is just the righthand side of a grammar rule, this suggests the following algorithm for a general CFG:
 If the top of the stack looks like the righthand side of a grammar rule, reduce.
 Otherwise shift.
However, this can't be quite right, because above we sometimes shifted, when rule 1 would have prompted us to reduce. In the case of precedence parsing, it is easy to explain the decision using operator precedences, but since we want to use general CFGs, let's use the unambiguous grammar and figure out what we want to happen:
add → mul
add → add TPLUS mul
mul → TSYMBOL
mul → mul TMUL TSYMBOL
Stack Action Input
TSYMBOL shift a + b * c + d
mul reduce + b * c + d
add reduce + b * c + d
add TPLUS shift b * c + d
add TPLUS TSYMBOL shift * c + d
add TPLUS mul reduce * c + d
(*) add TPLUS mul TMUL shift c + d
add TPLUS mul TMUL TSYMBOL shift c + d
add TPLUS mul reduce c + d
add reduce c + d
add TPLUS shift + d
add TPLUS TSYMBOL shift d
add TPLUS mul reduce
add reduce
Note that at (*)
we shifted instead of reduced.
If we had reduced, we would have got
add reduce * c + d
add TMUL shift c + d
add TMUL TSYMBOL shift + d
add TMUL mul reduce + d
add TMUL add TPLUS shift d
add TMUL add TPLUS TSYMBOL shift
add TMUL add TPLUS mul reduce
add TMUL add reduce
And now we're stuck, because there is no input and no applicable rules.
Let's take a step back. What other problems does this algorithm have?
 Erroneous input is detected far too late (try running it through TSYMBOL TSYMBOL TPLUS TSYMBOL).
 Comparing all grammar rules to the top stack elements is slow.
To fix this we need to make the algorithm aware of where it currently is in the input with respect to the grammar, i.e. it needs to keep track of state. In each state we have a table that tells the parser for each input token whether to shift or whether to reduce. Additionally, each shift specifies a new state and each reduce specifies a rule by which to reduce.
To implement reduce we also need to keep track of past states, since reducing by, e.g., add → add TPLUS mul
requires going back three states to remember the context in which those three symbols appeared.
This is easily done by keeping a state stack.
The top of the stack points to the current state and shift
pushes onto the stack.
Reduce
pops as many times as there are symbols on the righthand side of the rule, to recover the previous state.
From this state we then immediately perform a "shift" by the nonterminal that was on the lefthand side of the rule.
This special shift is called a goto
(but it behaves just like a shift).
This is the essence of the LR algorithm.
What we haven't talked about is what those states are, because it turns out to be rather difficult.
LR(0) parsing
UNDER CONSTRUCTION.
We can fix both problems if we keep track of where we are in a grammar rule. Let's define an item as a grammar rule with a little marker for the current position. For our grammar we have 12 items:
1. add → . mul
2. add → mul .
3. add → . add TPLUS mul
4. add → add . TPLUS mul
5. add → add TPLUS . mul
6. add → add TPLUS mul .
7. mul → . TSYMBOL
8. mul → TSYMBOL .
9. mul → . mul TMUL TSYMBOL
10. mul → mul . TMUL TSYMBOL
11. mul → mul TMUL . TSYMBOL
12. mul → mul TMUL TSYMBOL .
Now we know that if the dot is before a terminal, we must shift that terminal, and if the dot is at the end of a rule, we must reduce. It's tempting to conclude now that we simply need to keep track of the current item but that can't be quite right. For instance, which one is the start item? 1, 3, 7 or 9?
Twist: It's a trick question! All of them. States are sets of items!
Specifically, define a function CLOSURE
as follows:
CLOSURE(set){
do{
foreach(item in set of the form α → β . A γ with nonterminal A)
foreach(rule of the form A → . δ)
add A → δ . to set;
}while(set was changed);
return set;
}
(Greek letters stand for arbitrary strings of symbols)
Running CLOSURE
on the start rule with the dot at the far left (here 1 and 3), gives the start state:
State 0
add → . mul
add → . add TPLUS mul
mul → . TSYMBOL
mul → . mul TMUL TSYMBOL
We see that the only legitimate action is shifting TSYMBOL.
Once we have shifted TSYMBOL, we take all items with the dot to the left of TSYMBOL and move it to the right of it.
After calculating CLOSURE
again, we get the next state. In code:
GOTO(set, T){
newset = empty;
foreach(item in set of the form α → β . T γ)
add α → β T . γ to newset;
return CLOSURE(newset);
}
In this case we get the state:
State 1
mul → TSYMBOL .
Here the only legitimate action is reduce
.
We must now go back to state 0, which raises the question of how we are going to remember.
In general if we reduce using a rule
a → d e f ...
we need to go back as many states as they are symbols on the righthand side.
We can do this by using a stack to keep track of state.
The current state is the state on top of the stack and shift
carries a number n
that defines the state to be pushed on the stack.
reduce
similarly carries a rule number which we use to look up how many states to pop and discard.
In our case, after reduce
we are back in state 0 but we still need to process mul
we have just created from TSYMBOL
using the rule mul → TSYMBOL
.
We should push mul
on the token stack and move the dot past mul
to determine the new state (to be pushed on the state stack), i.e. simply perform a shift
by mul
!
Although it really is just like a shift
, typically this action is called goto
.
The new state in our case is
State 2
add → mul .
mul → mul . TMUL TSYMBOL
Now we have two legitimate actions: Shift TMUL
or reduce add → mul
.
This unfortunate situation is referred to as a shift/reduce conflict.
There isn't actually a solution in this case — the algorithm we built just now can't handle this grammar.