Convert Left Recursion into Left Associative Iteration


Here is an unambiguous "left recursive" grammar for a simple
expression like language.

      E3 ::= E3 '+' 'a'   // left recursion means left associative
           | 'a'

Using this grammar, we can parse the string "a+a+a+a" only one way,
as if it were "grouped" as ((a+a)+a)+a. Here is the parse tree.

                  E3
                 /|\
                / | \
              E3  +  a
             /|\
            / | \
          E3  +  a
         /|\
        / | \
      E3  +  a
      |
      |
      a

Notice that, in essence, a left-associative operator needs to have a
left-recursive grammar production that captures the "big stuff" on the
left side of the operator and it captures a "little thing" on the right
side of the operator, e.g.,
       E ::= E '+' 'a'
A right-associative operator needs to have a right-recursive grammar
production that captures "big stuff" on the right side of the operator
and a "little thing" on the left side of the operator, e.g.,
       E ::= 'a' '+' E

A left-recursive production causes a problem for a recursive descent
parser. A left-recursive production leads to an infinitely recursive
parser. Here is the pseudo code that describes the recursive decent
parser for the above left-recursive grammar.

   void getE(tokens)
   {
      if (number_of_remaining_tokens > 1)
      {
         getE(tokens);
         match('+');
         match('a');
      }
      else
      {
         match('a');
      }
   }

Notice how, if there is more than one token, then a call to getE()
immediately leads to another call to getE() without consuming any
tokens, so the number of tokens does not get decreased, and the
recursion does not stop.

We must rewrite the grammar to remove the left recursion and replace
the recursion with a kind of iteration.

We want to show that the above grammar can be rewritten to use the
Kleene star (from "regular expressions") in place of the left
recursion. The new grammar needs only one production.

   E3 ::= 'a' ( '+' 'a' )*   // * means zero-or-more

Let us look at how the left recursion is removed from the original
grammar to derive the new grammar.

Here is a sequence of left-most derivations of sentential forms
from the original left-recursive grammar.

                   E3
               E3 + a
           E3 + a + a
       E3 + a + a + a
   E3 + a + a + a + a
    a + a + a + a + a

Notice how the "expression part" is moving to the left, and in each
step it grows the expression by concatenating the string "a +" onto
what had been to its right. That means we can think of growing a
string in the language by starting with the sting "a" and then
concatenating on the left as many "a +" strings as we like. This
leads to the following iterative production for the language (which
is not yet the production that we want)

       E3 ::= ( 'a' '+' )* 'a'

which uses iteration (the Kleene star) in place of recursion.

But we can just as reasonably look at the final string
      a + a + a + a + a
and say that we grow this string by starting with the string "a"
and then concatenating on the right(!) as many "+ a" strings as
we like. Hence, we can also derive this iterative production for
the language, which is the production that we want.

       E3 ::= 'a' ( '+' 'a' )*

So the left recursion in this production

       E3 ::=  E3 '+' 'a' | 'a'

can be factored out to derive this non-recursive production.

       E3 ::= 'a' ( '+' 'a' )*

It is easy to translate this production into a while-loop that parses
the language. But there is still a problem. Let us go back to the
right recursive expression language E2,

      E2 ::= 'a' '+' E2   // right recursion means right associative
           | 'a'

and use the same trick to factor out the recursion.

Here is a sequence of left-most derivations of sentential forms
from the E2 grammar.

      E2
      a + E2
      a + a + E2
      a + a + a + E2
      a + a + a + a + E2
      a + a + a + a + a

Notice how the "expression part" is moving to the right, and in each
step it grows the expression by concatenating the string "+ a" onto
what had been to its left. So that means we can think of growing a
string in the E2 language by starting with the sting "a" and then
concatenating on the right as many "+ a" strings as we like. This
leads to the following production for language E2.

       E2 ::= 'a' ( '+' 'a' )*

But this is exactly the same production we derived for language E3!
Language E2 is supposed to be right associative and language E3 is
supposed to be left associative, but it they have the same grammar,
how can we say which associativity they have?

The answer is that we cannot. This production

       E ::= 'a' ( '+' 'a' )*

does not define an associativity for the '+' operator. Instead, the
associativity of the '+' operator will be determined by how we write
the parser, not by how we write the production. When the parser code
uses this production to parse an expression, we can have the parser
code build a parse tree as either a left associative or a right
associative parse tree.

Here is the non-recursive pseudo code that parses the iterative
production.

   void getE(tokens)
   {
      tokens.match("a");
      while ( ! tokens.isEmpty() && tokens.match("+") )
      {
         tokens.match("a");
      }
   }

This is a recognizing parser. It doesn't do anything but parse and
throw an error if the list of tokens doesn't parse.


Let us see how to modify this parser so that it builds a left
associative expression tree. To motivate how to modify the above
code, let us consider the example string "a+a+a+a+a+a".

Here is the sequence of left associative expression trees that we
should get as we parse "a+a+a+a+a+a" by starting with "a" and then
iterating the concatenation of "+ a" on the right.

string: "a"   "a+a"      "a+a+a"     "a+a+a+a"     "a+a+a+a+a"    "a+a+a+a+a+a"

  tree:  a      +           +            +              +                +
               / \         / \          / \            / \              / \
              a   a       +   a        +   a          +   a            +   a
                         / \          / \            / \              / \
                        a   a        +   a          +   a            +   a
                                    / \            / \              / \
                                   a   a          +   a            +   a
                                                 / \              / \
                                                a   a            +   a
                                                                / \
                                                               a   a


To better see what is going on, replace the letter 'a' with distinct letters.

string: "a"   "a+b"      "a+b+c"     "a+b+c+d"     "a+b+c+d+e"    "a+b+c+d+e+f"

  tree:  a      +           +            +              +                +
               / \         / \          / \            / \              / \
              a   b       +   c        +   d          +   e            +   f
                         / \          / \            / \              / \
                        a   b        +   c          +   d            +   e
                                    / \            / \              / \
                                   a   b          +   c            +   d
                                                 / \              / \
                                                a   b            +   c
                                                                / \
                                                               a   b

Notice that as we move to the right from string to string, the expression trees
grow in a very specific way. The next tree in the sequence of trees always has
the previous tree as the left branch of its root.

      root of next tree  -- >  +
                              / \
                      previous   a
                      tree

This is the hint that we need to write the code that builds these expression
trees. Be sure to carefully compare this version of getExp() to the previous
version.

   Tree getE(tokens)
   {
      tokens.match('a');
      Tree currentTree = new Tree("a");

      while ( ! tokens.isEmpty() && tokens.match('+') )
      {
         tokens.match('a');
         currentTree = new Tree("+", currentTree, "a"); // left associative
      }
      return currentTree;
   }

Follow this code as it parses the string "a+a+a+a". (INPORTANT: Really do
follow this code as it parses the string.) It parses the string into a
left-associative parse tree.

But now modify the code this way.

   Tree getE(tokens)
   {
      tokens.match('a');
      Tree currentTree = new Tree("a");

      while ( ! tokens.isEmpty() && tokens.match('+') )
      {
         tokens.match('a');
         currentTree = new Tree("+", "a", currentTree); // right associative?
      }
      return currentTree;
   }

Again, follow this code as it parses the string "a+a+a+a". (INPORTANT: Really
do follow this code as it parses the string.) Now it parses the string into
(what seems to be) a right-associative parse tree. But there's a problem.

Modify the parser once again (so that it can parse strings with variables
other than "a").

   Tree getE(tokens)
   {
      Token tk = tokens.nextToken();
      Tree currentTree = new Tree(tk);

      while ( ! tokens.isEmpty() && tokens.match('+') )
      {
         tk = tokens.nextToken();
         currentTree = new Tree("+", tk, currentTree); // right associative
      }
      return currentTree;
   }

Now follow the above parser as it parses the string "a+b+c+d". You will
see that it is not really parsing the expression to be right-associative.
It's not even parsing the string correctly. Instead of building this right
associative expression tree,

         "a+b+c+d"
             +
            / \
           a   +
              / \
             b   +
                / \
               c   d

the code is builds this tree.

             +
            / \
           d   +
              / \
             c   +
                / \
               b   a

But if you tokenize the string "a+b+c+d" from right-to-left, so the
token list is
    ["d", "+", "c", "+", "b", "+", "a"]
and then you once again follow the parser as it parses this token list,
then you should get a correct, right-associative, parse tree.


The last several examples show that the grammar

       E3 ::= 'a' ( '+' 'a' )*

DOES NOT determine any associativity for the operator. It doesn't really
tell us how to parse. But we can use the grammar as a guide to implement
parsers for either a left-associative operator or a right-associative
operator (but the right-associative parser needs a right-to-left
tokenizer!).

Of course, if we really want a right-associative operator, we should use
the right recursive grammar

       E ::= 'a' '+' E
           | 'a'

and write the recursive descent parser for this grammar, and use a
left-to-right tokenizer.



Question: What does the following grammar give you? Notice
that this production mixes left recursion with iteration.

   E ::= ( E '+' )* 'a'