next up previous contents
Next: Parse Trees Up: CS2111: Design and Implementation Previous: Inside LEX and YACC:   Contents

Subsections


Debugging YACC grammars

How YACC works and fails

It is possible to write apparently sensible grammars that YACC cannot cope with. To deal with this problem, we must understand enough of how YACC works to understand how it can fail.

YACC repeatedly shifts input symbols until it has a complete RHS and then reduces the RHS to its LHS. At each input, it must decide whether it has a complete RHS to reduce or not, and if so which RHS to use. If it has a complete RHS, will shifting more symbols give it a different, overlapping RHS?

YACC uses the LR(1) algorithm, which restricts the grammars it accepts so that it can always decide by looking ahead at the next input symbol.

If, at some point in a grammar, YACC can not make a decision just using 1 symbol lookahead, it reports shift/reduce (is the RHS complete?) or reduce/reduce (which RHS is it?) conflicts. When this happens, we can look in the y.output file to find the states where something is wrong, and work out what the problem is.

Debugging an example grammar

Suppose we tried to modify the left recursive grammar from $\S $6 to detect various errors:

list    : id
        | list optcomma id
        | list ','               {yyerror ("extra comma ignored");}
        ;
optcomma: ','
        |                        {yyerror ("missing comma inserted");}
        ;
the y.output file from YACC includes the following:

4: reduce/reduce conflict (red'ns 3 and 4) on id
state 4 list : list ,_   (3)
        optcomma : ,_   (4)

        .=reduce 3
(_ indicates where we are up to in the input - BYACC uses .)

i.e. having read a ',' to get to state 4, if we then read an id we do not know which of the two rules to recognise the comma with:


\begin{picture}(32,13)
\put(10,12){list}
\put(15,12){*optcomma}
\put(10.8,11.8)...
...10,4){list}
\put(10.8,3.8){\line(1,-1){2.2}} %
\put(12,0){*list}
\end{picture}
$[$ $]$ = state number at this point in the input, * = the alternative reductions

We can sometimes get more information by finding the state(s) that precede(s) state 4 by looking for shift 4 or goto 4 in y.output (and repeating if necessary):


state 1 $accept : list_$end
        list : list_optcomma id
        list : list_,
        optcomma : _   (5)

        $end accept       , shift 4       id reduce 5       optcomma goto 3

\begin{picture}(32,17)
\put(15,16){list}
\put(14.8,15.8){\line(-1,-1){2.2}} % /...
...10,4){list}
\put(10.8,3.8){\line(1,-1){2.2}} %
\put(12,0){*list}
\end{picture}
Finally, if we reexamine the grammar, we can see that the only way the second parse can recognise a list followed by an id is if we can insert an optcomma between them, which we can onll do by recognising an empty optcomma:

\begin{picture}(32,20)
\put(15,20){list}
\put(14.8,19.8){\line(-1,-1){2.2}} % /...
...12,4){*list}
\put(14.8,3.8){\line(1,-1){2.2}} %
\put(18,0){list}
\end{picture}

i.e. the parse tree for an input of id , id can be:

\begin{picture}(16,9)
\put(7,8){list}
\put(6.8,7.8){\line(-2,-1){4.4}} % /
\pu...
...put(7.8,3.8){\line(0,-1){2.2}} % \vert
\put(0,0){id}\put(7,0){','}
\end{picture}

i.e. ((id) (,) id)
or:

\begin{picture}(20,13)
\put(11,12){list}
\put(10.8,11.8){\line(-2,-1){4.4}} % /...
...t(7,4){','}
\put(0.8,3.8){\line(0,-1){2.2}} % \vert
\put(0,0){id}
\end{picture}

i.e. (((id) ,) () id) with two error messages:

   extra comma ignored
   missing comma inserted

Like LEX, YACC assumes that it should use the earlier of the two rules, which in this case means rule 3 rather than rule 4, so we get the second derivation. If we swap the rules for list and optcomma we get the first derivation, which would be best.

However, few grammars are as simple as this, and it is rarely possible to be certain that a reduce-reduce conflict is safe, so we normally try to remove them completely.

In this case, the problem may be that we are forcing the parser to decide what to do with the ',' too early on. We can try to delay the decision by getting rid of the optcomma grammar rule and putting extra alternatives into the rule for list:


list    : id
        | list ',' id
        | list id                {yyerror ("missing comma inserted");}
        | list ','               {yyerror ("extra comma ignored");}
        ;
Unfortunately, looking at the y.output file from YACC for this grammar, we have simply replaced the reduce-reduce conflict by a shift-reduce conflict:

3: shift/reduce conflict (shift 5, reduce 4) on id
state 3 list : list ,_id
        list : list ,_   (4)

        id shift 5        . reduce 4
i.e. after inputting and reducing a list, and shifting a ',' if the parser then encounters another id, it does not know whether to include this id in the current list (shift) or leave this id to become part of a later list (reduce). i.e. the parse tree for an input of id , id can be:

\begin{picture}(16,9)
\put(7,8){list}
\put(6.8,7.8){\line(-2,-1){4.4}} % /
\pu...
...t(14,4){id}
\put(0.8,3.8){\line(0,-1){2.2}} % \vert
\put(0,0){id}
\end{picture}

i.e. ((id) , id)
or:

\begin{picture}(20,13)
\put(11,12){list}
\put(10.8,11.8){\line(-2,-1){4.4}} % /...
...t(7,4){','}
\put(0.8,3.8){\line(0,-1){2.2}} % \vert
\put(0,0){id}
\end{picture}

i.e. (((id) ,) id) with the same two error messages as before

Looking at the possible parses, this is still pretty much the same problem as before, so unfortunately the change we made to the grammar has not made any real improvement - our grammar is fundamentally ambiguous, rather than (as we hoped after the first error) just a little too complicated for YACC to use.

Like LEX, YACC assumes that it should try and recognise the longer of the two rules i.e it should shift the id, which in this case means we get the first derivation, which is what we wanted. However, as mentioned above, it is dangerous to leave a conflict in a grammar and we normally try to remove them completely.

The only thing to do is to take a step back and describe afresh what we are trying to recognise. I will concentrate on the identifiers, as they are what we are really interested in. We want to recognise a list of identifiers, each pair separated by zero or more commas (although we expect exactly one), and the whole list perhaps preceded and trailed by some commas (although we expect none):


list     : no_commas idlist no_commas
         ;
idlist   : id
         | idlist one_comma id
         ;
one_comma: commas
         |                 {yyerror ("missing comma inserted");}
         ;
no_commas: commas          {yyerror ("extra comma ignored");}
         |
         ;
commas   : commas ','      {yyerror ("extra comma ignored");}
         | ','
         ;


Language design and implementation

Most of the time, an ambiguous grammar results from an error made by the implementor of a programming language. Sometimes, however, it is the fault of the language designer. Many languages are defined in such a way that some part is either inherently ambiguous, or at least ambiguous to YACC (i.e. not LALR(1)). There is some controversy as to whether this is important - we should not limit language designers to what a particular type of parser generator can cope with, but on the other hand there is no particular merit in making a language harder to process if a simple change can resolve the problem.

A simple example of this is a well-known problem with conditional statements; the dangling else. Most imperative languages permit conditional statements to take two slightly different forms:


  if ( ... ) ...
  if ( ... ) ...  else ...
(many languages also use the keyword then) so the else d in

  if (a) if (b) c else d
could be associated either with if (a) or with if (b).

Most languages attempt to fix this problem by stating that the second interpretation is more natural, and so is correct, although in Algol 60 the example would be illegal: a then-clause may not consist of an if-statement. Whatever the language definition, it is an extra rule that anyone learning the language has to remember.

Similarly, the language implementor has to deal with this special case: if we use a tool like YACC we get a shift-reduce error, and the generated parser would perform the shift to give us the second alternative as expected. (To avoid the error message, if the language includes the then keyword, we can make then and else right associative or give then a lower precedence than else.)

It is possible to make a simple change to the grammar which completely removes this ambiguity; to have a terminating keyword such as end_if or fi (c.f. $\S $9, $\S $11).

Exercises

($\dag $ indicates harder problems)

Readings

Capon & Jinks: chs. 6, 8.1, 8.4, 8.5
Aho, Sethi & Ullman: chs. 2.2, 4.3
Johnson $\S $5
Levine, Mason & Brown: chs. 3, 7, 8
next up previous contents
Next: Parse Trees Up: CS2111: Design and Implementation Previous: Inside LEX and YACC:   Contents
Pete Jinks
1999-09-30