Next: Parse Trees Up: CS2121: The Implementation and Previous: How lexers (e.g. from

Subsections

Ambiguous and confusing grammars

Ambiguous grammars

C and Java have an ambiguity in the grammar for expressions, which, hugely simplified, looks something like this:

  exp        : exp '-' sub_exp
             | sub_exp
             ;
  sub_exp    : '(' type_name ')' sub_exp
             | '-' sub_exp
             | id
             | literal
             | '(' exp ')'
             ;
  type_name  : id
             | more_complex_type_descriptions
             ;

This allows expressions like: 1, a, a - 1, ( a ), ( a - 1 ), - a, ( int ) a
but what is meant by: ( b ) - ( c ) ?

The problem is that a single input string corresponds to more than one possible parse tree. That is, it is a valid part of the language, but we don't know what it means for certain!

This is a genuine problem with Java and with C, that takes extra work by compiler-writers to solve - every identifier has to be checked (e.g. by LEX) to see if it has already appeared in a class or typedef declaration, in which case it definitely a type_name, otherwise it is an ordinary id and can't become a type_name. We would also need to modify the grammar slightly to make this distinction clear.

Ambiguous grammars are, by definition, going to be difficult to handle no matter what tools we use. The assumption made with languages designed for computers is that we do our best to make them unambiguous. Therefore, we would normally expect any tools we use, like YACC, only to have to handle unambiguous grammars. Given that, can they handle any unambiguous grammar?

Unfortunately, the answer is ``no'' - there are unambiguous grammars that tools like YACC and JAVACC can't handle. Luckily, for most good tools, you are unlikely to come across such a grammar, and if you do, you can usually modify the grammar to overcome the problems but still recognise the same language.

Equally unfortunately, there is no way of deciding whether a grammar is ambiguous or not - the best that can be done is to try to create a parser, but if the process fails it can't tell us whether this is because the grammar is really ambiguous or if it is just because the grammar is too confusing for the kind of parser we are trying to make.

How to confuse parsers

The decision that a parser repeatedly makes is: given what it has already read of the input, and the grammar rules it has already recognised, what grammar rule comes next? The more input the parser can look at before it has to make a decision, the more likely it is to be able to avoid confusion and get it right.

For example, suppose we look at languages where assignment is a particular kind of statement, rather than an operation that can be embedded in any expression:

  stat   : target '=' exp ';'
         | target '(' explist ')' ';'
         ;
  target : id
         | target '.' id
         ;

An LL(1) parser trying to compile this language would have difficulties distinguishing between assignments (e.g. a=x;) and procedure calls i.e. functions/methods returning void (e.g. a(x);). This is because an LL(1) parser has to decide which kind of statement it is looking at after seeing only 1 symbol (i.e. a), and it isn't until we see the = or ( that we can tell what is intended. Suppose we used a more complex algorithm, such as LL(3) - even this couldn't decide between e.g. a.b=x and a.b(x). In fact, no matter how far it looks ahead, an LL(n) parser, which looks ahead a fixed amount, can always be confused by a sufficiently complicated target in an assignment or call.

There are two kinds of solutions - the parser can use a variable amount of lookahead, as JAVACC can be asked to do, so it reads as far as the = or ( before making a decision - or we can rewrite the grammar, by left-factorising it (as mentioned in $\S$ 5.2.1), so that the two kinds of statement are merged until we can make the decision:

  stat            : target assign_or_call ';'
                  ;
  assign_or_call  : '=' exp
                  | '(' explist ')'
                  ;

An LR(1) parser has no difficulty dealing with the original grammar, as it will have read to the end of the statement, and seen the = or ( on the way, before it has to decide whether to recognise an assignment or a call.

It is possible to construct unambiguous grammars that would confuse any LR(n) parser (as well as any LL(n) parser) e.g. palindromes - strings that are their own mirror images, such as abba or abacaba:

  P :
    |    'a'    |    'b'    |    'c'    | . . .
    | 'a' P 'a' | 'b' P 'b' | 'c' P 'c' | . . .
    ;

The problem is that, although it is perfectly obvious to us what to do - find the middle, and work out to both ends - LR(n) and LL(n) read strictly left-to-right, and can only locate the middle of the string by using their finite lookahead to find the end of the string. This could not work for strings of length for LL(n), or length for LR(n).

Confusing YACC

Once an ambiguity has been pointed out in a grammar, it is usually clear enough to the user what the problem is, even if it isn't obvious what to do about it. However, what kinds of error messages are reported by tools like YACC, and how easy is it to find the corresponding ambiguity or confusion?

YACC reports problems with grammars, whether ambiguous or just confusing, as shift/reduce conflicts (where YACC can't decide whether to perform a shift or a reduce - i.e. is the grammar rule complete?) and/or as reduce/reduce conflicts (where YACC can't decide which reduce to perform - i.e. which grammar rule is it?).

An example of a shift/reduce conflict

The start of a function/method declaration in a C-like language, that accepts headers like
void fred(int a, int b, float x, float z), looks something like this (type_name as in $\S$ 6.1):

header  : type_name id '(' params ')'
        | type_name id '(' ')'
        ;
params  : param
        | params ',' param
        ;
param   : type_name id
        ;

YACC has no problems with this grammar, but what if we modify it? It might be nice to be able to write the example above simply as void fred(int a, b, float x, z). We could try rewriting the grammar like this:

param   : type_name ids
        ;
ids     : id
        | ids ',' id
        ;

but now, YACC reports a shift/reduce conflict, and the details from the y.output file are:

13: shift/reduce conflict (shift 15, reduce 5) on ','
state 13
        param : type_name ids .  (5)
        ids : ids . ',' id  (7)

That is, when the generated parser sees a , after a list of identifiers in a param, it doesn't know whether that , (and the id it expects after) is part of the same param (in which case it should shift, to include them as part of the RHS) or the start of the next param (in which case it should reduce this RHS and start a new RHS).

This is not ambiguous, just confusing to YACC, as it needs more lookahead to see if the next few symbols are e.g. , a b (a is a type_name, b is a parameter name of type a) or , a , or , a ) (a is a parameter name of the current type). The way to make this clear to YACC is to rewrite the grammar so that it can see more of the input before having to make a decision:

params  : type_name id
        | params ',' type_name id
        | params ',' id
        ;

An example of a reduce/reduce conflict

YACC reports a reduce/reduce conflict for the grammar given in $\S$ 6.1:

8: reduce/reduce conflict (reduce 5, reduce 8) on ')'
state 8
        sub_exp : id .  (5)
        type_name : id .  (8)

That is, when it sees id ) it doesn't know whether the id is a variable giving a value or a type name, so it doesn't know which rule to use to recognise the id.

Assuming we don't already know what the problem is, this hasn't helped much, but we can get more information by working back through the states in the y.output file to try to find how we get here. To do so, we need to look for states that include shift 8 or goto 8. In this example, all we find is:

state 4
        sub_exp : '(' . type_name ')' sub_exp  (3)
        sub_exp : '(' . exp ')'  (7)
...	
        id  shift 8

so the input must include ( id ), which can be recognised either as a type-cast or as an expression.

This is a big hint about the source of the ambiguity in the grammar, but more by luck than anything else - YACC remains confused even if we make the grammar unambiguous, by removing the rule sub_exp : '-' sub_exp. YACC still reports the same reduce/reduce conflict for this modified grammar, as it is confused by an input as simple as ( a ) - it has to decide whether this is a value in an expression or a type-cast before it reads past the ) to see e.g. ( a ) 99 (i.e. a type-cast) or ( a ) - 99 (i.e. the value ).

Luckily, the solution to the general problem of the ambiguity - to somehow get LEX to distinguish between identifiers that are really type names (or class names) and all other identifiers - also solves this confusion for YACC.

Epilogue

Most of the time, an ambiguous grammar results from an error made by the implementor of a programming language. Sometimes, however, it is the fault of the language designer. Many languages are defined in such a way that some part is either inherently ambiguous or confusing (e.g. not LR(1)). Does this matter? We should not limit language designers to what a particular type of parser generator can cope with, but on the other hand there is no particular merit in making a language harder to compile if a small change can simplify the problem.

An example of this is a well-known problem with conditional statements; the dangling else. Most imperative languages permit conditional statements to take two slightly different forms:

  if ( ... ) ...
  if ( ... ) ...  else ...

so the else d in if (a) if (b) c else d could be associated either with if (a) or with if (b).

Most languages attempt to fix this problem by stating that the second interpretation is more natural, and so is correct, although some languages have different rules. Whatever the language definition, it is an extra rule that anyone learning the language has to remember.

Similarly, the compiler writer has to deal with this special case: if we use a tool like YACC we get a shift/reduce error - do we shift the else to get if (b) c else d, or do we reduce the if (b) c as it stands, so we get if (a) ... else d To overcome this problem, we can rewrite the grammar to explicity say ``you can't have an unmatched then (logically) immediately before an else - the then and the else must be paired up'':

stat      : matched
          | unmatched
          ;
unmatched : IF '(' exp ')' stat
          | IF '(' exp ')' matched ELSE unmatched
          | FOR '(' exp ';' exp ';' exp ')' unmatched
          | WHILE '(' exp ')' unmatched
          | . . .
          ;
matched   : IF '(' exp ')' matched ELSE matched
          | FOR '(' exp ';' exp ';' exp ')' matched
          | WHILE '(' exp ')' matched
          | . . .
          | exp
          ;

Alternatively, it is possible to make a simple change to the language which completely removes this ambiguity - to have a terminating keyword such as end_if or fi:

stat      : IF '(' exp ')' stat FI
          | IF '(' exp ')' stat ELSE stat FI
          | . . .
          ;

Exercises

( $\dag$ indicates harder problems)

You will need your version of the calculator for the exercise of $\S$ 4. Remove the %lefts and see what error messages you get from YACC. Do you understand how they arise? (Use the -v flag with YACC to obtain the file y.output.)
Modify the calculator in $\S$ 3.1 to make the input format more flexible: create a grammar rule that recognises none, one or more spaces, and insert it where appropriate in the original grammar. You are likely to get lots of unexpected shift/reduce and reduce/reduce conflicts, so run your answer through YACC to check it.
$\dag$ The grammar: s = 'a' 'b' | 'a' s 'b' ;
will recognise a string of as followed by an equal number of bs. This grammar has been extended to:
s = 'a' 'b' | 'a' s 'b' | 'a' s {error} | s 'b' {error} ;
to detect errors but this gives shift/reduce and reduce/reduce conflicts (try it for yourself). Write a conflict-free grammar to detect errors.
$\dag$ The grammar for ANSI C given by Kernighan & Ritchie is in $CS2121/e*/c_grammar/* Run it through YACC to discover the conflicts, and try to work out why they are there and how you could improve the grammar to remove them.
$\dag$ How would you use LEX and YACC to recognise a language like SML, where the precedence, associativity and fix of operators can be varied.

Readings

Capon & Jinks: chs. 6, 8.1, 8.4, 8.5
Aho, Sethi & Ullman: chs. 2.2, 4.3
Johnson $\S$ 5
Levine, Mason & Brown: chs. 3, 7, 8

Next: Parse Trees Up: CS2121: The Implementation and Previous: How lexers (e.g. from

Pete Jinks
2004-10-26