next up previous
Next: Parse Trees Up: CS2121: The Implementation and Previous: How lexers (e.g. from

Subsections


Ambiguous and confusing grammars

Ambiguous and confusing grammars


Ambiguous grammars

C and Java have an ambiguity in the grammar for expressions, which, hugely simplified, looks something like this:

  exp        : exp '-' sub_exp
             | sub_exp
             ;
  sub_exp    : '(' type_name ')' sub_exp
             | '-' sub_exp
             | id
             | literal
             | '(' exp ')'
             ;
  type_name  : id
             | more_complex_type_descriptions
             ;

This allows expressions like: 1, a, a - 1, ( a ), ( a - 1 ), - a, ( int ) a
but what is meant by: ( b ) - ( c ) ?

The problem is that a single input string corresponds to more than one possible parse tree. That is, it is a valid part of the language, but we don't know what it means for certain!

This is a genuine problem with Java and with C, that takes extra work by compiler-writers to solve - every identifier has to be checked (e.g. by LEX) to see if it has already appeared in a class or typedef declaration, in which case it definitely a type_name, otherwise it is an ordinary id and can't become a type_name. We would also need to modify the grammar slightly to make this distinction clear.

Ambiguous grammars are, by definition, going to be difficult to handle no matter what tools we use. The assumption made with languages designed for computers is that we do our best to make them unambiguous. Therefore, we would normally expect any tools we use, like YACC, only to have to handle unambiguous grammars. Given that, can they handle any unambiguous grammar?

Unfortunately, the answer is ``no'' - there are unambiguous grammars that tools like YACC and JAVACC can't handle. Luckily, for most good tools, you are unlikely to come across such a grammar, and if you do, you can usually modify the grammar to overcome the problems but still recognise the same language.

Equally unfortunately, there is no way of deciding whether a grammar is ambiguous or not - the best that can be done is to try to create a parser, but if the process fails it can't tell us whether this is because the grammar is really ambiguous or if it is just because the grammar is too confusing for the kind of parser we are trying to make.


How to confuse parsers

The decision that a parser repeatedly makes is: given what it has already read of the input, and the grammar rules it has already recognised, what grammar rule comes next? The more input the parser can look at before it has to make a decision, the more likely it is to be able to avoid confusion and get it right.

For example, suppose we look at languages where assignment is a particular kind of statement, rather than an operation that can be embedded in any expression:

  stat   : target '=' exp ';'
         | target '(' explist ')' ';'
         ;
  target : id
         | target '.' id
         ;

An LL(1) parser trying to compile this language would have difficulties distinguishing between assignments (e.g. a=x;) and procedure calls i.e. functions/methods returning void (e.g. a(x);). This is because an LL(1) parser has to decide which kind of statement it is looking at after seeing only 1 symbol (i.e. a), and it isn't until we see the = or ( that we can tell what is intended. Suppose we used a more complex algorithm, such as LL(3) - even this couldn't decide between e.g. a.b=x and a.b(x). In fact, no matter how far it looks ahead, an LL(n) parser, which looks ahead a fixed amount, can always be confused by a sufficiently complicated target in an assignment or call.

There are two kinds of solutions - the parser can use a variable amount of lookahead, as JAVACC can be asked to do, so it reads as far as the = or ( before making a decision - or we can rewrite the grammar, by left-factorising it (as mentioned in $\S $5.2.1), so that the two kinds of statement are merged until we can make the decision:

  stat            : target assign_or_call ';'
                  ;
  assign_or_call  : '=' exp
                  | '(' explist ')'
                  ;

An LR(1) parser has no difficulty dealing with the original grammar, as it will have read to the end of the statement, and seen the = or ( on the way, before it has to decide whether to recognise an assignment or a call.

It is possible to construct unambiguous grammars that would confuse any LR(n) parser (as well as any LL(n) parser) e.g. palindromes - strings that are their own mirror images, such as abba or abacaba:

  P :
    |    'a'    |    'b'    |    'c'    | . . .
    | 'a' P 'a' | 'b' P 'b' | 'c' P 'c' | . . .
    ;

The problem is that, although it is perfectly obvious to us what to do - find the middle, and work out to both ends - LR(n) and LL(n) read strictly left-to-right, and can only locate the middle of the string by using their finite lookahead to find the end of the string. This could not work for strings of length $>n$ for LL(n), or length $>2n$ for LR(n).

Confusing YACC

Once an ambiguity has been pointed out in a grammar, it is usually clear enough to the user what the problem is, even if it isn't obvious what to do about it. However, what kinds of error messages are reported by tools like YACC, and how easy is it to find the corresponding ambiguity or confusion?

YACC reports problems with grammars, whether ambiguous or just confusing, as shift/reduce conflicts (where YACC can't decide whether to perform a shift or a reduce - i.e. is the grammar rule complete?) and/or as reduce/reduce conflicts (where YACC can't decide which reduce to perform - i.e. which grammar rule is it?).

An example of a shift/reduce conflict

The start of a function/method declaration in a C-like language, that accepts headers like
void fred(int a, int b, float x, float z), looks something like this (type_name as in $\S $6.1):

header  : type_name id '(' params ')'
        | type_name id '(' ')'
        ;
params  : param
        | params ',' param
        ;
param   : type_name id
        ;

YACC has no problems with this grammar, but what if we modify it? It might be nice to be able to write the example above simply as void fred(int a, b, float x, z). We could try rewriting the grammar like this:

param   : type_name ids
        ;
ids     : id
        | ids ',' id
        ;
but now, YACC reports a shift/reduce conflict, and the details from the y.output file are:
13: shift/reduce conflict (shift 15, reduce 5) on ','
state 13
        param : type_name ids .  (5)
        ids : ids . ',' id  (7)

That is, when the generated parser sees a , after a list of identifiers in a param, it doesn't know whether that , (and the id it expects after) is part of the same param (in which case it should shift, to include them as part of the RHS) or the start of the next param (in which case it should reduce this RHS and start a new RHS).

This is not ambiguous, just confusing to YACC, as it needs more lookahead to see if the next few symbols are e.g. , a b (a is a type_name, b is a parameter name of type a) or , a , or , a ) (a is a parameter name of the current type). The way to make this clear to YACC is to rewrite the grammar so that it can see more of the input before having to make a decision:

params  : type_name id
        | params ',' type_name id
        | params ',' id
        ;

An example of a reduce/reduce conflict

YACC reports a reduce/reduce conflict for the grammar given in $\S $6.1:

8: reduce/reduce conflict (reduce 5, reduce 8) on ')'
state 8
        sub_exp : id .  (5)
        type_name : id .  (8)

That is, when it sees id ) it doesn't know whether the id is a variable giving a value or a type name, so it doesn't know which rule to use to recognise the id.

Assuming we don't already know what the problem is, this hasn't helped much, but we can get more information by working back through the states in the y.output file to try to find how we get here. To do so, we need to look for states that include shift 8 or goto 8. In this example, all we find is:

state 4
        sub_exp : '(' . type_name ')' sub_exp  (3)
        sub_exp : '(' . exp ')'  (7)
...	
        id  shift 8
so the input must include ( id ), which can be recognised either as a type-cast or as an expression.

This is a big hint about the source of the ambiguity in the grammar, but more by luck than anything else - YACC remains confused even if we make the grammar unambiguous, by removing the rule sub_exp : '-' sub_exp. YACC still reports the same reduce/reduce conflict for this modified grammar, as it is confused by an input as simple as ( a ) - it has to decide whether this is a value in an expression or a type-cast before it reads past the ) to see e.g. ( a ) 99 (i.e. a type-cast) or ( a ) - 99 (i.e. the value $a-99$).

Luckily, the solution to the general problem of the ambiguity - to somehow get LEX to distinguish between identifiers that are really type names (or class names) and all other identifiers - also solves this confusion for YACC.

Epilogue

Most of the time, an ambiguous grammar results from an error made by the implementor of a programming language. Sometimes, however, it is the fault of the language designer. Many languages are defined in such a way that some part is either inherently ambiguous or confusing (e.g. not LR(1)). Does this matter? We should not limit language designers to what a particular type of parser generator can cope with, but on the other hand there is no particular merit in making a language harder to compile if a small change can simplify the problem.

An example of this is a well-known problem with conditional statements; the dangling else. Most imperative languages permit conditional statements to take two slightly different forms:

  if ( ... ) ...
  if ( ... ) ...  else ...
so the else d in if (a) if (b) c else d could be associated either with if (a) or with if (b).

Most languages attempt to fix this problem by stating that the second interpretation is more natural, and so is correct, although some languages have different rules. Whatever the language definition, it is an extra rule that anyone learning the language has to remember.

Similarly, the compiler writer has to deal with this special case: if we use a tool like YACC we get a shift/reduce error - do we shift the else to get if (b) c else d, or do we reduce the if (b) c as it stands, so we get if (a) ... else d To overcome this problem, we can rewrite the grammar to explicity say ``you can't have an unmatched then (logically) immediately before an else - the then and the else must be paired up'':

stat      : matched
          | unmatched
          ;
unmatched : IF '(' exp ')' stat
          | IF '(' exp ')' matched ELSE unmatched
          | FOR '(' exp ';' exp ';' exp ')' unmatched
          | WHILE '(' exp ')' unmatched
          | . . .
          ;
matched   : IF '(' exp ')' matched ELSE matched
          | FOR '(' exp ';' exp ';' exp ')' matched
          | WHILE '(' exp ')' matched
          | . . .
          | exp
          ;

Alternatively, it is possible to make a simple change to the language which completely removes this ambiguity - to have a terminating keyword such as end_if or fi:

stat      : IF '(' exp ')' stat FI
          | IF '(' exp ')' stat ELSE stat FI
          | . . .
          ;

Exercises

($\dag $ indicates harder problems)

Readings

Capon & Jinks: chs. 6, 8.1, 8.4, 8.5
Aho, Sethi & Ullman: chs. 2.2, 4.3
Johnson $\S $5
Levine, Mason & Brown: chs. 3, 7, 8


next up previous
Next: Parse Trees Up: CS2121: The Implementation and Previous: How lexers (e.g. from
Pete Jinks
2004-10-26