C and Java have an ambiguity in the grammar for expressions, which, hugely simplified, looks something like this:
exp : exp '-' sub_exp | sub_exp ; sub_exp : '(' type_name ')' sub_exp | '-' sub_exp | id | literal | '(' exp ')' ; type_name : id | more_complex_type_descriptions ;
This allows expressions like:
1
,
a
,
a - 1
,
( a )
,
( a - 1 )
,
- a
,
( int ) a
but what is meant by:
( b ) - ( c )
?
The problem is that a single input string corresponds to more than one possible parse tree. That is, it is a valid part of the language, but we don't know what it means for certain!
This is a genuine problem with Java and with C, that takes extra work by
compiler-writers to solve - every identifier has to be checked (e.g. by
LEX) to see if it has already appeared in a class or typedef
declaration, in which case it definitely a type_name
, otherwise it is
an ordinary id
and can't become a type_name
. We would also need
to modify the grammar slightly to make this distinction clear.
Ambiguous grammars are, by definition, going to be difficult to handle no matter what tools we use. The assumption made with languages designed for computers is that we do our best to make them unambiguous. Therefore, we would normally expect any tools we use, like YACC, only to have to handle unambiguous grammars. Given that, can they handle any unambiguous grammar?
Unfortunately, the answer is ``no'' - there are unambiguous grammars that tools like YACC and JAVACC can't handle. Luckily, for most good tools, you are unlikely to come across such a grammar, and if you do, you can usually modify the grammar to overcome the problems but still recognise the same language.
Equally unfortunately, there is no way of deciding whether a grammar is ambiguous or not - the best that can be done is to try to create a parser, but if the process fails it can't tell us whether this is because the grammar is really ambiguous or if it is just because the grammar is too confusing for the kind of parser we are trying to make.
The decision that a parser repeatedly makes is: given what it has already read of the input, and the grammar rules it has already recognised, what grammar rule comes next? The more input the parser can look at before it has to make a decision, the more likely it is to be able to avoid confusion and get it right.
For example, suppose we look at languages where assignment is a particular kind of statement, rather than an operation that can be embedded in any expression:
stat : target '=' exp ';' | target '(' explist ')' ';' ; target : id | target '.' id ;
An LL(1) parser trying to compile this language would
have difficulties distinguishing between assignments
(e.g. a=x;
)
and procedure calls i.e. functions/methods returning void
(e.g. a(x);
).
This is because an LL(1) parser has to decide which kind of statement it is
looking at after seeing only 1 symbol (i.e. a
), and it isn't until we
see the =
or (
that we can tell what is intended. Suppose we
used a more complex algorithm, such as LL(3) - even this couldn't decide
between e.g. a.b=x
and a.b(x)
. In fact, no matter how
far it looks ahead, an LL(n) parser, which looks ahead a fixed amount, can
always be confused by a sufficiently complicated target
in an
assignment or call.
There are two kinds of solutions - the parser can use a variable amount of
lookahead, as JAVACC can be asked to do, so it reads as far as the
=
or (
before making a decision - or we can rewrite the
grammar, by left-factorising it (as mentioned in 5.2.1),
so that the two kinds of statement are merged until we can make the decision:
stat : target assign_or_call ';' ; assign_or_call : '=' exp | '(' explist ')' ;
An LR(1) parser has no difficulty dealing with the original grammar, as it
will have read to the end of the statement, and seen the =
or (
on the way, before it has to decide whether to recognise an assignment or a
call.
It is possible to construct unambiguous grammars that would confuse any LR(n)
parser (as well as any LL(n) parser) e.g. palindromes - strings that are
their own mirror images, such as abba
or abacaba
:
P : | 'a' | 'b' | 'c' | . . . | 'a' P 'a' | 'b' P 'b' | 'c' P 'c' | . . . ;
The problem is that, although it is perfectly obvious to us what to do - find the middle, and work out to both ends - LR(n) and LL(n) read strictly left-to-right, and can only locate the middle of the string by using their finite lookahead to find the end of the string. This could not work for strings of length for LL(n), or length for LR(n).
Once an ambiguity has been pointed out in a grammar, it is usually clear enough to the user what the problem is, even if it isn't obvious what to do about it. However, what kinds of error messages are reported by tools like YACC, and how easy is it to find the corresponding ambiguity or confusion?
YACC reports problems with grammars, whether ambiguous or just confusing, as shift/reduce conflicts (where YACC can't decide whether to perform a shift or a reduce - i.e. is the grammar rule complete?) and/or as reduce/reduce conflicts (where YACC can't decide which reduce to perform - i.e. which grammar rule is it?).
The start of a function/method declaration in a C-like language,
that accepts headers like
void fred(int a, int b, float x, float z)
,
looks something like this (type_name
as in 6.1):
header : type_name id '(' params ')' | type_name id '(' ')' ; params : param | params ',' param ; param : type_name id ;
YACC has no problems with this grammar, but what if we modify it?
It might be nice to be able to write the example above simply as
void fred(int a, b, float x, z)
.
We could try rewriting the grammar like this:
param : type_name ids ; ids : id | ids ',' id ;
y.output
file are:
13: shift/reduce conflict (shift 15, reduce 5) on ',' state 13 param : type_name ids . (5) ids : ids . ',' id (7)
That is, when the generated parser sees a ,
after a list of identifiers
in a param, it doesn't know whether that ,
(and the id
it
expects after) is part of the same param (in which case it should shift, to
include them as part of the RHS) or the start of the next param (in which case
it should reduce this RHS and start a new RHS).
This is not ambiguous, just confusing to YACC, as it needs more
lookahead to see if the next few symbols are e.g. , a b
(a
is a
type_name, b
is a parameter name of type a
) or , a ,
or , a )
(a
is a parameter name of the current type). The way
to make this clear to YACC is to rewrite the grammar so that it can
see more of the input before having to make a decision:
params : type_name id | params ',' type_name id | params ',' id ;
YACC reports a reduce/reduce conflict for the grammar given in 6.1:
8: reduce/reduce conflict (reduce 5, reduce 8) on ')' state 8 sub_exp : id . (5) type_name : id . (8)
That is, when it sees id )
it doesn't know whether the id
is a
variable giving a value or a type name, so it doesn't know which rule to use
to recognise the id
.
Assuming we don't already know what the problem is, this hasn't helped much,
but we can get more information by working back through the states in the
y.output
file to try to find how we get here. To do so, we need to look
for states that include shift 8
or goto 8
. In this example, all
we find is:
state 4 sub_exp : '(' . type_name ')' sub_exp (3) sub_exp : '(' . exp ')' (7) ... id shift 8
( id )
, which
can be recognised either as a type-cast or as an expression.
This is a big hint about the source of the ambiguity in the grammar, but more
by luck than anything else - YACC remains confused even if we make
the grammar unambiguous, by removing the rule sub_exp : '-' sub_exp
.
YACC still reports the same reduce/reduce conflict for this modified
grammar, as it is confused by an input as simple as ( a )
- it has to
decide whether this is a value in an expression or a type-cast before it reads
past the )
to see e.g. ( a ) 99
(i.e. a type-cast) or
( a ) - 99
(i.e. the value ).
Luckily, the solution to the general problem of the ambiguity - to somehow get LEX to distinguish between identifiers that are really type names (or class names) and all other identifiers - also solves this confusion for YACC.
Most of the time, an ambiguous grammar results from an error made by the implementor of a programming language. Sometimes, however, it is the fault of the language designer. Many languages are defined in such a way that some part is either inherently ambiguous or confusing (e.g. not LR(1)). Does this matter? We should not limit language designers to what a particular type of parser generator can cope with, but on the other hand there is no particular merit in making a language harder to compile if a small change can simplify the problem.
An example of this is a well-known problem with conditional statements; the dangling else. Most imperative languages permit conditional statements to take two slightly different forms:
if ( ... ) ... if ( ... ) ... else ...
else d
in if (a) if (b) c else d
could be associated either with if (a)
or with if (b)
.
Most languages attempt to fix this problem by stating that the second interpretation is more natural, and so is correct, although some languages have different rules. Whatever the language definition, it is an extra rule that anyone learning the language has to remember.
Similarly, the compiler writer has to deal with this special case: if
we use a tool like YACC we get a shift/reduce error
- do we shift the else
to get if (b) c else d
,
or do we reduce the if (b) c
as it stands, so we get
if (a) ... else d
To overcome this problem, we can rewrite the grammar to explicity say ``you
can't have an unmatched then
(logically) immediately before an
else
- the
then
and the else
must be paired up'':
stat : matched | unmatched ; unmatched : IF '(' exp ')' stat | IF '(' exp ')' matched ELSE unmatched | FOR '(' exp ';' exp ';' exp ')' unmatched | WHILE '(' exp ')' unmatched | . . . ; matched : IF '(' exp ')' matched ELSE matched | FOR '(' exp ';' exp ';' exp ')' matched | WHILE '(' exp ')' matched | . . . | exp ;
Alternatively, it is possible to make a simple change to the language which
completely removes this ambiguity - to have a terminating keyword such as
end_if
or fi
:
stat : IF '(' exp ')' stat FI | IF '(' exp ')' stat ELSE stat FI | . . . ;
%left
s and see what error messages you
get from YACC. Do you understand how they arise? (Use the
-v
flag with YACC to obtain the file y.output
.)
s = 'a' 'b' | 'a' s 'b' ;
a
s followed by an equal number of
b
s. This grammar has been extended to:
s = 'a' 'b' | 'a' s 'b' | 'a' s {error} | s 'b' {error} ;
$CS2121/e*/c_grammar/*
Run it through YACC to discover the
conflicts, and try to work out why they are there and how you could improve
the grammar to remove them.
Capon & Jinks: chs. 6, 8.1, 8.4, 8.5
Aho, Sethi & Ullman: chs. 2.2, 4.3
Johnson 5
Levine, Mason & Brown: chs. 3, 7, 8