UP

Ambiguous grammars for unambiguous languages


The starting point

Chris Dodd gave a different solution for this ambiguous grammar:
text	: text1
	| text2
	;
text1	:
	| text PARA inlineText PARA_END
	| text1 inlineElement
	;
text2	: text PARA inlineText
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK
	| BOLD inlineText BOLD_END
	;
This solution is broadly similar to mine, but expressed rather differently, and arrived at in a totally different and very interesting way (see below)

Here is a comparison of the two answers:

-> EBNF (I used my version of EBNF and in particular # to indicate a list separator.)

text	: text1 ( PARA inlineElement* )*
	;
text1	: ( text PARA inlineElement* PARA_END )? inlineElement*
	;
inlineElement: TEXTBLOCK | BOLD inlineElement* BOLD_END
	;
->
text	: (text PARA inlineElement* PARA_END)? (inlineElement* # PARA)+
	;
inlineElement: TEXTBLOCK | BOLD inlineElement* BOLD_END
	;
-> [ a : (a b)? c ; -> a : (c # b)+ ; ]
text	: ((inlineElement* # PARA)+ # PARA inlineElement* PARA_END)+
	;
inlineElement: TEXTBLOCK | BOLD inlineElement* BOLD_END
	;
-> [ ((a # b)+ # c)+ -> (a # (b|c))+ ]
text	: (inlineElement* # (PARA | PARA inlineElement* PARA_END) )+
	;
inlineElement: TEXTBLOCK | BOLD inlineElement* BOLD_END
	;

How does this method work?

Note that I am trying to make sense of something that is new to me and looks to be a very useful technique, rather than trying to make any definitive analysis of all this.

Chris Dodd wrote: "To get rid of shift/reduce conflicts, the basic technique is to split the rules that are causing the reduction part of the conflict into two rules and to use the appropriate rule(s) based on the following context of the use. So you need to look at the symbols that come after the [NON-]terminal you're trying to split, where that [NON-]terminal appears on the right hand side of a rule."

i.e.

text	:
	| text textElement
	;
textElement: inlineElement
	| paragraph
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK
	| BOLD inlineText BOLD_END
	;
paragraph: PARA inlineText optParaEnd
	;
optParaEnd:
	| PARA_END
	;
gives:
8: shift/reduce conflict (shift 3, reduce 10) on BOLD
8: shift/reduce conflict (shift 4, reduce 10) on TEXTBLOCK
state 8
	inlineText : inlineText . inlineElement  (6)
	paragraph : PARA inlineText . optParaEnd  (9)
	optParaEnd : .  (10)
i.e., if we see the start of an "inlineElement" (BOLD or TEXTBLOCK) we don't know if this is part of the "inlineText" of this "paragraph" of this "textElement" (shift), or we have an empty "optParaEnd" and this is part of the next "textElement" (reduce 10).

"The problem comes when the [NON-]terminal is the last thing on the right hand side. In that case, you need to unfactor the rule (eliminating the [NON-]terminal, but resulting in more rules) until you can find the following context." The rule that is involved in the reduce part of the shift/reduce conflict ("optParaEnd") is indeed at the extremem right-hand-side of the rule it is used in ("paragraph") so we need to expand it:

text	:
	| text textElement
	;
textElement: inlineElement
	| paragraph
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK
	| BOLD inlineText BOLD_END
	;
paragraph: PARA inlineText PARA_END
	| PARA inlineText
	;
giving
8: shift/reduce conflict (shift 3, reduce 10) on BOLD
8: shift/reduce conflict (shift 4, reduce 10) on TEXTBLOCK
state 8
	inlineText : inlineText . inlineElement  (6)
	paragraph : PARA inlineText . PARA_END  (9)
	paragraph : PARA inlineText .  (10)
"We still have no FOLLOW context, because everywhere paragraph" (the non-terminal for the reduce part of the shift/reduce conflict) "appears on the right side of a rule, it is the last thing. So continue to unfactor" i.e. expand "paragraph":
text	:
	| text textElement
	;
textElement: inlineElement
	| PARA inlineText PARA_END
	| PARA inlineText
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK
	| BOLD inlineText BOLD_END
	;
giving
7: shift/reduce conflict (shift 3, reduce 5) on BOLD
7: shift/reduce conflict (shift 4, reduce 5) on TEXTBLOCK
state 7
	textElement : PARA inlineText . PARA_END  (4)
	textElement : PARA inlineText .  (5)
	inlineText : inlineText . inlineElement  (7)
This still isn't enough, as "textElement" is also at the extreme right of the grammar rules in which it appears, so now expand "textElement":
text	:
	| text inlineElement
	| text PARA inlineText PARA_END
	| text PARA inlineText
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK
	| BOLD inlineText BOLD_END
	;
giving
6: shift/reduce conflict (shift 3, reduce 4) on BOLD
6: shift/reduce conflict (shift 4, reduce 4) on TEXTBLOCK
state 6
	text : text PARA inlineText . PARA_END  (3)
	text : text PARA inlineText .  (4)
	inlineText : inlineText . inlineElement  (6)
Now we can see what can FOLLOW rule 4, as it is reduced to "text", which is followed (within the rule for "text" itself) by "PARA" or "inlineElement", and the latter could instead be shifted to be part of the "inlineText".

"The specific rule triggering the conflict is now "text: text PARA inlineText", and it conflicts with a following context of inlineElement. So we need to split the 'text' rule into those parts that conflict" (text2) "and those that do not" (text1)

text	: text1
	| text2
	;
text1	:
	| text inlineElement
	| text PARA inlineText PARA_END
	;
text2	: text PARA inlineText
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK
	| BOLD inlineText BOLD_END
	;
giving
8: shift/reduce conflict (shift 5, reduce 6) on BOLD
8: shift/reduce conflict (shift 6, reduce 6) on TEXTBLOCK
state 8
	text1 : text PARA inlineText . PARA_END  (5)
	text2 : text PARA inlineText .  (6)
	inlineText : inlineText . inlineElement  (8)
"and change the uses that conflict" (i.e. where what FOLLOWs "text", after a reduce, is the alternative to the shift, in a shift/reduce conflict) "to only use the non-conflicting part"
i.e. we are explicitly resolving the shift/reduce conflict in favour of the shift, just as yacc will implicitly do by default after reporting a conflict.
text	: text1
	| text2
	;
text1	:
	| text1 inlineElement
	| text PARA inlineText PARA_END
	;
text2	: text PARA inlineText
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK
	| BOLD inlineText BOLD_END
	;
which is conflict-free.

Does this work for other ambiguous grammars for unambiguous languages?

A grammar for errors in list separators

list    : id
        | list optcomma id
        | list ',' /* extra comma */
        ;
optcomma: ','
        | /* missing comma */
        ;
gives:
3: reduce/reduce conflict (reduce 3, reduce 4) on id
state 3
	list : list ',' .  (3)
	optcomma : ',' .  (4)
and, unfortunately, this is a reduce/reduce conflict. However, I think we can proceed by expanding "optcomma":
list    : id
        | list ',' id
        | list id /* missing comma */
        | list ',' /* extra comma */
        ;
to give a shift/reduce conflict:
4: shift/reduce conflict (shift 5, reduce 4) on id
state 4
	list : list ',' . id  (2)
	list : list ',' .  (4)
so we split "list":
list	: list_C
	| list_noC
	;
list_noC: id
        | list ',' id
        | list id /* missing comma */
        ;
list_C	: list ',' /* extra comma */
	;
and disambiguate in favour of the shift for an "id":
list	: list_C
	| list_noC
	;
list_noC: id
        | list ',' id
        | list_noC id /* missing comma */
        ;
list_C	: list ',' /* extra comma */
	;
and this works! Rewriting slightly:
list	: list ',' /* extra comma */
	| list_id
	;
list_id	:          id
        | list ',' id
        | list_id  id /* missing comma */
        ;

The dangling else

This should work, as we want to resolve the shift/reduce conflict in favour of the shift.
stat	: IF '(' exp ')' stat
	| IF '(' exp ')' stat ELSE stat
	| exp
        ;
gives
7: shift/reduce conflict (shift 8, reduce 1) on ELSE
state 7
        stat : IF '(' exp ')' stat .  (1)
        stat : IF '(' exp ')' stat . ELSE stat  (2)
I now need to set about splitting the rule for "stat" into conflicting and non-conflicting parts. As I'm not really sure how to do this, I will start by simply rewriting the grammar as:
stat	: stat_ok
	| stat_not
	;
stat_not: SILLY
	;
stat_ok	: exp
	| IF '(' exp ')' stat_ok
	| IF '(' exp ')' stat_not
	| IF '(' exp ')' stat_ok ELSE stat_not
	| IF '(' exp ')' stat_not ELSE stat_not
	| IF '(' exp ')' stat_ok ELSE stat_ok
	| IF '(' exp ')' stat_not ELSE stat_ok
        ;
(SILLY is just used to stop stat_not matching an empty string before I have moved any rules from stat_ok to stat_not.)

Following the algorithm, the rule(s) triggering the conflicts is (the expansions of) stat:IF '(' exp ')' stat, which conflicts with a following context of ELSE, so we have to get rid of the uses of stat_not ELSE:

stat	: stat_ok
	| stat_not
	;
stat_not: SILLY
	;
stat_ok	: exp
	| IF '(' exp ')' stat_ok
	| IF '(' exp ')' stat_not
	| IF '(' exp ')' stat_ok ELSE stat_not
	| IF '(' exp ')' stat_ok ELSE stat_ok
        ;
and move the conflicting rule - i.e. the expansions of stat:IF '(' exp ')' stat - from "stat_ok" to "stat_not" (as stat_not is no longer empty, I can also get rid of silly):
stat	: stat_ok
	| stat_not
	;
stat_not: IF '(' exp ')' stat_not
	| IF '(' exp ')' stat_ok
	;
stat_ok	: exp
	| IF '(' exp ')' stat_ok ELSE stat_not
	| IF '(' exp ')' stat_ok ELSE stat_ok
        ;
however, this still gives a conflict, and I don't think the report helps us to know what to do next - except maybe it hints that the problem is that there is still a rule in "stat_ok" that ends in "stat_not", and we are trying to ensure that "stat_not" cannot precede "ELSE" in the grammar:
9: shift/reduce conflict (shift 11, reduce 4) on ELSE
state 9
        stat_not : IF '(' exp ')' stat_ok .  (4)
        stat_ok : IF '(' exp ')' stat_ok . ELSE stat_not  (6)
        stat_ok : IF '(' exp ')' stat_ok . ELSE stat_ok  (7)
I guess that it is plausible that, as we are worrying about resolving a shift/reduce conflict in favour of the shift, I should also move rules ending in "stat_not" into the rule for "stat_not" itself, and this certainly cures the problem and gets the expected answer:
stat	: stat_ok
	| stat_not
	;
stat_not: IF '(' exp ')' stat_not
	| IF '(' exp ')' stat_ok
	| IF '(' exp ')' stat_ok ELSE stat_not
	;
stat_ok	: exp
	| IF '(' exp ')' stat_ok ELSE stat_ok
        ;

A very simple version of the ambiguous grammar for expressions

exp	: id
	| exp '+' exp
	| exp '*' exp
	;
which gives:
5: shift/reduce conflict (shift 3, reduce 2) on '+'
5: shift/reduce conflict (shift 4, reduce 2) on '*'
state 5
        exp : exp . '+' exp  (2)
        exp : exp '+' exp .  (2)
        exp : exp . '*' exp  (3)

6: shift/reduce conflict (shift 3, reduce 3) on '+'
6: shift/reduce conflict (shift 4, reduce 3) on '*'
state 6
        exp : exp . '+' exp  (2)
        exp : exp . '*' exp  (3)
        exp : exp '*' exp .  (3)
i.e. we haven't specified the associativity and precedence of the two operators. However, although we want to resolve one of these conflicts in favour of shifting (state 5 and '*') in the other three cases we want to reduce!

Classically, assuming we want to use pure BNF and not %left etc., the way to achieve this is to split the rule so we end up with a different rule for each precedence level, and to rewrite the recursion to get the correct associativity:

exp	: exp2
	| exp '+' exp2
	;
exp2	: id
	| exp2 '*' id
	;

Does this do anything for unambiguous grammars?

An LR(2) grammar that recognises yacc input

%token RULENAME TOKEN
%start yacc
%%
yacc : rule
     | yacc rule
     ;
rule : rulebody
     | rulebody ";"
     ;
rulebody : RULENAME ':' alt
         | rulebody "|" alt
         ;
alt :
    | alt TOKEN
    | alt RULENAME
    ;
gives:
9: shift/reduce conflict (shift 11, reduce 5) on RULENAME
state 9
	rulebody : RULENAME ':' alt .  (5)
	alt : alt . TOKEN  (8)
	alt : alt . RULENAME  (9)

10: shift/reduce conflict (shift 11, reduce 6) on RULENAME
state 10
	rulebody : rulebody '|' alt .  (6)
	alt : alt . TOKEN  (8)
	alt : alt . RULENAME  (9)
So the reductions are of "rulebody", which occurs in 3 places in the grammar, and as one of those is at the extreme right of a rule, we need to start expanding it. As it is recursive, this is not as simple as in the previous grammar:
yacc : rule
     | yacc rule
     ;
rule : RULENAME ':' alt
     | rulebody "|" alt
     | rulebody ";"
     ;
rulebody : RULENAME ':' alt
         | rulebody "|" alt
         ;
alt :
    | alt TOKEN
    | alt RULENAME
    ;
gives:
9: shift/reduce conflict (shift 11, reduce 3) on RULENAME
state 9
	rule : RULENAME ':' alt .  (3)
	rulebody : RULENAME ':' alt .  (6)
	alt : alt . TOKEN  (9)
	alt : alt . RULENAME  (10)

10: shift/reduce conflict (shift 11, reduce 4) on RULENAME
state 10
	rule : rulebody '|' alt .  (4)
	rulebody : rulebody '|' alt .  (7)
	alt : alt . TOKEN  (9)
	alt : alt . RULENAME  (10)
and now the conflicts have moved to the rules for "rule", which also occurs at the extreme right, so expand again:
yacc : RULENAME ':' alt
     | rulebody "|" alt
     | rulebody ";"
     | yacc RULENAME ':' alt
     | yacc rulebody "|" alt
     | yacc rulebody ";"
     ;
rulebody : RULENAME ':' alt
         | rulebody "|" alt
         ;
alt :
    | alt TOKEN
    | alt RULENAME
    ;
giving:
9: shift/reduce conflict (shift 14, reduce 1) on RULENAME
state 9
	yacc : RULENAME ':' alt .  (1)
	rulebody : RULENAME ':' alt .  (7)
	alt : alt . TOKEN  (10)
	alt : alt . RULENAME  (11)

13: shift/reduce conflict (shift 14, reduce 2) on RULENAME
state 13
	yacc : rulebody '|' alt .  (2)
	rulebody : rulebody '|' alt .  (8)
	alt : alt . TOKEN  (10)
	alt : alt . RULENAME  (11)

16: shift/reduce conflict (shift 14, reduce 4) on RULENAME
state 16
	yacc : yacc RULENAME ':' alt .  (4)
	rulebody : RULENAME ':' alt .  (7)
	alt : alt . TOKEN  (10)
	alt : alt . RULENAME  (11)

17: shift/reduce conflict (shift 14, reduce 5) on RULENAME
state 17
	yacc : yacc rulebody '|' alt .  (5)
	rulebody : rulebody '|' alt .  (8)
	alt : alt . TOKEN  (10)
	alt : alt . RULENAME  (11)
There are conflicts in 4 of the rules for "yacc", all to do with how a "RULENAME" is treated, so we split the rule for "yacc":
yacc : yacc_noC
     | yacc_C
     ;
yacc_C : RULENAME ':' alt
       | rulebody "|" alt
       | yacc RULENAME ':' alt
       | yacc rulebody "|" alt
       ;
yacc_noC : rulebody ";"
	 | yacc rulebody ";"
	 ;
rulebody : RULENAME ':' alt
         | rulebody "|" alt
         ;
alt :
    | alt TOKEN
    | alt RULENAME
    ;
and follow the rules to try to shift "RULENAME" at a shift/reduce conflict:
yacc : yacc_noC
     | yacc_C
     ;
yacc_C : RULENAME ':' alt
       | rulebody "|" alt
       | yacc_noC RULENAME ':' alt
       | yacc_noC rulebody "|" alt
       ;
yacc_noC : rulebody ";"
	 | yacc_noC rulebody ";"
	 ;
rulebody : RULENAME ':' alt
         | rulebody "|" alt
         ;
alt :
    | alt TOKEN
    | alt RULENAME
    ;
This gets rid of all conflicts but has changed the language! I believe the problem is that the original grammar was unambiguous and so when we tried to decide the shift/reduce conflict in favour of the shift we can no longer input some legal programs. In fact, I'm not even convinced that, in this case, the final modification does cause the grammar to perform a shift in preference to a reduction.

An LR(2) grammar that recognises parameter lists

header	: type_name id '(' params ')'
	| type_name id '(' ')'
	;
type_name: id
	| complex_type
	;
params	: param
	| params ',' param
	;
param	: type_name ids
	;
ids	: id
	| ids ',' id
	;
gives:
12: shift/reduce conflict (shift 15, reduce 7) on ','
state 12
	param : type_name ids .  (7)
	ids : ids . ',' id  (9)
The reduce rule is for "param", and this occurs on the extreme right-hand-side of other rules, so expand it:
header	: type_name id '(' params ')'
	| type_name id '(' ')'
	;
type_name: id
	| complex_type
	;
params	: type_name ids
	| params ',' type_name ids
	;
ids	: id
	| ids ',' id
	;
giving:
11: shift/reduce conflict (shift 14, reduce 5) on ','
state 11
	params : type_name ids .  (5)
	ids : ids . ',' id  (8)

17: shift/reduce conflict (shift 14, reduce 6) on ','
state 17
	params : params ',' type_name ids .  (6)
	ids : ids . ',' id  (8)
so at this point we should split "params", but all the rules for "params" cause conflicts, so the non-conflicting rule cannot be written!

Instead, at this point, I think I need to recognise that the "," in "ids" is also relevant, and merge the rule for "ids" in with that for "params":

header	: type_name id '(' params ')'
	| type_name id '(' ')'
	;
type_name: id
	| complex_type
	;
params	: type_name id
	| params ',' type_name id
	| params ',' id
	;
and this is conflict-free without any further work!

An LR(2) grammar to recognise C expressions

exp	: exp '-' sub_exp
	| sub_exp
	;
sub_exp	: '(' type_name ')' sub_exp
	| id
	| literal
	| '(' exp ')'
	;
type_name: id
	| complex_type
	;
which gives a reduce/reduce conflict:
6: reduce/reduce conflict (reduce 4, reduce 7) on ')'
state 6
	sub_exp : id .  (4)
	type_name : id .  (7)
so lets try expanding rules until we get a shift/reduce conflict, starting with "type_name":
exp	: exp '-' sub_exp
	| sub_exp
	;
sub_exp	: '(' id ')' sub_exp
	| '(' complex_type ')' sub_exp
	| id
	| literal
	| '(' exp ')'
	;
which gives:
6: shift/reduce conflict (shift 10, reduce 5) on ')'
state 6
	sub_exp : '(' id . ')' sub_exp  (3)
	sub_exp : id .  (5)
but sub_exp appears on the extreme right-hand-side of four rules, and I have no idea what to do next.

A grammar that is not LR(n)

The derivation here is informed by the methodology being discussed on this page.