How to convert grammars to LR(1)

To help to remove unecessary details from the BNF (such as exactly what form of recursion we will use to represent a list, unless it actually matters) I am going to use my version of EBNF and in particular # to indicate a list separator.

Examples

An LR(2) grammar to recognise yacc input
An LR(2) grammar to recognise parameter lists
An LR(2) grammar to recognise C expressions
A grammar that is not LR(n)
An ambiguous grammar - the dangling else
An ambiguous grammar for an HTML-like language
An ambiguous grammar for errors in list separators
An ambiguous grammar for declarations and expressions

Notes

A grammar to recognise yacc input

yacc	: rule
	| yacc rule
	;
rule	: rulebody
	| rulebody ";"
	;
rulebody: RULENAME ':' alt
	| rulebody "|" alt
	;
alt	:
	| alt TOKEN
	| alt RULENAME
	;

-> [EBNF]

yacc	: rule+ ;
rule	: rulebody ";"? ;
rulebody: RULENAME ':' (alt # "|")+ ;
alt	: (TOKEN | RULENAME)* ;

-> [substitute rulebody and alt into rule]

yacc	: rule+ ;
rule	: RULENAME ':' ((TOKEN | RULENAME)* # "|")+ ";"? ;

-> [4.1: A=RULENAME ':'; B=; C=(TOKEN|RULENAME); D="|"; E=";"?]

yacc	: rule+ ;
rule	: RULENAME ':' (TOKEN | RULENAME | "|")* ";"? ;

-> [substitute rule into yacc]

yacc	: (RULENAME ':' (TOKEN | RULENAME | "|")* ";"? )+ ;

-> [4.2: A=; B=RULENAME ':'; C=TOKEN|RULENAME|"|"; D=";"?; E=;]

yacc	: RULENAME ':' (TOKEN | RULENAME | "|" | ";"? RULENAME ':')* ";"? ;

Now, we need to convert back into BNF, and there are several possibilities:

yacc	: head ";"? ;
head	: RULENAME ':' (TOKEN | RULENAME | "|" | ";"? RULENAME ':')* ;

yacc	: head ";"? ;
head	: RULENAME ':'
	| head (TOKEN | RULENAME | "|" | ";"? RULENAME ':')
	;

yacc	: head ";"
	| head
	;
head	: RULENAME ':'
	| head TOKEN
	| head RULENAME
	| head "|"
	| head RULENAME ':'
	| head ";" RULENAME ':'
	;

yacc	: RULENAME ':' middle ";"? ;
middle	: (TOKEN | RULENAME | "|" | ";"? RULENAME ':')* ;

yacc	: RULENAME ':' middle
	| RULENAME ':' middle ";"
	;
middle	:
	| middle TOKEN
	| middle RULENAME
	| middle "|"
	| middle RULENAME ':'
	| middle ";" RULENAME ':'
	;

which is also LR(1)

yacc	: RULENAME ':' tail ;
tail	: (TOKEN | RULENAME | "|" | ";"? RULENAME ':')* ";"? ;

yacc	: RULENAME ':' tail
	;
tail	: TOKEN tail
	| RULENAME tail
	| "|" tail
	| RULENAME ':' tail
	| ";" RULENAME ':' tail
	| ";"
	|
	;

which is also LR(1).

Alternatively, we can do something different at an earlier step:

yacc	: rule+ ;
rule	: RULENAME ':' ((TOKEN | RULENAME)* # "|")+ ";"? ;

yacc	: (rule2 ";"?)+ ;
rule2	: RULENAME ':' ((TOKEN | RULENAME)* # "|")+ ;

-> [4.1: A=RULENAME ':'; B=; C=(TOKEN|RULENAME); D="|"; E=;]

yacc	: (rule2 ";"?)+ ;
rule2	: RULENAME ':' (TOKEN | RULENAME | "|")* ;

-> back to BNF

yacc	: (rule2 ";"?);
rule2	: RULENAME ':'
	| rule2 TOKEN
	| rule2 RULENAME
	| rule2 "|"
	;

and if we use left-recursion for yacc this is still LR(2):

yacc	: rule2 ";"
	| rule2
	| yacc rule2 ";"
	| yacc rule2
	;
rule2	: RULENAME ':'
	| rule2 TOKEN
	| rule2 RULENAME
	| rule2 "|"
	;

but if we use right recursion it is LR(1):

yacc	: rule2 ";"
	| rule2
	| rule2 ";" yacc
	| rule2 yacc
	;
rule2	: RULENAME ':'
	| rule2 TOKEN
	| rule2 RULENAME
	| rule2 "|"
	;

A grammar to recognise parameter lists

header  : type_name id '(' params ')'
        | type_name id '(' ')'
        ;
type_name: id
	| more_complex_type_descriptions
	;
params  : param
        | params ',' param
        ;
param   : type_name ids
        ;
ids     : id
        | ids ',' id
        ;

Looking at the conflict reported by yacc:

12: shift/reduce conflict (shift 15, reduce 7) on ','
state 12
	param : type_name ids .  (7)
	ids : ids . ',' id  (9)

the problem relates to what happens when we see a ',' near or within a "param", so we will concentrate just on the rules for "params", "param" and "ids".
-> [EBNF]

params  : (param # ',')+ ;
param   : type_name ids ;
ids     : (id # ',')+ ;

-> [merge rules]

params  : (type_name (id # ',')+ # ',')+ ;

-> [4.5: A=; B=type_name; C=id; D=','; F=','; G=;]

params  : type_name id (',' typename id | ',' id)* ;

-> [BNF]

header  : type_name id '(' params ')'
        | type_name id '(' ')'
        ;
type_name: id
	| more_complex_type_descriptions
	;
params  : type_name id
	| params ',' typename id
	| params ',' id
	;

A grammar to recognise C expressions

exp	: exp '-' sub_exp
	| sub_exp
	;
sub_exp	: '(' type_name ')' sub_exp
	| id
	| literal
	| '(' exp ')'
	;
type_name: id
	| complex_type
	;

This is the conflict report from yacc:

6: reduce/reduce conflict (reduce 4, reduce 7) on ')'
state 6
	sub_exp : id .  (4)
	type_name : id .  (7)

which doesn't help much. However, maybe it encorages us to expand "type_name" within "sub_exp", to see more context:

exp	: exp '-' sub_exp
	| sub_exp
	;
sub_exp	: '(' id ')' sub_exp
	| '(' complex_type ')' sub_exp
	| id
	| literal
	| '(' exp ')'
	;

giving:

6: shift/reduce conflict (shift 10, reduce 5) on ')'
state 6
	sub_exp : '(' id . ')' sub_exp  (3)
	sub_exp : id .  (5)

which is slightly more informative.
We want to bring out enough right context around a bracketed id that yacc can distinguish between its use as a value in e.g. ( a ) + b
and its use as a cast in e.g. ( a ) b
i.e. we need to make explicit the fact that an exp can be an id in sub_exp : '(' exp ')'
so we want to develop a rule of the form exp : id | . . .
and then substitute it into sub_exp : '(' exp ')'

I can't see how using EBNF will help here e.g.

exp      : (sub_exp # '-')<+ ;
sub_exp  : ('(' (id | complex_type) ')')*> ( id | literal | '(' exp ')' ) ;

Instead, we will stick with the BNF, and gradually expose the ids that represent values within an exp:

exp	: exp '-' sub_exp | sub_exp ;
sub_exp	: '(' id ')' sub_exp
	| '(' complex_type ')' sub_exp
	| id
	| literal
	| '(' exp ')'
	;

-> [exposing id as a special case of sub_exp]

exp	: exp '-' sub_exp | sub_exp ;
sub_exp	: id | sub_exp2 ;
sub_exp2: '(' id ')' sub_exp
	| '(' complex_type ')' sub_exp
	| literal
	| '(' exp ')'
	;

-> [substituting into exp]

exp	: exp '-' sub_exp | id | sub_exp2 ;
sub_exp	: id | sub_exp2 ;
sub_exp2: '(' id ')' sub_exp
	| '(' complex_type ')' sub_exp
	| literal
	| '(' exp ')'
	;

-> [exposing id as a special case of exp]

exp	: id | exp2 ;
exp2	: exp '-' sub_exp | sub_exp2 ;
sub_exp	: id | sub_exp2 ;
sub_exp2: '(' id ')' sub_exp
	| '(' complex_type ')' sub_exp
	| literal
	| '(' exp ')'
	;

-> [substituting back into sub_exp2]

exp	: id | exp2 ;
exp2	: exp '-' sub_exp | sub_exp2 ;
sub_exp	: id | sub_exp2 ;
sub_exp2: '(' id ')' sub_exp
	| '(' complex_type ')' sub_exp
	| literal
	| '(' id ')'
	| '(' exp2 ')'
	;

A grammar that is not LR(n)

The problem is how to parse "id ( . . ." - is the id really an obj_id ? The parser can't know until it sees an "=" (or not), after an obj_id, which may have any number of symbols in it.

obj_id	: id
	| obj_id "." id
	;
expr	: obj_id
	| simple_func
	| obj_func
	;
simple_func: id "(" asgn_list ")"
	;
asgn_list: (obj_id "=" expr ";")+
	;
obj_func: obj_id "(" expr_list? ")"
	;
expr_list: expr ("," expr)*
	;

Converting fully to EBNF may help in understanding the grammar:

expr	   : obj_id
	   | obj_id "(" (           expr # ",")* ")"
	   | id     "(" (obj_id "=" expr   ";")+ ")"
	   ;
obj_id	   : (id # ".")+ ;

but doesn't really help find an equivalent LR(1) grammar.

After converting to BNF:

expr    : obj_id
        | simple_func
        | obj_func
        ;
obj_id  : id
        | obj_id "." id
        ;
simple_func: id "(" asgn_list ")"
        ;
asgn_list: asgn
	| asgn_list asgn
	;
asgn	: obj_id "=" expr ";"
        ;
obj_func: obj_id "(" opt_expr_list ")"
        ;
opt_expr_list:
	| expr_list
	;
expr_list: expr
	| expr_list "," expr
        ;

yacc reports the following:

1: shift/reduce conflict (shift 6, reduce 4) on '('
state 1
	obj_id : id .  (4)
	simple_func : id . '(' asgn_list ')'  (6)

so we need to expand "opt_id" to expose more right-context:

expr    : id
	| obj_id "." id
        | simple_func
        | obj_func
        ;
obj_id  : id
	| obj_id "." id
        ;
simple_func: id "(" asgn_list ")"
        ;
asgn_list: asgn
	| asgn_list asgn
	;
asgn	: id            "=" expr ";"
	| obj_id "." id "=" expr ";"
        ;
obj_func: id            "(" opt_expr_list ")"
	| obj_id "." id "(" opt_expr_list ")"
        ;
opt_expr_list:
	| expr_list
	;
expr_list: expr
	| expr_list "," expr
        ;

which is conflict-free. Rewriting slightly to simplify further use:

expr    : id
	| obj_id "." id
	| simple_func
	| obj_func
	;
obj_id	: id
	| obj_id "." id
	;
simple_func: id "(" asgn_list ")"
	;
asgn_list: asgn
	| asgn_list asgn
	;
asgn	: id            asgn_tail
	| obj_id "." id asgn_tail
	;
asgn_tail: "=" expr ";"
	;
obj_func: id            obj_func_tail
	| obj_id "." id obj_func_tail
	;
obj_func_tail: "(" ")"
	| "(" expr_list ")"
	;
expr_list: expr
	| expr_list "," expr
	;

An ambiguous grammar - the dangling else

I have no idea how to derive this in steps from the original! However, there is an excellent discussion at Resolving the General Dangling Else/If-Else Ambiguity and there is the beginnings of an algorithm here.

An ambiguous grammar for an HTML-like language

original grammar:

%token PARA PARA_END BOLD BOLD_END TEXTBLOCK
%%
text	:
	| text textElement
	;
textElement: inlineElement
	| paragraph
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK
	| BOLD inlineText BOLD_END
	;
paragraph: PARA inlineText optParaEnd
	;
optParaEnd:
	| PARA_END
	;

-> EBNF

text	: ( inlineElement | PARA inlineElement* PARA_END? )*
	;
inlineElement: TEXTBLOCK | BOLD inlineElement* BOLD_END
	;

-> [ (a | b)* -> (a* # b)* ]

text	: ( inlineElement* # PARA inlineElement* PARA_END? )*
	;
inlineElement: TEXTBLOCK | BOLD inlineElement* BOLD_END
	;

-> [ (a # b)* -> (a # b)+ if a can become epsilon e.g. a=c? or a=c* ]

text	: ( inlineElement* # PARA inlineElement* PARA_END? )+
	;
inlineElement: TEXTBLOCK | BOLD inlineElement* BOLD_END
	;

-> [expand PARA_END?]

text	: ( inlineElement* # ( PARA inlineElement* | PARA inlineElement* PARA_END ) )+
	;
inlineElement: TEXTBLOCK | BOLD inlineElement* BOLD_END
	;

-> [ (a* # (b a* | c) )+ -> (a* # (b|c) )+ ]

text	: ( inlineElement* # ( PARA | PARA inlineElement* PARA_END ) )+
	;
inlineElement: TEXTBLOCK | BOLD inlineElement* BOLD_END
	;

-> BNF

text	: inlineText
	| text PARA inlineText
	| text PARA inlineText PARA_END inlineText
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK
	| BOLD inlineText BOLD_END
	;

I actually worked out the sequence of manipulations above from the answer, which I got to by remembering what I did for optional commas.

Alternatively, from

text	: ( inlineElement* # PARA inlineElement* PARA_END? )+
	;
inlineElement: TEXTBLOCK | BOLD inlineElement* BOLD_END
	;

->BNF

text	: inlineText
	| text PARA inlineText PARA_END inlineText
	| text PARA inlineText          inlineText
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK | BOLD inlineText BOLD_END
	;

-> [explicitly remove the ambiguity]

text	: inlineText
	| text PARA inlineText PARA_END inlineText
	| text PARA inlineText
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK | BOLD inlineText BOLD_END
	;

An ambiguous grammar for errors in list separators

list    : id
        | list optcomma id
        | list ',' /* extra comma */
        ;
optcomma: ','
        | /* missing comma */
        ;

-> EBNF

list	: id ( (',' | /* missing comma */) id | ',' /* extra comma */ )*
	;

-> [expand]

list	: id (',' id | /* missing comma */ id | ',' /* extra comma */ )*
	;

-> [ a ( b a | a | b )* -> a ( b | a )* or (a # b*)+ b* ]
-> [ a ( b a | M a | b E )* -> ( a # (M|b(bE)*) )+ (bE)* ]

list	: ( id # comma )* nocomma
	;
comma	: /* missing comma */
	| ',' nocomma
	;
nocomma	:
	| (',' /* extra comma */)+
	;

-> BNF

list	: sublist nocomma
	;
sublist	: id
	| sublist comma id
	;
comma	: /* missing comma */
	| ',' nocomma
	;
nocomma	:
	| nocomma ',' /* extra comma */
	;

but this still has a conflict (between the nocomma after the sublist in list and the comma after a sublist in sublist) so try again:

list	: sublist nocomma
	;
sublist	: id
	| sublist comma id
	;
comma	: /* missing comma */
	| commas
	;
nocomma	:
	| commas /* extra comma */
	;
commas	: ','
	| commas ',' /* extra comma */
	;

An ambiguous grammar for declarations and expressions

Chris Dodd suggested the following (deliberately?) nasty grammar:

input	: /* empty */
	| input decl
	| input exprlist
	;
decl	: TYPE declarator
	;
declarator: ID
	| '(' declarator ')'
	;
expr	: ID
	| TYPE '(' exprlist ')'
	;
exprlist: expr
	| exprlist OP expr
	;

The problem is that an input of TYPE ( ID ) matches both decl and expr, and there is no obvious reason to prefer one derivation over the other.

The only thing I can see to do is to explicity expose the ambiguity, by expanding rules until it is visible e.g.:

input	:
	/* input decl has been expanded to: */
	| input TYPE declarator
	/* input exprlist has been expanded to: */
	| input expr
	| input exprlist OP expr
	;
declarator: ID
	| '(' declarator ')'
	;
expr	: ID
	| TYPE '(' exprlist ')'
	;
exprlist: expr
	| exprlist OP expr
	;

However, this can't succeed as exprlist is left-recursive, so each expansion of it will leave another exprlist on the left. As with converting grammars for LL parsing, we need to get rid of the left recursion (e.g. replacing it by right recursion, and left-factoring to reduce the amount of work later):

input	:
	/* input decl has been expanded to: */
	| input TYPE declarator
	/* input exprlist has been expanded to: */
	| input expr tail
	;
declarator: ID
	| '(' declarator ')'
	;
expr	: ID
	| TYPE '(' exprlist ')'
	;
exprlist: expr
	| exprlist OP expr
	;
tail	:
	| OP exprlist
	;

now we can expand some more:

input	:
	/* input decl has been expanded to: */
	| input TYPE ID
	| input TYPE '(' declarator ')'
	/* input exprlist has been expanded to: */
	| input ID tail
	| input TYPE '(' exprlist ')' tail
	;
declarator: ID
	| '(' declarator ')'
	;
expr	: ID
	| TYPE '(' exprlist ')'
	;
exprlist: expr
	| exprlist OP expr
	;
tail	:
	| OP exprlist
	;

and some more:

input	:
	/* input decl has been expanded to: */
	| input TYPE ID
	| input TYPE '(' ID ')'
	| input TYPE '(' '(' declarator ')' ')'
	/* input exprlist has been expanded to: */
	| input ID tail
	| input TYPE '(' expr tail ')' tail
	;
declarator: ID
	| '(' declarator ')'
	;
expr	: ID
	| TYPE '(' exprlist ')'
	;
exprlist: expr
	| exprlist OP expr
	;
tail	:
	| OP exprlist
	;

and some more:

input	:
	/* input decl has been expanded to: */
	| input TYPE ID
	| input TYPE '(' ID ')'
	| input TYPE '(' '(' declarator ')' ')'
	/* input exprlist has been expanded to: */
	| input ID tail
	| input TYPE '(' ID tail ')' tail
	| input TYPE '(' TYPE '(' exprlist ')' tail ')' tail
	;
declarator: ID
	| '(' declarator ')'
	;
expr	: ID
	| TYPE '(' exprlist ')'
	;
exprlist: expr
	| exprlist OP expr
	;
tail	:
	| OP exprlist
	;

and some more:

input	:
	/* input decl has been expanded to: */
	| input TYPE ID
	| input TYPE '(' ID ')'
	| input TYPE '(' '(' declarator ')' ')'
	/* input exprlist has been expanded to: */
	| input ID tail
	| input TYPE '(' ID ')' tail
	| input TYPE '(' ID OP exprlist ')' tail
	| input TYPE '(' TYPE '(' exprlist ')' tail ')' tail
	;
declarator: ID
	| '(' declarator ')'
	;
expr	: ID
	| TYPE '(' exprlist ')'
	;
exprlist: expr
	| exprlist OP expr
	;
tail	:
	| OP exprlist
	;

and some more, and now the ambiguity is visible:

input	:
	/* input decl has been expanded to: */
	| input TYPE ID
	| input TYPE '(' ID ')'			/* OOPS! */
	| input TYPE '(' '(' declarator ')' ')'
	/* input exprlist has been expanded to: */
	| input ID tail
	| input TYPE '(' ID ')'			/* OOPS! */
	| input TYPE '(' ID ')' OP exprlist
	| input TYPE '(' ID OP exprlist ')' tail
	| input TYPE '(' TYPE '(' exprlist ')' tail ')' tail
	;
declarator: ID
	| '(' declarator ')'
	;
expr	: ID
	| TYPE '(' exprlist ')'
	;
exprlist: expr
	| exprlist OP expr
	;
tail	:
	| OP exprlist
	;

The ambiguity then has to be resolved somehow (e.g. by forcing it to be recognised as a decl rather than an expr).