UP

Converting gramamrs to LR(1)

Any language that can be recognised by a LR(n) grammar can also be recognised by an equivalent LR(1) grammar (Hopcroft and Ullman, "Introduction to Automata Theory, Languages and Computation" 1979) although the grammar may need to be rewritten to such an extent that it is greatly expanded and obfuscated. Here are some examples of grammars being rewritten to be LR(1).

A grammar to recognise yacc input

LL and LR Translator Need k; Terence Parr and Russell Quong; SIGPLAN Notices Volume 31 #2 Februrary 1996 notes, among other things, that yacc, being LR(1), is unable to parse its own input language, which is clearly LR(2) (the problem is that the terminating ";" of each rule is optional, so we may need to look for the ":" after the RULENAME of the next rule to know that we must terminate this rule):
%token RULENAME TOKEN
%start yacc
%%

yacc : rule
     | yacc rule
     ;
rule : rulebody
     | rulebody ";"
     ;
rulebody : RULENAME ':' alt
	 | rulebody "|" alt
	 ;
alt :
    | alt TOKEN
    | alt RULENAME
    ;
However, yacc can be used to recognise itself, and Appendix B of Yacc: Yet Another Compiler-Compiler; Stephen C. Johnson explains what the problem is and how to "cheat" in the lexical analyser to overcome it (simplified for clarity):
%token  IDENTIFIER      /* includes identifiers and literals */
%token  C_IDENTIFIER    /* identifier (but not literal) followed by colon */
. . .
%%
. . .
rules   :       C_IDENTIFIER  rbody  prec
        |       rules  rule
        ;
rule    :       C_IDENTIFIER  rbody  prec
        |       '|'  rbody  prec
        ;
rbody   :       /*  empty  */
        |       rbody  IDENTIFIER
        ;
prec    :       /*  empty  */
        |       PREC  IDENTIFIER
        |       prec  ';'
        ;
Here are a couple of LR(1) grammars that are equivalent to the original LR(2) grammar above:
yacc	: rulebody
	| rulebody ";"
	| rulebody yacc
	| rulebody ";" yacc
	;
rulebody: RULENAME ':'
	| rulebody TOKEN
	| rulebody RULENAME	
	| rulebody "|"
	;
and:
yacc	: head
	| head ';'
	;
head	: RULENAME ':'
	| head TOKEN
	| head RULENAME
	| head '|'
	| head ';' RULENAME ':'
	| head RULENAME ':'
	;
In both versions, the conversion to LR(1) is achieved by combining grammar rules until the parser doesn't need to perform different reductions on tokens preceding a RULENAME ":" or preceding a RULENAME.

For example, we can ensure that all instances of the ambiguous tokens (RULENAME) are in the same rule. However, that is not sufficient, as the ambiguity is that it is hard to distinguish the end of one rule from the start of the next.

In the first version, this is achieved by using tail-recursion in the rule for "yacc". In fact, in this version, the "RULENAME ':'" can be moved from inside "rulebody" to precede every instance of it in "yacc" and still leave the grammar LR(1).

In the second version, this is achieved by combining the grammar rules describing the list of rules with the grammar rules describing an individual rule.

Also note that, we have expanded the grammar from 9 rules written using 33 tokens to, well, 8 rules and 28 tokens or 8 rules and 30 tokens.

derivation and derivation

A grammar to recognise parameter lists

From my CS2121 lectures. The LR(2) grammar:
header  : type_name id '(' params ')'
        | type_name id '(' ')'
        ;
type_name: id
	| more_complex_type_descriptions
	;

params  : param
        | params ',' param
        ;
param   : type_name ids
        ;
ids     : id
        | ids ',' id
        ;
can be used to recognise headers like:
void fred (int a, int b, float x, float z)
or like:
void fred (int a, b, float x, z)
The problem is that after a "ids ," the next symbols can be "a b" (a is a type_name, b is a parameter name of type a) or "a ," or "a )" (a is a parameter name of the current type), but an LR(1) parser can't see far enough ahead to decide whether the "," is part of a "params" (in which case the preceding "ids" must be reduced to a "param"), or part of a bigger "ids".

The rules for "params", "param" and "ids" can be rewritten as the LR(1) grammar rule:

params  : type_name id
        | params ',' type_name id
        | params ',' id
        ;
This solution is similar to that for the previous example, in that we are combining two recursive rules into one, except that there is no need to use tail recursion. The important point is that the two different uses of "," now appear side-by-side in the one grammar rule, followed by sufficent context ("type_name id" or just "id") that an LR(1) parser can pick the correct action.

derivation and derivation

A grammar to recognise C expressions

There is an example of an ambiguous grammar in my CS2121 lectures. However, if we make it unambiguous by omitting the rule "sub_exp: '-' sub_exp", it is LR(2).
exp	: exp '-' sub_exp
	| sub_exp
	;
sub_exp	: '(' type_name ')' sub_exp
	| id
	| literal
	| '(' exp ')'
	;
type_name: id
	| more_complex_type_descriptions
	;
The problem is that the first "id" in "( id ) id" is a "type_name", but in "( id ) - id" it is an "exp", and the two must be reduced differently when the ")" is seen but before the "-" or second "id" has been seen by an LR(1) parser.

This can be rewritten as the LR(1) grammar:

exp	: id
	| exp2
	;
exp2	: exp '-' sub_exp
	| sub_exp2
	;
sub_exp	: id
	| sub_exp2
	;
sub_exp2: '(' id ')' sub_exp
	| '(' more_complex_type_descriptions ')' sub_exp
	| literal
	| '(' id ')'
	| '(' exp2 ')'
	;
The trick here is to spot that it is only one expansion of "exp" (to "id") that causes problems, and to pick it out as a special case in "sub_exp" and then in "exp", and then substitue it (and the rule for "type_name") explicitly back into the rule for "sub_exp2" so that an "id" does not need to be reduced immediately at a ")". Once I had spotted this, the most difficult part was convincing myself that the grammar still recognised the same language!

derivation and derivation

A grammar that is not LR(n)

Chris Clark suggested:
obj_id	: id
	| obj_id "." id
	;
expr	: obj_id
	| simple_func
	| obj_func
	;
simple_func: id "(" asgn_list ")"
	;
asgn_list: (obj_id "=" expr ";")+
	;
obj_func: obj_id "(" expr_list? ")"
	;
expr_list: expr ("," expr)*
	;
which is not LR(n) for any finite n!

The problem is how to parse "id ( . . ." - is the id really an obj_id ? The parser can't know until it sees an "=" (or not) inside the argument list, which can only follow an obj_id, which can be of any length.

Here is an LR(1) version, in BNF:

expr    : id
	| obj_id "." id
	| simple_func
	| obj_func
	;
obj_id	: id
	| obj_id "." id
	;
simple_func: id "(" asgn_list ")"
	;
asgn_list: asgn
	| asgn_list asgn
	;
asgn	: id            asgn_tail
	| obj_id "." id asgn_tail
	;
asgn_tail: "=" expr ";"
	;
obj_func: id            obj_func_tail
	| obj_id "." id obj_func_tail
	;
obj_func_tail: "(" ")"
	| "(" expr_list ")"
	;
expr_list: expr
	| expr_list "," expr
	;

derivation and derivation

An ambiguous grammar - the dangling else

Most programming languages include syntax for two different kinds of "if" statement, like this for C:
stat	: IF '(' exp ')' stat
	| IF '(' exp ')' stat ELSE stat
	| FOR '(' exp ';' exp ';' exp ')' stat
	| WHILE '(' exp ')' stat
	| . . .
	| exp
	;
but this is ambiguous - which of the following is meant by if (a) if (b) c else d?
if (a)				if (a)
  if (b)			  if (b)
    c				    c
  else				  /* end of if(b) */
    d				else
  /* end of if(b) */		  d
/* end of if(a)	*/		/* end of if(a) */
(actually, the left-most of these is usually "the answer", as in C)

The following grammar is LR(1):

stat      : matched
          | unmatched
          ;
unmatched : IF '(' exp ')' stat
          | IF '(' exp ')' matched ELSE unmatched
          | FOR '(' exp ';' exp ';' exp ')' unmatched
          | WHILE '(' exp ')' unmatched
          | . . .
          ;
matched   : IF '(' exp ')' matched ELSE matched
          | FOR '(' exp ';' exp ';' exp ')' matched
          | WHILE '(' exp ')' matched
          | . . .
          | exp
          ;
derivation and derivation

An ambiguous grammar for an HTML-like language

original grammar:
%token PARA PARA_END BOLD BOLD_END TEXTBLOCK
%%
text	:
	| text textElement
	;
textElement: inlineElement
	| paragraph
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK
	| BOLD inlineText BOLD_END
	;
paragraph: PARA inlineText optParaEnd
	;
optParaEnd:
	| PARA_END
	;
"The grammar does not specify how to group a sequence of inlineElement. For example:
<P> textblock1 textblock2
when parsed at text can either resolve to
(PARA textblock1) textblock2 or
(PARA textblock1 textblock2)
The latter is correct, that is the shift is correct behavior."

The following grammar is LR(1):

text	: inlineText
	| text PARA inlineText
	| text PARA inlineText PARA_END inlineText
	;
inlineText:
	| inlineText inlineElement
	;
inlineElement: TEXTBLOCK
	| BOLD inlineText BOLD_END
	;

derivation and derivation

An ambiguous grammar for errors in list separators

This grammar is intended to allow for simple typing mistakes in a list of identifiers separated by commas optional commas
list    : id
        | list optcomma id
        | list ',' /* extra comma */
        ;
optcomma: ','
        | /* missing comma */
        ;
The problem is that an input like a , b can be recognised as: ((a) (,) b)
or as: (((a) , /*extra comma*/) (/*missing comma*/) b)
- the former is what is intended.

Either of the following grammars solves the problem:

list	: sublist nocomma
	;
sublist	: id
	| sublist comma id
	;
comma	: /* missing comma */
	| commas
	;
nocomma	:
	| commas /* extra comma */
	;
commas	: ','
	| commas ',' /* extra comma */
	;
or:
list	: list ',' /* extra comma */
	| list_id
	;
list_id	:          id
        | list ',' id
        | list_id  id /* missing comma */
        ;

derivation and derivation

An ambiguous grammar for declarations and expressions

Chris Dodd suggested the following (deliberately?) nasty grammar:
input	: /* empty */
	| input decl
	| input exprlist
	;
decl	: TYPE declarator
	;
declarator: ID
	| '(' declarator ')'
	;
expr	: ID
	| TYPE '(' exprlist ')'
	;
exprlist: expr
	| exprlist OP expr
	;
The problem is that an input of TYPE ( ID ) matches both decl and expr, and there is no obvious reason to prefer one derivation over the other.

This can be rewritten to explicitly expose the ambiguity, which then has to be resolved somehow (e.g. by forcing it to be recognised as a decl rather than an expr):

input	:
	/* input decl has been expanded to: */
	| input TYPE ID
	| input TYPE '(' ID ')' 		/* OOPS! */
	| input TYPE '(' '(' declarator ')' ')'
	/* input exprlist has been expanded to: */
	| input ID tail
	| input TYPE '(' ID ')' 		/* OOPS! */
	| input TYPE '(' ID ')' OP exprlist
	| input TYPE '(' ID OP exprlist ')' tail
	| input TYPE '(' TYPE '(' exprlist ')' tail ')' tail
	;
tail	:
	| OP exprlist
	;
/* these rules are unchanged */
declarator: ID
	| '(' declarator ')'
	;
expr	: ID
	| TYPE '(' exprlist ')'
	;
exprlist: expr
	| exprlist OP expr
	;

derivation