%token RULENAME TOKEN %start yacc %% yacc : rule | yacc rule ; rule : rulebody | rulebody ";" ; rulebody : RULENAME ':' alt | rulebody "|" alt ; alt : | alt TOKEN | alt RULENAME ;However, yacc can be used to recognise itself, and Appendix B of Yacc: Yet Another Compiler-Compiler; Stephen C. Johnson explains what the problem is and how to "cheat" in the lexical analyser to overcome it (simplified for clarity):
%token IDENTIFIER /* includes identifiers and literals */ %token C_IDENTIFIER /* identifier (but not literal) followed by colon */ . . . %% . . . rules : C_IDENTIFIER rbody prec | rules rule ; rule : C_IDENTIFIER rbody prec | '|' rbody prec ; rbody : /* empty */ | rbody IDENTIFIER ; prec : /* empty */ | PREC IDENTIFIER | prec ';' ;Here are a couple of LR(1) grammars that are equivalent to the original LR(2) grammar above:
yacc : rulebody | rulebody ";" | rulebody yacc | rulebody ";" yacc ; rulebody: RULENAME ':' | rulebody TOKEN | rulebody RULENAME | rulebody "|" ;and:
yacc : head | head ';' ; head : RULENAME ':' | head TOKEN | head RULENAME | head '|' | head ';' RULENAME ':' | head RULENAME ':' ;In both versions, the conversion to LR(1) is achieved by combining grammar rules until the parser doesn't need to perform different reductions on tokens preceding a RULENAME ":" or preceding a RULENAME.
For example, we can ensure that all instances of the ambiguous tokens (RULENAME) are in the same rule. However, that is not sufficient, as the ambiguity is that it is hard to distinguish the end of one rule from the start of the next.
In the first version, this is achieved by using tail-recursion in the rule for "yacc". In fact, in this version, the "RULENAME ':'" can be moved from inside "rulebody" to precede every instance of it in "yacc" and still leave the grammar LR(1).
In the second version, this is achieved by combining the grammar rules describing the list of rules with the grammar rules describing an individual rule.
Also note that, we have expanded the grammar from 9 rules written using 33 tokens to, well, 8 rules and 28 tokens or 8 rules and 30 tokens.
derivation and derivation
header : type_name id '(' params ')' | type_name id '(' ')' ; type_name: id | more_complex_type_descriptions ; params : param | params ',' param ; param : type_name ids ; ids : id | ids ',' id ;can be used to recognise headers like:
void fred (int a, int b, float x, float z)or like:
void fred (int a, b, float x, z)The problem is that after a "ids ," the next symbols can be "a b" (a is a type_name, b is a parameter name of type a) or "a ," or "a )" (a is a parameter name of the current type), but an LR(1) parser can't see far enough ahead to decide whether the "," is part of a "params" (in which case the preceding "ids" must be reduced to a "param"), or part of a bigger "ids".
The rules for "params", "param" and "ids" can be rewritten as the LR(1) grammar rule:
params : type_name id | params ',' type_name id | params ',' id ;This solution is similar to that for the previous example, in that we are combining two recursive rules into one, except that there is no need to use tail recursion. The important point is that the two different uses of "," now appear side-by-side in the one grammar rule, followed by sufficent context ("type_name id" or just "id") that an LR(1) parser can pick the correct action.
derivation and derivation
sub_exp: '-' sub_exp
", it is LR(2).
exp : exp '-' sub_exp | sub_exp ; sub_exp : '(' type_name ')' sub_exp | id | literal | '(' exp ')' ; type_name: id | more_complex_type_descriptions ;The problem is that the first "id" in "
( id ) id
" is a
"type_name", but in "( id ) - id
" it is an "exp", and the two
must be reduced differently when the ")" is seen but before the "-" or second
"id" has been seen by an LR(1) parser.
This can be rewritten as the LR(1) grammar:
exp : id | exp2 ; exp2 : exp '-' sub_exp | sub_exp2 ; sub_exp : id | sub_exp2 ; sub_exp2: '(' id ')' sub_exp | '(' more_complex_type_descriptions ')' sub_exp | literal | '(' id ')' | '(' exp2 ')' ;The trick here is to spot that it is only one expansion of "exp" (to "id") that causes problems, and to pick it out as a special case in "sub_exp" and then in "exp", and then substitue it (and the rule for "type_name") explicitly back into the rule for "sub_exp2" so that an "id" does not need to be reduced immediately at a ")". Once I had spotted this, the most difficult part was convincing myself that the grammar still recognised the same language!
derivation and derivation
obj_id : id | obj_id "." id ; expr : obj_id | simple_func | obj_func ; simple_func: id "(" asgn_list ")" ; asgn_list: (obj_id "=" expr ";")+ ; obj_func: obj_id "(" expr_list? ")" ; expr_list: expr ("," expr)* ;which is not LR(n) for any finite n!
The problem is how to parse "id ( . . ." - is the id really an obj_id ? The parser can't know until it sees an "=" (or not) inside the argument list, which can only follow an obj_id, which can be of any length.
Here is an LR(1) version, in BNF:
expr : id | obj_id "." id | simple_func | obj_func ; obj_id : id | obj_id "." id ; simple_func: id "(" asgn_list ")" ; asgn_list: asgn | asgn_list asgn ; asgn : id asgn_tail | obj_id "." id asgn_tail ; asgn_tail: "=" expr ";" ; obj_func: id obj_func_tail | obj_id "." id obj_func_tail ; obj_func_tail: "(" ")" | "(" expr_list ")" ; expr_list: expr | expr_list "," expr ;
derivation and derivation
stat : IF '(' exp ')' stat | IF '(' exp ')' stat ELSE stat | FOR '(' exp ';' exp ';' exp ')' stat | WHILE '(' exp ')' stat | . . . | exp ;but this is ambiguous - which of the following is meant by
if (a) if (b) c else d
?
if (a) if (a) if (b) if (b) c c else /* end of if(b) */ d else /* end of if(b) */ d /* end of if(a) */ /* end of if(a) */(actually, the left-most of these is usually "the answer", as in C)
The following grammar is LR(1):
stat : matched | unmatched ; unmatched : IF '(' exp ')' stat | IF '(' exp ')' matched ELSE unmatched | FOR '(' exp ';' exp ';' exp ')' unmatched | WHILE '(' exp ')' unmatched | . . . ; matched : IF '(' exp ')' matched ELSE matched | FOR '(' exp ';' exp ';' exp ')' matched | WHILE '(' exp ')' matched | . . . | exp ;derivation and derivation
%token PARA PARA_END BOLD BOLD_END TEXTBLOCK %% text : | text textElement ; textElement: inlineElement | paragraph ; inlineText: | inlineText inlineElement ; inlineElement: TEXTBLOCK | BOLD inlineText BOLD_END ; paragraph: PARA inlineText optParaEnd ; optParaEnd: | PARA_END ;"The grammar does not specify how to group a sequence of inlineElement. For example:
<P> textblock1 textblock2
(PARA textblock1) textblock2
or
(PARA textblock1 textblock2)
The following grammar is LR(1):
text : inlineText | text PARA inlineText | text PARA inlineText PARA_END inlineText ; inlineText: | inlineText inlineElement ; inlineElement: TEXTBLOCK | BOLD inlineText BOLD_END ;
derivation and derivation
list : id | list optcomma id | list ',' /* extra comma */ ; optcomma: ',' | /* missing comma */ ;The problem is that an input like
a , b
can be recognised as: ((a) (,) b)
(((a) , /*extra comma*/) (/*missing comma*/) b)
Either of the following grammars solves the problem:
list : sublist nocomma ; sublist : id | sublist comma id ; comma : /* missing comma */ | commas ; nocomma : | commas /* extra comma */ ; commas : ',' | commas ',' /* extra comma */ ;or:
list : list ',' /* extra comma */ | list_id ; list_id : id | list ',' id | list_id id /* missing comma */ ;
derivation and derivation
input : /* empty */ | input decl | input exprlist ; decl : TYPE declarator ; declarator: ID | '(' declarator ')' ; expr : ID | TYPE '(' exprlist ')' ; exprlist: expr | exprlist OP expr ;The problem is that an input of
TYPE ( ID )
matches both decl
and expr
,
and there is no obvious reason to prefer one derivation over the other.
This can be rewritten to explicitly expose the ambiguity,
which then has to be resolved somehow (e.g. by forcing it to be recognised as
a decl
rather than an expr
):
input : /* input decl has been expanded to: */ | input TYPE ID | input TYPE '(' ID ')' /* OOPS! */ | input TYPE '(' '(' declarator ')' ')' /* input exprlist has been expanded to: */ | input ID tail | input TYPE '(' ID ')' /* OOPS! */ | input TYPE '(' ID ')' OP exprlist | input TYPE '(' ID OP exprlist ')' tail | input TYPE '(' TYPE '(' exprlist ')' tail ')' tail ; tail : | OP exprlist ; /* these rules are unchanged */ declarator: ID | '(' declarator ')' ; expr : ID | TYPE '(' exprlist ')' ; exprlist: expr | exprlist OP expr ;