Bending grammars* to your will

If you can write down the grammar you need, or you have already been given a grammar, and you can convert it into a parser using your favourite algorithm (e.g. run it through lex + yacc or javacc etc.), and everything works fine, then you have no problems.

On the other hand, it may be that:

You are trying to design a computer language, but don't really know how to start
You are finding it difficult to write a grammar for your computer language (or whatever)
You have written a grammar but you don't know what to do next
You have put your grammar through e.g. yacc and it has reported problems, and you aren't sure how to solve them, or maybe even what they mean
Yacc (or whatever tool you are using) accepts your grammar but you don't know what to do next
Yacc (or whatever) accepts your grammar but when you try to write the actions for each rule you realise that some of the information you need isn't available when you need it

This discussion concentrates on problem 4 above, but touches on problem 6.

If you have any of the other problems listed above, then the best I can do is refer you to e.g.:

BNF, Syntax Diagrams & EBNF
Example of Formalising a Grammar for use with Lex & Yacc or Lex alone
my CS2111 lectures (computer language design, sections 1, 2, 9-19)
my CS2121 lectures (notations, lexers, parsers, parse trees, dictionaries, compilers, etc.)
Parsers: Theory and Implementation
Parsing including top-down (LL) and Bottom-up (LR) parsing
more information about computer languages, parsing, grammars, and compilers

(*Technical note: when I refer to "grammars", I really mean "context-free grammars", that can be written e.g. in BNF or EBNF. Other kinds of grammars, including both more expressive kinds like context-sensitive grammars, and less expressive kinds like regular expressions, are outside the scope of this discussion.)

Ambiguous and confusing grammars, and what to do about them

Ambiguous grammars

An ambiguous grammar is one that gives more than one way of understanding a single piece of text in the language. e.g. C and Java have an ambiguity in the grammar for expressions

There are contexts in which ambiguity is useful, and in which parsers that can deal with ambiguity are necessary, but not normally for artificial languages like those used with computers - currently, the cost of dealing with ambiguity vastly outweighs any usefulness it might have.

It may be that, as for C above, the language itself is ambiguous (at the grammatical level - there is no ambiguity at the semantic level) or it may be that, as in this example, the language itself is unambigous but the person who wrote the grammar made a mistake.

For the purposes of this discussion, parsers and parser-generators cannot deal with ambiguous grammars.

Confusing grammars

A confusing grammar is what, for the purposes of this discussion, I am calling an unambiguous grammar that is still too complicated for the parsing algorithm you are trying to use. e.g. How to confuse parsers

Unfortunately, for most parsing algorithms, there will always exist grammars that confuse them. It is possible to use parsing algorithms that can handle any unambiguous grammar, but it is not currently cost-effective to do so.

For the purposes of this discussion, I am assuming that cost-effective parsers use variants on the LL(n) and LR(n) algorithms.

More notes and examples

including discussion of the kinds of error messages you may get from yacc (and similar bottom-up parser generators)

Dealing with Ambiguous and Confusing grammars

There are no general-purpose algorithms to deal with ambiguous and confusing grammars, just heuristics, and I want to investigate these heuristics.

[ An annoying feature of grammars is that many things you might sensibly want to do are undecidable i.e. there cannot be an algorithm to achieve it in all cases! For example, it is not possible to decide if a grammar or a language is ambiguous - the best you can do is try to create a parser and, if you succeed, the grammar must be unambiguous, but if you fail, maybe you just need to use a more powerful method. Similarly, there is no way to decide if a grammar can be converted to LL(1) or to LR(1) except by trying to do it - if you succeed, then it was, but if you fail that just tells you that the method you used isn't appropriate.
(Almost the only exception to this that I am aware of are results that say that, if you can create an LR(n) grammar for some language, it is also possible to create an LR(1), LALR(1) or SLR(1) grammar for it - however my understanding is that general purpose algorithms to achieve this have been proved to be impossible!) ]

I can think of four main options for dealing with ambiguous and confusing gramamrs:

Change the parser (or parser-generator), keeping the grammar the same.

LL(1) -> LL(n) -> LALR(1) -> LR(1) -> LR(n), or use modifications of these algorithms that allow extra lookahead, or limited backtracking, or use of predicates, at strategic moments during the parse. (This is not usually a sensible option if the grammar is actually ambiguous, but as mentioned above, it is possible if there really is no alternative.)
e.g. How to confuse parsers discusses a grammar that is LR(1) but not LL(n).
e.g. example grammars that are LR(1) but not LL(n), and LR(1) but not LALR(1)

Change the grammar, keeping the language the same.

Examples of converting grammars to LL(1) and to LR(1) and some heuristics for how to proceed.
(I am assuming that we are preserving the intent of the language designer, even if, strictly speaking, we may be changing the language when we change the grammar e.g. to make it unambiguous.)

Change the language.

e.g. the dangling else

"Cheat" (which usually involves modifying the lexical analyser).

distinguish similar rules by using semantic information e.g. C and Java expressions
(The details are a lot more unpleasant than my brief summary implies: see last paragraph or here or 3rd paragraph or here or here or here or here or here or here or here or here or here or here or here or here or here . . . Apparently, C++ is even worse!)
get more lookahead by recognising longer tokens e.g. recognising yacc input

There are similar problems and solutions when parsing FORTRAN, which:

has no reserved words (it has keywords like "DO", but they can also be used as variable names)
does not treat space as a separator (so "DO 1 I" and "DO1I" can both mean the same variable)
allows default declarations for undefined variables.

This makes it hard work to distinguish between e.g. the loop construct DO 1 I = 1, 10 and the floating point assignment DO 1 I = 1. 10 - even if the variable DO1I is undefined it would (by default) be automatically declared as a floating point variable. e.g. see moderator's footnote

Note that these kinds of cheats can often be expressed more succinctly in more advanced parsers, for example as "semantic predicates" - are they still cheats, or are they just making good use of the facilities provided?

I suppose one can draw a formal distinction between facilities that make it easier to do the job but don't increase the fundamental expressiveness of BNF, and those that actually increase the range of languages that can be handled. e.g. one such system I came across allowed you to use predicates to recognise AⁿBⁿCⁿ (any number of As, followed by the same number of Bs, followed by the same number of Cs) which is not possible for any context-free grammar!

However, the only practical distinction is whether these facilities cause the parsers that use them to be significantly less efficient. Even if they do, it may be that that is a price worth paying to handle more languages. The danger is that the extra price will also be paid by less experienced users handling languages or grammars that are only superficially hard to deal with, and who unknowingly end up using a sledge-hammer to crack a nut.

A related problem

Suppose you are using e.g. yacc, and your grammar has no conflicts, but when you add the actions, you find that you can't get information to flow between the grammar rules correctly:
e.g. Getting actions in the right order or
When to use a tool or technique (scroll down to the last few paragraphs of "parse trees")
Much as above, the solutions include using a "better" parser-generator, that knows how to build a parse tree automatically and walk it to calculate all your attributes (or doing this yourself by hand) or rewriting the grammar to change the order in which various rules are reduced.

Further information

Parsing Techniques

Aho and Ullman's "The Theory of Parsing, Translation and Compiling"

Sippu, S. and E. Soisalon-Soininen (1988). Parsing Theory. Vol.1: Languages and Parsing, Vol. 15 of EATCS Monographs on Theoretical Computer Science. Springer.
Sippu, S. and E. Soisalon-Soininen (1990). Parsing Theory. Vol.2: LR(k) and LL(k) Parsing, Vol. 20 of EATCS Monographs on Theoretical Computer Science. Springer.

some theory and more