UP
Bending grammars* to your will
If you can write down the grammar you need, or you have already been given a
grammar, and you can convert it into a parser using your favourite algorithm
(e.g. run it through lex + yacc or javacc etc.), and everything works fine,
then you have no problems.
On the other hand, it may be that:
- You are trying to design a computer language, but don't really know how
to start
- You are finding it difficult to write a grammar for your computer
language (or whatever)
- You have written a grammar but you don't know what to do next
- You have put your grammar through e.g. yacc and it has reported
problems, and you aren't sure how to solve them, or maybe
even what they mean
- Yacc (or whatever tool you are using) accepts your grammar but you don't
know what to do next
- Yacc (or whatever) accepts your grammar but when you try to write the
actions for each rule you realise that some of the information you
need isn't available when you need it
This discussion concentrates on problem 4 above,
but touches on problem 6.
If you have any of the other problems listed above, then the best I can do is
refer you to e.g.:
- BNF, Syntax Diagrams & EBNF
- Example of Formalising a Grammar for use with
Lex & Yacc
or
Lex alone
- my CS2111 lectures
(computer language design, sections 1, 2, 9-19)
- my CS2121 lectures
(notations, lexers, parsers, parse trees, dictionaries, compilers, etc.)
- Parsers: Theory and Implementation
- Parsing
including top-down (LL) and Bottom-up (LR) parsing
- more information about computer languages, parsing, grammars, and compilers
(*Technical note: when I refer to "grammars", I really mean
"context-free grammars", that can be written e.g. in
BNF
or
EBNF.
Other kinds of grammars, including both
more expressive
kinds like context-sensitive grammars, and less expressive kinds like
regular expressions,
are outside the scope of this discussion.)
Ambiguous and confusing grammars, and what to do about them
Ambiguous grammars
An ambiguous grammar is one that gives more than one way of
understanding a single piece of text in the language. e.g.
C and Java have an ambiguity in the grammar for expressions
There are contexts in which ambiguity is useful, and in which parsers that
can deal with ambiguity are necessary, but not normally for artificial
languages like those used with computers - currently, the cost of dealing with
ambiguity vastly outweighs any usefulness it might have.
It may be that, as for
C above,
the language itself is ambiguous
(at the grammatical level - there is no ambiguity at the semantic level)
or it may be that, as in
this example,
the language itself is unambigous but the person who wrote the grammar made a
mistake.
For the purposes of this discussion, parsers and parser-generators cannot
deal with ambiguous grammars.
Confusing grammars
A confusing grammar is what, for the purposes of this discussion, I am calling
an unambiguous grammar that is still too complicated for the parsing algorithm
you are trying to use. e.g.
How to confuse parsers
Unfortunately, for most parsing algorithms, there will always exist
grammars that confuse them.
It is possible to use parsing algorithms that can handle
any unambiguous grammar, but it is not currently cost-effective to do so.
For the purposes of this discussion, I am assuming that cost-effective
parsers use variants on the LL(n) and LR(n) algorithms.
More notes and examples
including discussion of the kinds of error messages you may get
from yacc (and similar bottom-up parser generators)
Dealing with Ambiguous and Confusing grammars
There are no general-purpose algorithms to deal with ambiguous and confusing
grammars, just heuristics, and I want to investigate these heuristics.
[
An annoying feature of grammars is that many things you might sensibly want
to do are undecidable i.e. there cannot be an algorithm to achieve it in all
cases! For example, it is not possible to decide if a grammar or a language is
ambiguous - the best you can do is try to create a parser and, if you succeed,
the grammar must be unambiguous, but if you fail, maybe you just need to use a
more powerful method. Similarly, there is no way to decide if a grammar can be
converted to LL(1) or to LR(1) except by trying to do it - if you succeed,
then it was, but if you fail that just tells you that the method you used
isn't appropriate.
(Almost the only exception to this that I am aware of are results that say
that, if you can create an LR(n) grammar for some language, it is also
possible to create an LR(1), LALR(1) or SLR(1) grammar for it - however my
understanding is that general purpose algorithms to achieve this have been
proved to be impossible!)
]
I can think of four main options for dealing with ambiguous and confusing
gramamrs:
- Change the parser (or parser-generator), keeping the grammar the same.
-
LL(1) -> LL(n) -> LALR(1) -> LR(1) -> LR(n),
or use modifications of these algorithms that allow extra lookahead, or
limited backtracking, or use of predicates, at strategic moments during the
parse.
(This is not usually a sensible option if the grammar is actually ambiguous,
but as mentioned above, it is possible if there really is no alternative.)
e.g.
How to confuse parsers
discusses a grammar that is LR(1) but not LL(n).
e.g.
example grammars that are
LR(1) but not LL(n), and LR(1) but not LALR(1)
- Change the grammar, keeping the language the same.
-
Examples of converting grammars to
LL(1)
and to
LR(1)
and some
heuristics
for how to proceed.
(I am assuming that we are preserving the intent of the language designer,
even if, strictly speaking, we may be changing the language when we change the
grammar e.g. to make it unambiguous.)
- Change the language.
-
e.g.
the dangling else
- "Cheat" (which usually involves modifying the lexical analyser).
-
There are similar problems and solutions when parsing FORTRAN, which:
- has no reserved words (it has keywords like "DO", but they can also be
used as variable names)
- does not treat space as a separator (so "DO 1 I" and "DO1I" can both mean
the same variable)
- allows default declarations for undefined variables.
This makes it hard work to distinguish between e.g. the loop construct
DO 1 I = 1, 10
and the floating point assignment
DO 1 I = 1. 10
- even if the variable DO1I is undefined it would (by default)
be automatically declared as a floating point variable.
e.g. see
moderator's footnote
Note that these kinds of cheats can often be expressed more succinctly in
more advanced parsers, for example as "semantic predicates" - are they still
cheats, or are they just making good use of the facilities provided?
I suppose one can draw a formal distinction between facilities that make it
easier to do the job but don't increase the fundamental expressiveness of BNF,
and those that actually increase the range of languages that can be handled.
e.g. one such system I came across allowed you to use predicates to recognise
AnBnCn (any number of As, followed by the
same number of Bs, followed by the same number of Cs) which is not possible
for any context-free grammar!
However, the only practical distinction is whether these facilities cause the
parsers that use them to be significantly less efficient. Even if they do, it
may be that that is a price worth paying to handle more languages. The danger
is that the extra price will also be paid by less experienced users handling
languages or grammars that are only superficially hard to deal with, and who
unknowingly end up using a sledge-hammer to crack a nut.
A related problem
Suppose you are using e.g. yacc, and your grammar
has no conflicts, but when you add the actions, you find that you can't get
information to flow between the grammar rules correctly:
e.g.
Getting actions in
the right order
or
When to use a
tool or technique (scroll down to the last few paragraphs of "parse trees")
Much as above, the solutions include using a "better" parser-generator, that
knows how to build a parse tree automatically and walk it to calculate all
your attributes (or doing this yourself by hand) or rewriting the grammar
to change the order in which various rules are reduced.
Further information
Parsing Techniques
Aho and Ullman's "The Theory of Parsing, Translation and Compiling"
Sippu, S. and E. Soisalon-Soininen (1988).
Parsing Theory. Vol.1: Languages and Parsing,
Vol. 15 of EATCS Monographs on Theoretical Computer Science. Springer.
Sippu, S. and E. Soisalon-Soininen (1990).
Parsing Theory. Vol.2: LR(k) and LL(k) Parsing,
Vol. 20 of EATCS Monographs on Theoretical Computer Science. Springer.
some theory
and
more