UP

Grammar-Manipulation Heuristics


General-purpose

The clearest I can be in writing down what I started with (although this is still hopelessly unclear) is:
  1. Convert to an EBNF
    (I am going to use my version of EBNF and in particular # to indicate a list separator)
  2. Repeat:
    1. Select some grammar rule(s) to manipulate
    2. Expand some of the rules, probably resulting in fewer, longer rules
    3. Simplify the resulting rule, e.g using the transformations below
    Until the grammar is in the right form
  3. Convert back to BNF (or whatever format is required by the tool being used)

Note that I have mainly explored this in the context of trying to create LR(1) grammars, so see below for more details.

Grammar Transformations

  1. Simple EBNF equivalences:
    1. A+ = A A*
    2. A? = ( | A )
    3. A* = (A+)? = ( | A+ )
    4. A ( B | C ) D = A B D | A C D
    5. (A # B)+ = A (B A)* = (A B)* A
    6. for (A # B)* use 1.5 + 1.3
  2. EBNF equivalences for recursion:
    1. A : B | A C | D A = A : D* B C*
    2. A : B | A C B = [2.1] A : B ( C B )* = [1.5] A : ( B # C )+
    3. A : B | B C A = [1.4,1.2] A : B ( C A )? = A : ( B # C )+
    4. A : (A B)? C = [1.3,1.4] A : C | A B C = [2.2] A : (C # B)+
    note that we may be losing important information e.g. in 2, that C is left-associative, and in 3, that it is right associative - we can try to preserve this information by decorating the "+" e.g.:
    for 2 - A : ( B # C )<+
    for 3 - A : ( B # C )+>
  3. Simplifying lists
    1. (A # B)+ B = [1.5] (A B)+
    2. (A # B)* = (A # B)+ if A can become epsilon e.g. A=C? or A=C*
    3. A ( B A | A | B )* = A ( A B | A | B )* = A ( B | A )* = (A # B*)+ B*
    4. A (B | C)[*+] = A (B+ | C)[*+] = A (B | B* C)[*+] etc.
      note : [*+] is intended to represent the fact that either * or + can be systematically substituted into the equations
  4. Simplifying lists of lists:
    1. A (B C* # D)+ E = A B (C | D B)* E
    2. A (B C*   D)+ E = [3.1] A B (C | D B)* D E
    3. A (B C+ # D)+ E = [4.1+1.1] A B C (C | D B C)* E
    4. A (B C+   D)+ E = [4.2+1.1] A B C (C | D B C)* D E
    5. A ( B (C # D)+ # F)+ G = A B C (F B C | D C) *  G
    6. A ( B (C # D)+   F)+ G = [3.1] A B C (F B C | D C)* F G
    7. A ( B (C # D)+ # F)* G = [4.5+1.3+1.4] A B C (F B C | D C)*   G | A G
    8. A ( B (C # D)+   F)* G = [4.6+1.3+1.4] A B C (F B C | D C)* F G | A G
    9. ((C # D)+ # F)+ = [4.5] C (F C | D C) * = C ( (F|D) C )* = (C # (D|F))+
    10. (A* # (B A* | C) )[*+] = (A* # (B|C) )[*+]
    11. (A+ # (B A+ | C) )[*+] = (A+ # (B|C) )[*+] = (A+ # (B A* | C) )[*+]
    12. (A* # B)* = (B* # A)* = ( A+ | B+ )* = ( A | B )*

Converting Grammars to LR(1)

examples and more details, mainly of the general heuristic above.

I have come across a much clearer heuristic, for removing the ambiguities from ambiguous grammars for unambiguous languages using an LR(1) parser-generator. The interesting thing about this is that the choice of what to do next is guided by the conflicts detected by the parser-generator, so the process is much more deterministic, although much less general. I would like to come up with similarly deterministic procedures for other situations.

When trying to make a grammar LR(1), a more general procedure may be something like:
expand (and thus dispose of) any rules that are named as the result of a reduction in shift/reduce or reduce/reduce conflicts,
(with reduce/reduce conflicts, it may make sense to pick just one of the two non-terminals - assuming they are different - to expand, and see if that is enough by itself; I wonder how to pick one.)
(not only the particular grammar rule for the non-terminal listed in the conflict, but any other grammar rules for the same non-terminal as well)
(at worst, reduce/reduce conflicts seem to be replaced by shift/reduce conflicts as this process continues)
until either "enough" right context has become visible (because the grammar rules have grown longer) that the conflict is resolved,
or you are unable to expand the rules any further, or expanding doesn't help (e.g. because they are recursive, and so you can't get rid of the original rule after expansion, and there is still a conflict being reported in it)
Then, if you still have conflicts, if you are sure that you want to resolve a shift/reduce conflict in favour of the shift (i.e. the grammar is ambiguous but the language isn't or shouldn't be, and you have actually exposed the ambiguity to view) use the rule-splitting procedure outlined by Chris Dodd.
Otherwise, ... (and maybe I can find something useful to say by investigating the example grammars I have collected).

My repeated failures to do this for yacc input grammar (e.g.) have had one useful outcome - to persuade me that the outline above is not right! Maybe we expand rules in those places where doing so makes "significant" right context visible?

Converting Grammars to LL(1)

examples of these and of the general heuristic above.