UP to more information about computer languages, parsing, grammars, and compilers

Notations for context-free grammars


BNF

BNF = Backus Normal Form or Backus Naur Form.

History: the very first version was created by John Backus, and shortly after improved by Peter Naur, and it was this improved version that was publicly used for the first time, to define Algol 60. [Naur P (ed.), 1963, Revised report on the algorithmic language Algol 60, Comm. ACM 6:1 pp1-17]
Compilers: Principles, Techniques, and Tools, by Aho, Sethi and Ullman says: "The proposal that BNF, which began as an abbreviation of Backus Normal Form, be read as Backus-Naur Form, to recognise Naur's contributions ... is contained in a letter by Knuth [Knuth D.E., 1964, Backus Normal Form vs. Backus Naur Form, Comm. ACM 7:12 pp735-736]"
Knuth's letter (which you may be able to access via ACM) is interesting to read, as it indicates exactly what he thought the importance concepts in BNF were. He also points out that BNF is not a "Normal Form", which would imply there were some restrictions on how a grammar could be written down, as in e.g. Chomsky Normal Form or Greibach Normal Form.
There is a message archived from the comp.compilers newsgroup that gives a different view of Naur's contribution.

This first published version looked like:

<number> ::= <digit> | <number> <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
which can be read as something like:
"a number is a digit, or any number followed by an extra digit"
(which is a contorted but precise way of saying that a number consists of one or more digits)
"a digit is any one of the characters 0, 1, 2, ... 9"

There are many variants, all equivalent to the original definition, such as:

Every variant has to distinguish between the names of grammar rules, such as number, and actual characters that can appear in the text of a program, such as 0123456789. Usually, one or the other is quoted, e.g. <number> or '0123456789' (or "0123456789") so the other can be left unquoted. Sometimes both are quoted, as in the last example above.

However, if we decide to quote the names of rules and leave the actual characters unquoted (as in the first example above) a problem can arise if the metacharacters (i.e. the characters such as ' < > . | used to punctuate the rules) can appear as actual characters in the programming language. Therefore, the most widely used variants of BNF usually quote the actual characters e.g.:

logical_expression = expression '|' expression
                   | expression '&' expression .
real_number = number '.' number .

e.g. ANSI C syntax from K&R using a BNF (the "%token" line lists things that are assumed to be simple enough to leave out, such as names and numbers. The form of BNF used is essentially that accepted by yacc/bison.)


Syntax Diagrams (or Charts or Graphs)

Unlike BNF, this kind of notation does not seem to have a commonly agreed-on name. "Syntax diagrams" are also known as "Railway Tracks" or "Railroad Diagrams". Whatever they are called, they do not allow us to write anything that can't be written in BNF, they just make the grammar easier to understand.

History: I first came across this style in the definition of Pascal
(e.g. identifiers from Amazon, or if-statements from CS370 at Augustana, Canada).
Something similar was used for FORTRAN 77
(e.g. assignment from HP FORTRAN 77/iX Reference)
and COBOL
(e.g. How to Read the Syntax Diagrams from IBM VS COBOL II Reference Summary - you may need to fiddle with your font sizes and/or scaling to be able to read this properly!).
Something more like EBNF (see below), but nearly as space-filling as syntax diagrams, was also used for COBOL. In this notation, alternatives were stacked vertically between brackets.
(e.g. on slides 4 and 8 of (pdf) from Programming Language Concepts at Iowa U
or e.g.
<identifier> ::= <letter> { <letter>
<digit>
} *
 
<integer> ::= [ +
-
] <unsigned integer>
from CS370 at Augustana, Canada)

e.g. (this is a crude attempt to give you an idea of what the Pascal version looks like - the real thing looks a lot better!)

IDENTIFIER:
			     +------+
			  ---|LETTER|---
			 /   +------+   \
			|		 |
	      +------+	v		/
	----->|LETTER|---------------------->
	      +------+	^		\
			|		 |
			 \   +-----+    /
			  ---|DIGIT|----
			     +-----+
which can be read as something like:
"an identifier consists of a letter, possibly followed by any number of letters and/or digits"

The names of diagrams/rules, such as letter and digit, appear in rectangles, and actual characters appear in circles or in boxes with rounded ends.

You can see many (much better-drawn) examples of this particular style at The BNF Web Club e.g. (gif) and in the documentation for Ebnf2ps: Peter's Syntax Diagram Drawing Tool.

A LaTeX package to help produce syntax diagrams is described here.

You may also come across a slightly different, more angular version of this style e.g. in Computer Science: Abstraction to Implementation by Robert M. Keller


EBNF

EBNF = Extended BNF

Like syntax diagrams, EBNF does not allow us to write anything that can't be written in BNF, it just makes the grammar easier to understand.

History: many extensions to BNF were used to define computer languages after Algol 60, all slightly different. Niklaus Wirth proposed a single formulation in: "What Can We Do About the Unnecessary Diversity of Notation for Syntactic Definitions, November 1977, Comm. ACM 20:11 pp. 822-823" However, this did not actually mention "EBNF". You may be able to access Wirth's article via ACM. Alternatively, you can see most of the details of his proposal here - look for section 5.9.1

I don't know of a universally accepted set of extensions to BNF, but here are some attempts:

Almost all variants of EBNF use brackets ( ) for the usual mathematical meaning of grouping items together. (See a detailed comparison of several different EBNFs.) There are (at least) three main styles of EBNF:

Note that e.g. x* means the same as (x+)? , and x+ the same as xx* , and x? the same as (x|) , so it is possible, but not as simple, to use a subset of these facilities.
Given that the only reason for using EBNF instead of BNF is to simplify and clarify language descriptions, it would seem sensible to provide all these facilities, and maybe others as well (e.g. see the note about list separators below).

ANSI C syntax from K&R using an EBNF based on regular-expressions

One further point - if we really want to minimise the amount of work we do and make the grammar as clear as possible, then we need to deal with the problem of list separators:

We can recognise lists where every item is identical, like "a ; b ; c ;" using: ( name ';' )+
[here ';' acts as a list-item terminator]
but the only way to recognise lists where the items are separated from each other, like "a , b , c" is: name ( ',' name )* or equivalently: ( name ',' )* name
[here ',' acts as a list-item separator, and a different marker, such as ';', would normally be used to terminate the list]
so we end up with something that is harder to use (we need to deal with names in two different places rather than one), longer to type and - most important - less easy to understand. You can see 9 examples in ANSI C syntax from K&R using an EBNF.

We really need a notation where we can say: a list of one or more names separated by ','
I don't know of a "natural" notation for this, but we need to be able to write something like: ( name # ',' )+
Unfortunately, most EBNFs do not include a notation for this concept.
As an example of this occuring in "real life", here is an ANSI C syntax using # for list separators.


Related links


UP to more information about computer languages, parsing, grammars, and compilers

This page is maintained by and copyright © Pete Jinks [last modified 12/Mar/2004] suggestions, corrections etc. welcome You are welcome to make educational, not-for-profit use (else what would be the point!) but please give due credit.