1) Language Theory

Contents

Return to first page of the Compiler Theory article


Basic Language Theory

A compiler is a program that translates one language to another. The target language, that is the output of the compiler, is often assembler language or object code, but this is not necessarily the case. Many compilers convert one high level language to another high level language. One notable example of this is cfront, the UNIX C++ compiler, which translates C++ into C. The C language output is then fed into cc, the standard UNIX C compiler, which produces object code. Another example of a compiler that converts from one high level language to another is the BEACON back end. (Magnus Rimvall's paper on BEACON will be posted on James Alan Farrell's web site in the near future. When it is a link will be provided here.)

In order to write a program that performs a translation on a language, that language must first be precisely and accurately defined. By precisely, we mean ALL aspects of the language must be set out. If any aspect of a language is left undefined, compiler writers will be on their own to figure out what to do, and each compiler will do these things differently. The language will not be portable. By accurately we mean that the language must be defined consistently to the smallest detail. Any details left undefined or in conflict will lead compiler writers to each resolve the problems in their own way. Again, this leads to a language that is not portable.

In order to create a language that meets these criteria, we must develop the concept of language in a formal mathematical sense. There are two parts to creating a language in the mathematical sense. The first part is defining rules for the language. This is called syntax. The second part is defining the meaning behind the rules. This is called semantics. Semantics is a difficult thing to get a handle on, and rules for defining semantics for a language have not yet been codified.

Return to start of Language Theory page


Syntax

In order to define the syntax for a language we start with an alphabet. An alphabet is a series of tokens, not necessarily letters, that can be used to create strings in the alphabet. A string is a series of tokens pieced together according to the rules of the language. Strings are also sometimes referred to as sentences. So in order to define the syntax for a language, we need an alphabet and a set of rules used to manipulate the alphabet. The rules and the alphabet can be expressed together in one of two forms: Chomsky Normal Form (CNF) or Extended Backus Naur Form (EBNF). Both of these forms will be covered in detail. A description of the syntax of a language in either of these forms is called a grammar.

If we take English as an example of a language by the above definition we find something quite interesting. Namely that what we normally understand as the alphabet is not what we define above as being the alphabet. According to our mathematical definition of a language, the words of the English language are the alphabet of the English language.

According to the rules of the English language, a sentence must have an object and an action on that object. In addition a sentence can have prepositional phrases, modifiers, and other constructs, which we do not necessarily need for this example. The following sentence is an example of an English language sentence:

The cat sat.

In this example there are three tokens, or members of the alphabet: The, cat and sat. There is an object cat and a verb sat. The following are not English language sentences:

Sat ran run jump.
The paper on compilers.

We will refer to these as phrases. The first phrase is made of four tokens. It does not meet the rules of English for two reasons. The first is that it is missing an object, and the second is that it has four actions, not one. The second phrase also contains four tokens, and is not an English sentence because it is missing an object.

The alphabet is sometimes referred to as the lexical elements of a language.

Part of the task of a compiler is to determine if a program meets the rules of the language. If not, it contains a syntax error.

Return to start of Language Theory page


Semantics

Semantics refers to the meaning behind a sentence in a language. Developing mathematical rules to describe semantics has proven to be an elusive undertaking. It is possible to define a meaning for each token in an alphabet, but when tokens are placed together there are subtle interactions in meaning that make precise semantic definition of a language quite difficult. This is a major reason for continued difficulties in making modern languages completely portable.

Take the following English language example:

The bank president ate a plate of spaghetti.

This is a good solid sentence that drives AI gurus nuts. Programs that attempt to understand the semantics of English often translate this to mean the president of a river bank at a plate, and then a bunch of spaghetti fell on his lap. (I last took a course in AI about 4 years ago. This little problem may have been overcome by now.)

Solutions for describing the semantics of a language tend to be exponential, since interactions between a modifier and the object of the modifier have to be described. Consider how you would tell a computer, in a generic way, that bank in the above example refers to a financial institution. After you've developed a solution consider how the computer would then handle the following:

The boat washed up onto the bank.

When developing a compiler there are generally well accepted semantics to most constructs. Assignment statements and for loops exist in some form in most languages. Where semantics are unclear, as in a construct unique to a particular language, English language descriptions generally have to suffice. Hopefully the English language description can be made clear enough that all compilers based on the description will work the same way.

Return to start of Language Theory page


Chomsky Normal Form

Language theory owes a great deal to Noam Chomsky. In the 1950s Chomsky studied many spoken languages and attempted to develop a formal method of describing language. Among other achievements, it was Chomsky who first divided language study into syntax and semantics. Another achievement was the development of a language used to specify languages.

Chomsky Normal Form (CNF) uses a series of intermediate tokens to describe syntax rules. The left side of an expression in CNF shows a string with intermediate symbols, the right side shows how that string can be translated. The translation may contain terminal symbols, which are the tokens in the language's alphabet, or it may contain intermediate symbols, or it may contain a combination of the two. Traditionally terminals are shown with lower case letters and intermediate symbols are shown in upper case. A CNF grammar always starts with the intermediate symbol S.

S -> a

In this exceedingly simple example, the only sentence possible sentence in the grammar is the single token a.

S -> aBa
B -> bb

This example again shows a grammar that can only accept a single sentence. S is replaced by aBa, and B is replaced by bb, giving the sentence abba. In order to add flexibility to the grammar we need to be able to make choices in how a symbol can be expanded. CNF shows a choice in two ways. The symbol may be repeated on the right side to show multiple rules for expansion:

S -> aBa
B -> bb
B -> aa

Or the vertical bar can be used to show a choice. The following example is exactly equivalent to the previous example:

S -> aBa
B -> bb | aa

This grammar can produce two different sentences: abba or aaaa. Using the methods outlined so far we can create a grammar with a large number of intermediate symbols that can produce a large number of distinct sentences. But all the sentences such a grammar will create will have a definite length. In order to create a useful language, either spoken or programming, we need to be able to create sentences of arbitrary length. The only way CNF can be used to define a grammar that can create sentences of arbitrary length is to allow a symbol to appear in its own expansion rules.

S -> aBa
B -> bb | aaB

Such a language is call recursive. Some sentences that can be generated with this grammar include abba, aaabba, aaaaabba, etc. The number of a's before the final bba is quite arbitrary.

For a more complex example of CNF, we can look at the more or less standard definition of expression used by most computer languages.

S -> EXPRESSION
EXPRESSION -> TERM | TERM + EXPRESSION | TERM - EXPRESSION
TERM -> FACTOR | FACTOR * EXPRESSION | FACTOR / EXPRESSION
FACTOR -> NUMBER | ( EXPRESSION )
NUMBER -> 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | 
          1 NUMBER | 2 NUMBER | 3 NUMBER | 4 NUMBER |
          5 NUMBER | 6 NUMBER | 7 NUMBER | 8 NUMBER |
          9 NUMBER | 0 NUMBER 

This grammar is fairly small and compact, but it is also quite complex. There is a lot of ground to cover here. Analyzing it will be well worth the effort, however.

Notice first of all how the expression 123 is covered by this grammar. S is expanded to EXPRESSION, which expands to FACTOR, which expands to TERM, which expands to NUMBER. Number expands to 1 NUMBER. NUMBER then expands to 2 NUMBER. And finally this NUMBER expands to 3.

Next notice how 2 + 3 * 6 expands. S expands to EXPRESSION, which expands to TERM + EXPRESSION. TERM expands to FACTOR, then to NUMBER, then to 2, giving us 2 + EXPRESSION. EXPRESSION expands to TERM, which expands to FACTOR * EXPRESSION. FACTOR expands to NUMBER, which then expands to 3. EXPRESSION expands to TERM, which expands to FACTOR, which expands to NUMBER, which finally expands to 6.

This expression is more complex than it first appears, and requires a deeper analysis. First of all we need to look at the precedence of the operators. Given the expression 2 + 3 * 6, according to standard rules of arithmetic we need to perform the 3 * 6 before we add 2, giving a result of 20. Using the above grammar to develop a compiler will do that. If we reversed the rules for TERM and EXPRESSION, we would have the opposite case, where 2 is added to 3, then the result is multiplied by 6, giving 30.

The next thing we need to consider is that all recursion is on the right side of the expansions. All intermediate symbols on the left side are on a lower level than the intermediate symbol being expanded. Such a language is called a right recursive language. Consider what would happen if you based a compiler on a grammar with the following rules:

S -> EXPRESSION
EXPRESSION -> TERM | EXPRESSION + TERM | EXPRESSION - TERM
TERM -> FACTOR | EXPRESSION * FACTOR | EXPRESSION / FACTOR
FACTOR -> NUMBER | ( EXPRESSION )
NUMBER -> 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | 
          1 NUMBER | 2 NUMBER | 3 NUMBER | 4 NUMBER |
          5 NUMBER | 6 NUMBER | 7 NUMBER | 8 NUMBER |
          9 NUMBER | 0 NUMBER 

To expand 1 + 2, first S would expand to EXPRESSION, then EXPRESSION would expand to EXPRESSION. This is an infinite loop. Such a left recursive language cannot be read by mechanical means. (Strictly speaking, this is not true - the entire program can be placed on a stack then parsed backwards. This is quite inefficient. Real problems start when one rule is right recursive and another in the same grammar is left recursive).

Another interesting example is one with parentheses. It will be left to the reader to develop such an example and see how it expands.

Return to start of Language Theory page


Extended Backus Naur Form

A short time after Chomsky devised CNF, two researchers, Backus and Naur, independently developed a similar form for specifying language grammar. The Backus Naur form can specify some languages more compactly than CNF. Over the years other researchers have added symbols to Backus Naur Form, creating Extended Backus Naur Form (EBNF). Grammars specified in either CNF or EBNF can be converted directly into a compiler, however most compiler writers prefer to work with EBNF.

EBNF uses the symbol :== to specify the right and left sides of a rule. Terminal symbols are placed in single quotes.

S :== 'a' B 'a'
B :== 'bb'

The vertical bar is again used to represent a choice in expansion rules and just as in CNF, recursion is used to develop strings of arbitrary length. Two symbols added to EBNF that do not exist in CNF are the square brackets ([]) and curly braces ({}). The square brackets are used to denote zero or one occurrence of an expansion, and curly braces are used to denote an arbitrary, but at least one, number of expansions.

S :== 'a' [B]
B :== {'a'}

In this example, S expands to a or aB, since there can be 0 or 1 B's. B expands to a or aa or aaa or aaaa, etc. We can rewrite the expression grammar from the last section in EBNF:

S :== EXPRESSION
EXPRESSION :== TERM | TERM { [+,-] TERM] }
TERM :== FACTOR | FACTOR { [*,/] FACTOR] }
FACTOR :== NUMBER | '(' EXPRESSION ')'
NUMBER :== '1' | '2' | '3' | '4' | '5' | 
           '6' | '7' | '8' | '9' | '0' | 
           '1' NUMBER | '2' NUMBER | '3' NUMBER | 
           '4' NUMBER | '5' NUMBER | '6' NUMBER |
           '7' NUMBER | '8' NUMBER | '9' NUMBER | '0' NUMBER 

This grammar looks a little different. The only recursive rule is in FACTOR. This is because the curly braces allow strings of arbitrary length to be created without recursion. When translating this to a compiler, recursive rules translate to recursive procedure calls, rules with curly braces translate to loops and rules with square brackets translate to if statements.

To expand the expression 1 + 2 + 3 * 4 + 5, start with S expanding to TERM [+ TERM]. The first term expands to FACTOR, then to NUMBER, then to 1. The second term is really an arbitrary number of terms, since it is in square brackets. The compiler will expand it to + 2 + TERM + TERM, then expand the first TERM to FACTOR * FACTOR, giving FACTOR * FACTOR + TERM. Next the FACTORs will be expanded, giving 3 * 4 + TERM, and then the final term will be expanded in the end to 5.

Sometimes it is useful to denote zero to n instances of a rule. This is done by using curly braces with an asterisk after them ({}*).

The same concerns we had to consider with CNF follow for EBNF. A left recursive language is a left recursive language, no matter what grammar is used to specify it, and left recursive languages cannot be parsed by mechanical means. We must also worry about operator precedence no matter what grammar we use.

Even though EBNF is rather more confusing than CNF, we will tend to use it over CNF. First of all EBNF supports loops, so we can avoid some recursion, allowing for more efficient compilers. Also EBNF is more standard for compiler writers than CNF. When we look at translating EBNF into a parser it will become easier to understand how EBNF works.

The process of translating grammars in either form is actually quite mechanical. This has lead to a number of tools for automatically creating compilers. The most well known tool is the UNIX program YACC (Yet Another Compiler Compiler). (I wanted to provide a link here to a page with a good YACC explanation. The best I've found is the SunOS YACC man page, but I did find an excellent page on BISON, the GNU version of YACC.) This program is fed a grammar and outputs source code in C for a parser to match the grammar. I'm not sure if this program uses CNF or EBNF form. (The BEACON back end was written in Ada, so we did not use YACC.) Another tool for automatically developing parsers has been built into the PROLOG programming language. By entering a CNF grammar into the PROLOG interpreter, you can effectively turn it into a parser for any right recursive language.

To review a complete EBNF of the Pascal language, take this link
Thanks to Dr. S. Fitzpatrick for taking the time to type this whole EBNF in!
[added by Pete Jinks: the link is broken, but try slightly modified version by Bernhard Treutwein - thanks to Irving Axel Rivas for pointing this out]

Return to start of Language Theory page


Computer Languages

Computer languages are languages in the mathematical sense. They are made from tokens and they can be described with right recursive grammars using either EBNF or CNF.

Taking Pascal as a typical computer language, it contains the following token types, or the following alphabet: Reserved words, reserved symbols, identifiers, strings, integers and reals. The entire EBNF of Pascal will not be given, but we do have room for several examples. Examples of reserved words include Program, begin, if, etc. Examples of reserved symbols include (, ), ', .., etc. An identifier starts with a character, contains characters, digits and underscores, and can be any length (but only the first 30 are significant). Strings are any characters enclosed in single quotes. Integers are a series of digits, with no decimal point, and reals are series of digits that do contain a decimal point and optionally a power of ten scaling factor (scientific notation).

In addition to the alphabet, when reading a computer program we have to worry about delimiters. A delimiter separates two tokens. In most cases a blank space is a delimiter. Reserved symbols also act as delimiters.

Any time a token is encountered that is not part of the language's alphabet, we have encountered a lexical error. This is different from a syntax error. An example of a lexical error might be the symbol -> in a Pascal program (C uses this symbol for pointers, but Pascal does not use it).

Pascal is quite standard in its alphabet. Most other computer languages, such as C, Ada, FORTRAN, BASIC, COBOL, LISP and PROLOG have basically the same alphabets (accept PROLOG has no reserved words). Some computer languages also have context sensitive reserved words. Such words are reserved in certain parts of the program and not in others. Imagine that we could declare a procedure named Integer in Pascal. That means in the Var section Integer would be a reserved word, but in sections where procedures can be declared it is not considered reserved. It is reserved depending on the context where it is found. Pascal compilers handle all reserved words as globally reserved words. We will not make this assumption and we will develop a compiler that can handle context sensitive reserved words.

One final aspect of computer languages is that they must be computable. The definition of computable in this case is that the language can be interpreted by a Turing machine. In order to interpret the language the Turing machine might translate it first to another language, but then the Turing machine must be able to translate it.

One aspect of spoken languages is that no determination has been made as to whether they are computable or not. To my knowledge this problem has only been solved for one language (I believe that language is Dutch, but I could be mistaken), and that language has been shown to NOT be computable. For other languages it is believed that determining whether they are computable or not is an undecidable problem. Some would argue that the human brain is nothing more than an advanced computer, and therefor any spoken language must be computable. This is not true because spoken languages contain fuzzy constructs, very open to interpretation and to miss-interpretation. When it comes to spoken language mistakes are often made. Computer languages must be much more strictly defined, in order to keep them computable.

Go to next page of Compiler Theory article
Return to start of Language Theory page
Return to Table of Contents of the Compiler Theory article
Return to High Tech page