next up previous contents
Next: Building a Compiler: generating Up: CS2111: Design and Implementation Previous: Language Design and Implementation   Contents

Subsections


Building a Compiler: recognising the input

A simple (fast) translation phase often results in a slower execution phase (and possibly a simpler/smaller/more flexible system overall):

e.g. a simple fast compiler generating poor code v. a complex slow compiler generating better code.

e.g. interpreting a command by calling an existing general routine with a particular set of parameters v. compiling and executing a command by translating (binding) it into a specific sequence of instructions which runs many times faster. (e.g. a SH script v. a c program)

We will now look in outline at how we would build a complete compiler for a high-level language like C. To start with, however, it will help if we briefly look again at the subject of grammars.

Chomsky's classification of grammars

Chomsky grouped grammars into 4 distinct classes:
Unrestricted grammars
the most general and thus most powerful form, and the hardest to work with - any string of symbols can be replaced by any other string, with no restrictions on the number of symbols, or whether they are actual characters or e.g. rule names.
Context Sensitive grammars
the next most powerful form, but still very hard to do anything useful with. A single grammar symbol on the LHS can be replaced by different symbols on the RHS depending on the symbols that surround it (the context).
Context Free grammars
similar to the form of grammar that we use in YACC i.e. one grammar symbol on the LHS can be replaced by none or more symbols on the RHS. However, YACC grammars (LR(1)) must be decidable with a single symbol lookahead so they can be parsed in O(n) time, whereas general context free grammars can be ambiguous and thus O($n^3$).
Regular grammars
or equivalently Regular Expressions, the form of grammar that we use in LEX.
Chomsky also proved that various different kinds of automata are equivalent in power to the 4 different kinds of grammars. Regular grammars are equivalent to Finite State Automata, and Context Free grammars to Push Down Automata, which is why they are used by the parsers created by LEX and YACC.

More importantly, he showed that Turing Machines are equivalent to Unrestricted grammars, and Finite Turing Machines to Context Sensitive grammars. We also know that programming languages are equivalent in power to (finite?) turing machines, so we can deduce that a context free grammar is unlikely to be sufficiently powerful to describe a programming language. However, it is also true that using an unrestricted or context sensitive grammar to describe a programming language would be pointless, as such descriptions are too complicated to be useful.

What we can usefully conclude is that the levels in Chomsky's hierarchy are equivalent to components of the specification of any programming language (and are of fundamentally different complexities). Therefore, to process any such language efficiently we must separate out these components and process them differently. Further, we must be careful when separating the components to maximise the simpler tasks and minimise the more complex tasks. (Thus programming languages should be designed so that this is straightforward, or else users may face an unnecessary penalty.)

The part of a programming language that should be described by a context-sensitive grammar is the context i.e. the identifiers, which we have seen we can process using a dictionary. This processing (of identifiers, types etc.) is normally called semantic analysis. Similarly, that part that corresponds to a context-free grammar can be handled by syntactic analysers such as those generated by YACC, and the part that corresponds to a regular grammar can be handled by lexical analysers such as those generated by LEX.

Note that although we do not normally use a grammar to define the semantics of a programming language, we really need some formal notation (e.g. VDM), and although it is difficult to automatically generate semantic analysers, it is impossible without such a notation.

Macro pre-processing

Essentially an extra translator, making simple, naive modifications to the text e.g. CPP knows about identifiers and does not replace fred in freda, but has no concept of scope except the end of the text (including any #included files).

Lexical analysis (itemisation, scanning, tokenisation)

input:
characters of source program
output:
symbols/ lexemes/ tokens/ items/ words
names $\rightarrow$ dictionary namelist
error messages, listings etc.

Recognises simplest pieces (i.e. regular part) of grammar:
discard blanks and comments
identify literals, names, keywords, operators, punctuation
as lexemes = (syntactic kind, semantic value)

lexeme kinds are:

identifier
value = particular name (e.g. a pointer to the name in the namelist)
literals
value = particular number, character or string etc. and type
each keyword or punctuation
no value, distinguished purely by kind
each operator
usually different kinds, but may combine operators with the same precedence and associativity etc. and just distinguish them by value

Example fragment of C:


  int i; float j;
  float f (int i) {int k; float r; k= i*j; r= k; return r;}
and corresponding lexemes:

  (int, -)
  (id, *) -> i in namelist
  (;, -)
  (float, -)
  (id, *) -> j in namelist
  (;, -)
and namelist:

  int
  float
  f
  i
  j
  k
  r
The need for error messages means that we may need a line buffer, so we can output complete lines containing an error and a pointer to where-abouts on the line the error is. (We can do this using some of the more advanced facilities of LEX). However, more sophisticated programming environments can read the error messages and enter an editor, setting the cursor to the position of each error in turn, and so making a line buffer superfluous.

Syntactic analysis (parsing)

input:
lexemes/symbols
output:
parse trees or equivalent
error messages

Recognises context-free grammar.
Builds structure from string of symbols.

Detects grammatical errors, but usually cannot correct them.

Good parsers will usually try to continue after finding one error, in case there are more errors.

Possible parse tree for example C program:


program
  |
decl--------var
  |     [int, *->i]
  |
decl--------var
  |     [float, *->j]
  |
decl--------func--------+----------------------+
        [float, *->f] decl-------param        exp--------[=]
                        |     [int,*->i]       | [*->k]       [*]
                        |                      |        [*->i]   [*->j]
                      decl-------var           |
                        |     [int,*->k]      exp--------[=]
                        |                      |   [*->r]   [*->k]
                      decl-------var           |
                              [float,*->r]    return-----[*->r]
(as in lexemes, identifiers are pointers to entries in the namelist.)

Semantic analysis

input:
parse trees or equivalent
dictionary containing just names
output:
parse trees (or equivalent) for executable code,
enhanced by information from declarations
declarations $\rightarrow$ dictionary properties
error messages

Semantic actions can be directly embedded in the parser (1 pass), or the semantic analyser can walk over parse trees created earlier (multi-pass).

The semantic analyser recognises and checks the declarations and use of identifiers, using the dictionary.

Unlike lexical and syntactic analysis, there are no widely available toolkits for semantic analysis, although there are packages that help with maintaining a dictionary.

semantic actions for executable statements

Possible enhanced parse tree for example C program:


program[2 vars]
  |
decl--func----+
     [*->f]  exp--------[I,=]
              |              [F,->I]
              | [I,*->k]       [F,*]
              |          [I->F]
              |        [I,*->i]   [F,*->j]
              |
             exp--------[F,=]
              |            [I->F]
              |   [F,*->r]   [I,*->k]
              |
             return-----[F,*->r]
(identifiers have become pointers to property entries.) (F short for float, I for int)

Possible dictionary for example C program:


int   [type  | int] <--I

float [type  | float] <--F

f     [func  | global | ? | 1 param | 2 vars | *->F]

i     [param | local  | 1 | *->I]

      [var   | global | 1 | *->I]

j     [var   | global | 2 | *->F]

k     [var   | local  | 1 | *->I]

r     [var   | local  | 2 | *->F]

Readings

Louden: ch. 4.7
Capon & Jinks: chs. 6, 3.1, 7.1, 7.3, 3.2
Aho, Sethi & Ullman: chs. 1, 2.1-2.6, 3.1-3.4, 4.1-4.3, 4.5, 4.7-4.9
next up previous contents
Next: Building a Compiler: generating Up: CS2111: Design and Implementation Previous: Language Design and Implementation   Contents
Pete Jinks
1999-09-30