next up previous
Next: Phases of Compilation Up: CS2121: The Implementation and Previous: Composing Programs

Subsections

Putting it all together

Putting it all together

We have looked at various tools and techniques used in compilers, but:

When to use a tool or technique

The questions relevant to the use of LEX, YACC, and dictionaries are all about the form of the input text. By contrast, the most important questions relevant to the use of parse trees are about what you want to do with the input text once you have analysed it.

What are the components of a Compiler

Lexical analysis (itemisation, scanning, tokenisation)

input:
characters of source program
outputs:
symbols/ lexemes/ tokens/ items/ words
names $\rightarrow$ dictionary namelist
error messages, listings etc.

Recognises simplest pieces (i.e. regular part) of grammar:
discard blanks and comments
identify literals, names, keywords, operators, punctuation
lexemes = pair (syntactic kind [from "return"], semantic value [if used, from "yylval.???="])

lexeme kinds are (e.g.):

identifier
value = particular name (e.g. a pointer to the name in the namelist)
literals
value = particular number, character or string etc. and type
each keyword or punctuation
no value, distinguished purely by kind
each operator
usually different kinds, but may combine operators with the same precedence and associativity etc. and just distinguish them by value

Many useful tools available, mainly similar to LEX.

Syntactic analysis (parsing)

input:
lexemes/symbols
outputs:
parse trees or equivalent
error messages

Recognises context-free grammar.
Builds structure from string of symbols.

Detects grammatical errors, but usually cannot correct them. Good parsers will usually try to continue after finding one error, in case there are more errors.

Many useful tools available, mostly in two broad classes - top down LL(k), and bottom up LR(k). The bottom-up parsers, like YACC, are more powerful, but harder to understand at first. Many tools improve the look-ahead beyond the 1 symbol of LL(1) and LR(1), as this significantly increases the power of LL parsers and makes LR parsers easier to use. Some tools integrate the lexical and syntactic grammars, rather than having two separate files as with LEX and YACC, but still generate separate lexical and syntactic analysers.

Semantic analysis

inputs:
parse trees or equivalent
dictionary containing just names
outputs:
parse trees (or equivalent) for executable code, enhanced by information from declarations
declarations $\rightarrow$ dictionary properties
error messages

The semantic analyser checks the declarations and use of identifiers, using the dictionary.

Unlike lexical and syntactic analysis, there are no widely available toolkits for semantic analysis, although there are packages that help with maintaining a dictionary.

Code generation

input:
enhanced parse trees or equivalent for executable code
output:
object code

There are no code generation tools as ubiquitous and as useful as LEX and YACC.

We have looked at the code generated for expressions in lab exercise 4, for control structures in $\S $9, and for variables in $\S $8. This is covered in much more detail in CS2042.

How to combine components, tools and techniques - Passes

We looked at the concept of passes when we discussed parse trees ($\S $7.1.1) but here is a quick reminder:

A pass is an actual scan of the whole program being translated, that must be completed before the next phase/pass can start.

A typical translator program will usually combine several phases into each pass. The first pass in any translator is (by definition) over the actual input text. If there are any subsequent passes, they may also be over the input text, but more usually they are over some internal form of the text, such as parse trees.

Even complex languages like C can be compiled in one pass, and one pass translators are usually simplest to write. However, if we have decided above that we have to use parse trees, then by definition we are using more than one pass. Thus, most compilers nowadays are multi-pass as they include significant code optimisation.

A typical 1-pass organisation

The heart of the translator will be the syntactic analyser, written using YACC or something equivalent. This will make calls on the lexical analyser, written using LEX or some equivalent. The actions associated with each grammar rule in YACC (and/or in LEX) will make calls on routines that do all the rest of the translation, such as build and access the dictionary, perform semantic analysis, and generate any output (assuming these are all relevant).

A typical multi-pass organisation

The first pass will be similar to the 1-pass organisation, except that the generated output will be the parse tree (so the actions will need to call tree-building routines), and by definition not all of the routines in the rest of the translator will have been called.
For example, the first pass might consist of all the analysis phases of the compiler (lexical, syntactic and semantic), creating a parse tree with information from the declarations added at each use of each identifier. The second (and final) pass could then be a simple code generator.

The tree will then be handed on to the second pass, which will be organised around a tree-walking routine. Typically, this will be a depth-first, left-to-right traversal over the whole tree, but circumstances may require the nodes of the tree to be visited in a different order, and some of them may even be visited more than once or not at all. The tree-walking routine will make calls on some more of the routines from the rest of the translator as it visits each node.

If this is a 2-pass program, we are done. Otherwise, the output from the second pass, which may be a modified tree or may be some other data structure altogether, is handed on to the third pass, which is again organised around a routine that walks over the data-structure (e.g. modified parse tree), calling some more of the routines in the rest of the translator.
For example, we might decide to arrange our compiler so that the first pass consisted just of lexical and syntactic analysis, ending by building simple parse trees, followed by a second pass during which we performed semantic analysis and significantly changed the trees, followed by a third and final pass in which we performed simple code generation.

If we have completed all the passes, we are done. Otherwise, the next pass will be organised in a similar way to the previous pass, although the data structure handed on may be changed yet again. This process continues until all the passes have been performed.
For example, we might expand the code generation phase in what was our hypothetical 2-pass compiler to include code optimisation, requiring many passes over the parse tree to re-arrange it to allow better code to be generated.

The net effect is, that during each pass, we perform more of the actions until by end of the last pass we have completed all of them and the final result has been output. Note, however, that the ``parse tree'' may be completely unrecognisable as such by the time that we have finished!

Epilogue - Syntax-directed Translation

Syntax-directed translation is essentially a formal treatment of two simple concepts:

Attributes are essentially equivalent to the %union value that YACC can calculate in the action associated with each grammar rule - we can use struct values if there is more than one attribute. However, the actions themselves are generalised - YACC values can only move from leaves of the parse tree towards the root, but in syntax-directed translation the values can move around the tree in any way the user requires. This is achieved by automatically creating a parse tree and traversing it as required to move the values around.

The automatic creation and use of parse trees usually also simplifies one more step in the translation process, as we often need parse trees anyway for a later stage e.g. interpretation or code optimisation.

Systems like YACC and LEX exist that can generate syntax-directed translators, but they are inevitably harder to learn than YACC and LEX, although like any other high-level language, experienced users can achieve a lot more. However, just as with low-level languages, with sufficient ingenuity, YACC and LEX can be made to do anything that the more complex systems can do. (Note that some of the tools being created today, such as JavaCC, if anything have less functionality than LEX and YACC.) More importantly, by becoming aware of these more advanced concepts, we can make better use of LEX and YACC.

Exercises - missing

Readings

Louden: ch. 4.7
Capon & Jinks: chs. 3.1-2, 3.4-6, 6, 7.1, 7.3, 12.1, 10, 14
Aho, Sethi & Ullman: chs. 1, 2.1-2.6, 2.8-9, 3.1-3.4, 4.1-4.3, 4.5, 4.7-4.9, 8, 9.9-10, 10.1-2, 11 CS2121 - Autumn 2002 - Pete Jinks - appendix
next up previous
Next: Phases of Compilation Up: CS2121: The Implementation and Previous: Composing Programs
Pete Jinks
2004-10-26