We have looked at various tools and techniques used in compilers, but:
The questions relevant to the use of LEX, YACC, and dictionaries are all about the form of the input text. By contrast, the most important questions relevant to the use of parse trees are about what you want to do with the input text once you have analysed it.
Does the text you are processing have any structure at all?
If so, you will usually want to split it into small pieces (words), so you
will almost always use LEX (or something equivalent).
The only situation I can imagine when it might not be useful would be e.g. trying to deal with FORTRAN, which has a structure, but the boundaries between words are decided by the context they appear in (spaces are not separators, so ``DO 1 I'' might be 3 words in one place and 1 word in another).
Does the text you are processing have a hierarchy of structures e.g. do the
words combine into groups, which combine into larger groups, and so on?
Is this hierarchical structure well-defined (i.e. has a grammar), or can you
create a grammar that accurately captures the required structure?
If so, then YACC should be able to help you.
However, one thing that YACC is not very good at is coping with mistakes in its input. Most modern systems have some facilities to automatically deal with errors, e.g. by skipping or inserting a word, or in more serious cases by jumping to the end of the current grammar rule and skipping input until it finds something that can follow it. In YACC (and in JavaCC) although there are some helpful facilities, they must be inserted by hand at suitable points in the grammar.
In some circumstances, such as when you are trying to deal with input that is full of errors (e.g. dealing with HTML) this problem can nullify the benefits of using YACC and cause you to just use LEX by itself, particularly if the input is not heavily structured.
Do you have user-defined names, or perhaps a long list of built-in names, to
deal with?
If so, you almost certainly need to use a dictionary of some sort, even if it
is quite simple.
Can there be more than one instance of the same name in the input e.g. are
there blocks and scope rules, or can the same name be used e.g. for a variable
and for a piece of code?
If so, you will need to structure your dictionary correspondingly - e.g. as a
stack for blocks, or with different dictionaries e.g. for variables and for
code.
Do you need to repeatedly scan the input text to be able to completely analyse
it?
e.g. using names before their definition, if the definition can change how
uses of that name have to be translated.
If so, then you will need to use a parse tree so that, having found e.g. the
declaration, you can go back and decide on the detailed translation of the
uses of the name.
If we are starting from scratch, we can usually design a new language to avoid
this difficulty, although we may encounter it if we have to deal with existing
languages. Some modern languages, such as SML, are deliberately defined in
this way to make them easier to use, although harder to translate.
What do you want to do with the input text once you have analysed it?
The more complex this is, the more likely you are to need parse trees:
e.g. significant (global) code optimisation.
e.g. reordering in translations - reading from left to right,
we can easily translate infix to postfix
subexp1 op subexp2 -> subexp1 subexp2 op
because we just have to remember the operator, a
single symbol, until we can output it at the end of the rule.
However, to translate postfix to infix
subexp1 subexp2 op -> subexp1 op subexp2
we have to remember the second sub-expression, which can be arbitrarily
complicated, until the end of the rule, so we need to use a parse
tree to remember it (although if we wanted to we could still do all of this
within the YACC actions).
e.g. interpreting the code - if there are control constructs in the language
then the interpreter will have to move around in the analysed input, so there
will have to be some saved version of the input for it to move around in.
This could be a parse tree, or maybe we generate e.g. postfix code and the
interpreter jumps around in that.
Looking at the actions we want to associate with grammar rules,
it may be that we end up needing to use values from places that YACC
cannot easily access. If so the simple solution
is to build a parse tree including as many
values as can be calculated by YACC, and then walk over the tree calculating
any
extra values that YACC can't manage.
However, there is an alternative - we can often
rewrite the grammar so that values only need to be passed towards the root of
the tree, and so YACC can calculate
everything. (4.3)
e.g. suppose the language includes declarations
declaration : type idlist ;
idlist : id | idlist ',' id ;
and we want to declare the ``id''s to be of the ``type''. YACC cannot
pass the ``type'' down into the ``idlist''. We can rewrite the grammar
as:
declaration : type id | declaration ',' id ;
and now the ``type'' information can be used by YACC.
In general, it may not be easy to see how to change the grammar and it may not
even be desirable to do so, as you have to show that your new grammar
recognises the same language, and you have to be able to maintain your code.
Note that in the problem above we want information to flow from left
to right, and we have used left recursion in the solution. Similarly,
if we wanted information to flow from right to left we could use right
recursion, transforming:
declaration : idlist type ;
idlist : id | idlist ',' id ;
into:
declaration : id type | id ',' declaration ;
(If you aren't convinced, try drawing the parse trees for a simple
declaration for each different grammar, and then draw arrows on them
showing how the type information has to move around the tree.)
Most (but by no means all, as the example immediately above shows) of this can be summarised as: if you do anything except process the input left to right as you read it, you need to think about using a parse tree.
Recognises simplest pieces (i.e. regular part) of grammar:
discard blanks and comments
identify literals, names, keywords, operators, punctuation
lexemes = pair
(syntactic kind [from "return"],
semantic value [if used, from "yylval.???="])
lexeme kinds are (e.g.):
Many useful tools available, mainly similar to LEX.
Recognises context-free grammar.
Builds structure from string of symbols.
Detects grammatical errors, but usually cannot correct them. Good parsers will usually try to continue after finding one error, in case there are more errors.
Many useful tools available, mostly in two broad classes - top down LL(k), and bottom up LR(k). The bottom-up parsers, like YACC, are more powerful, but harder to understand at first. Many tools improve the look-ahead beyond the 1 symbol of LL(1) and LR(1), as this significantly increases the power of LL parsers and makes LR parsers easier to use. Some tools integrate the lexical and syntactic grammars, rather than having two separate files as with LEX and YACC, but still generate separate lexical and syntactic analysers.
The semantic analyser checks the declarations and use of identifiers, using the dictionary.
Unlike lexical and syntactic analysis, there are no widely available toolkits for semantic analysis, although there are packages that help with maintaining a dictionary.
There are no code generation tools as ubiquitous and as useful as LEX and YACC.
We have looked at the code generated for expressions in lab exercise 4, for control structures in 9, and for variables in 8. This is covered in much more detail in CS2042.
We looked at the concept of passes when we discussed parse trees (7.1.1) but here is a quick reminder:
A pass is an actual scan of the whole program being translated, that must be completed before the next phase/pass can start.
A typical translator program will usually combine several phases into each pass. The first pass in any translator is (by definition) over the actual input text. If there are any subsequent passes, they may also be over the input text, but more usually they are over some internal form of the text, such as parse trees.
Even complex languages like C can be compiled in one pass, and one pass translators are usually simplest to write. However, if we have decided above that we have to use parse trees, then by definition we are using more than one pass. Thus, most compilers nowadays are multi-pass as they include significant code optimisation.
The heart of the translator will be the syntactic analyser, written using YACC or something equivalent. This will make calls on the lexical analyser, written using LEX or some equivalent. The actions associated with each grammar rule in YACC (and/or in LEX) will make calls on routines that do all the rest of the translation, such as build and access the dictionary, perform semantic analysis, and generate any output (assuming these are all relevant).
The first pass will be similar to the 1-pass organisation, except that
the generated output will be the parse tree (so the actions will need
to call tree-building routines), and by definition not all of the
routines in the rest of the translator will have been called.
For example, the first pass might consist of all the analysis phases of the
compiler (lexical, syntactic and semantic), creating a parse tree with
information from the declarations added at each use of each identifier. The
second (and final) pass could then be a simple code generator.
The tree will then be handed on to the second pass, which will be organised around a tree-walking routine. Typically, this will be a depth-first, left-to-right traversal over the whole tree, but circumstances may require the nodes of the tree to be visited in a different order, and some of them may even be visited more than once or not at all. The tree-walking routine will make calls on some more of the routines from the rest of the translator as it visits each node.
If this is a 2-pass program, we are done. Otherwise, the output from the
second pass, which may be a modified tree or may be some other data structure
altogether, is handed on to the third pass, which is again organised around a
routine that walks over the data-structure (e.g. modified parse tree), calling
some more of the routines in the rest of the translator.
For example, we might decide to arrange our compiler so that the first pass
consisted just of lexical and syntactic analysis, ending by building simple
parse trees, followed by a second pass during which we performed semantic
analysis and significantly changed the trees, followed by a third and final
pass in which we performed simple code generation.
If we have completed all the passes, we are done. Otherwise, the next pass
will be organised in a similar way to the previous pass, although the data
structure handed on may be changed yet again. This process continues until all
the passes have been performed.
For example, we might expand the code generation phase in what was our
hypothetical 2-pass compiler to include code optimisation, requiring many
passes over the parse tree to re-arrange it to allow better code to be
generated.
The net effect is, that during each pass, we perform more of the actions until by end of the last pass we have completed all of them and the final result has been output. Note, however, that the ``parse tree'' may be completely unrecognisable as such by the time that we have finished!
Syntax-directed translation is essentially a formal treatment of two simple concepts:
Attributes are essentially equivalent to the %union value that YACC can calculate in the action associated with each grammar rule - we can use struct values if there is more than one attribute. However, the actions themselves are generalised - YACC values can only move from leaves of the parse tree towards the root, but in syntax-directed translation the values can move around the tree in any way the user requires. This is achieved by automatically creating a parse tree and traversing it as required to move the values around.
The automatic creation and use of parse trees usually also simplifies one more step in the translation process, as we often need parse trees anyway for a later stage e.g. interpretation or code optimisation.
Systems like YACC and LEX exist that can generate syntax-directed translators, but they are inevitably harder to learn than YACC and LEX, although like any other high-level language, experienced users can achieve a lot more. However, just as with low-level languages, with sufficient ingenuity, YACC and LEX can be made to do anything that the more complex systems can do. (Note that some of the tools being created today, such as JavaCC, if anything have less functionality than LEX and YACC.) More importantly, by becoming aware of these more advanced concepts, we can make better use of LEX and YACC.