Syntax-directed translation is essentially a formal treatment of two simple concepts:
Attributes are essentially equivalent to the %union value that yacc can calculate in the action associated with each grammar rule - we can use struct values if there is more than one attribute. However, the actions themselves are generalised - yacc values can only move from leaves of the parse tree towards the root, but in syntax-directed translation the values can move around the tree in any way the user requires. This is achieved by automatically creating a parse tree and traversing it as required to move the values around.
The automatic creation and use of parse trees also simplifies one more step in the translation process, as we often need to do this one way or another.
Systems like yacc and lex exist that can generate syntax-directed translators, but they are inevitably harder to learn than yacc and lex, although like any other high-level language, experienced users can achieve a lot more. However, just as with low-level languages, with sufficient ingenuity, yacc and lex do anything that the more complex systems can do. (Note that some of the tools being created today, such as JavaCC, if anything have less functionality than lex and yacc.) More importantly, by becoming aware of these more advanced concepts, we can make better use of lex and yacc. I will introduce a few such ideas below.
We have looked at various tools and techniques used in compilers, but:
The questions relevant to the use of lex, yacc, and dictionaries are all about the form of the input text. By contrast, the most important questions relevant to the use of parse trees are about what you want to do with the input text once you have analysed it.
Does the text you are processing have any structure at all?
If so, you will usually want to split it into small pieces (words), so you
will almost always use lex (or something equivalent).
The only other situation I can imagine when it would not be useful would be e.g. trying to deal with FORTRAN, which has a structure, but the boundaries between words are decided by the context they appear in (spaces are not separators, so ``DO 1 I'' might be 3 words in one place and 1 word in another).
Does the text you are processing have a hierarchy of structures e.g. do the
words combine into groups, which combine into larger groups, and so on?
Is this hierarchical structure well-defined (i.e. has a grammar), or can you
create a grammar that accurately captures the required structure?
If so, then yacc should be able to help you.
However, one thing that yacc is not very good at is coping with mistakes in its input. Most modern systems have some facilities to automatically deal with errors, e.g. by skipping or inserting a word, or in more serious cases by jumping to the end of the current grammar rule and skipping input until it finds something that can follow it. In yacc (and in JavaCC) although there are some helpful facilities, they must inserted by hand at suitable points in the grammar.
In some circumstances, such as when you are trying to deal with input that is full of errors (e.g. dealing with HTML) this problem can nullify the benefits of using yacc and cause you to just use lex by itself, particularly if the input is not heavily structured.
Do you have user-defined names, or perhaps a long list of built-in names, to
deal with?
If so, you almost certainly need to use a dictionary of some sort, even if it
is quite simple.
Can there be more than one instance of the same name in the input e.g. are
there blocks and scope rules, or can the same name be used e.g. for a variable
and for a piece of code?
If so, you will need to structure your dictionary correspondingly - e.g. as a
stack for blocks, or with different dictionaries e.g. for variables and for
code.
Do you need to repeatedly scan the input text to be able to completely analyse
it?
e.g. using names before their definition, if the definition can change how
uses of that name have to be translated.
If so, then you will need to use a parse tree so that, having found e.g. the
declaration, you can go back and decide on the detailed translation of the
uses of the name.
If we are starting from scratch, we can usually design a new language to avoid
this difficulty, although we may encounter it if we have to deal with existing
languages. Some modern languages, such as SML, are deliberately defined in
this way to make them easier to use, although harder to translate.
What do you want to do with the input text once you have analysed it?
The more complex this is, the more likely you are to need parse trees:
e.g. significant (global) code optimisation.
e.g. reordering in translations - we can easily translate infix to postfix
subexp1 op subexp2 -> subexp1 subexp2 op
as we read from left to right because we just have to remember the operator, a
single symbol, until we can output it at the end of the rule.
However, to translate postfix to infix
subexp1 subexp2 op -> subexp1 op subexp2
we have to remember the second sub-expression, which can be arbitrarily
complicated, until the end of the rule, so we need to use a parse
tree to remember it (although if we wanted to we could still do all of this
within the yacc actions).
e.g. interpreting the code - if there are control constructs in the language
then the interpreter will have to move around in the analysed input, so there
will have to be some saved version of the input for it to move around in.
This could be a parse tree, or maybe we generate e.g. postfix code and the
interpreter jumps around in that.
If we are using the general concept of attributes, it may be that we end up
needing to use values from places that yacc cannot easily access. If so the
simple solution, as mentioned above, is to build a parse tree including as
many attributes as can be calculated by yacc, and then walk over the tree
calculating any attributes that yacc can't manage.
However, there is an alternative - we can often
rewrite the grammar so that values only need to be passed towards the root of
the tree, and so yacc can calculate all the attributes.
e.g. suppose the language includes declarations
declaration : type idlist ;
idlist : id | idlist ',' id ;
and we want to declare the ``id''s to be of the ``type''. Yacc cannot
pass the ``type'' down into the ``idlist''. We can rewrite the grammar
as:
declaration : type id | declaration ',' id ;
and now the ``type'' information can be used by yacc.
In general, it may not be easy to see how to change the grammar and it may not
even be desirable to do so, as you have to show that your new grammar
recognises the same language, and you have to be able to maintain your code.
Note that in the problem above we want information to flow from left
to right, and we have used left recursion in the solution. Similarly,
if we wanted information to flow from right to left we could use right
recursion, transforming:
declaration : idlist type ;
idlist : id | idlist ',' id ;
into:
declaration : id type | id ',' declaration ;
(If you aren't convinced, try drawing the parse trees for a simple
declaration for each different grammar, and then draw arrows on them
showing how the type information has to move around the tree.)
Most (but by no means all, as the example immediately above shows) of this can be summarised as: if you do anything except process the input left to right as you read it, you need to think about using a parse tree.
A pass is an actual scan of the whole program being translated, that must be completed before the next phase/pass can start.
A typical translator program will usually combine several phases into each pass. The first pass in any translator is (by definition) over the actual input text. If there are any subsequent passes, they may also be over the input text, but more usually they are over some internal form of the text, such as parse trees.
Even complex languages like C can be compiled in one pass, and one pass translators are usually simplest to write. However, if we have decided above that we have to use parse trees, then by definition we are using more than one pass. Thus, most compilers nowadays are multi-pass as they include significant code optimisation.
The heart of the translator will be the syntactic analyser, written using yacc or something equivalent. This will make calls on the lexical analyser, written using lex or some equivalent. The actions associated with each grammar rule in yacc (and/or in lex) will make calls on routines that do all the rest of the translation, such as build and access the dictionary, perform semantic analysis, and generate any output (assuming these are all relevant).
The first pass will be similar to the 1-pass organisation, except that the generated output will be the parse tree (so the actions will need to call tree-building routines), and by definition not all of the routines in the rest of the translator will have been called.
The tree will then be handed on to the second pass, which will be organised around a tree-walking routine. Typically, this will be a depth-first, left-to-right traversal over the whole tree, but circumstances may require the nodes of the tree to be visited in a different order, and some of them may even be visited more than once or not at all. The tree-walking routine will make calls on some more of the routines from the rest of the translator as it visits each node.
If this is a 2-pass program, we are done. Otherwise, the output from the second pass, which may be a modified tree or may be some other data structure altogether, is handed on to the third pass, which is again organised around a routine that walks over the data-structure (e.g. modified parse tree), calling some more of the routines in the rest of the translator.
If we have completed all the passes, we are done. Otherwise, the next pass will be organised in a similar way to the previous pass, although the data structure handed on may be changed yet again. This process continues until all the passes have been performed.
The net effect is, that during each pass, we calculate more of the attributes, until by end of the last pass we have completed all of them and the final result has been output. Note, however, that the ``parse tree'' may be completely unrecognisable as such by the time that we have finished!