next up previous
Next: Variables and Types Up: CS2121: The Implementation and Previous: Ambiguous and confusing grammars


Parse Trees

Parse Trees

The previous sections show how to analyse (parse) the structure of typical computer languages, but what do we do with the results of the analysis - how do we extract the meaning from the representation and make whatever further use of it is necessary?

Language design and implementation

In the examples so far the meaning has been trivial or irrelevant and any further processing was simply embedded in the parser actions. However, what if the language we are processing, or the processing itself, is more complex?

A major source of language complexity is identifiers; processing declarations, matching identifier usage to their declarations, and checking that the usage is legal. We will be discussing this for much of the remainder of this half of the course unit, but for the time being note that once we have a set of routines to maintain and query a dictionary, we can simply embed calls to them in the parser actions.

A further source of complexity in some languages is that it is not possible to decide the type of an identifier before its first use, so we must scan the program more than once, initially to pick up type information, and finally to process the uses of identifiers. Some older languages, such as Algol 60, were designed this way by mistake, before the problem was properly understood. Some more recent languages, such as SML, were deliberately designed this way to simplify programming.

A source of processing complexity only loosely related to language complexity is found in interpreters (e.g. shells). If the language to be interpreted contains control structures (particularly loops), the interpreter has to repeatedly scan portions of input programs. (Compilers do not have this problem.)

A major source of processing complexity, not resulting from language complexity, is found in optimising compilers; good quality code results from the compiler repeatedly scanning and transforming the program.


A Pass is a single scan over the input program, that must be completed before the next pass can start e.g. one pass is creating information for the next pass, and moreover, information from later in the program may be needed earlier on in the next pass. Multiple passes sometimes literally re-scan the source text, but usually the source is transformed into some more compact representation in the first pass so as to simplify later passes - usually parse trees. (Parsers created by LEX and YACC or their equivalents, when directly working together, form a single pass, rather than one for the LEX parser and one for the YACC parser.)

One-pass language processors, where all further processing is performed via the parser actions, are usually simpler than multi-pass language processors, and so preferred by implementors. It is common for implementors to classify computer languages by the number of passes necessary to completely analyse them: e.g. Pascal is a 1-pass language, Algol-60 is 2-pass.

Concrete and Abstract

The concrete grammar describes the actual text written by the user. If we created parse trees that contained all the text, they would be concrete parse trees. However, once we have used the punctuation and keywords to recognise and check the input text, most of them can be discarded as they have served their purpose. We need only retain enough information in the parse trees to preserve the underlying meaning. These simplified trees are known as abstract parse trees, and their structure is described by an abstract grammar. If we just refer to parse trees we normally mean this simplified and condensed form.

e.g. creating parse-trees for C:

  : while '(' exp ')' statement          {$$= make_while ($3, $5);}
  | do statement while '(' exp ')' ';'   {$$= make_do ($2, $5);}
  | '{' statements '}'                   {$$= make_block ($2);}
  | . . . ;
  : statement                            {$$= list ($1); /* [$1] */}
  | statements statement                 {$$= append ($1, $2); /* $1 @ [$2] */}

As long as we can ensure the correct order for the bulk of the reductions, so we do not overflow the stack, and for any actions other than tree-building, we can change part of the grammar to use right recursion so that the lists are built in the correct order. This will usually be safe, simply because the input is always going to be read from left to right and we will naturally split the detailed grammar of the list items (statement) from the list itself (statements). e.g.

  : statement                           {$$= list ($1); /* [$1] */}
  | statement statements                {$$= cons ($1, $2); /* $1 :: $2 */}

If this is not possible, or we are worried that even with these precautions the parser stack might still overflow, we just have to use left recursion and append.

Building parse trees for expressions

 %{#include ""
 /* very silly implementation, just sufficient for this example */
 static char findname (char *yytext) {return yytext [0];}
 [0-9]+                         {yylval.a_number = atoi(yytext); return number;}
 [A-Za-z][A-Za-z0-9]*           {yylval.a_variable = findname (yytext);
                                 return variable;}
 [ \t\n]                        ;
 [-+*/( );]                     return yytext[0];
 .                              {ECHO; yyerror ("unexpected character");}
 int yywrap (void) {return 1;}

 #include <stdio.h>
 enum treetype {operator_node, number_node, variable_node};
 typedef struct tree {
   enum treetype nodetype;
   union {
     struct {struct tree *left, *right; char operator;} an_operator;
     int a_number;
     char a_variable;
   } body;
 } tree;
 static tree *make_operator (tree *l, char o, tree *r) {
   tree *result= (tree*) malloc (sizeof(tree));
   result->nodetype= operator_node;
   result->body.an_operator.left= l;
   result->body.an_operator.operator= o;
   result->body.an_operator.right= r;
   return result;
 static tree *make_number (int n) {
   tree *result= (tree*) malloc (sizeof(tree));
   result->nodetype= number_node;
   result->body.a_number= n;
   return result;
 static tree *make_variable (char v) {
   tree *result= (tree*) malloc (sizeof(tree));
   result->nodetype= variable_node;
   result->body.a_variable= v;
   return result;
 static void printtree (tree *t, int level) {
 #define step 4
   if (t)
     switch (t->nodetype)
       case operator_node:
        printtree (t->body.an_operator.right, level+step);
        printf ("%*c%c\n", level, ' ', t->body.an_operator.operator);
        printtree (t->body.an_operator.left, level+step);
       case number_node:
        printf ("%*c%d\n", level, ' ', t->body.a_number);
       case variable_node:
        printf ("%*c%c\n", level, ' ', t->body.a_variable);

 %union {
   int a_number;
   char a_variable;
   tree* a_tree;
 %start input
 %token <a_number> number
 %token <a_variable> variable
 %type <a_tree> exp term factor
 input  : exp ';'               {printtree ($1, 1);}
 exp    : '+' term              {$$ = $2;}
        | '-' term              {$$ = make_operator (NULL, '~', $2);}
        | term                  {$$ = $1;}
        | exp '+' term          {$$ = make_operator ($1, '+', $3);}
        | exp '-' term          {$$ = make_operator ($1, '-', $3);}
 term   : factor                {$$ = $1;}
        | term '*' factor       {$$ = make_operator ($1, '*', $3);}
        | term '/' factor       {$$ = make_operator ($1, '/', $3);}
 factor : number                {$$ = make_number ($1);}
        | variable              {$$ = make_variable ($1);}
        | '(' exp ')'           {$$ = $2;}
 void yyerror (char *s) {fprintf (stderr, "%s\n", s);}
 int main (void) {return yyparse();}
Example input expression, and output equivalent fully-bracketed expression and parse tree:
((1 + 2) * 3 / (- a) - (+ b)) * (c + d + e * 4) ;



Up until now I have been talking about the parse tree for each expression, giving the impression that a single run of a compiler might build many such trees. You can now begin to see that this is only part of the story, as the parse trees for expressions are really just subtrees in bigger trees representing individual functions, which are themselves parts of a single tree representing the whole program.


($\dag $ indicates harder problems)


Louden: chs. 4.3, 4.4
Capon & Jinks: chs. 8.5 (fig. 8.20), 10.2 (fig. 10.3)
Aho, Sethi & Ullman: ch. 5.2

next up previous
Next: Variables and Types Up: CS2121: The Implementation and Previous: Ambiguous and confusing grammars
Pete Jinks