Next: Variables and Types Up: CS2121: The Implementation and Previous: Ambiguous and confusing grammars

Subsections

Parse Trees

The previous sections show how to analyse (parse) the structure of typical computer languages, but what do we do with the results of the analysis - how do we extract the meaning from the representation and make whatever further use of it is necessary?

Language design and implementation

In the examples so far the meaning has been trivial or irrelevant and any further processing was simply embedded in the parser actions. However, what if the language we are processing, or the processing itself, is more complex?

A major source of language complexity is identifiers; processing declarations, matching identifier usage to their declarations, and checking that the usage is legal. We will be discussing this for much of the remainder of this half of the course unit, but for the time being note that once we have a set of routines to maintain and query a dictionary, we can simply embed calls to them in the parser actions.

A further source of complexity in some languages is that it is not possible to decide the type of an identifier before its first use, so we must scan the program more than once, initially to pick up type information, and finally to process the uses of identifiers. Some older languages, such as Algol 60, were designed this way by mistake, before the problem was properly understood. Some more recent languages, such as SML, were deliberately designed this way to simplify programming.

A source of processing complexity only loosely related to language complexity is found in interpreters (e.g. shells). If the language to be interpreted contains control structures (particularly loops), the interpreter has to repeatedly scan portions of input programs. (Compilers do not have this problem.)

A major source of processing complexity, not resulting from language complexity, is found in optimising compilers; good quality code results from the compiler repeatedly scanning and transforming the program.

Passes

A Pass is a single scan over the input program, that must be completed before the next pass can start e.g. one pass is creating information for the next pass, and moreover, information from later in the program may be needed earlier on in the next pass. Multiple passes sometimes literally re-scan the source text, but usually the source is transformed into some more compact representation in the first pass so as to simplify later passes - usually parse trees. (Parsers created by LEX and YACC or their equivalents, when directly working together, form a single pass, rather than one for the LEX parser and one for the YACC parser.)

One-pass language processors, where all further processing is performed via the parser actions, are usually simpler than multi-pass language processors, and so preferred by implementors. It is common for implementors to classify computer languages by the number of passes necessary to completely analyse them: e.g. Pascal is a 1-pass language, Algol-60 is 2-pass.

Concrete and Abstract

The concrete grammar describes the actual text written by the user. If we created parse trees that contained all the text, they would be concrete parse trees. However, once we have used the punctuation and keywords to recognise and check the input text, most of them can be discarded as they have served their purpose. We need only retain enough information in the parse trees to preserve the underlying meaning. These simplified trees are known as abstract parse trees, and their structure is described by an abstract grammar. If we just refer to parse trees we normally mean this simplified and condensed form.

punctuation can be discarded: e.g. a list of identifiers separated by commas can be held as just the identifiers.
spurious (single child) nodes can usually be discarded: caused by rules like
component_declarator : simple_component | bit_field ;
impossible parse trees need not be explicitly forbidden: e.g. to simplify later processing, save lists that must have at least one element in structures that could be empty but never are.
operator precedence: often splits expressions into terms, factors etc., but once the expression has been recognised these distinctions can usually be ignored.
operator overloading: the parse trees may distinguish between the different meanings to simplify later processing.
the recursion used to recognise a list need not be the same as in the structure used to save it: YACC works best with left recursion but lists are naturally built using right recursion (left recursion buries the first i.e. left-most item at the end of the list). Sometimes we do not care or we even want the list reversed.

e.g. creating parse-trees for C:

statement
  : while '(' exp ')' statement          {$$= make_while ($3, $5);}
  | do statement while '(' exp ')' ';'   {$$= make_do ($2, $5);}
  | '{' statements '}'                   {$$= make_block ($2);}
  | . . . ;
statements
  : statement                            {$$= list ($1); /* [$1] */}
  | statements statement                 {$$= append ($1, $2); /* $1 @ [$2] */}
  ;

As long as we can ensure the correct order for the bulk of the reductions, so we do not overflow the stack, and for any actions other than tree-building, we can change part of the grammar to use right recursion so that the lists are built in the correct order. This will usually be safe, simply because the input is always going to be read from left to right and we will naturally split the detailed grammar of the list items (statement) from the list itself (statements). e.g.

statements
  : statement                           {$$= list ($1); /* [$1] */}
  | statement statements                {$$= cons ($1, $2); /* $1 :: $2 */}
  ;

If this is not possible, or we are worried that even with these precautions the parser stack might still overflow, we just have to use left recursion and append.

Building parse trees for expressions

 %{#include "y.tab.h"
 /* very silly implementation, just sufficient for this example */
 static char findname (char *yytext) {return yytext [0];}
 %}
 %%
 [0-9]+                         {yylval.a_number = atoi(yytext); return number;}
 [A-Za-z][A-Za-z0-9]*           {yylval.a_variable = findname (yytext);
                                 return variable;}
 [ \t\n]                        ;
 [-+*/( );]                     return yytext[0];
 .                              {ECHO; yyerror ("unexpected character");}
 %%
 int yywrap (void) {return 1;}

 %{
 #include <stdio.h>
 enum treetype {operator_node, number_node, variable_node};
 typedef struct tree {
   enum treetype nodetype;
   union {
     struct {struct tree *left, *right; char operator;} an_operator;
     int a_number;
     char a_variable;
   } body;
 } tree;
 static tree *make_operator (tree *l, char o, tree *r) {
   tree *result= (tree*) malloc (sizeof(tree));
   result->nodetype= operator_node;
   result->body.an_operator.left= l;
   result->body.an_operator.operator= o;
   result->body.an_operator.right= r;
   return result;
 }
 static tree *make_number (int n) {
   tree *result= (tree*) malloc (sizeof(tree));
   result->nodetype= number_node;
   result->body.a_number= n;
   return result;
 }
 static tree *make_variable (char v) {
   tree *result= (tree*) malloc (sizeof(tree));
   result->nodetype= variable_node;
   result->body.a_variable= v;
   return result;
 }
 static void printtree (tree *t, int level) {
 #define step 4
   if (t)
     switch (t->nodetype)
     {
       case operator_node:
        printtree (t->body.an_operator.right, level+step);
        printf ("%*c%c\n", level, ' ', t->body.an_operator.operator);
        printtree (t->body.an_operator.left, level+step);
        break;
       case number_node:
        printf ("%*c%d\n", level, ' ', t->body.a_number);
        break;
       case variable_node:
        printf ("%*c%c\n", level, ' ', t->body.a_variable);
     }
 }
 %}

 %union {
   int a_number;
   char a_variable;
   tree* a_tree;
 }
 %start input
 %token <a_number> number
 %token <a_variable> variable
 %type <a_tree> exp term factor
 %%
 input  : exp ';'               {printtree ($1, 1);}
        ;
 exp    : '+' term              {$$ = $2;}
        | '-' term              {$$ = make_operator (NULL, '~', $2);}
        | term                  {$$ = $1;}
        | exp '+' term          {$$ = make_operator ($1, '+', $3);}
        | exp '-' term          {$$ = make_operator ($1, '-', $3);}
        ;
 term   : factor                {$$ = $1;}
        | term '*' factor       {$$ = make_operator ($1, '*', $3);}
        | term '/' factor       {$$ = make_operator ($1, '/', $3);}
        ;
 factor : number                {$$ = make_number ($1);}
        | variable              {$$ = make_variable ($1);}
        | '(' exp ')'           {$$ = $2;}
        ;
 %%
 void yyerror (char *s) {fprintf (stderr, "%s\n", s);}
 int main (void) {return yyparse();}

Example input expression, and output equivalent fully-bracketed expression and parse tree:

((1 + 2) * 3 / (- a) - (+ b)) * (c + d + e * 4) ;

(((((1+2)*3)/(-a))-b)*((c+d)+(e*4)))

Epilogue

Up until now I have been talking about the parse tree for each expression, giving the impression that a single run of a compiler might build many such trees. You can now begin to see that this is only part of the story, as the parse trees for expressions are really just subtrees in bigger trees representing individual functions, which are themselves parts of a single tree representing the whole program.

Exercises

( $\dag$ indicates harder problems)

Alter the associativity and precedence of the operators in $\S$ 7.2 and generate the resulting parse trees.
$\dag$ Add extra language features (e.g. function calls, array accesses, conditional expressions) to the grammar in $\S$ 7.2 and build the corresponding parse trees.

Readings

Louden: chs. 4.3, 4.4
Capon & Jinks: chs. 8.5 (fig. 8.20), 10.2 (fig. 10.3)
Aho, Sethi & Ullman: ch. 5.2

Next: Variables and Types Up: CS2121: The Implementation and Previous: Ambiguous and confusing grammars

Pete Jinks
2004-10-26