next up previous contents
Next: Building a Compiler: recognising Up: CS2111: Design and Implementation Previous: Abstract Data-Types, Modules, Classes   Contents

Subsections


Language Design and Implementation - Identifiers

To implement identifiers, we have to take account of their static and dynamic properties, as described by the static and dynamic environments (c.f. $\S $13-$\S $18). Broadly speaking, we use stack frames at run-time to implement dynamic environments, and dictionaries at compile-time to implement static environments. However, as we bind static objects to dynamic objects we blur this boundary, so that e.g. stack frames include static links, and dictionaries include run-time addresses.

Dictionaries

A dictionary contains a set of names used in a piece of code and, for each name, information describing their implicit or explicit declarations. To look up an identifier in a dictionary:
locate the particular name (if present)
using any standard technique, such as B-trees or hashing. This part of the dictionary is known as the name-list or symbol-table.
locate (information about) the relevant declaration of the name
most languages permit multiple scopes and /or multiple meanings for a single name in a single scope. This part of the dictionary is known as the property-list(s), of property entries or properties.
We can either use a single dictionary, containing all the names in the source text, each linked to whichever declarations are currently valid,
or several dictionaries e.g. 1 per scope and/or per kind of declaration, so each dictionary contains fewer declarations, or even just one, for each name in it.

In the former case, we search the dictionary for a name, and then for the relevant declaration. In the latter case, we search several dictionaries in turn (e.g. from the current scope outwards) for the name until we find the declaration we need. As most programs have few nested scopes and few kinds of declarations, these repeated searches need not be a serious overhead, and can simplify other actions, such as tidying up at the end of a scope.

In some languages, a new scope starts at each new declaration. This makes multiple dictionaries harder to use but, worse, complicates the relationship between the static and dynamic environments. Implementations often coalesce such scopes to start at e.g. blocks or functions (or even coalesce blocks into the owning function), but perform extra compile-time and/or run-time actions to comply with the language design. Languages like SML, where the same name may be redeclared many times in the same function, cause problems that I shall not discuss here.


Using a dictionary

At the start of each new scope (e.g. function or block)
SPMquot "
At each declaration of an identifier
SPMquot "
At each use of an identifier
SPMquot "
At the end of each scope
SPMquot "

Property entries

Some information will be used by most or all declaration kinds:
kind
variable/constant/function/type/etc.
scope
$\rightarrow$ property entry of the owning function or block,
and/or the depth of nested scopes
type
(see below)
$\rightarrow$ name in name-list
$\rightarrow$ any previous declarations of the same name in the dictionary
$\rightarrow$ next declaration in the same scope
(?)
Other information will be different for each declaration kind:
constant
For manifest constants, we determine the value at compile time and place that value in the property entry. Otherwise, we treat it as a read-only variable initialised at run-time.
variable
To plant code, we need the run-time address, usually held as (data frame, offset in frame). The data frame is often a stack frame, but could be a global variable frame.

A stack frame can be described by the scope information.

The offset is copied from the current size of the data frame, which we then increment by the size of the variable, derived from the type information.
We may need extra information for other semantic rules, such as not modifying a loop control variable inside the loop.

parameter
The information held is similar to that for a variable.
If the language has more than one parameter mechanism (c.f. $\S $12.2), or there are constraints on e.g. modifying a value parameter, we must note how this applies to each parameter.
record/struct fields
Also very similar to variables, but the data frame refers to the owning record. We calculate the offsets for fields of unions (and the total size of the union frame) slightly differently.
We must save extra information about fields of discriminated unions, to check if accesses are legal.
labels
Consist just of the code address, although we usually need to cope with forward references as described above.
programs, subprograms and blocks
For a function, we need its signature: we could use the type information for the result, and a link to the list of parameters for the rest of the signature. However, this is not sensible if a function signature can be used as a type. We also need the size of the stack frame. We may want to point to a list of all properties defined within the scope to make processing easier. Finally, we need the code address as for labels.
An entry for a procedure would be just like that for a function, but with a null result type.
A program entry is usually needed to provide a scope for global declarations, and to hold any language-specific information.
We do not need entries for blocks if they are simply compound statements, but if they have associated scope or extent rules they may need entries like those for functions, but without signature or code address (and usually without an associated name!)
types
The only types some languages have are a few built-in primitive types and vectors, and we need only remember this information and the resulting size (e.g. in bytes) as part of the property entry for the variable, parameter or constant.
Most modern languages allow types to be created using e.g. multi-dimensional arrays, records, pointers or recursive types, subranges, sets, functions, abstract data types (c.f. $\S $15, $\S $16, $\S $18). If the language forces a complex declaration to be broken up into several simpler declarations, then we can place each declaration in a single property entry. However, most languages permit individual declarations to be arbitrarily complex, and the implementor must split each declaration up, creating entries as if for a whole series of (anonymous) type declarations as well as the declaration of the named type or variable or whatever.

Example of equivalent anonymous and explicit type declarations in C:


  char (*(*x())[])() { . . . }

  typedef char anon1();
  typedef anon1 *anon2;
  typedef anon2 anon3[];
  typedef anon3 *anon4;
  anon4 x() { . . . }
and possible dictionary entries (many details omitted):
namelist properties
   
char \fbox { type $\vert$\ description of char }
   
anon1 \fbox { type $\vert$\ signature $\vert$\ * $\vert$\ () }
   
anon2 \fbox { type $\vert$\ pointer $\vert$\ * }
   
anon3 \fbox { type $\vert$\ array $\vert$\ ? $\vert$\ * }
   
anon4 \fbox { type $\vert$\ pointer $\vert$\ * }
   
x \fbox { function $\vert$\ * $\vert$\ () }

Keywords and predefined identifiers.

All languages include words that could be mistaken for identifiers, but are actually defined as part of the language. These include words used to describe data and control structures, typenames, conversions, values, input/output, operators, and maths functions. These words are treated in two main ways:

Keywords (reserved words) are built into the grammar, and so cannot have their meaning changed by the user.

Predefined identifiers are otherwise normal identifiers, that behave as if they were declared before the start of each program. Their meanings can therefore be changed just like any other name, although this is rarely sensible. These extra declarations can be handled simply by pre-inserting the identifiers into the dictionary.

Keywords are easy to recognise by using a different LEX grammar rule for each, but this tends to make the resulting analyser very large (and slow for LEX to generate). Therefore, some analysers initially do not distinguish between keywords and identifiers, but then search a special dictionary to recognise keywords before treating everything else as an identifier. This makes the analyser smaller but slower.


Binding

The binding of static to dynamic objects can be eager (i.e. as soon as possible) or lazy (i.e. as late as possible).

Most high-level languages, intended for compilation, assume eager binding, although some languages allow for constants and/or arrays whose value or size cannot be calculated at compile time, or allow the binding of external library functions to be delayed.

Many interpreted languages, such as shells, assume lazy binding, so we may not even know if a variable or function exists until it is used at run-time, let alone what type it might be.

Eager binding allows for more checking to be done once and for all, and for faster run-times. Lazy binding allows for more feedback to the user, and may be helpful during interactive debugging.

The description of dictionaries above assumes we are using eager binding; lazy binding essentially requires us to move some use of the dictionary to run-time.

Exercises

($\dag $ indicates harder problems)

Readings

Louden: chs. 4.7, 5.3, 6.6
Capon & Jinks: chs. 3.3, 7.3, 9
Aho, Sethi & Ullman: chs. 2.7, 6.1-6.6, 7.6
next up previous contents
Next: Building a Compiler: recognising Up: CS2111: Design and Implementation Previous: Abstract Data-Types, Modules, Classes   Contents
Pete Jinks
1999-09-30