Next: Building a Compiler: recognising Up: CS2111: Design and Implementation Previous: Abstract Data-Types, Modules, Classes Contents

Subsections

Language Design and Implementation - Identifiers

To implement identifiers, we have to take account of their static and dynamic properties, as described by the static and dynamic environments (c.f. $\S$ 13- $\S$ 18). Broadly speaking, we use stack frames at run-time to implement dynamic environments, and dictionaries at compile-time to implement static environments. However, as we bind static objects to dynamic objects we blur this boundary, so that e.g. stack frames include static links, and dictionaries include run-time addresses.

Dictionaries

A dictionary contains a set of names used in a piece of code and, for each name, information describing their implicit or explicit declarations. To look up an identifier in a dictionary:

locate the particular name (if present): using any standard technique, such as B-trees or hashing. This part of the dictionary is known as the name-list or symbol-table.
locate (information about) the relevant declaration of the name: most languages permit multiple scopes and /or multiple meanings for a single name in a single scope. This part of the dictionary is known as the property-list(s), of property entries or properties.

We can either use a single dictionary, containing all the names in the source text, each linked to whichever declarations are currently valid,
or several dictionaries e.g. 1 per scope and/or per kind of declaration, so each dictionary contains fewer declarations, or even just one, for each name in it.

In the former case, we search the dictionary for a name, and then for the relevant declaration. In the latter case, we search several dictionaries in turn (e.g. from the current scope outwards) for the name until we find the declaration we need. As most programs have few nested scopes and few kinds of declarations, these repeated searches need not be a serious overhead, and can simplify other actions, such as tidying up at the end of a scope.

In some languages, a new scope starts at each new declaration. This makes multiple dictionaries harder to use but, worse, complicates the relationship between the static and dynamic environments. Implementations often coalesce such scopes to start at e.g. blocks or functions (or even coalesce blocks into the owning function), but perform extra compile-time and/or run-time actions to comply with the language design. Languages like SML, where the same name may be redeclared many times in the same function, cause problems that I shall not discuss here.

Using a dictionary

At the start of each new scope (e.g. function or block)

SPMquot "

Whatever initialisation is necessary.

At each declaration of an identifier

SPMquot "

Search the relevant dictionary to find the name in the namelist, and insert it if missing.
Check that there is not already a declaration (of the same kind?) for the name in the scope.
Insert a new property entry describing the declaration. If one dictionary holds multiple scopes, insert new entries at the head of the list of properties for the name, so declarations in scope are found first.

At each use of an identifier

SPMquot "

Search the relevant dictionary (or dictionaries, in sequence) to find the name, and insert it if missing.
Find the current declaration in scope (of the correct kind) for that name, and insert a new property entry if none is found.
- Some languages allow implicit declarations, so they specify default types etc. Others do not, so we report the error, but generate a property entry to avoid repeating the error message.
- Some languages expect explicit declarations but allow forward references i.e. allow an identifier to be used before it is declared. If the semantics of the forward reference are non-trivial we must use a multi-pass translator (c.f. $\S$ 8.1.1). However, most modern languages permit forward references only in very simple circumstances, so we create a dummy entry to be overwritten by the real declaration later. The dummy entry may point to places where the identifier is used, so e.g. generated code can be corrected.

At the end of each scope

SPMquot "

Check all forward references satisfied.
Some languages require us to move unsatisfied references into the enclosing scope, if one exists, but for most this is simply an error.
Delete all properties that are no longer relevant.
- We may need to keep some properties and/or names e.g. public identifiers of an Abstract Data Type, or if we use a function's parameters to hold its signature.
- We may need to copy some information for run-time debugging.
- We may also remove names that no longer have any associated properties to speed up searches.

Property entries

Some information will be used by most or all declaration kinds:

kind: variable/constant/function/type/etc.
scope: $\rightarrow$ property entry of the owning function or block,
and/or the depth of nested scopes
type: (see below)
$\rightarrow$ name in name-list
$\rightarrow$ any previous declarations of the same name in the dictionary
$\rightarrow$ next declaration in the same scope: (?)

Other information will be different for each declaration kind:

constant

For manifest constants, we determine the value at compile time and place that value in the property entry. Otherwise, we treat it as a read-only variable initialised at run-time.

variable

To plant code, we need the run-time address, usually held as (data frame, offset in frame). The data frame is often a stack frame, but could be a global variable frame.

A stack frame can be described by the scope information.

The offset is copied from the current size of the data frame, which we then increment by the size of the variable, derived from the type information.
We may need extra information for other semantic rules, such as not modifying a loop control variable inside the loop.

parameter

The information held is similar to that for a variable.
If the language has more than one parameter mechanism (c.f. $\S$ 12.2), or there are constraints on e.g. modifying a value parameter, we must note how this applies to each parameter.

record/struct fields

Also very similar to variables, but the data frame refers to the owning record. We calculate the offsets for fields of unions (and the total size of the union frame) slightly differently.
We must save extra information about fields of discriminated unions, to check if accesses are legal.

labels

Consist just of the code address, although we usually need to cope with forward references as described above.

programs, subprograms and blocks

For a function, we need its signature: we could use the type information for the result, and a link to the list of parameters for the rest of the signature. However, this is not sensible if a function signature can be used as a type. We also need the size of the stack frame. We may want to point to a list of all properties defined within the scope to make processing easier. Finally, we need the code address as for labels.
An entry for a procedure would be just like that for a function, but with a null result type.
A program entry is usually needed to provide a scope for global declarations, and to hold any language-specific information.
We do not need entries for blocks if they are simply compound statements, but if they have associated scope or extent rules they may need entries like those for functions, but without signature or code address (and usually without an associated name!)

types

The only types some languages have are a few built-in primitive types and vectors, and we need only remember this information and the resulting size (e.g. in bytes) as part of the property entry for the variable, parameter or constant.
Most modern languages allow types to be created using e.g. multi-dimensional arrays, records, pointers or recursive types, subranges, sets, functions, abstract data types (c.f. $\S$ 15, $\S$ 16, $\S$ 18). If the language forces a complex declaration to be broken up into several simpler declarations, then we can place each declaration in a single property entry. However, most languages permit individual declarations to be arbitrarily complex, and the implementor must split each declaration up, creating entries as if for a whole series of (anonymous) type declarations as well as the declaration of the named type or variable or whatever.

Example of equivalent anonymous and explicit type declarations in C:


  char (*(*x())[])() { . . . }

  typedef char anon1();
  typedef anon1 *anon2;
  typedef anon2 anon3[];
  typedef anon3 *anon4;
  anon4 x() { . . . }

and possible dictionary entries (many details omitted):

namelist	properties

char	$\fbox { type $\vert$\ description of char }$

anon1	$\fbox { type $\vert$\ signature $\vert$\ * $\vert$\ () }$

anon2	$\fbox { type $\vert$\ pointer $\vert$\ * }$

anon3	$\fbox { type $\vert$\ array $\vert$\ ? $\vert$\ * }$

anon4	$\fbox { type $\vert$\ pointer $\vert$\ * }$

x	$\fbox { function $\vert$\ * $\vert$\ () }$

Keywords and predefined identifiers.

All languages include words that could be mistaken for identifiers, but are actually defined as part of the language. These include words used to describe data and control structures, typenames, conversions, values, input/output, operators, and maths functions. These words are treated in two main ways:

Keywords (reserved words) are built into the grammar, and so cannot have their meaning changed by the user.

Predefined identifiers are otherwise normal identifiers, that behave as if they were declared before the start of each program. Their meanings can therefore be changed just like any other name, although this is rarely sensible. These extra declarations can be handled simply by pre-inserting the identifiers into the dictionary.

Keywords are easy to recognise by using a different LEX grammar rule for each, but this tends to make the resulting analyser very large (and slow for LEX to generate). Therefore, some analysers initially do not distinguish between keywords and identifiers, but then search a special dictionary to recognise keywords before treating everything else as an identifier. This makes the analyser smaller but slower.

Binding

The binding of static to dynamic objects can be eager (i.e. as soon as possible) or lazy (i.e. as late as possible).

Most high-level languages, intended for compilation, assume eager binding, although some languages allow for constants and/or arrays whose value or size cannot be calculated at compile time, or allow the binding of external library functions to be delayed.

Many interpreted languages, such as shells, assume lazy binding, so we may not even know if a variable or function exists until it is used at run-time, let alone what type it might be.

Eager binding allows for more checking to be done once and for all, and for faster run-times. Lazy binding allows for more feedback to the user, and may be helpful during interactive debugging.

The description of dictionaries above assumes we are using eager binding; lazy binding essentially requires us to move some use of the dictionary to run-time.

Exercises

( $\dag$ indicates harder problems)

Some languages treat upper-case and lower-case names as distinct (e.g. Pascal), some as being the same (e.g. Ada), and some only allow one case. How would you recognise and match names in an Ada compiler?
Some languages ignore characters in names past some limit. e.g. in ANSI C, only the first 6 characters of extern identifiers have to be checked (although more can be) and their case can be ignored. How could you use LEX and YACC to check ANSI C programs for identifiers that might be confused by a compiler/linker that only provided the minimum facilities? How could you use #define to avoid any problems thus detected?
List all the data structures you can think of that could possibly be used for a namelist - which of them would you prefer to use and why? Can you imagine any circumstances in which you would pick a different one?
What identifiers do LEX and YACC (i.e. the parser generators themselves, not the parsers they generate) each have to deal with, and so what do they need in the way of dictionaries?
What about other Unix tools that you are familiar with, like bash?
Check your favourite compilers to see what happens if you misspell an identifier. Do you get just one error message? If you repeatedly misspell it in the same way, are the error messages repeated?
Write a set of C declarations that uses all the different kinds of declarations, and all the sorts of types, that you can think of, and then give the corresponding dictionary entries.
$\dag$ How would you change the semantics of C to simplify semantic analysis? What useful features of the language would be lost by this simplification?

Readings

Louden: chs. 4.7, 5.3, 6.6
Capon & Jinks: chs. 3.3, 7.3, 9
Aho, Sethi & Ullman: chs. 2.7, 6.1-6.6, 7.6

Next: Building a Compiler: recognising Up: CS2111: Design and Implementation Previous: Abstract Data-Types, Modules, Classes Contents

Pete Jinks
1999-09-30