next up previous
Next: About this document ... Up: dict2 Previous: dict2


Identifiers, Declarations and Dictionaries


Declarations only exist in programming languages so that compilers can check things - it is perfectly possible to program without them (e.g. assembly code), just a lot harder to get everything right by yourself:

Identifiers that represent types are not normally needed after the checking process is finished. For other kinds of identifiers, a compiler also has to bind the identifier to (i.e. replace each use of the identifier by) e.g. a value (for a constant) or an address (for a variable or a function) to be used at run time.

To help it do all these things, a compiler uses a dictionary to keep track of the declarations and to remember the bindings.

Visibility - Scope

Identifier visibility, usually known as scope, is a property of the written program, so that it all gets sorted out by the compiler at compile time.

An identifier which is declared at some point in a program may be used, with its declared meaning, at other points in the program. The scope of the identifier is the region of the text of that program in which it is allowed to be used.

The dictionary must change at the start of the scope of an identifier and again at the end of that scope - usually a function body, or a { } block. Usually the same name can be used in several different declarations, each with its own scope.

In early programming languages, identifiers and variables had essentially unlimited scope. However, it soon became clear that it was important to find ways of restricting scope to make the best use of limited computer memories and, more importantly, limited human memories.

How can we safely reuse memory space for different purposes in different parts of a program? How can we safely create new meanings for names in a part of the program without having to worry whether those names are used anywhere else?

Example C program showing nested block structure:

  int main (. . .)
  {                                            /* start of Block1 */
    float A; int B, C[10];
    . . A; . . B; . . C;
    {                                              /* start of Block2 */
      char P; int Q, R[11];
      . . A; . . B; . . C; . . P; . . Q; . . R;
    }                                              /* end of Block2 */
    . . A; . . B; . . C;
    {                                              /* start of Block3 */
      float A[6], C; int D;
      . . B; . . A; . . C; . . D;
    }                                              /* end of Block3 */
    . . A; . . B; . . C;
  }                                            /* end of Block1 */
Compile time - Scopes of identifiers:
  B        = body of Block1
  P, Q, R  = body of Block2
  A, C     = body of Block1 but omitting Block3 (the first A and C)
  A, C, D  = body of Block3                     (the second A and C)
Run time - Lifetimes (extents) of memory locations accessed by identifiers:
  A, B, C  = duration of Block1 (the first A and C)
  P, Q, R  = duration of Block2
  A, C, D  = duration of Block3 (the second A and C)

Layout of memory at different times during the run:
A B C[0] . . . . . . . . C[9]                            
A B C[0] . . . . . . . . C[9] P Q R[0] . . . . . . . . . R[10]  
A B C[0] . . . . . . . . C[9] A[0] . . . . A[5] C D            

In a single function, both visibility and life-time are nested in similar ways - visibility in the program text at compile time, and life-time in the use of memory during run time. The story gets more complicated when we use more than one function, as the identifiers of one function are not visible inside another, even though they may be alive when the other function is called at run time.

Allowing recursion brings further complications, as at any one moment during run time there may be several memory locations representing different recursive instances of the same identifier. Other language features, such as allowing functions to be declared inside other functions, and allowing functions to themselves be passed as parameters, can lead to yet more complexity. There is also another kind of memory usage (heap) which bypasses the use of identifiers and some compile-time checks by using e.g. malloc instead.

Getting the use of memory right during run time is dealt with in CS2042 - I will mainly deal with getting the visibility right at compile time. Getting both right is actually quite straightforward - it is only justifying it that requires looking at all sorts of complex examples, so we will skip them and go straight to the answer:

The compiler has to assign variables to run time memory locations. However, the complications mentioned above mean that these locations cannot be completely fixed at compile time. Instead, we work out everything we can in the compiler, but leave a small amount of work until run time:

at the start of a function
grab a chunk of memory and set a register (namebase or framebase) pointing to it - the compiler makes sure that the chunk is exactly big enough for all the variables of the function to fit
in the body of the function
access variables at fixed offsets (decided by the compiler) from this register
at the end of the function
release the chunk of memory

Because of the pattern of use, the memory can be and usually is on the stack - thus the chunks of memory are known as stack-frames. Referring back to the C example above, we could either allocate the maximum memory needed at the start of the function, or we could allocate the minimum and adjust the size of the stack-frame as we enter and leave the inner blocks. In practice, for simplicity and speed we normally do the former.

Global variables are not affected by recursion, so they can be placed at a fixed location. However, this can only be decided during linking (when libraries etc. are combined after compilation) or loading (putting the code into memory just before running), so again the compiler treats these as being at fixed offsets from some base memory location to be set later. Similarly, functions and labels will be at fixed offsets from some base location decided during linking or loading (unless we can use relative branches instead of absolute jumps, so we only need to know the differences in the offsets).

Thus, for the compiler, all run time addresses are at known offsets from some base address that cannot be known until the program is running. It therefore holds addresses as pairs (offset, memory space). It also decides how big each of the memory spaces has to be.

The final address may be found using an offset from a register (as for variables in stack frames) or it may be calculated by the linker or loader by actually adding the two numbers together and putting the result into the correct positions in the program code (as for global variables and functions and labels).

Implementing Identifiers - Dictionaries

A dictionary contains a set of names used in a piece of program text and, for each name, information describing their declarations. To look up an identifier in a dictionary:
locate the particular name (if present)
using any standard technique, such as B-trees or hashing.
This part of the dictionary is known as the name-list or symbol-table.
locate (information about) the relevant declaration of the name
most languages permit many instances of the same name to exist at once, for variables in nested scopes, and for fields of different structs. (Some languages, such as C, are misguided enough to permit multiple meanings for a single name in a single scope (e.g. variable, label, struct) but we will ignore this complication.)
This part of the dictionary is known as the property-list(s), of property entries or properties.
Theoretically, we could either use a single dictionary, containing all the names in the source text, each linked to whichever declarations are currently valid, or several dictionaries e.g. 1 per scope, so each dictionary contains just one declaration for each name in it. In practice, the latter can get fiddly when we come to deal with e.g. the fields of each different struct, so I prefer to use a single dictionary. Thus, we search the dictionary for a name, and then for the relevant declaration belonging to that name.

Using a dictionary

At the start of each new scope (e.g. function or block)
At each declaration of an identifier
At each use of an identifier
At the end of each scope

Note that the first thing we do at each declaration or use of an identifier (i.e. every time we see an identifier) is look it up in the namelist. Therefore, this is often done as the identifier is recognised by LEX, which then passes the result of the look-up as the value part of the lexeme (yylval), rather than the string of characters that make up the name.

Property entries

Some information will be used by most or all declaration kinds:
$\rightarrow$ property entry of the owning function or block,
and/or the depth of nested scopes
(see below)
$\rightarrow$ name in name-list
$\rightarrow$ any previous declarations of the same name in the dictionary
$\rightarrow$ next declaration in the same scope
Other information will be different for each declaration kind:
constant (enum)
we determine the value at compile time and place that value in the property entry
To plant code, we need the run-time address, usually held as (data frame, offset in frame). The data frame is often a stack frame, but could be a global variable frame.

A stack frame can be described by the scope information.

At the declaration, the offset is copied from the current size of the data frame, which we then increment by the size of the variable, derived from the type information.

similar to variables.
record/struct and union fields
Also very similar to variables, but the data frame refers to the owning record. We calculate the offsets for fields of unions (and the total size of the union frame) as for a function containing nested blocks (i.e. use the maximum size, not the total of all the sizes).
Consist just of the code address (i.e. code frame + offset), although we usually need to cope with forward references as described above.
programs, subprograms and blocks
For a function, we need its signature: we could use the type information for the result, and a link to the list of parameters for the rest of the signature. However, this is not sensible if a function signature can be used as a type. We also need the size of the stack frame. We may want to point to a list of all properties defined within the scope to make processing easier. Finally, we need the code address as for labels.
A program entry is usually needed to provide a scope for global declarations, and to hold any language-specific information.
We need entries for { } blocks if they can contain declarations as in C - their entries will be like those for functions, but without signature, code address or name.
The only types some languages have are a few built-in primitive types and vectors, and we need only remember this information and the resulting size (e.g. in bytes) as part of the property entry for the variable, parameter or constant.
Most modern languages allow types to be created using e.g. multi-dimensional arrays, records, pointers or recursive types, If the language forces a complex declaration to be broken up into several simpler declarations, then we can place each declaration in a single property entry. However, most languages permit individual declarations to be arbitrarily complex, and the implementor must split each declaration up, creating entries as if for a whole series of (anonymous) type declarations as well as the declaration of the named type or variable or whatever.

Example of equivalent anonymous and explicit type declarations in C:

  char (*(*x())[])() { . . . }

  typedef char anon1();
  typedef anon1 *anon2;
  typedef anon2 anon3[];
  typedef anon3 *anon4;
  anon4 x() { . . . }
and possible dictionary entries (many details omitted):
namelist properties
char \fbox{ type $\vert$\ description of char }
anon1 \fbox{ type $\vert$\ signature $\vert$\ * $\vert$\ () }
anon2 \fbox{ type $\vert$\ pointer $\vert$\ * }
anon3 \fbox{ type $\vert$\ array $\vert$\ ? $\vert$\ * }
anon4 \fbox{ type $\vert$\ pointer $\vert$\ * }
x \fbox{ function $\vert$\ * $\vert$\ () }

Keywords and predefined identifiers.

All languages include words that could be mistaken for identifiers, but are actually defined as part of the language. These include words used to describe data and control structures, typenames, conversions, values, input/output, operators, and maths functions. These words are treated in two main ways:

Keywords (reserved words) are built into the grammar, and so cannot have their meaning changed by the user.

Predefined identifiers are otherwise normal identifiers, that behave as if they were declared before the start of each program. Their meanings can therefore be changed just like any other name, although this is rarely sensible. These extra declarations can be handled simply by pre-inserting the identifiers into the dictionary.

Keywords are easy to recognise by using a different LEX grammar rule for each, but this tends to make the resulting analyser very large (and slow for LEX to generate). Therefore, some analysers initially do not distinguish between keywords and identifiers, but then search a special dictionary to recognise keywords before treating everything else as an identifier. This makes the analyser smaller but slower.


A compiler uses a dictionary at compile time to check the declarations and uses of identifiers and to help it plant code that, at run time, creates stack frames to hold variables. A dictionary holds information about the identifiers derived from their declarations in the program text. A stack frame holds the actual values of its variables at any given moment while the program is running.


($\dag $ indicates harder problems)


Louden: chs. 4.7, 5.1-5.6, 6.6, 7.5
Bal & Grune: chs. 2.2.1, 2.3, 2.3.1
Aho, Sethi & Ullman: chs. 2.7, 6.1-6.6, 7.1-7.4, 7.6-7.9, 9.3
Capon & Jinks: chs. 3.3, 7.3, 9
next up previous
Next: About this document ... Up: dict2 Previous: dict2
Pete Jinks 2002-07-25