next up previous
Next: Composing Programs Up: CS2121: The Implementation and Previous: Parse Trees


Variables and Types

Variables and Types

A simple example

Suppose we are implementing a simple language to drive a program, such as a calculator or a shell. We would start simply, implementing expressions or commands, but the next facility users are likely to want is variables: ``nothing complicated, just numbers with simple names, so no need for declarations''. How would we cope?

The simplest implementation would be to use a list to hold all the variables - a namelist. Each entry would have (a pointer to) a string containing the name of the variable, and space to hold its value (e.g. an int). Each time we see a variable name, we look it up in the namelist - if it is not already there, we create a new entry for the name, and some initial value (e.g. zero). (We might even sort the list or use some extra mechanism such as a tree or a hashtable to speed up the search.) We then convert the name into a pointer to its entry in the namelist and e.g. return that from LEX to YACC via yylval. For example:

  \fbox{namelist \dots }
a = 1 + 2 * 3 ; create new entry in namelist for "a"
  \fbox{namelist \dots ,} \fbox{\lq\lq a'', 0}
  [name,pointer to entry for a] [=] [number,1] [+] [number,2] [*] [number,3] [;]
  do the calculation as normal, then set a's value to the result
  \fbox{namelist \dots ,} \fbox{\lq\lq a'', 7}
a / 2 ; [name, pointer to entry for a] [/] [number,2] [;]
  get a's value, divide by 2, output result ``3''

Multiple types

Suppose we wanted to extend our calculator further, and allow for both scalar values and (indeterminate length) arrays. We would need to have some type information in each namelist entry, in this case simply indicating whether the variable is a scalar or an array (and for an array, how big it currently is). It would not be sensible to store the value of an array variable directly in the namelist. Instead we would just have the location of the variable in the namelist, pointing to some other area of memory actually holding the value. e.g. ``a'' as before, ``b'' an array, currently of 100 elements:

\fbox{\lq\lq a'', scalar, 7}
\fbox{\lq\lq b'', array, 100, $\rightarrow$} $\rightarrow$ \fbox{\fbox{b[0]}\fbox{b[1]} \dots \fbox{b[99]}}

Compilers and interpreters

So far, we have assumed that we are trying to write something like a calculator or a shell, where the expressions and commands are interpreted directly as they are read. Instead, suppose we are trying to write a translator, such as a compiler, where we don't obey the commands immediately but instead convert them e.g. into assembly code for execution later.

In this case, we can't have the values of the variables in their namelist entries - their values will not be known until the assembly code is run, long after the compiler has finished work. Therefore, the entries for all the variables (not just arrays) will include a location rather than a value i.e. the address that the variable will appear in at run-time.

In fact, because of the complexity of the process of compiling and running programs written in real languages like C or Java, the location is not simply a numerical address. Instead, it usually consists of a pair: some indication of a data area, and an offset into that data area. The data area might indicate e.g. the area where the class variables of a particular class will go, or a stack frame where the local variables of a method will go.

\fbox{\lq\lq a'', type information, data area, offset}
\fbox{\lq\lq b'', type information, data area, offset}


Declarations only exist in programming languages so that compilers can check things - it is perfectly possible to program without them (e.g. assembly code), just a lot harder to get everything right by yourself:

to spot spelling mistakes

to check what kind of thing the identifier represents - is this a variable you can assign to, or a constant you can only take a value from, or a method (function) you can only call.

to check and sometimes modify the types - inheritance and overloading; an integer can be automatically converted to a float.

to check that local identifiers are only used in a part of the whole program (scope)

A compiler also has to bind the identifier to (i.e. replace each use of the identifier by) e.g. a value (for a constant) or an address (for a variable or a method) to be used at run time.

To a compiler, a declaration like float fred; means something like this: ``Dear compiler, meet fred. fred is a variable, so I only want to use it for values - please don't let me forget and use it as a method. fred is a float, so I will use it for arithmetic and avoid using it like an object. Also, if I use ints with fred, please convert the ints to floats for me. fred is only for use in this class or method, so please don't let me use it anywhere else. I will do my best to help by spelling fred's name correctly at all times. By the way, could you allocate some memory for fred - no rush, I don't need the memory until I run the program, and that is billions of nanoseconds away.''

Compile-time visibility - Scope

Identifier visibility, usually known as scope, is a property of the written program, so that it all gets sorted out by the compiler at compile time. In early programming languages, identifiers and variables had essentially unlimited visibility. However, it soon became clear that it was important to find ways of restricting visibility to make the best use of limited computer memories and, more importantly, limited human memories.

How can we safely create new meanings for names in a part of the program without having to worry whether those names are used anywhere else?
At run-time, how can we safely reuse memory space for different purposes in different parts of a program?
The best answers to these questions known so far, are those found in languages like C, and even more so in Java: the scope of an identifier is usually a class or method body, or a { } block. Usually the same name can be used in several different declarations, each with its own scope. For example:

  \fbox{ \dots }
{ float A; int B, C[10]; \fbox{ \dots , \lq\lq A'' , \lq\lq B'' , \lq\lq C'' }
// can use A, B and C here  
{ char P; int Q, R[11]; \fbox{ \dots , \lq\lq A'' , \lq\lq B'' , \lq\lq C'' , \lq\lq P'' , \lq\lq Q'' , \lq\lq R'' }
// can use A, B, C, P, Q and R here  
} \fbox{ \dots , \lq\lq A'' , \lq\lq B'' , \lq\lq C'' }
// can use A, B and C here  
{ float A[6], C; int D; \fbox{ \dots , [\lq\lq A''] , \lq\lq B'' , [\lq\lq C''] , \lq\lq A'' , \lq\lq C'' , \lq\lq D'' }
// can use A, B, C and D here, but A is an array and C is not !  
} \fbox{ \dots , \lq\lq A'' , \lq\lq B'' , \lq\lq C'' }
// can use A, B and C here - C is an array and A is not !  
} \fbox{ \dots }

We have to arrange things so that, when we see the second declarations for A and C, they are visible in the namelist but the previous declarations for A and C are not. However, when we leave that block, the original declarations for A and C must become visible again in the namelist.

Run-time memory - Stack frames

The layout of memory at different moments during run-time looks similar to the layout of the namelist, if we include those names temporarily hidden by fresh declarations:

A B C[0] . . . . . . . . C[9]                            
A B C[0] . . . . . . . . C[9] P Q R[0] . . . . . . . . . R[10]  
A B C[0] . . . . . . . . C[9]                            
A B C[0] . . . . . . . . C[9] A[0] . . . . A[5] C D            
A B C[0] . . . . . . . . C[9]                            

However, remember that the namelist only exists during compile-time, and the values of variables only exist during run-time, so don't confuse them!

The picture gets more complicated when one method calls another, as the identifiers of the first method are not visible inside the second, even though they still exist when the other method is called at run time. Allowing recursion brings further complications, as at any one moment during run time there may be several memory locations representing different recursive instances of the same identifier.

Other language features, such as allowing functions to be declared inside other functions, and allowing functions to themselves be passed as parameters, can lead to yet more complexity. There is also another kind of memory usage (heap) which bypasses the use of identifiers and some compile-time checks by using e.g. malloc instead.

Getting the use of memory right during run time is dealt with in CS2042 - I will mainly deal with getting the visibility right at compile time. Getting both right is actually quite straightforward - it is only justifying it that requires looking at all sorts of complex examples, so we will skip them and go straight to the answer:

The compiler has to assign variables to run time memory locations. However, the complications mentioned above mean that these locations cannot be completely fixed at compile time. Instead, we work out everything we can in the compiler, but leave a small amount of work until run time:

at the start of a method/function
grab a chunk of memory and set a register (namebase or framebase) pointing to it - the compiler ensures that the chunk is exactly big enough for all the variables of the method/function
in the body of the method/function
access variables at fixed offsets (decided by the compiler) from this register
at the end of the method/function
release the chunk of memory

Because of the pattern of use, the memory can be and usually is on the stack - thus the chunks of memory are known as stack-frames. Referring back to the example above, we could either allocate the maximum memory needed at the start of the method/function, or we could allocate the minimum and adjust the size of the stack-frame as we enter and leave the inner blocks. In practice, for simplicity and speed we normally do the former.

Implementing scope - Dictionary

If we are dealing with a computer language that has similar sorts of scope rules to C and Java, a simple namelist is no longer sufficient. Instead, the namelist is just part of a more complex structure, the dictionary.

We arrange things so that every identifier appears in the namelist exactly once, and each identifier points to the most recent declaration for that name. We manipulate these pointers, and maintain extra links between entries in the dictionary, so that the correct declarations can be found.

  namelist $$dictionary entries
{ float A; int B, C[10];   $$ 
  \fbox{ \lq\lq A'' } $$ \fbox{ float }
  \fbox{ \lq\lq B'' } $$\fbox{ int }
  \fbox{ \lq\lq C'' } $$ \fbox{ int[10] }
{ char P; int Q, R[11];   $$ 
  \fbox{ \lq\lq A'' } $$ \fbox{ float }
  \fbox{ \lq\lq B'' } $$\fbox{ int }
  \fbox{ \lq\lq C'' } $$ \fbox{ int[10] }
  \fbox{ \lq\lq P'' } $$\fbox{ char }
  \fbox{ \lq\lq Q'' } $$\fbox{ int }
  \fbox{ \lq\lq R'' } $$ \fbox{ int[11] }
}   $$ 
  \fbox{ \lq\lq A'' } $$ \fbox{ float }
  \fbox{ \lq\lq B'' } $$\fbox{ int }
  \fbox{ \lq\lq C'' } $$ \fbox{ int[10] }
  \fbox{ \lq\lq P'' } $$ 
  \fbox{ \lq\lq Q'' } $$ 
  \fbox{ \lq\lq R'' } $$ 
{ float A[6], C; int D;   $$ 
  \fbox{ \lq\lq A'' } $$ \fbox{ float[6] }$\rightarrow$ \fbox{ float }
  \fbox{ \lq\lq B'' } $$\fbox{ int }
  \fbox{ \lq\lq C'' } $$ \fbox{ float }$\rightarrow$ \fbox{ int[10] }
  \fbox{ \lq\lq D'' } $$\fbox{ int }
  \fbox{ \lq\lq P'' } $$ 
  \fbox{ \lq\lq Q'' } $$ 
  \fbox{ \lq\lq R'' } $$ 
}   $$ 
  \fbox{ \lq\lq A'' } $$ \fbox{ float }
  \fbox{ \lq\lq B'' } $$\fbox{ int }
  \fbox{ \lq\lq C'' } $$ \fbox{ int[10] }
  \fbox{ \lq\lq D'' } $$ 
  \fbox{ \lq\lq P'' } $$ 
  \fbox{ \lq\lq Q'' } $$ 
  \fbox{ \lq\lq R'' } $$ 
}   $$ 

In particular, at the end of each scope (block), we need to be able to get rid of any dictionary entries that correspond to declarations made within that block, and return the dictionary to how it was before we started the block. The only exceptions are that we need to remember the parameters of a method or function (although not their names) so that we can check that calls have the correct arguments, and for a language like Java, the interface to each class (the public methods etc.). We do not need to discard the names from the namelist - the unused names don't point anywhere, so they can't cause any confusion.

User-defined types and classes

There are several different kinds of names that need to go into the dictionary - variables, constants, classes or types, methods or functions, etc.. For each kind of name, we will need to save slightly different information about it. For a variable (or parameter), we need the type and the address; for a constant, the type and the value (or address for a final object); for a function, the signature (parameter list and result type) and address.

For a type, in a language like C, we may need to save an indeterminate amount of information, which we can deal with most easily as a series of anonymous type declarations.

e.g. char (*(*x())[])() { }
can be treated as if it was really:
  typedef char anon1();
  typedef anon1 *anon2;
  typedef anon2 anon3[];
  typedef anon3 *anon4;
  anon4 x() { }

  \fbox{ char } $$ \fbox{ type $\vert$ built-in type char }
  \fbox{anon1} $$ \fbox{ type $\vert$ signature $\vert \uparrow \vert$ () }
  \fbox{anon2} $$ \fbox{ type $\vert$ pointer $\vert \uparrow $ }
  \fbox{anon3} $$ \fbox{ type $\vert$ array $\vert$ ? $\vert \uparrow $ }
  \fbox{anon4} $$ \fbox{ type $\vert$ pointer $\vert \uparrow $ }
  \fbox{x} $$ \fbox{ function $\vert \uparrow\vert$ () }

In a language like Java, a class is implemented like a struct (record), with a field for each variable or method, containing the address (for a method or class variable) or value (for an instance variable, although this will just be a pointer to the real value for an object). The only complication in the dictionary is that we also need to note the accessibility (public, private, etc.) for each item.


The part of a compiler that deals with declarations and types is usually called the semantic analyser. It will create and use a dictionary, as outlined above, checking that every identifier that is used in a program has been declared, and that there are not two declarations for the same name in the same scope. It will also use the information from the declarations to check that, for example, the user does not try to assign a value to a constant, method or class.

Finally, the semantic analyser will copy information about the types of identifiers into the parse-tree created by the parser to check that, for example, the user does not try to add a Vector to an integer. More importantly, it uses the types to decide whether a + is acting on two integers, in which case it is addition, or on two Strings, in which case it is concatenation, or on an integer and a String, in which case it will modify the parse-tree to indicate the extra conversion of the integer to a String before concatenation.



Louden: chs. 4.7, 5.1-5.6, 6.6, 7.5
Bal & Grune: chs. 2.2.1, 2.3, 2.3.1
Aho, Sethi & Ullman: chs. 2.7, 6.1-6.6, 7.1-7.4, 7.6-7.9, 9.3
Capon & Jinks: chs. 3.3, 7.3, 9

next up previous
Next: Composing Programs Up: CS2121: The Implementation and Previous: Parse Trees
Pete Jinks