Next: About this document ...
Up: dict2
Previous: dict2
Subsections
Identifiers, Declarations and Dictionaries
Declarations only exist in programming languages so that compilers can check
things - it is perfectly possible to program without them (e.g. assembly code),
just a lot harder to get everything right by yourself:
- to spot spelling mistakes
- to check what kind of thing the identifier represents
- is this a variable you can assign to, or a constant you can only take a
value from, or a function you can only call.
- to check and sometimes modify the types
- a pointer variable can be dereferenced to give e.g. an integer, an integer
variable can be automatically converted to be assigned to a float variable.
- to check that local identifiers are only used in a part of the whole
program (scope)
Identifiers that represent types are not normally needed after the checking
process is finished. For other kinds of identifiers, a compiler also has to
bind the identifier to (i.e. replace each use of the identifier by) e.g. a
value (for a constant) or an address (for a variable or a function) to be used
at run time.
To help it do all these things, a compiler uses a dictionary to keep track of
the declarations and to remember the bindings.
Identifier visibility, usually known as scope, is a property of the written
program, so that it all gets sorted out by the compiler at compile time.
An identifier which is declared at some point in a program may be used, with
its declared meaning, at other points in the program. The scope of the
identifier is the region of the text of that program in which it is allowed to
be used.
The dictionary must change at the start of the scope of an identifier and
again at the end of that scope - usually a function body, or a { }
block. Usually the same name can be used in several different declarations,
each with its own scope.
In early programming languages, identifiers and variables had essentially
unlimited scope. However, it soon became clear that it was important to find
ways of restricting scope to make the best use of limited computer memories
and, more importantly, limited human memories.
How can we safely reuse memory space for different purposes in different parts
of a program? How can we safely create new meanings for names in a part of the
program without having to worry whether those names are used anywhere else?
Example C program showing nested block structure:
int main (. . .)
{ /* start of Block1 */
float A; int B, C[10];
. . A; . . B; . . C;
{ /* start of Block2 */
char P; int Q, R[11];
. . A; . . B; . . C; . . P; . . Q; . . R;
} /* end of Block2 */
. . A; . . B; . . C;
{ /* start of Block3 */
float A[6], C; int D;
. . B; . . A; . . C; . . D;
} /* end of Block3 */
. . A; . . B; . . C;
} /* end of Block1 */
Compile time - Scopes of identifiers:
B = body of Block1
P, Q, R = body of Block2
A, C = body of Block1 but omitting Block3 (the first A and C)
A, C, D = body of Block3 (the second A and C)
Run time - Lifetimes (extents) of memory locations accessed by identifiers:
A, B, C = duration of Block1 (the first A and C)
P, Q, R = duration of Block2
A, C, D = duration of Block3 (the second A and C)
Layout of memory at different times during the run:
A |
B |
C[0] |
. |
. |
. |
. |
. |
. |
. |
. |
C[9] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A |
B |
C[0] |
. |
. |
. |
. |
. |
. |
. |
. |
C[9] |
P |
Q |
R[0] |
. |
. |
. |
. |
. |
. |
. |
. |
. |
R[10] |
|
A |
B |
C[0] |
. |
. |
. |
. |
. |
. |
. |
. |
C[9] |
A[0] |
. |
. |
. |
. |
A[5] |
C |
D |
|
|
|
|
|
|
In a single function, both visibility and life-time are nested in similar ways
- visibility in the program text at compile time, and life-time in the use of
memory during run time. The story gets more complicated when we use
more than one function, as the identifiers of one function are not visible
inside another, even though they may be alive when the other function is
called at run time.
Allowing recursion brings further complications, as at any one moment during
run time there may be several memory locations representing different
recursive instances of the same identifier. Other language features, such as
allowing functions to be declared inside other functions, and allowing
functions to themselves be passed as parameters, can lead to yet more
complexity. There is also another kind of memory usage (heap) which bypasses
the use of identifiers and some compile-time checks by using e.g. malloc
instead.
Getting the use of memory right during run time is dealt with in CS2042 - I
will mainly deal with getting the visibility right at compile time. Getting
both right is actually quite straightforward - it is only justifying it that
requires looking at all sorts of complex examples, so we will skip them and
go straight to the answer:
The compiler has to assign variables to run time memory locations. However,
the complications mentioned above mean that these locations cannot be
completely fixed at compile time. Instead, we work out everything we can in
the compiler, but leave a small amount of work until run time:
- at the start of a function
- grab a chunk of memory and set a register (namebase or
framebase) pointing to it - the compiler makes sure that the chunk is
exactly big enough for all the variables of the function to fit
- in the body of the function
- access variables at fixed offsets (decided by the compiler) from this register
- at the end of the function
- release the chunk of memory
Because of the pattern of use, the memory can be and usually is on the stack -
thus the chunks of memory are known as stack-frames. Referring back to the C
example above, we could either allocate the maximum memory needed at the start
of the function, or we could allocate the minimum and adjust the size of the
stack-frame as we enter and leave the inner blocks. In practice, for
simplicity and speed we normally do the former.
Global variables are not affected by recursion, so they can be placed at a
fixed location. However, this can only be decided during linking (when
libraries etc. are combined after compilation) or loading (putting the code
into memory just before running), so again the compiler treats these as being
at fixed offsets from some base memory location to be set later. Similarly,
functions and labels will be at fixed offsets from some base location decided
during linking or loading (unless we can use relative branches instead of
absolute jumps, so we only need to know the differences in the offsets).
Thus, for the compiler, all run time addresses are at known offsets from some
base address that cannot be known until the program is running. It therefore
holds addresses as pairs (offset, memory space). It also decides how big each
of the memory spaces has to be.
The final address may be found using an offset from a register (as for
variables in stack frames) or it may be calculated by the linker or loader by
actually adding the two numbers together and putting the result into the
correct positions in the program code (as for global variables and functions
and labels).
Implementing Identifiers - Dictionaries
A dictionary contains a set of names used in a piece of program text and, for
each name, information describing their declarations. To look up an identifier
in a dictionary:
- locate the particular name (if present)
- using any standard technique, such as B-trees or hashing.
This part of the dictionary is known as the name-list or symbol-table.
- locate (information about) the relevant declaration of the name
- most languages permit many instances of the same name to exist at once,
for variables in nested scopes, and for fields of different structs.
(Some languages, such as C, are misguided enough to permit multiple meanings
for a single name in a single scope (e.g. variable, label, struct) but we will
ignore this complication.)
This part of the dictionary is known as the property-list(s), of property
entries or properties.
Theoretically, we could either use a single dictionary, containing all the
names in the source text, each linked to whichever declarations are currently
valid, or several dictionaries e.g. 1 per scope, so each dictionary contains
just one declaration for each name in it. In practice, the latter can get
fiddly when we come to deal with e.g. the fields of each different struct, so
I prefer to use a single dictionary. Thus, we search the dictionary for a
name, and then for the relevant declaration belonging to that name.
Using a dictionary
- At the start of each new scope (e.g. function or block)
- Whatever initialisation is necessary.
- At each declaration of an identifier
- Search the dictionary to find the name in the namelist, and
insert it if missing.
- Check that there is not already a declaration for this name in this
scope.
- Insert a new property entry describing the declaration
- insert the new entry at the head of the list of properties for the name,
so declarations in scope are found first.
- At each use of an identifier
- At the end of each scope
- Check all forward references satisfied.
If we are ending a function an unsatisfied reference may be an error (e.g. an
unknown label) but usually we just move unsatisfied references into the
enclosing scope (assuming one exists).
- Delete all properties that are no longer relevant.
- We may need to keep some properties and/or names
e.g. if we use a function's parameters to hold its signature.
- We may need to copy some information for run-time debugging.
- We may also remove names that no longer have any associated properties
to speed up searches.
Note that the first thing we do at each declaration or use of an identifier
(i.e. every time we see an identifier) is look it up in the
namelist. Therefore, this is often done as the identifier is recognised by
LEX, which then passes the result of the look-up as the value part
of the lexeme (yylval), rather than the string of characters that make up the
name.
Some information will be used by most or all declaration kinds:
- kind
- variable/constant/function/type/etc.
- scope
- property entry of the owning function or block,
and/or the depth of nested scopes
- type
- (see below)
- name in name-list
-
- any previous declarations of the same name in the
dictionary
-
- next declaration in the same scope
- (?)
Other information will be different for each declaration kind:
- constant (enum)
- we determine the value at compile time and place that value in the property
entry
- variable
- To plant code, we need the run-time address, usually held as (data frame,
offset in frame). The data frame is often a stack frame, but could be a global
variable frame.
A stack frame can be described by the scope information.
At the declaration, the offset is copied from the current size of the data
frame, which we then increment by the size of the variable, derived from the
type information.
- parameter
- similar to variables.
- record/struct and union fields
- Also very similar to variables, but the data frame refers to the owning
record. We calculate the offsets for fields of unions (and the total size of
the union frame) as for a function containing nested blocks (i.e. use the
maximum size, not the total of all the sizes).
- labels
- Consist just of the code address (i.e. code frame + offset),
although we usually need to cope with forward references as described above.
- programs, subprograms and blocks
- For a function, we need its signature: we could use the type information for
the result, and a link to the list of parameters for the rest of the
signature. However, this is not sensible if a function signature can be used
as a type. We also need the size of the stack frame. We may want to point to a
list of all properties defined within the scope to make processing
easier. Finally, we need the code address as for labels.
A program entry is usually needed to provide a scope for global declarations,
and to hold any language-specific information.
We need entries for { }
blocks if they can contain declarations as in C
- their entries will be like those for functions, but without signature, code
address or name.
- types
- The only types some languages have are a few built-in primitive types and
vectors, and we need only remember this information and the resulting size
(e.g. in bytes) as part of the property entry for the variable, parameter or
constant.
Most modern languages allow types to be created using e.g. multi-dimensional
arrays, records, pointers or recursive types,
If the language forces a complex declaration to be
broken up into several simpler declarations, then we can place each
declaration in a single property entry. However, most languages permit
individual declarations to be arbitrarily complex, and the implementor must
split each declaration up, creating entries as if for a whole series of
(anonymous) type declarations as well as the declaration of the named type
or variable or whatever.
Example of equivalent anonymous and explicit type declarations in C:
char (*(*x())[])() { . . . }
typedef char anon1();
typedef anon1 *anon2;
typedef anon2 anon3[];
typedef anon3 *anon4;
anon4 x() { . . . }
and possible dictionary entries (many details omitted):
namelist |
properties |
|
|
char |
|
|
|
anon1 |
|
|
|
anon2 |
|
|
|
anon3 |
|
|
|
anon4 |
|
|
|
x |
|
All languages include words that could be mistaken for identifiers, but are
actually defined as part of the language. These include words used to
describe data and control structures, typenames, conversions, values,
input/output, operators, and maths functions.
These words are treated in two main ways:
Keywords (reserved words) are built into the grammar, and so cannot
have their meaning changed by the user.
Predefined identifiers are otherwise normal identifiers, that behave
as if they were declared before the start of each program. Their meanings can
therefore be changed just like any other name, although this is rarely
sensible. These extra declarations can be handled simply by pre-inserting the
identifiers into the dictionary.
Keywords are easy to recognise by using a different LEX grammar
rule for each, but this tends to make the resulting analyser very large (and
slow for LEX to generate). Therefore, some analysers initially do
not distinguish between keywords and identifiers, but then search a special
dictionary to recognise keywords before treating everything else as an
identifier. This makes the analyser smaller but slower.
A compiler uses a dictionary at compile time to check the
declarations and uses of identifiers and to help it plant code that, at
run time, creates stack frames to hold variables.
A dictionary holds information about the identifiers derived from their
declarations in the program text. A stack frame holds the actual values of its
variables at any given moment while the program is running.
( indicates harder problems)
- Some languages treat upper-case and lower-case names as distinct
(e.g. Pascal), some as being the same (e.g. Ada), and some only allow one
case. How would you recognise and match names in an Ada compiler?
Some languages ignore characters in names past some limit. e.g. in ANSI C,
only the first 6 characters of extern identifiers have to be checked
(although more can be) and their case can be ignored. How could you use
LEX and YACC to check ANSI C programs for identifiers that
might be confused by a compiler/linker that only provided the minimum
facilities? How could you use #define to avoid any problems thus detected?
- List all the data structures you can think of that could possibly be
used for a namelist - which of them would you prefer to use and why? Can you
imagine any circumstances in which you would pick a different one?
- What identifiers do LEX and YACC (i.e. the parser
generators themselves, not the parsers they generate) each have to deal
with, and so what do they need in the way of dictionaries?
What about other Unix tools that you are familiar with, like bash?
- Check your favourite compilers to see what happens if you misspell an
identifier. Do you get just one error message? If you repeatedly misspell it
in the same way, are the error messages repeated?
- Write a set of C declarations that uses all the different kinds of
declarations, and all the sorts of types, that you can think of, and then
give the corresponding dictionary entries.
- How would you change the semantics of C to simplify semantic
analysis? What useful features of the language would be lost by this
simplification?
- Assume you could declare one function inside another in C, and the
inner function could access the variables of the outer function. What
complications would this cause at compile time and at run time?
Louden: chs. 4.7, 5.1-5.6, 6.6, 7.5
Bal & Grune: chs. 2.2.1, 2.3, 2.3.1
Aho, Sethi & Ullman: chs. 2.7, 6.1-6.6, 7.1-7.4, 7.6-7.9, 9.3
Capon & Jinks: chs. 3.3, 7.3, 9
Next: About this document ...
Up: dict2
Previous: dict2
Pete Jinks
2002-07-25