.NH S 4 Semantic Analysis

input:
parse trees or equivalent
dictionary containing just names
output:
parse trees (or equivalent) for executable code,
enhanced by information from declarations
declarations -> dictionary properties
error messages

The semantic analyser recognises & checks the declarations & use of identifiers, using the dictionary.

Unlike lexical and syntactic analysis, there are no widely available toolkits for semantic analysis, although there are packages that help with maintaining a dictionary.

Dictionaries

A dictionary contains a set of names used in a piece of code and, for each name, information describing their declarations. To look up an identifier in a dictionary:

* locate the particular name (if present - insert if missing):
using any standard technique, such as B-trees or hashing. This part of the dictionary is known as the name-list or symbol-table.

* locate (information about) the relevant declaration of the name:
C permits multiple scopes and/or multiple meanings for a single name in a single scope. This part of the dictionary is known as the property-list(s), of property entries or properties.

We normally use a single dictionary, containing all the names in the source text, each linked to whichever declarations are currently valid, so we search the dictionary for a name, and then for the relevant declaration.

(It is possible to use several dictionaries e.g. 1 per scope and/or per kind of declaration, so we must search them dictionaries in turn until we find the declaration we need.)

Using a dictionary

* At the start of each new scope (e.g. function or block):
Whatever initialisation is necessary.

* At each declaration of an identifier:
Find the name in the namelist (insert it if missing).
Check that there is not already a declaration (of the same kind?) for the name in the scope.
Insert a new property entry describing the declaration.
Insert new entries at the head of the list of properties for the name, so declarations in scope are found first.

* At each use of an identifier:

Find the name in the namelist (insert it if missing).
Find the current declaration in scope (of the correct kind) for that name. If none is found, insert a new property entry.
C does allow forward references i.e. allow an identifier to be used before it is declared, but only in very simple circumstances, so we create a dummy entry to be overwritten by the real declaration later. The dummy entry may point to places where the identifier is used, so e.g. generated code can be corrected.
Otherwise, C requires a declaration before any use of an identifier, so we report an error, but complete the new property entry to avoid repeating the error message.

* check identifier kinds

field names only accessible via struct/union variable
assign to: variable, field, parameter
goto: label
call: function (or pointer variable)
expressions: variable, field, parameter, constant, function
parameters: variable, expression, function

* binding: add information about identifiers to parse trees

types
values: enums
data addresses: variables, parameters, fields
code addresses: labels, functions (forward references?)

* check & propagate type information through expressions



check types of operands match operations
insert coercions
resolve overloading
note results of operations

add e.g. variable and function addresses, constant values

* At the end of each scope:
Check all forward references satisfied.
Unsatisfied references indicate an error.
Delete all properties that are no longer relevant.
We may need to keep some properties and/or names e.g. a function's parameters to hold its signature.

We may need to copy some information for run-time debugging.

Property entries

Some information will be used by most or all declaration kinds:
kind: variable/constant/function/type/etc.
scope: -> property entry of the owning function or block,
and/or the depth of nested scopes
type: (see below)
-> name in name-list
-> any previous declarations of the same name in the dictionary
-> next declaration in the same scope (?)

Other information will be different for each declaration kind:

* enum:
Determine the value at compile time and place that value in the property entry.

* variable & parameter:
To plant code, we need the run-time address, usually held as (data frame, offset in frame). The data frame is often a stack frame, but could be e.g. a common block, or a global variable frame. A stack frame can be described by the scope information, and a common block by a pointer to its property entry. The offset is copied from the current size of the data frame, which we then increment by the size of the variable, derived from the type information.

* struct & union field:
Similar to variables, but the data frame refers to the owning struct. We calculate the offsets for fields of unions (and the total size of the union frame) slightly differently.

* label:
Consists just of the code address, although we need to cope with forward references as described above.

* program, function & block:
For a function, we need its signature: we could use the type information for the result, and a link to the list of parameters for the rest of the signature. However, this is not sensible if a function signature can be used as a type. We also need the size of the stack frame. We may want to point to a list of all properties defined within the scope to make processing easier. Finally, we need the code address as for labels.
A program entry may be needed to provide a scope for global declarations.
As blocks have associated scope rules we may need to have an entry like that for a function, but without signature or code address (and without an associated name!)

* types:
Types can be created using e.g. multi-dimensional arrays, structs, unions, pointers, functions. C permits individual declarations to be arbitrarily complex, and the implementor must split each declaration up, creating entries as if for a whole series of (anonymous) type declarations as well as the declaration of the named type or variable or whatever. The detailed type information may be stored directly in property entries, but it is usually better for them to point to separate type entries.

e.g. char (*(*x())[])() { . . . }
is equivalent to:
typedef char anon1();
typedef anon1 *anon2;
typedef anon2 anon3[];
typedef anon3 *anon4;
anon4 x() { . . . }
which can be represented by (many details omitted):

names properties types

char [ type | \(bu ] [ description of char ]

anon1 [ type | \(bu ] [ signature | \(bu | x ]

anon2 [ type | \(bu ] [ pointer | \(bu ]

anon3 [ type | \(bu ] [ array | ? | \(bu ]

anon4 [ type | \(bu ] [ pointer | \(bu ]

x [ function | \(bu ] [ signature | \(bu | x ]

Keywords

C includes words that could be mistaken for identifiers, but are actually defined as part of the language. These include words used to describe data and control structures, typenames, conversions, values, input/output, operators, and maths functions.

Keywords (reserved words) are built into the grammar, and so cannot have their meaning changed by the user.

Keywords are easy to recognise by using a different Lex grammar rule for each, but this tends to make the resulting analyser very large (and slow for Lex to generate). Therefore, some analysers initially do not distinguish between keywords and identifiers, but then search a special dictionary to recognise keywords before treating everything else as an identifier. This makes the analyser smaller but slower.

For next lecture

You should bring any reference material you have about ARM assembly code programming to the next CS5031 Compilers lecture.