next up previous
Next: LEX Up: CS2121: The Implementation and Previous: Introduction

Subsections

Representation and Meaning

Representation and Meaning

A translator, whether human or machine, has to understand what it is translating. Because we can normally rely on the common sense and shared experiences of other human beings, natural languages can be ambiguous, fuzzily structured and have large vocabularies.

In contrast, not only do computer languages have limited, mainly user-defined vocabularies, but also we have to exactly specify the legal forms of each construct of the language and what it means, both alone and in combination with every other construct.


Expressions

The meaning of a + b is usually along the lines of:
"take the values associated with the identifiers a and b, combine those two values into one by addition, and have the resulting value ready for further use"

This typical meaning may be modified in many ways e.g.:

This meaning can be represented in several ways e.g.:
  a + b          [infix]
  a b +          [postfix/Reverse Polish Notation]
  + a b          [prefix]
  add a b
  add (a, b)
  (add, a, b)
  add a and b
  a plus b

Prefix and postfix notations, although awkward to use at first, don't get significantly more complex as expressions get bigger. However, most humans, and therefore most computer languages, use infix notation, and this gives rise to yet more possibilities of representation and meaning as expressions get bigger:

Precedence
- what is the meaning of 1 + 2 * 3 ?

If we simply obey the expression left-to-right, it means (1 + 2) * 3 i.e. 9, but we usually give operators like * and / higher precedence than operators like + and - i.e. they are obeyed first, so the expression means 1 + (2 * 3) i.e. 7.

Associativity
- what is the meaning of 3 - 2 - 1 ?

What do we do if precedence doesn't resolve the problem i.e. two operators in an expression have the same precedence?

If we obey the example expression left-to-right, it means (3 - 2) - 1 i.e. 0, but if we obey it right-to-left it means 3 - (2 - 1) i.e. 2.

With most arithmetic operators (including -, so the answer is normally 0) we simply work left-to-right. This is known as left associative.

With some operators we have to work right-to-left (right associative) e.g. raising to a power - $(10^2)^3$ is just $10^6$, whereas $10^{(2^3)}$ is $10^8$.

Some operators are not allowed to be used more than once without explicit brackets (non-associative) e.g. 1 <= x <= 10 is illegal in Pascal. It is legal in C, but is probably not doing what you want (it does (1 <= x) <= 10). The representation for the intended meaning is (1 <= x) && (x <= 10). Making <= etc. non-associative in Pascal means that you can't get the unintended meaning by mistake, although if you really want something similar you can force it by using extra brackets.

(We will be using the concepts of fix, precedence and associativity in later sections and in the lab.)

Implementing Programming Languages

Representation is normally referred to as syntax or grammar. There are many different formal notations for describing syntax, most of which are equivalent in descriptive power, although they may vary in how succinct or verbose they are. We will be looking at two different notations (regular expressions and BNF), which have different descriptive powers, and the associated tools for dealing with them, LEX and YACC. These kinds of tools, that convert a grammar into a parser (a program that recognise inputs that match the grammar), are known as parser-generators.

Meaning is normally referred to as semantics (or static semantics). There are formal notations for describing semantics, as you may have seen in CS1112, but we will not be using them. This is because semantics is a much more complex subject than syntax, still an area for much research, and because different programming languages vary much more in semantics than in syntax. Therefore, we will only be looking at one major area of semantics, that of identifiers and the declarations that bind meanings to them in a scope, and one technique for dealing with them, the use of dictionaries.

We will be concentrating mainly on describing and implementing syntax, and we will be looking at this next.

Implementing Representation - Dealing with Grammars


GREP and EGREP

GREP echoes lines from the input(s) that match a pattern. Each input line is processed in turn: characters on each line are scanned for patterns, and if a pattern is completely recognised, the current line is output.
Metacharacters (characters with a special meaning) \ ^ $ . [ ] * + ? | ( )
any non-metacharacter recognise that character
\any metacharacter recognise that metacharacter
\n or \t as for C - newline or tab
^ beginning of line
$ end of line
. any one character except newline
[ ] any one character from inside the brackets
  indicate ranges by - e.g. [a-zA-Z]
  an actual ] must be the first character
  an actual - must be the first or last character
[^ ] any one character except . . .
anything* anything repeated none or more times
extra facilities in EGREP:
anything+ anything repeated one or more times
anything? anything repeated none or one times
one | another either one or another
( ) recognise the contents
precedence: [ ] ( ), * + ?, concatenation, |

LEX and YACC

We will examine these two programs in a lot more detail later, as they are very useful for implementing computer languages. The major part of their languages are patterns and corresponding pieces of C code (actions). The major difference between LEX and YACC is the way the patterns are described - LEX is similar to EGREP, but YACC is very different.

Parse trees

A 2-dimensional representation. They make the relationships between operands, operators etc. explicit, and so are clearer than the linear notations above. However, we still need extra information about the meaning, such as the order in which operands are accessed. e.g. parse tree for a + b

\begin{picture}(5,4)
\put(3,3){\circle{1.0}}\put(2.8,2.8){+}
\put(1,1){\circle{1...
...){\line(1,1){1.5}} % + -> a
\put(5,1.4){\line(-1,1){1.5}} % + -> b
\end{picture}

Epilogue

When dealing with computer languages, there are two main approaches:

interpretation

Recognise and obey commands immediately, as with EGREP and shells/CLIs.

translation

Recognise commands and create an equivalent set of commands in another language (e.g. assembly language), as with compilers. LEX and YACC fall into this category, as they transform the list of patterns into a C program that searches for them in its input.

Sometimes a combination of these approaches is used: commands can be embedded within the piece of text they are designed to act on, so that the commands are interpreted, but give rise to a translation of the input text into some new form. For example, CPP, which deals with commands like "#include" embedded within a C program and outputs a more verbose C program, and LATEX, which reads commands embedded in e.g. English text to generate beautifully formatted documents like this!

Exercises

Readings

Bal & Grune: chs. 1, 7.2.2
Louden: chs. 1, 2
Capon & Jinks: ch. 1
Aho, Sethi & Ullman: ch. 1

next up previous
Next: LEX Up: CS2121: The Implementation and Previous: Introduction
Pete Jinks
2004-10-26