Next: LEX Up: CS2121: The Implementation and Previous: Introduction

Subsections

Representation and Meaning

A translator, whether human or machine, has to understand what it is translating. Because we can normally rely on the common sense and shared experiences of other human beings, natural languages can be ambiguous, fuzzily structured and have large vocabularies.

In contrast, not only do computer languages have limited, mainly user-defined vocabularies, but also we have to exactly specify the legal forms of each construct of the language and what it means, both alone and in combination with every other construct.

Expressions

The meaning of a + b is usually along the lines of:
"take the values associated with the identifiers a and b, combine those two values into one by addition, and have the resulting value ready for further use"

This typical meaning may be modified in many ways e.g.:

"first take the value... a, then take the value... b, ..."
"...convert the values so they are compatible..."
"...combine the values by set-union/ string-concatenation/ matrix-addition..."

This meaning can be represented in several ways e.g.:

  a + b          [infix]
  a b +          [postfix/Reverse Polish Notation]
  + a b          [prefix]
  add a b
  add (a, b)
  (add, a, b)
  add a and b
  a plus b

Prefix and postfix notations, although awkward to use at first, don't get significantly more complex as expressions get bigger. However, most humans, and therefore most computer languages, use infix notation, and this gives rise to yet more possibilities of representation and meaning as expressions get bigger:

Precedence

- what is the meaning of 1 + 2 * 3 ?

If we simply obey the expression left-to-right, it means (1 + 2) * 3 i.e. 9, but we usually give operators like * and / higher precedence than operators like + and - i.e. they are obeyed first, so the expression means 1 + (2 * 3) i.e. 7.

Associativity

- what is the meaning of 3 - 2 - 1 ?

What do we do if precedence doesn't resolve the problem i.e. two operators in an expression have the same precedence?

If we obey the example expression left-to-right, it means (3 - 2) - 1 i.e. 0, but if we obey it right-to-left it means 3 - (2 - 1) i.e. 2.

With most arithmetic operators (including -, so the answer is normally 0) we simply work left-to-right. This is known as left associative.

With some operators we have to work right-to-left (right associative) e.g. raising to a power - is just , whereas $10^{(2^3)}$ is .

Some operators are not allowed to be used more than once without explicit brackets (non-associative) e.g. 1 <= x <= 10 is illegal in Pascal. It is legal in C, but is probably not doing what you want (it does (1 <= x) <= 10). The representation for the intended meaning is (1 <= x) && (x <= 10). Making <= etc. non-associative in Pascal means that you can't get the unintended meaning by mistake, although if you really want something similar you can force it by using extra brackets.

(We will be using the concepts of fix, precedence and associativity in later sections and in the lab.)

Implementing Programming Languages

Representation is normally referred to as syntax or grammar. There are many different formal notations for describing syntax, most of which are equivalent in descriptive power, although they may vary in how succinct or verbose they are. We will be looking at two different notations (regular expressions and BNF), which have different descriptive powers, and the associated tools for dealing with them, LEX and YACC. These kinds of tools, that convert a grammar into a parser (a program that recognise inputs that match the grammar), are known as parser-generators.

Meaning is normally referred to as semantics (or static semantics). There are formal notations for describing semantics, as you may have seen in CS1112, but we will not be using them. This is because semantics is a much more complex subject than syntax, still an area for much research, and because different programming languages vary much more in semantics than in syntax. Therefore, we will only be looking at one major area of semantics, that of identifiers and the declarations that bind meanings to them in a scope, and one technique for dealing with them, the use of dictionaries.

We will be concentrating mainly on describing and implementing syntax, and we will be looking at this next.

Implementing Representation - Dealing with Grammars

GREP and EGREP

GREP echoes lines from the input(s) that match a pattern. Each input line is processed in turn: characters on each line are scanned for patterns, and if a pattern is completely recognised, the current line is output.
Metacharacters (characters with a special meaning) \ ^ $ . [ ] * + ? | ( )

any non-metacharacter	recognise that character
`\`any metacharacter	recognise that metacharacter
`\n` or `\t`	as for C - newline or tab
`^`	beginning of line
`$`	end of line
`.`	any one character except newline
`[ ]`	any one character from inside the brackets
	indicate ranges by `-` e.g. `[a-zA-Z]`
	an actual `]` must be the first character
	an actual `-` must be the first or last character
`[^ ]`	any one character except . . .
anything*	anything repeated none or more times

extra facilities in EGREP:

anything`+`	anything repeated one or more times
anything`?`	anything repeated none or one times
one `\|` another	either one or another
`( )`	recognise the contents

precedence: [ ] ( ), * + ?, concatenation, |

LEX and YACC

We will examine these two programs in a lot more detail later, as they are very useful for implementing computer languages. The major part of their languages are patterns and corresponding pieces of C code (actions).

unlike GREP, they do not need to process the input line-by-line,
we normally expect their patterns to match the whole of the input,
as each pattern is completely recognised in the input, the corresponding action is obeyed.

The major difference between LEX and YACC is the way the patterns are described - LEX is similar to EGREP, but YACC is very different.

Parse trees

A 2-dimensional representation. They make the relationships between operands, operators etc. explicit, and so are clearer than the linear notations above. However, we still need extra information about the meaning, such as the order in which operands are accessed. e.g. parse tree for a + b

$\begin{picture}(5,4) \put(3,3){\circle{1.0}}\put(2.8,2.8){+} \put(1,1){\circle{1... ...){\line(1,1){1.5}} % + -> a \put(5,1.4){\line(-1,1){1.5}} % + -> b \end{picture}$

Epilogue

When dealing with computer languages, there are two main approaches:

interpretation

Recognise and obey commands immediately, as with EGREP and shells/CLIs.

translation

Recognise commands and create an equivalent set of commands in another language (e.g. assembly language), as with compilers. LEX and YACC fall into this category, as they transform the list of patterns into a C program that searches for them in its input.

Sometimes a combination of these approaches is used: commands can be embedded within the piece of text they are designed to act on, so that the commands are interpreted, but give rise to a translation of the input text into some new form. For example, CPP, which deals with commands like "#include" embedded within a C program and outputs a more verbose C program, and L^ATEX, which reads commands embedded in e.g. English text to generate beautifully formatted documents like this!

Exercises

Any metacharacters of GREP or EGREP that are also metacharacters of the shell you are using have to be quoted. Which characters are these?
Choose some complicated expressions and for each
1. fully bracket it
2. write it using the other representations in $\S$ 1.1
3. draw the corresponding parse tree
Try expressions involving functions calls and data structure accesses - see if you can find different ways of drawing the parse trees for these by making different assumptions about which parts of the expression are the operators.
Use the MAN command to find out more details about MAKE, GREP, EGREP, and whichever shell you normally use (probably BASH) e.g. type man make etc.
Use GREP or EGREP on some text files e.g. to list all the Subjects of all your email.
Write a shell command file to compile and run any program you wrote e.g. for a first semester lab.
Convert your shell command file into a MAKEfile, that can compile and run your program as two separate steps. Add a third entry to the MAKEfile to keep the output from the run and DIFF it with a previous output, so you can immediately see the consequences of any change to your program.

Readings

Bal & Grune: chs. 1, 7.2.2
Louden: chs. 1, 2
Capon & Jinks: ch. 1
Aho, Sethi & Ullman: ch. 1

Next: LEX Up: CS2121: The Implementation and Previous: Introduction

Pete Jinks
2004-10-26