Next: Expressions: Code Generation Up: CS2111: Design and Implementation Previous: Parse Trees Contents

Subsections

Language Design

Rather than discuss vague design principles, I want to look at concrete examples of good and bad design. We will mainly look at ANSI C, not because it is particularly good or bad, but because it should be most familiar to you. Also, we will mainly look at representation (syntax), rather than meaning (semantics).

Error detection and correction

Not all strings are valid representations. The distance between 2 strings is the number of places they are different.

If valid strings are never neighbours then a single error is always detected,
if they are always at least 2 differences apart then a single error can always be corrected, etc.
This can require extra effort, so some disagree.
Sometimes user can make choices that enhance error detection and correction.
Sometimes possible to language designer: change concrete grammar, keeping abstract grammar and meaning the same.

C is deformed because its parents were frightened by keywords

C has a consistent philosophy of minimising the use of keywords. This is probably because C was based on BCPL, a language designed when compilers had to be small to fit in memory - a BCPL compiler on a PDP-7 used less than 4K 18-bit words. It replaces keywords by overloading other symbols such as brackets ( ) and braces { }, or simply overloading the user and implementor! In several places the absence of keywords makes reading and/or writing C programs harder and more error prone.

Control structures

$\S$ 7.3: dangling else - the best solution is simply to insist on a terminating keyword (e.g. end_if or fi). As well as being a good solution to this problem (no special rule for users to learn and compiler writers to implement), there are further benefits when we generalise this to all control constructs (as we should, to avoid yet another special rule):

if each control construct has a different terminating keyword (endwhile, endif, endfor etc.) it is much easier to find where a missing bracket should be within lots of nested blocks,
no danger of unexpected empty construct body, etc.

However, some C control constructs have keywords that other languages avoid:

return: for function results, may be better than Pascal's name:=, although the difference is rarely important. However, some people disapprove of multiple returns, or a return from the middle of a function.
break: most modern languages that have a switch statement (often known as a case statement) insist that the cases are separate and automatically terminate. C, along with many other old languages, allows cases to fall into each other, but at least provides break to take control to the end of the switch. (We will discuss other uses of break and the related continue in a later section.)

Declarations

Pascal and C are very different when it comes to declaring variables with complicated types - in particular, Pascal declarations and accesses are easily derived from an English description of what is required. (Note: Pascal does not have some of the facilities of C with respect to functions, and C does not have some of Pascal's data types.) However, both languages make declarations match the subsequent uses of the variables in expressions.

C:
access: postfix, left-associative [ ] ( ) . -> of higher precedence than prefix, right-associative *
declarations: must use (* ). instead of -> and element typename comes first
(like a cast - same precedence/associativity/fix as dereference)

Pascal:
access: postfix, left-associative ^ . [ ]
declarations: [...] becomes array [...] of, element typename comes last

e.g. an array of pointers to characters
in Pascal:


        s: array [0..9] of ^char;
        . . .
        s[i]^ := 'a';

in C:


        char *s[10];
        . . .
        *s[i] = 'a';        /* i.e. *(s[i]) */

e.g. a pointer to an array of characters
in Pascal:


        s: ^array [0..9] of char;
        . . .
        s^[i] := 'a';

in C:


        char (*s)[10];
        . . .
        (*s)[i] = 'a';

Why do new users find declarations in C harder to understand than in Pascal?

The operations for accessing data structures all have similar precedences and associativities in Pascal, but in C, pointer dereference * is different from the rest.
This is an artefact of the design of C expressions, for which I have seen only historical explanations. Once * has been overloaded as the dereference operator it has to be prefix rather than postfix, or else it is hard to make sense of expressions.
In Pascal declarations, the type information follows the list of identifiers, whereas in simple C declarations it precedes the list. Furthermore, in more complex C declarations, the various type constructors (* [] ()) are attached to each individual identifier declared. Thus, in Pascal, all the identifiers in a single declaration are of the same type, but in C they can all be of different types.
This is partly historical accident and partly a consequence of minimising the use of keywords. Every Pascal declaration must be preceded by a keyword, such as var for variables, whereas C variable declarations make do with the name of the element type - C declarations can be preceded by a storage class specifier (extern, register, auto, static or typedef) but auto is usually omitted.

This leads to a much bigger problem with C: the grammar of declarations is fundamentally confusing, both for LR(1) parsers like YACC, and for human beings e.g.:


  {int bar;}
  {long int;}
  {long int bar;}

  typedef int foo;
  {foo bar;}
  {long foo;}
  {long foo bar;}

CPP


 #define SP3() if(b){int x; av=f(&x); bv+=x;}
  . . .
 if (x==3) SP3(); else bork();

If we omit the ; after the use of SP3, the else will silently become associated with the if in the macro, but if we include it the else does not belong to any if. In Pascal, a block end that replaces a statement must still be followed by the usual ;, but C treats { }; as two separate statements. We need something like:


 #define STMT(stuff) do{stuff}while(0)
 #define SP3() STMT(if(b){int x; av=f(&x); bv+=x;})

Puns: similar representations with different meanings

`(int) *x= 0`	assigns 0 to the object pointed to by x
`int *x= 0`	assigns the NULL pointer to x

`(int) x= y= 0`	makes x and y both equal to 0
`int x= y= 0`	makes x and y both equal to 0, but only declares x

`(x= 1, y= x+1)`	the `,` indicates sequencing
`f (x= 1, y= x+1)`	if f is a function the `,` does not indicate sequencing
	if f is a macro, any sequence depends on its definition

<>
Uses of , are macro definitions and calls, function definitions and calls, operation, separator in initialisations, separator in declarations.

Arrays and functions sometimes behave like pointers (and vice-versa).

0 is an integer, is false, is a pointer, and is the string terminator

void is the result type of a function that is really a procedure, the type of a pointer that is semi-valid (?), and the type of a function that takes no parameters

int f(void) can only be called by f()

int f() can be called with any parameters!

<>

12 == 014 but '\14' == '\014'

'A' getchar() are both of type int, not of type char

MIN <= x <= MAX is a valid expression, but unlikely to mean what the user intended.

typedef enum {red, blue, green=0, yellow} colour; gives the same value to red and to green, and to blue and to yellow, and they are all of type int, so yellow - green == blue

Overloading symbols

x = 1 x == 1 are both expressions, so both can appear inside the condition of an if statement or a loop, and both can appear on their own as a statement.
similarly a <= b a <<= b

static, extern, void, *, &, <, ( )

p= N * sizeof * q; r= malloc (p);

sizeof (int) * x; sizeof ((int *) x); sizeof ((int) * x);

(b) + (c); (b) - (c); (b) * (c); (b) & (c); (b) (c);

When is white space significant?

a+++ ++b
"hello\
again"
char *s[]= {"A", "B", "C" "D", "E",};
*a/ *b
#define a (b) c

Multiple name spaces: same name, different meanings


typedef int foo; struct foo bar; struct foo {int f;};
void fred (void) {foo: ;}

typedef int foo; struct foo {foo foo;};
void fred (int foo) {struct foo foo;}

typedef int foo; struct foo {int foo;};
void fred (void) {foo foo;}

struct foo {int foo;};
void foo (int foo) {struct foo bar;}


typedef struct list *list;
struct list {int field; list list;};
int length (list l)
{
  int i= 0;
  list: if (l) {
        i++; l= l->list;
        goto list;
        }
        else return i;
}

Synonyms: different representations, similar meanings

a=a+1 a+=1 a++ ++a
(*a).b a->b
a[i] *(a+i) *(i+a) i[a]
for(A;B;C)D; A;while(B){D;C;}

Orthogonality

All language features can be combined in every way the user wants. e.g.

records and arrays can contain records and arrays
several kinds of declarations and several kinds of types $\rightarrow$ all combinations valid
I/O of all scalar types, including boolean and enumerated
function results and parameters can be of any type - operations? functions? labels? arrays? polymorphic? generic?
all objects first class
a function can have any number of parameters $\rightarrow$ can have none
any statement can be used anywhere a statement is allowed
a statement can be empty or be replaced by many statements

Exercises

( $\dag$ indicates harder problems)

$\dag$ The grammar for ANSI C given by Kernighan & Ritchie is in $CS2111/e*/c_grammar/* Run it through YACC to discover the conflicts, and try to work out why they are there and how you could improve the grammar to remove them. Try to generally improve the grammar.

Readings

man cdecl
Louden: ch. 3
Bal & Grune: 2.2.3, 2.4.6

Next: Expressions: Code Generation Up: CS2111: Design and Implementation Previous: Parse Trees Contents

Pete Jinks
1999-09-30

`int f(void)`	can only be called by f()
`int f()`	can be called with any parameters!