Next: Vectors and Arrays - Up: CS2111: Design and Implementation Previous: Types Contents

Subsections

Composite Types

A composite type will be based on some component type or types, sometimes referred to as e.g. base-types, element-types, index-types etc. Different languages have different composite types, or different views of similar composite types.

Common Composite Types

sets

e.g. type hue = set of colour Some languages have bags (multisets - sets that permit repeated values) or power sets.
Constructors: {red, yellow} {n: 1 $\le$ n $\le$ 100 and n mod 2 == 0}
Selectors: mathematicians disagree as to whether it should be possible to extract a (random) element from a set (axiom of choice, Zorn's lemma) - few languages support it
Operations: $\cup \cap - \in = \subset \subseteq$ card etc.
Implementation: bit-maps or lists (or trees or hash-tables etc.)

structures (records, tuples)

the Cartesian product of its component types.
e.g. type pigment = colour * real or
struct pigment {colour paint; float intensity;} pigment
Constructors: e.g. (red, .50)
Selectors: e.g. let (paint, intensity) = pigment_instance in ... end or pigment_instance.paint
or paint of pigment_instance
Operations: normally only functions defined by user
Implementation: groups of words

unions (variant records)

The set union of its component types. Unions may be:
discriminated - a structure with one field to indicate the current variant, which should be hidden from the user in a strongly-typed language, or
undiscriminated - should not be used in a strongly-typed language.
Constructors, Selectors, Operations: similar to structures
Implementation: alternative uses of (parts of) the same group of words

vectors

have an index type and an element type. Some languages have a fixed index type (usually integer) and/or a fixed lower bound (usually 0 or 1).
Constructors: e.g. int a[] = {1, 2, 3, 4, 5}; some languages allow e.g. {6=>3, 7..15=>4, others=>--1}
Selectors: e.g. a [j] = a [i];
In many languages we should check that each index falls in the permitted range i.e.
lower bound $\le$ index $\le$ upper bound
Operations: some languages can query a vector (usually a parameter) to find its upper and/or lower bounds, or iterate over all elements of a vector, or copy or compare vectors, or even select subsets of the elements.
Implementation: a sequence of e.g. words or bytes, depending on the element type. May need extra information, depending on operations.
(more details about vectors and multi-dimensional arrays in $\S$ 17)

strings

a variable-length vector. The most useful implementations allow the length of a string to vary as needed at runtime. Most languages require each string to have some fixed maximum length. Bit strings are sometimes only available as computer words.
Constructors: as for vectors, but often also special cases for character strings ("hello") or bit strings (0xF0AD).
Selectors: as for vectors.
Operations: as for vectors, and we can usually obtain the current length of a string. Often special operations for character (strcmp, strcpy) and bit (|, &, $\hat{ }$ ) strings.
Implementation: length==string

or string

length

==0 ?

Recursive Data Types (RDTs)

sequences

a sequence has a base type. A sequence is either empty, or is a value of the base type followed by a sequence. Used for strings, stacks, queues, lists etc.
Constructors: [1, 2, 3] "ABC" 1::[2, 3]
Operations: first, last, length, n-th member, concat, append, remove first/last, begins_with, ends_with, contains, is_empty, pattern matching, iteration over sequence
Implementation: linked lists, vectors (with care), files

RDTs can also be used to build trees, but general graphs can only be represented by e.g. one list of nodes and another list of links between nodes.

Pointers (references)

A pointer is as low-level, dangerous and useful a concept as a go-to: a go-to allows us to move at will in the code, a pointer allows us to move at will in the data. Just as some languages allow us to go-to anywhere, some allow us to point to anywhere - in either case, there is no chance for the language system to check that we are being sensible, and horrible things can happen. Therefore, just as some languages restrict go-tos to certain code items (e.g. labels), some restrict pointers to certain data items (e.g. mallocs) to try and reduce the chance of accidents, often at the expense of more verbose code.

Constructors: some languages allow pointers to any named variables and/or functions e.g. a = &i, some only allow them to anonymous data items e.g. a = malloc(...) and usually also disallow pointer arithmetic.

There are two main rationalisations for restricting pointers to anonymous data items:
extent (sometimes known as dynamic scope) describes the lifetime of a variable. We often want to call a function to create a data structure that uses pointers, and if the lifetime of the objects pointed at is the same as that of the function they were created in, this will not be of any use. However, it is not clear that this means that pointing to data items with a short lifetime should be prohibited. Nor do all languages that disallow pointers to named items cope with the problem that an anonymous item can be explicitly killed (e.g. by free) without all relevant pointers being reset. Some languages use various tricks, such as garbage collection, to help solve this problem.
aliasing is the concept of an object, such as a data item, having more than one name. This makes life more difficult both for humans and compilers trying to understand a program. If a pointer can be used to access a named data item, then the data item has an alias. A similar problem occurs with parameters. However, even if we prohibit pointers to named data items, we can still have more than one pointer to the same anonymous data item, which is still aliasing.

Operations: dereferencing
some languages insist on an explicit dereference operation e.g. *a = 1, whereas some automatically dereference pointers e.g. a = 1. The latter is nifty but dangerous e.g. if a and b are both pointers to integers, a = b changes the value of a, but a = b + 1 changes the value of whatever a points at and leaves a itself unchanged.
There is a related problem with accesses to var- or ref-parameters.
There is also something related going on in assignments e.g. a = b - we treat a and b rather differently, in that b could be anything that can be evaluated, such as a function call or a constant, but a has to be a variable and we do not use its value but instead its location. There are some languages where we can even write e.g. (a ? b : c) = d and the left-hand expression is manipulating the locations of b and c, not their values, and then automatically dereferencing the location to store the value of the right-hand side in it - such a language often treats all variables as if they were references to values, and pointer variables as references to references to values, and so on (e.g. Algol 68).

Other operations: tests for equality, inequality, nul pointer
Pointer arithmetic is dangerous, but some languages allow it (strongly typed languages usually do not). It is often useful, for example, to obtain pointers to components of a data structure, given a pointer to the whole structure e.g. when stepping through the elements of an array. Pointer arithmetic and automatic dereferencing cannot both be allowed in a programming language, unless we can avoid overloading the arithmetic operators - the caveat above hints that even pointer assignment should not be mixed with automatic dereferencing. C function to insert an item into an ordered list using pointers to anonymous (malloc) variables ($CS2111/e*/insert/*):


typedef struct node {int item; struct node* next;} node;

node *insert_1 (int new_item, node *list)
{
  node *temp, *prev, *new= (node*) malloc(sizeof(node));

  new->item= new_item;
  if (!list || new_item<=list->item) {
    new->next= list;
    return new;
  }
  temp= list; prev=NULL;
  while (temp && temp->item<=new_item) {
    prev= temp; temp= temp->next;
  }
  new->next= temp; prev->next= new;
  return list;
}

C function to insert an item into an ordered list using pointers to named variables ($CS2111/e*/insert/*):


node *insert_2 (int new_item, node *list)
{
  node **temp, *new= (node*) malloc(sizeof(node));

  new->item= new_item;
  temp= &list; /* <------------------------ pointer to named variable */
  while (*temp && (*temp)->item<=new_item)
    temp= &((*temp)->next);
  new->next= *temp; *temp= new;
  return list;
}

Language design and implementation

Orthogonality is important: every reasonable combination of data-types should be possible, so programmers do not need to worry about special cases.

Readings

Louden: chs. 6.3, 6.4
Bal & Grune: chs. 2.2.2, 2.5, 2.5.3-2.5.5
Aho, Sethi & Ullman: ch. 7.1

Next: Vectors and Arrays - Up: CS2111: Design and Implementation Previous: Types Contents

Pete Jinks
1999-09-30