On Race Conditions

A long time ago I worked for ICL on the development of the 3900 range. The
system could be seen as organised according to a series of priority levels.
The OS had levels 15 down to 2, lower ones being for hardware and testing. So
the raw instruction set could be seen as level 1, and the microcode (internal
instructions out of which programmer level instructions were built) could be
seen as level 0. I worked on the address translation microcode, which could
be seen as level -1, a yet more primitive set of instructions that performed
service functions for the main level 0 microcode. Conceptually, each level
could pre-empt higher levels, but had the responsibility of leaving the system
in a state that higher levels were capable of understanding.

One day, in completing the coding of my subsystem, certain functionality
had to be provided by my code. It turned out to be impossible as stated,
because part of the data that was involved was inaccessible under the
circumstances. I consulted my immediate boss. He said to complete the
functionality using a so-called 'auxiliary function'. This involved planting
an interrupt to the level 0 microcode and geting it to invoke the required
auxiliary function in my code. I could see that this would work on its own
terms, but I was very uneasy about doing it this way. In between the start
of the task and its completion via the auxiliary function, the system would
be left in an inconsistent state. On the other hand, the design was such that
interrupts from my subsystem had the very highest priority among all the things
that could interrupt at level 0, so, when it occurred, my interrupt would run
first, and the inconsistency would be cleaned up before any other part of
the system could be exposed to it. In the event, that's what we did. Nothing
could go wrong, right?

Things continued swimmingly. The development was completed, machines began
to be sold. Everyone was happy. We moved on to develop the multi-node version
of the machines. In these, several machines would be tightly coupled at the
memory level. To preserve consistency, some memory writes (including things
called semaphores), needed to be strictly serialised, i.e. there had to be
global agreement among all the coupled machines about the order in which they
took place. There was provision for these in the design of the single machines,
but it proved to be hopelessly inefficient (basically because the OS people
had not managed to rewrite the OS so as to drastically reduce the number of
these semaphores ... eventually they did this, and the original design was
reverted to, but something had to be done NOW). A new piece of hardware was
conceived to handle the high number of semaphores efficiently. It was coupled
directly to the memory, in effect introducing a new level -2, with even higher
pre-emption capability than my level -1 code.

The technically savvy can imagine the rest at this point.....

Everything went swimmingly. The development was completed, testing revealed
no problem, multi-node machines began to be sold. Everyone was happy. At that
point I left ICL to join the university.

British Gas were buying machines as fast as they could be built to keep ICL
in business. (From their point of view, if there was anything worse than
running their code on ICL hardware, it was not having any ICL hardware to
run their code on). One day there was an urgent call to the development team.
One of their multinode systems had seized up over a period of about 15 minutes.
A semaphore had gone missing, and as a consequence, the entire system had
ended up in a tree of queues waiting for the semaphore that was never going
to come back. Of course all the internal hardware logs that retained a bit
of diagnostic information had been overwritten millions of times during this
15min death rattle, so there was no clue. It was flagged as a 'red alert'
problem (which means 'fix within 24hrs ... or else').

No one had any idea what was going on. Various noises were made, ruffled
feathers were soothed, etc. No one got too upset though. Most people involved
had seen it all before. About once a month a semaphore would go missing, and
the system would go into its 15min death rattle. The BG guys just shrugged
their shoulders and restarted the system from the last checkpoint.

Back at base, the commissioning team carried on investigating. One day a test
program stopped dead. Unlike an operating system, which is designed to soldier
on at all costs, a good test program will stop at the first whiff of trouble.
A semaphore had gone missing. The lads eagerly picked over the steaming
entrails of the fresh kill. It seemed that just prior to the semaphore going
west, the system had been doing a guard page interrupt (it doesn't matter what
that is). The lads wrote new tests that did more or less nothing except guard
page interrupts and semaphores. Pretty soon, semaphores were being lost to
order. They looked more closely. It was the functionality mentioned earlier.
Because of the new multinode hardware, which had higher priority than my code,
the semaphores could sneak in, in between the two parts of the functionality,
see the inconsistent state, and consequently get written to a spurious
location. The rest of the system meanwhile waited for the semaphore in the
place it should have gone. When this was realised, the functionality was
redesigned (in a much less efficient, though now correct manner), and machines
stopped losing semaphores. It had taken 18 months from the original red altert.

That's what race conditions are like.