Towards Petflop Computing

A Massive FLOP

Toby Howard

This article first appeared in Personal Computer World magazine, February 1997.

GIVEN THE COMPUTING NEEDS of most of us, if we owned a top-of-the-range home machine, we'd probably be more than happy. Today (early 1997), such a system would typically centre around a 200 MHz processor with 32Kb of memory, with at least a 2 gigabyte disk. It's hard to think of such a high-spec system as anything but a self-contained machine to impress the neighbours with, but US researchers are now taking 9,000 -- yes, nine thousand -- of these systems and using them as the building blocks for a single, enormous, machine. It's called TFLOPS, and it will be the most powerful supercomputer the world has ever seen.

One way to express the performance of a computer is to measure its 'flops' -- not its failures, but the number of floating-point operations it can perform each second. A single 200 MHz Pentium Pro, for example, can perform at best about 130 megaflops (Mflops, 10⁶ flops). Since the 1970s, supercomputers have operated thousands of times faster, measured in gigaflops (Gflops, 10⁹ flops). The new TFLOPS machine, based at Sandia National Laboratories in Albuquerque, will perform a thousand times faster again, at a predicted peak of 1.8 teraflops (Tflops, 10¹² flops).

Although TFLOPS will be available for general scientific research, its raison d'etre is virtual weapons testing, now a hot research topic since experimental detonations are no longer considered politically or socially acceptable by most enlightened nations. However, mathematically predicting the behaviour of nuclear weapons is an extremely complex problem, requiring enormous computing power to obtain even approximate solutions.

Traditionally, the supercomputer industry has used custom-built designs, featuring special architectures tailored to the sorts of problems the machines will be used to solve. For example, many problems arise in fields such as seismic modelling, weather forecasting, and fluid flow analysis which involve huge amounts of repeated calculations. One of the earliest special supercomputer architectures to support such calculations was the "vector pipeline processor", which gave huge speed increases by allowing calculations of streams of numbers to be overlapped, in the same way that partially assembled components pass down a constantly moving assembly line. This led to enormous speed-ups, but only for this particular type of large-scale numerical problem.

The TFLOPS design philosophy is different, and the idea is to achive high performance by connecting together thousands of off-the-shelf components. TFLOPS is a massively parallel machine: instead of the single processor we find in most PCs, it contains many thousands of processors, each of which can operate quite independently of the others. If a calculation can be broken down into multiple parts that can be executed simultaneously, and the results later combined, the parallel approach can give huge savings in compute time.

Although Intel has been building parallel supercomputers from arrays of standard processors since the early 80s, what sets TFLOPS apart is its sheer scale. Its architecture is fearsome, comprising over 4,500 motherboards known as 'compute nodes', each of which has two 200Mhz Pentium Pros, and 64Mb memory. With 2 thousand gigabytes of hard disk, this giant occupies 85 cabinets on 1600 square feet of floor space, and will draw 800 kW of power. The operating system is Intel's UNIX-based Paragon system, together with a light-weight kernel which runs in each processor. TFLOPS is costing the US Department of Energy a cool $46 million. One assumes that includes a technical support hotline.

Once you've got a supercomputer, the problem is how to program it effectively to make the most of its power. For a massively parallel machine like TFLOPS, the trick is to structure the program such that pieces of it can be distributed to separate processors, which can work on each part of the problem in parallel.

There are two ways to do this: the first is for the programmer to carefully structure the algorithm for the solution of the overall problem, such that it can be expressed as a collection of smaller, independent algorithms. In general this is hard, and relies on the insight and ingenuity of the programmer. The second approach is to write the program as if it were for a single processor, and let the compiler analyse the code and partition it into portions than can execute in parallel. Modern 'parallelising' compilers can do an excellent job of this, but the results are rarely quite as good as a program originally designed with parallel processing in mind.

TFLOPS will be the world's most 'super' supercomputer for some years, but an apochryphal law of computing states that eventually usage expands to consume all available resources. Incredible as performance measured in teraflops might seem, research is already underway on machines which will be a thousand times faster still. The next target is the petaflop -- 10¹⁵ flops.

At a recent workshop in California, many of the major figures in supercomputer design met to consider petaflop technologies. Key issues included predictions that miniaturisation would have to exceed the nanometre scale, employing biological construction techniques; that a peta-computer would need at least 30 terabytes of RAM; and that the best chances of success would involve hybrid technologies of superconductors, nanotechnology, optics, and perhaps quantum computation.

The experts predict that petaflop performance will be achieved around 2020. After that will undoubtedly come the next thousand-scale hike towards the exaflop (10¹⁸ flops) and then the thousand-exaflop (there isn't a namefor that yet).

Nobody can possibly know what computing will be like twenty-five years from now. The only sure bet must be that we'll look back and wonder how we ever managed to do anything with 200 MHz and 32 Kb.

Toby Howard teaches at the University of Manchester.