next up previous
Next: Related Work Up: Compiler Synthesis of Task Previous: Enumerating the symbolic sets


Status and Results

We have successfully implemented the techniques described above in the dHPF compiler. We have extended the dHPF compiler to synthesize a static task graph for the MPI code generated by dHPF, including the symbolic processor sets and mappings for communication tasks and edges, and the scaling functions for loop nodes. In computing the condensed static task graph, we collapse all DO-NODEs or sequences of computational tasks that do not contain any communication or any IF-NODEs (We would rely on user intervention to collapse IF-NODEs). We also compute the combined scaling function for the collapsed tasks.

We have also partially implemented the support to instantiate dynamic task graphs at compile-time. In particular, we are able to enumerate the task instances and control-flow edges. We also synthesize the code from symbolic integer-sets required to enumerate the edge mappings at compile-time. We do not yet link in this code to enumerate the edges at compile-time.

Because of the aggressive computation partitioning and communication strategies used by dHPF, capturing the resulting MPI code requires the full generality of our task graph representation. This gives us confidence that we can synthesize task graphs for a wide range of explicit message-passing programs as well (including all the ones we have examined so far).

In order to illustrate the size of the static task graph generated and the effectiveness of condensing the task graph, Table 1 lists some particulars for the STG produced by the dHPF compiler for three HPF benchmarks: Tomcatv (from SPEC92), jacobi (a simple 2D Jacobi iterator PDE solver), and expl (Livermore Loop # 18). The effect of condensing the task graph on reducing the number of loops (DO-NODE) and computational tasks (COMP-TASK) can be observed. After condensing, most of the remaining tasks are either IF-NODEs and dummy nodes (e.g., ENDIF-NODE, etc.) or communication tasks (which are never condensed), since we opted for a detailed representation of communication behavior, rather than compromise on the accuracy of the representation. The compiler generated task graphs for the above can be found at http://www.cs.man.ac.uk/~rizos/taskgraph/


Table 1: Size of STG for various example codes before and after condensing.
tomcatv jacobi expl
Lines of source HPF program 227 64 94
Lines of output parallel MPI program 1850 1156 3722
1st pass condensed 1st pass condensed 1st pass condensed
Total number of tasks 247 193 122 83 225 174
     # COMM-NODE 54 54 36 36 114 114
     # DO-NODE 18 3 13 1 17 3
     # COMP-TASK 39 20 16 5 23 6


The most important application of our compiler-synthesized task graphs to date has been for improving the state of the art of parallel simulation of message-passing programs [5]. Those results are briefly summarized here because they provide the best illustration of the correctness and benefits of the compiler-synthesized task graphs. This work was performed in collaboration with the parallel simulation group at UCLA, using MPI-Sim, a direct-execution parallel simulator for MPI programs [10].

The basic strategy in using the STG for program simulation is to generate an abstracted MPI program from the STG where all computation corresponding to a computational task is eliminated, except those computations whose results are required to determine the control-flow, communication behavior, and task scaling functions. We refer to the eliminated computations as redundant computations (from the point of view of performance prediction), and we use the task scaling functions to generate simple symbolic estimates for their execution time. The simulator can avoid simulating the redundant computations, and simply needs to advance the simulation clock by an amount equal to the estimated execution time of the computation. The simulator can even avoid allocating memory for program data that is referenced only in redundant computations.

The key to implementing this strategy lies in identifying non-redundant computations. To do so, we must first identify the values in the program that determine program performance. These are exactly those variable values that appear in the control-flow expressions, communication descriptors, and scaling functions (both for task times and for communication volume) of the STG. Thus, using the STG makes these values very easy to identify. We can then use a standard program slicing algorithm [18] to isolate the computations that affect these values. We then generate the simplified MPI code by including all the control-flow that appears in the static task graph, all the communication calls, and the non-redundant computations identified by program slicing. All the remaining (i.e., redundant) computations in each computational task are replaced with a single call to a special simulator delay function which simply advances the simulator clock by a specified amount. The argument to this function is a symbolic expression for the estimated execution time of the redundant computation. Note that the simulator continues to simulate the communication behavior in detail.

We have applied this methodology both to HPF programs (compiled to MPI by the dHPF compiler), and also to existing MPI programs (in the latter case, generating the abstracted MPI program by hand). The benchmarks include an HPF version of Tomcatv, and MPI versions of Sweep3D (a key ASCI benchmark) and NAS SP. Table 2 shows the percentage error in the execution times predicted by MPI-Sim using the simplified MPI code, compared with direct program measurement. As can be seen, the error was less than 16% in all cases tested, and less than 10% in most of these cases. This is important because the simplified MPI program can be thought of as simply an executable representation of the static task graph itself. These results show that the task graph abstraction very accurately captures the properties of the program that determine performance. We believe that the errors observed could be further reduced by applying more sophisticated techniques for estimating the execution time of redundant computations, particularly with simple estimates of cache behavior.


Table 2: Validation of the compiler-generated task graphs using MPI-Sim.
Benchmark % Error in prediction vs. measurement
  #Procs = 4 8 16 32 64
Tomcatv $-5.44$ 15.75 11.79 8.50 9.27
Sweep3D $-7.01$ $-4.97$ 9.02 9.80 5.13
NAS SP, class A $-2.59$   $-1.24$ 7.11 6.10
NAS SP, class C     0.09 $-14.01$ $-1.58$


The benefits of using the task graph based simulation strategy were extremely impressive. For these benchmarks, the optimized simulator requires factors of 5 to 2000 less memory and up to a factor of 10 less time to execute than the original simulator. These dramatic savings allow us to simulate systems and problem sizes 10 to 100 times larger than is possible with the original simulator. Also, they have allowed us to simulate MPI programs for parallel architectures with hundreds of processors faster than real-time, and have made it feasible to simulate execution of programs on machines with 10,000+ processors. These results are described in more detail in [5].



next up previous
Next: Related Work Up: Compiler Synthesis of Task Previous: Enumerating the symbolic sets
Rizos Sakellariou 2000-10-16