Synthesizing the Static Task Graph

Four key steps need be performed in synthesizing the static task graph (STG): (1) generating computation and control-flow nodes; (2) generating communication tasks for each logical communication event; (3) generating symbolic sets describing the processors that execute each task, and symbolic mappings describing the pattern of communication edges; and (4) eliminating excess control flow edges.

Generating computation and control-flow nodes in the STG can be done in a single preorder traversal of the internal representation for each procedure; in our case, the representation is an Abstract Syntax Tree (AST). STG nodes are created as appropriate statements are encountered in the AST. Thus, program statements, such as DO, IF, CALL, PROGRAM/FUNCTION/SUBROUTINE, STOP/RETURN, trigger the creation of a single node in the graph; encountering one of the first two also leads to the creation of an ENDDO-NODE or an ENDIF-NODE, a THEN-NODE and an ELSE-NODE, respectively. Any contiguous sequence of other computation statements that are executed by the same set of processors are grouped into a single computational task (contiguous implies that they are not interrupted by any of the above statements or by communication).

Identifying statements that are computed by the same set of processors is a critical aspect of the above step. This information is derived from the computation partitioning phase of the compiler and is translated into a symbolic integer set [1] that is included with each task. By having a general representation of the set of processors associated with each task, our representation can describe sophisticated computation partitioning strategies. The explicit set representation also enables us to check equality by direct set operations, in order to group statements into tasks. These processors sets are also essential for the fourth step listed above, namely eliminating excess control-flow edges between tasks, so as to expose program parallelism. In particular, a control flow edge is retained between two tasks only if the intersection of their processor sets is not empty. Otherwise, the sink node is connected to its most immediate ancestor in the STG for which the result of this intersection is a non-empty set.

When the first communication statement for a logical communication event is encountered, the communication event descriptor and all the communication tasks that are pertinent to this single event are built. The processor mappings for the synchronization between the tasks are also built at this time.

Figure 1: An example of generating the communication tasks.

CHPF$ DISTRIBUTE A(*,BLOCK)
      DO I=2,N
         DO J=1,M-1
            A(I,J) = A(I-1,J+1)
         ENDDO
      ENDDO

(a) HPF source code fragment

blk = block size per processor
DO I=2,N
    IF (myid < P-1) 
        irecv B(i, myid*blk+blk+1) from myid+1

    ! Execute local iterations of j-loop
    DO J=myid*blk+1, min(myid*blk+blk-1, M-1)
        A(I,J) = A(I-1,J+1)
    ENDDO

    IF (myid > 0) isend B(i, myid*blk+1) to myid-1
    IF (myid < P-1) wait-recv
    IF (myid > 0) wait-send

    ! Execute non-local iterations of j-loop
    J=myid*blk+blk
    IF (J <= M-1)
        A(I,J) = A(I-1,J+1)
ENDDO

(b) Unoptimized MPI code generated by dHPF

$\psfig{file=example1.eps,height=4.0in}$

(c) Static task graph

For an explicit message-passing program, the computation partitioning information can be derived by analyzing the control-flow expressions that depend on process id variables. The communication pattern information has to be extracted by recognizing the communication calls syntactically, analyzing their arguments, and identifying the matching send and receive calls. In principle, both the control-flow and the communication can be written in a manner that is too complex for the compiler to decipher, and some message passing programs will probably not be analyzable. But in most of the programs we have looked at, the control-flow idioms for partitioning the computation and the types of message passing operations that are used are fairly simple. We believe that the required analysis to construct the STG would be feasible with standard interprocedural symbolic analysis techniques [16].

To illustrate the communication information built by the compiler, consider the simple HPF example, which, along with the MPI parallel code generated by the dHPF compiler, are shown on the left-hand side of Figure 1. The parallelization of the code requires the boundary values of array A along the

dimension to be communicated inside the I loop. (In practice, the compiler pipelines the communication in larger blocks by strip-mining the I loop [17] but that is omitted to simplify the example.) The corresponding STG is shown on the right-hand side of the figure. Solid lines represent control flow edges and dashed lines represent interprocessor synchronization. In this example, the compiler uses the non-blocking MPI communication primitives. The two dashed lines show that the wait-recv operation cannot complete until the isend is executed, and the isend cannot complete until the irecv is issued by the receiver (the latter is true because our target MPI library uses sender-side buffering).Also, the compiler interleaves the communication tasks and computation so as to overlap waiting time at the isend with the computation of local loop iterations, i.e., the iterations that do not read or write any off-processor data. The use of explicit communication tasks within the task graph allows this overlap to be captured precisely in the task graph. The dashed edge between the isend and the wait-recv tasks is associated with the processor mapping: $\{[p_0]\rightarrow[q_0]: q_0 = p_0 - 1 \wedge 0 \leq q_0 < p-1 \}$ , denoting that each processor receives data from its ``right'' neighbor, except the rightmost boundary processor. The other dashed edge has the inverse mapping, i.e.,

Finally, the compiler constructs the symbolic scaling functions for each task and communication event, using direct symbolic analysis of loop bounds and message sizes. For a communication event, the scaling function is simply an expression for the message size. For each DO-NODE the scaling function describes the number of iterations executed by each processor, as a function of processor id variables and other symbolic program variables. In the simple example above, the scaling functions for the two DO nodes are N-1 and min(myid*blk+blk-1,M-1) - (myid*blk+1) + 1, respectively. For a computational task, the scaling function is a single parameter representing the workload corresponding to the task. At this stage of the task graph construction, no further handling (such as symbolic iteration counting for more complex loop bounds), takes place.