Condensing Nodes of the Static Task Graph

Next: Instantiating the Dynamic Task Up: Compiler Techniques for Synthesizing Previous: Synthesizing the Static Task

Condensing Nodes of the Static Task Graph

The process described above produces a first-cut version of the STG. For many typical modeling studies of parallel program performance, however, a less detailed graph will be sufficient. For instance, a coarse-grain modeling approach could assume that all operations of a single process between two communication points constitute a single task. In order to add this functionality, the compiler traverses the STG and marks contiguous nodes, connected by a control-flow edge, that do not include communication. Such sequences of nodes are then collapsed and replaced in the STG by a single condensed task. Note that such a task will have a single point of entry and a single point of exit. For example, the two large dotted rectangles in Figure 1 (c) correspond to sequences of tasks that can be collapsed into a single condensed task.

To preserve precision, the computation of the scaling function of the new condensed task is of particular importance. Ignoring conditional control flow for the moment, this scaling function is the symbolic sum of the scaling functions of the individual collapsed tasks, each multiplied by the symbolic number of iterations of the surrounding loops, where appropriate (only if these loops are also collapsed).

  DO I=1,N
     RECV
     DO J=1,M
        S1
        IF (f)
           S2
        ENDIF
        S3
     ENDDO
     SEND
  ENDDO

$\psfig{file=Figures/fig2b.eps,width=15truemm}$

$\psfig{file=Figures/fig2c.eps,width=20truemm}$

$\psfig{file=Figures/fig2d.eps,width=10truemm}$

(a) Source; (b) Initial STG; (c) Collapsed STG if f=G(I); (d) Collapsed STG if f=Y(J)

In cases with conditional control flow, tasks can sometimes be condensed with no loss of accuracy, using sophisticated compiler analysis. For example, no accuracy will be lost in cases where all dynamic instances of the resulting collapsed task have identical computational time; that is, the workload expressions are free of any input parameters whose value changes for different instances of that task. In other cases, condensing would result in some loss of accuracy, and the goals of the modeling study should be used to dictate the degree to which tasks are collapsed together.

To illustrate, consider the code shown in Figure 2 (a). Let represent the workload (i.e., the scaling function) for statements S1, S2 and S3, respectively. The initial version of the STG is shown in Figure 2 (b); the nodes inside the dotted rectangle are candidates for collapsing. Assuming that the function f in the IF depends on at least one of I or J, we distinguish between:

If f is a function of I only, the IF statement can be moved outside the J loop and the J loop can be collapsed with no loss of accuracy. In this case, we are left with two separate tasks, representing the two possible versions of the J loop, as shown in part (c) of the figure. These two tasks have scaling functions given by $M\times (w_1+w_2+w_3)$ , $M\times (w_1+w_3)$ , respectively.
If f is a function of J only, the code can be condensed into a single task as shown in Figure 2(d). The scaling function of the task would be $M \times w$ , where is the workload inside the J loop body per iteration of the I loop.
Finally, if f is a function of both I and J, we can condense the code only by introducing a branching probability parameter. If I represents the probability that S2 will be executed for a given value of I, then the entire code inside the dotted rectangle can be condensed into a single task (as in part (d)) with a combined scaling function given by M $\times(w_1+p($ I $)\times w_2 + w_3)$ . Since this probabilistic expression for execution time can lead to inaccuracies, the decision to condense the task graph in such cases should depend on the goals of the modeling study.

The three cases can be differentiated using well-known but aggressive dataflow analysis. We note that the first two cases correspond directly to loop unswitching and identifying loop-invariant code respectively, except that only the static task graph is modified and the code itself is not transformed. A key point to note is that in the first two cases, there is no resulting loss of accuracy in condensing the task graph. For example, in the ASCI benchmark Sweep3D [4] used in Section 4, the one significant branch is in fact of the first type, which can be pulled out of the task and enclosing loops (the analysis would have to be interprocedural because the enclosing loops are not in the same procedure as the branch).

Next: Instantiating the Dynamic Task Up: Compiler Techniques for Synthesizing Previous: Synthesizing the Static Task

Rizos Sakellariou 2000-10-16