next up previous
Next: Evaluation and Experimental Results Up: Partitioning for Load Balance Previous: Canonical Loop Nests


Generalized Loop Nests

The previous section described an approach to partitioning canonical loop nests, as introduced in Definition 1. In this section we re-consider loop nests having the general form shown in Figure 2, but without the restrictions associated with Definition 1, except that $l_1 \leq u_1$ (i.e., the loop nest is not empty). Then, applying index set splitting, the original loop nest can be transformed into multiple adjacent loop nests each of which satisfies the requirements for partitioning set by Theorem 2 or it is rectangular.

Consider the loop nest shown in Figure 2; the first step consists of finding the values of $i$ which satisfy the inequalities $l_1 \leq i \leq u_1 \mbox{~~and~~} l_{21}i+l_{22} \leq u_{21}i+u_{22}.$ If no such values exist, then the loop with index $j_2$ is never executed. If there are such values, given by $l_1 \leq i \leq u_1$, the loop with index $j_2$ is always executed; therefore, this loop is canonical with respect to the outer loop, and no index set splitting is required. Conversely, if there is a subset of the values of $i$ which satisfies both inequalities, say $l_1 \leq i \leq u'_1$, where $u'_1 < u_1$, then the outer loop must be split into two consecutive loops, the first of which corresponds to the values given by $l_1 \leq i \leq u'_1$, while the second corresponds to $u'_1+1 \leq i \leq u_1$; for the former values, the loop with index $j_2$ is canonical with respect to the outer loop, while, for the latter, it is never executed.

If, as a result of the previous step, there are some values of $i$ for which the two outermost loops are canonical, the next step consists of finding the values of $i, j_2$ which make the three outermost loops canonical; these values must satisfy the system of inequalities

\begin{eqnarray*}
& l'_1 \leq i \leq u'_1 \\
& l_{21}i+l_{22} \leq j_2 \leq u...
...} \\
& l_{31}i+l_{32}j_2+l_{33} \leq u_{31}i+u_{32}j_2+u_{33},
\end{eqnarray*}



where the first inequality corresponds to those values of $i$ that make the two outermost loops canonical.

The same procedure is repeated for each loop, successively, until there are no remaining loops or else a given system of, say $k$, $2 \leq k \leq m$, inequalities has no solutions (this would imply that the loop with index $j_k$ is never executed for these values of $i, j_2, \dots, j_{k-1}$ that render the $k-1$ outermost loops canonical). Note that, in the case where the original loop nest contains more than one consecutive loops at some level, the same procedure should be applied for each loop separately.

 
Example 2Consider the loop nest shown in Figure 4.a. Since there are two consecutive loop nests in the body of the I loop, the procedure described above must be applied separately for each of them.

For the first loop nest, the J loop is canonical with respect to the outer loop; the K loop is canonical (with respect to the I, J loops) when 2*I-J $\leq$ 1000  $\Longleftrightarrow$ J $\geq$ 2*I-1000. Thus, the J loop must be split into two consecutive loops depending on which values of J render the K loop canonical; the bounds of the first loop will be 1 and MAX(1,2*I-1000)-1, and of the second loop MAX(1,2*I-1000) and I. Since the body of the J loop does not contain statements other than the K loop, no statements are executed in the case when 1 $\leq$ J $\leq$ MAX(1,2*I-1000)-1; hence, the corresponding loop can be eliminated.


Table 1: Expected load imbalance and relative load imbalance for upper triangular matrix multiplication.
5.0pt
N Mapping Number of processors
scheme 12 16
$L$ $L_R$ $L$ $L_R$ $L$ $L_R$ $L$ $L_R$ $L$ $L_R$
256 KAP/MARS 1056768 .428 923648 .566 577024 .620 356749.3 .602 319360 .644
CYC 8256 .006 12416 .017 14560 .040 15331.3 .061 15760 .082
BCS 382 .000 13271 .018 7183 .020 6329.3 .026 9828 .053
CAN-2 262144 .156 229376 .245 143360 .288 82091.3 .258 79360 .310
CAN-3 0 .000 0 .000 0 .000 50.3 .000 512 .003
1024 KAP/MARS 67239936 .428 58818560 .567 36757504 .621 22978604 .606 20346880 .645
CYC 131328 .001 197120 .004 230272 .010 241550 .016 247360 .022
BCS 151255 .002 118940 .003 193075 .009 322800 .021 281770 .025
CAN-2 16777216 .158 14680064 .247 9175040 .290 6228806 .294 5079040 .312
CAN-3 0 .000 0 .000 0 .000 48713 .003 0 .000


For the second loop nest, the J loop is canonical with respect to the outer loop when 2*I-500 $\leq$ 1000  $\Longleftrightarrow$ I $\leq$ 750; the index of the I loop is split accordingly. Furthermore, the K loop is canonical when I+J $\leq$ 1000  $\Longleftrightarrow$ J $\leq$ 1000-I. The code resulting after applying the appropriate transformations is shown in Figure 4.b. Evaluating MAX(1,2*I-1000), by replacing it with appropriate conditionals which are then removed using index set splitting (see Section 3.2), results in the code shown in Figure 4.c (note that MIN(1000,1000-I) is always equal to 1000-I since I takes only positive values). Finally, partitioning I for each loop nest as in Section 3.2, and grouping the partitions according to (7) for $m=3$, the partitioned code can lead to perfect load balance when using 5 processors, and, in general, a comparatively low value of load imbalance [15].$\Box$


next up previous
Next: Evaluation and Experimental Results Up: Partitioning for Load Balance Previous: Canonical Loop Nests
Rizos Sakellariou 2000-07-31