Generalized Loop Nests

The previous section described an approach to partitioning canonical loop nests, as introduced in Definition 1. In this section we re-consider loop nests having the general form shown in Figure 2, but without the restrictions associated with Definition 1, except that $l_1 \leq u_1$ (i.e., the loop nest is not empty). Then, applying index set splitting, the original loop nest can be transformed into multiple adjacent loop nests each of which satisfies the requirements for partitioning set by Theorem 2 or it is rectangular.

Consider the loop nest shown in Figure 2; the first step consists of finding the values of

which satisfy the inequalities $l_1 \leq i \leq u_1 \mbox{~~and~~} l_{21}i+l_{22} \leq u_{21}i+u_{22}.$ If no such values exist, then the loop with index

is never executed. If there are such values, given by $l_1 \leq i \leq u_1$ , the loop with index

is always executed; therefore, this loop is canonical with respect to the outer loop, and no index set splitting is required. Conversely, if there is a subset of the values of

which satisfies both inequalities, say $l_1 \leq i \leq u'_1$ , where

, then the outer loop must be split into two consecutive loops, the first of which corresponds to the values given by $l_1 \leq i \leq u'_1$ , while the second corresponds to $u'_1+1 \leq i \leq u_1$ ; for the former values, the loop with index

is canonical with respect to the outer loop, while, for the latter, it is never executed.

If, as a result of the previous step, there are some values of

for which the two outermost loops are canonical, the next step consists of finding the values of

which make the three outermost loops canonical; these values must satisfy the system of inequalities

$\begin{eqnarray*} & l'_1 \leq i \leq u'_1 \\ & l_{21}i+l_{22} \leq j_2 \leq u... ...} \\ & l_{31}i+l_{32}j_2+l_{33} \leq u_{31}i+u_{32}j_2+u_{33}, \end{eqnarray*}$

The same procedure is repeated for each loop, successively, until there are no remaining loops or else a given system of, say

, $2 \leq k \leq m$ , inequalities has no solutions (this would imply that the loop with index

is never executed for these values of $i, j_2, \dots, j_{k-1}$ that render the

outermost loops canonical). Note that, in the case where the original loop nest contains more than one consecutive loops at some level, the same procedure should be applied for each loop separately.

Example 2Consider the loop nest shown in Figure 4.a. Since there are two consecutive loop nests in the body of the I loop, the procedure described above must be applied separately for each of them.

For the first loop nest, the J loop is canonical with respect to the outer loop; the K loop is canonical (with respect to the I, J loops) when 2*I-J $\leq$ 1000 $\Longleftrightarrow$ J $\geq$ 2*I-1000. Thus, the J loop must be split into two consecutive loops depending on which values of J render the K loop canonical; the bounds of the first loop will be 1 and MAX(1,2*I-1000)-1, and of the second loop MAX(1,2*I-1000) and I. Since the body of the J loop does not contain statements other than the K loop, no statements are executed in the case when 1 $\leq$ J $\leq$ MAX(1,2*I-1000)-1; hence, the corresponding loop can be eliminated.

5.0pt

`N`	Mapping	Number of processors
	scheme							12		16

256	KAP/MARS	1056768	.428	923648	.566	577024	.620	356749.3	.602	319360	.644
	CYC	8256	.006	12416	.017	14560	.040	15331.3	.061	15760	.082
	BCS	382	.000	13271	.018	7183	.020	6329.3	.026	9828	.053
	CAN-2	262144	.156	229376	.245	143360	.288	82091.3	.258	79360	.310
	CAN-3	0	.000	0	.000	0	.000	50.3	.000	512	.003
1024	KAP/MARS	67239936	.428	58818560	.567	36757504	.621	22978604	.606	20346880	.645
	CYC	131328	.001	197120	.004	230272	.010	241550	.016	247360	.022
	BCS	151255	.002	118940	.003	193075	.009	322800	.021	281770	.025
	CAN-2	16777216	.158	14680064	.247	9175040	.290	6228806	.294	5079040	.312
	CAN-3	0	.000	0	.000	0	.000	48713	.003	0	.000

For the second loop nest, the J loop is canonical with respect to the outer loop when 2*I-500 $\leq$ 1000 $\Longleftrightarrow$ I $\leq$ 750; the index of the I loop is split accordingly. Furthermore, the K loop is canonical when I+J $\leq$ 1000 $\Longleftrightarrow$ J $\leq$ 1000-I. The code resulting after applying the appropriate transformations is shown in Figure 4.b. Evaluating MAX(1,2*I-1000), by replacing it with appropriate conditionals which are then removed using index set splitting (see Section 3.2), results in the code shown in Figure 4.c (note that MIN(1000,1000-I) is always equal to 1000-I since I takes only positive values). Finally, partitioning I for each loop nest as in Section 3.2, and grouping the partitions according to (7) for

, the partitioned code can lead to perfect load balance when using 5 processors, and, in general, a comparatively low value of load imbalance [15]. $\Box$