next up previous
Next: Generalized Loop Nests Up: Partitioning for Load Balance Previous: Loop Nests Containing Conditionals

Canonical Loop Nests

The partitioning schemes described in Section 3.1 result in perfect load balance, or a small value of load imbalance, when each iteration of the parallel loop performs the same amount of work. A simple example where this is not the case is that of a triangular loop nest; assuming that such a loop having the index of the outer loop, $i$, taking values from $1$ to $n$, and the index of the inner loop taking values from $1$ to $i$, is mapped onto $p$ processors, then, for $p, n \geq 2$, the relative load imbalance has a lower bound of $1/4$. It is apparent that the partitioning schemes described so far are not sufficient to minimise load imbalance; nevertheless, using these as a basis, more appropriate schemes can be devised.

Figure 2: A canonical loop nest of depth $m$.
\begin{figure}\begin{small}
\verb*\vert DOALL \vert$i=l_1,u_1$\par\verb*\vert (s...
...rt (statements.2m-1)\vert
\par\verb*\vert ENDDO\vert
\par\end{small}\end{figure}

In the remainder of this paper we examine loop nests of depth $m$ having the general form shown in Figure 2. It is assumed that the sets of statements labelled statements.1, statements.2, $\dots$, statements.2m-1 do not include statements whose execution depends on the value of the index of a surrounding loop; hence, the workload corresponding to each set of statements remains the same for any iteration of the outer loop. This, however, does not exclude the possibility that, in any of the above-mentioned sets of statements, DO $\dots$ ENDDO loops which perform the same number of iterations regardless of the value of I, may exist. Thus, literally, the depth of a loop nest may be higher than $m$; however, in this and subsequent discussion, only those loops having bounds dependent on the index of the outermost loop (directly or indirectly) are considered; any remaining loops do not affect the strategy followed and they are omitted.

Definition 1   Consider the loop nest of depth $m$, $m \geq 2$, shown in Figure 2, where the body of the outermost loop contains $m-1$ nested loops; this loop nest is a canonical loop nest of depth $m$, if and only if $u_1 > l_1$ and, for all $i, j_2, j_3, \dots, j_m$, the following inequalities always hold

\begin{eqnarray*}
& l_{21}i+l_{22} \leq u_{21}i+u_{22} \\
& l_{31}i+l_{32}j_2...
...{m1}i+l_{m2}j_2+\dots+l_{mm} \leq u_{m1}i+u_{m2}j_2+\dots+u_{mm}
\end{eqnarray*}



where, for each of $j_k$, $2 \leq k \leq m$, at least one of the differences $(l_{k1}-u_{k1})$, $(l_{k2}-u_{k2}), \dots, (l_{k,k-1}-u_{k,k-1})$ is non-zero, and the set (statements.m) contains at least one statement, i.e. it is not empty.

Based on Definition 1, the next theorem is proved [15]:

Theorem 1   Consider any loop nest of the form described in Definition 1. Assume that the outer loop is partitioned into $2p^{m-1}$ equal partitions; then, for all $k$, $0 \leq k \leq 2p^{m-1}-1$, the $k$-th partition has a workload, $W_k$, equal to

\begin{displaymath}W_k = \sum_{i=0}^{m-1} C_i k^{i},
\mbox{~~$C_i$\ constants.} \end{displaymath}

In order to illustrate Theorem 1 with an example, consider again a triangular loop nest having the index of the outer loop, $i$, taking values from $1$ to $n$, and the index of the inner loop taking values from $1$ to $i$; then, assuming that $i$ is partitioned into $p$ equal partitions, the $k$-th partition, $0 \leq k < p$, exhibits a workload equal to $C_1 k + C_0$, for $C_0, C_1$ constants. This property leads to the following [15]:

Theorem 2   Consider any loop nest which is partitioned along the index of the outer loop into $2p^{m-1}$, $m \geq 2$, equal partitions. If, for all $k$, $0 \leq k \leq 2p^{m-1}-1$, the $k$-th partition has workload, $W_k$, equal to

\begin{displaymath}W_k = \sum_{i=0}^{m-1} C_i k^{i},
\mbox{~~where $C_i$\ constants,} \end{displaymath}

then the loop nest can be partitioned into $p$ partitions of equal workload; the set of partitions along the index of the outermost loop which compose the $k'$-th, $0 \leq k' < p$, partition of the loop nest is given by
$\displaystyle S_{k'} \!$ $\textstyle =$ $\displaystyle \bigcup_{i=0}^{p^{m-2}-1} \left\{ \{ 2pi + ( k'+
\sum_{j=0}^{m-3} \lfloor i/p^j \rfloor) \bmod p \}
\cup \right.$  
    $\displaystyle \left. \{ 2p(i+1) - 1 - ( k'+
\sum_{j=0}^{m-3} \lfloor i/p^j \rfloor) \bmod p \} \right\}.$ (7)

For $m=2$, (7) reduces to $S_{k'}=\{k\}\cup\{2p-k-1\}$ [16].

Theorems 1 and 2 can be summarised as follows:

Corollary 1   Consider a canonical loop nest of depth $m$; if the index of the outer loop can be partitioned into $2p^{m-1}$ equal partitions, then the loop nest can be partitioned into $p$ partitions of equal workload.

In order to illustrate the results of Theorems 1 and 2, we consider the following example:

Figure 3: Example of partitioning a loop nest of depth 3.
       DOALL I=1,N
          DO J=-2,3*I-1
             DO K=J+I,5*I+2
                (statements)
             ENDDO
          ENDDO
       ENDDO
a) Unpartitioned loop nest.

C  --- assuming that MODULO(N,2*P*P)=0 ---
C*KSR* PARALLEL REGION(NUMTHREADS=P,
C*KSR*&   PRIVATE=I,J,K,LK,UK,K1,K2)
       K1=IPR_MID()
       DO II=0,P-1
          K2=2*P*II+MOD(K1+II,P)
          LK=1+K2*N/(2*P*P)
          UK=1+(K2+1)*N/(2*P*P)-1
          DO I=LK,UK
             DO J=-2,3*I-1
                DO K=J+I,5*I+2
                   (statements)
                ENDDO
             ENDDO
          ENDDO
          K2=2*P*(II+1)-1-MOD(K1+II,P)
          LK=1+K2*N/(2*P*P)
          UK=1+(K2+1)*N/(2*P*P)-1
          DO I=LK,UK
             DO J=-2,3*I-1
                DO K=J+I,5*I+2
                   (statements)
                ENDDO
             ENDDO
          ENDDO
       ENDDO
C*KSR* END PARALLEL REGION

b) After loop partitioning.

 
Example 1Consider the loop nest shown in Figure 3.a. Assuming that N $> 1$, then the inequalities -2 $\leq$ 3*I-1 and J+I $\leq$ 5*I+2 always hold, while, for each inequality, the coefficients of I are non-zero; hence, the requirements of Definition 1 are satisfied and the loop nest is canonical for $m=3$. Thus, based on Theorems 1 and 2, and assuming that the number of iterations of the outer loop, N, is a multiple of $2p^2$, where $p$ is the number of processors, partitioning the loop nest according to (7) leads to perfect load balance; the partitioned loop nest is shown in Figure 3.b.$\Box$

Figure 4: Transforming a generalized loop nest to canonical loop nests.
DOALL I=1,1000
      DO J=1,I
         DO K=2*I-J,1000
            (statements.1)
         ENDDO
      ENDDO
      (statements.2)
      DO J=2*I-500,1000
         DO K=I+J,1000
            (statements.3)
         ENDDO
      ENDDO
ENDDO
a) Unpartitioned loop nest.

DOALL I=1,750
      DO J=MAX(1,2*I-1000),I
         DO K=2*I-J,1000
            (statements.1)
         ENDDO
      ENDDO
      (statements.2)
      DO J=2*I-500,MIN(1000,1000-I)
         DO K=I+J,1000
            (statements.3)
         ENDDO
      ENDDO
ENDDO
DOALL I=751,1000
      DO J=MAX(1,2*I-1000),I
         DO K=2*I-J,1000
            (statements.1)
         ENDDO
      ENDDO
      (statements.2)
ENDDO
b) After initial index set splitting.

DOALL I=1,500
      DO J=1,I
         DO K=2*I-J,1000
            (statements.1)
         ENDDO
      ENDDO
      (statements.2)
      DO J=2*I-500,1000-I
         DO K=I+J,1000
            (statements.3)
         ENDDO
      ENDDO
ENDDO
DOALL I=501,750
      DO J=2*I-1000,I
         DO K=2*I-J,1000
            (statements.1)
         ENDDO
      ENDDO
      (statements.2)
      DO J=2*I-500,1000-I
         DO K=I+J,1000
            (statements.3)
         ENDDO
      ENDDO
ENDDO
DOALL I=751,1000
      DO J=2*I-1000,I
         DO K=2*I-J,1000
            (statements.1)
         ENDDO
      ENDDO
      (statements.2)
ENDDO
c) Transformed code.

 
In the general case, where the number of iterations of the outer loop, $n$, is not a multiple of $2p^{m-1}$, the partitioning technique suggested by Theorem 2 can be applied, provided that the outer loop is partitioned according to one of the partitioning schemes described in Section 3.1. In this case, a small value of load imbalance is expected.

Theorems 1 and 2 also apply to loop nests where there are more than one inner loops at the same level (i.e., loops which are surrounded only by the same outer loops) whose bounds depend on the index of a surrounding loop; the necessary proviso is that, for any loop in the nest, the lower bound is always less than or equal to the upper bound.


next up previous
Next: Generalized Loop Nests Up: Partitioning for Load Balance Previous: Loop Nests Containing Conditionals
Rizos Sakellariou 2000-07-31