The partitioning schemes described in Section 3.1 result in perfect load balance, or a small value of load imbalance, when each iteration of the parallel loop performs the same amount of work. A simple example where this is not the case is that of a triangular loop nest; assuming that such a loop having the index of the outer loop, , taking values from to , and the index of the inner loop taking values from to , is mapped onto processors, then, for , the relative load imbalance has a lower bound of . It is apparent that the partitioning schemes described so far are not sufficient to minimise load imbalance; nevertheless, using these as a basis, more appropriate schemes can be devised.
In the remainder of this paper we examine loop nests of depth having the general form shown in Figure 2. It is assumed that the sets of statements labelled statements.1, statements.2, , statements.2m-1 do not include statements whose execution depends on the value of the index of a surrounding loop; hence, the workload corresponding to each set of statements remains the same for any iteration of the outer loop. This, however, does not exclude the possibility that, in any of the above-mentioned sets of statements, DO ENDDO loops which perform the same number of iterations regardless of the value of I, may exist. Thus, literally, the depth of a loop nest may be higher than ; however, in this and subsequent discussion, only those loops having bounds dependent on the index of the outermost loop (directly or indirectly) are considered; any remaining loops do not affect the strategy followed and they are omitted.
Based on Definition 1, the next theorem is proved [15]:
In order to illustrate Theorem 1 with an example, consider again a triangular loop nest having the index of the outer loop, , taking values from to , and the index of the inner loop taking values from to ; then, assuming that is partitioned into equal partitions, the -th partition, , exhibits a workload equal to , for constants. This property leads to the following [15]:
Theorems 1 and 2 can be summarised as follows:
In order to illustrate the results of Theorems 1 and 2, we consider the following example:
|
Example 1Consider the loop nest shown in Figure 3.a. Assuming that
N , then the inequalities -2 3*I-1
and J+I 5*I+2 always hold, while, for each inequality,
the coefficients
of I are non-zero; hence, the requirements of Definition 1
are satisfied and the loop nest is canonical for . Thus, based on
Theorems 1 and 2, and assuming that the
number of iterations of the outer loop, N, is a multiple of ,
where is the number of processors,
partitioning the loop nest according to (7) leads to
perfect load balance; the partitioned loop nest is shown in
Figure 3.b.
|
In the general case, where the number of iterations of the outer
loop, , is not a multiple of , the partitioning
technique suggested by Theorem 2 can be applied,
provided that the outer loop is partitioned according to one
of the partitioning schemes described in Section 3.1. In this
case, a small value of load imbalance is expected.
Theorems 1 and 2 also apply to loop nests where there are more than one inner loops at the same level (i.e., loops which are surrounded only by the same outer loops) whose bounds depend on the index of a surrounding loop; the necessary proviso is that, for any loop in the nest, the lower bound is always less than or equal to the upper bound.