next up previous
Next: Methodology Up: Compile-Time Partitioning of Three-Dimensional Previous: Introduction

Background

The loop nests considered in this paper have the form shown in figure 1. It is assumed that the sets of statements labelled statements.1, statements.2, $\dots$, statements.5 do not contain statements whose execution depends on the value of the index of a surrounding loop; hence, the workload corresponding to each set of statements remains the same for any iteration of the outermost loop 2. It is also assumed that the third set (statements.3) contains at least one statement, i.e. it is not empty, while the remaining sets of statements may be empty. The outermost loop, denoted by DOALL in the figure, is parallel, that is, its iterations can run concurrently on different processors.

Figure 1: A canonical loop nest of depth $3$.
\begin{figure}\verb*\vert DOALL \vert$i=l_1,u_1$\par\verb*\vert (statements.1)\v...
...
\par\verb*\vert (statements.5)\vert
\par\verb*\vert ENDDO\vert
\par\end{figure}

Assuming that $W_{tot}$ is the total amount of computation in the loop nest which is distributed amongst $p$ processors in such a way that each processor $i$, $0 \leq i < p$, is assigned an amount of computation equal to $W_i$ (clearly, $W_{tot} = \sum_{i=0}^{p-1} W_i$), then we say that this distribution exhibits a load imbalance, $L$, equal to

(1) \begin{displaymath}
L = \max_{i} \left( W_i - \frac{W_{tot}}{p} \right)
= W_{max} - \frac{W_{tot}}{p},
\end{displaymath}

where $W_{max}$ is equal to $\max(W_0, W_1, \dots, W_{p-1})$. The value of $L$ is bounded between $0$ and $W_{tot} (1-1/p)$. In the former case, that is, when, for all $i$, $W_i = W_{tot}/p$, we say that there exists a perfect load balance. Thus, reducing the overhead due to load imbalance is equivalent to finding $W_0, W_1, \dots, W_{p-1}$ such that $L$ in (1) is minimised.

In order to compute the values of $W_{tot}, W_{i}, 0 \leq i < p$, we consider the computational work corresponding to each set of statements of the loop body. Assume that $W_I$ is the work corresponding to the statements which are executed only by the outermost loop (i.e., statements.1 and statements.5), $W_{J_2}$ is the work corresponding to the statements which are executed by the loop with index $j_2$ but not by the loop with index $j_3$ (i.e., statements.2 and statements.4), and $W_{J_3}$ is the work corresponding to the statements which are in the body of the loop with index $j_3$ (i.e., statements.3); then, the total amount of work in the loop nest, $W_{tot}$, is given by

\begin{displaymath}W_{tot} = \sum_{i=l_1}^{u_1} W_I \; + \;
\sum_{i=l_1}^{u_1} ...
...=l_{31}i+l_{32}j_2+l_{33}}^{u_{31}i+u_{32}j_2+u_{33}} W_{J_3}. \end{displaymath}

The evaluation of sums such as the above depends on the number of symbolic variables involved in the loop bounds; detailed methodologies for the symbolic evaluation of sums in the context of parallelising compilers are described in [3], [11], [12], [14].


next up previous
Next: Methodology Up: Compile-Time Partitioning of Three-Dimensional Previous: Introduction
Rizos Sakellariou 2000-11-12