next up previous
Next: Conclusion Up: Compile-Time Partitioning of Three-Dimensional Previous: Methodology

Experimental Results

Figure 4: Upper Triangular Matrix Multiplication.
\begin{figure}\begin{small}
\begin{verbatim}DO J=1,N
DO I=1,J
DO K=I,J
A(I,J)=A(I,J)+B(I,K)*C(K,J)
ENDDO
ENDDO
ENDDO\end{verbatim}\end{small}\end{figure}

In order to evaluate the performance gains of the partitioning strategy described in the previous section, we consider the parallelisation of the code shown in figure 4, which corresponds to the multiplication of two upper triangular $n \times n$ matrices [6] (the two outer loops have been interchanged in order to increase the number of unit stride array references). Clearly, the loop nest is canonical (see Definition 3.1), and, given that it has a depth of 3, a partitioning scheme based on the proof described in Theorem 3.1 may lead to perfect load balance. We compared this scheme (henceforth denoted by CAN) with three other compile-time mapping schemes, subsequently denoted by the shorthands KAP, MARS, and CYC; KAP corresponds to the mapping strategy of the KAP auto-parallelising compiler (which is based on scheduling chunks of consecutive iterations having a fixed size), MARS corresponds to the mapping strategy of the MARS experimental parallelising compiler [2] (which is equivalent to compile-time partitioning into a number of chunks of consecutive iterations equal to the number of processors), and CYC corresponds to a cyclic (or wrap [5]) way of mapping the iterations onto processors (i.e., processor 0 executes iterations $1, p+1, 2p+1, \dots$, processor 1 executes iterations $2, p+2, 2p+2, \dots$, in general, processor $i$, $0 \leq i \leq p-1$, executes iterations $i+1+kp, k=0,1,2,\dots,n/p-1$).

The parallelised programs were run on a KSR1, using two different values for N, $256$ and $1024$; their execution time is shown in Table 1. In both cases, KAP and MARS perform worst of all while CAN performs best. The performance of CYC is comparable with that of CAN when using a relatively small number of processors; for more than 12 processors, CYC causes a large number of cache misses.


10pt
Table 1: Execution time (in seconds) of upper triangular matrix multiplication for various mapping schemes on the KSR1.
  Mapping Number of processors
N scheme 1 2 4 8 12 16
256 KAP 4.03 3.64 2.78 1.71 1.12 0.97
  MARS 3.99 3.63 2.76 1.68 1.06 0.95
  CYC 4.02 2.05 1.07 0.55 0.40 0.35
  CAN 3.99 2.03 1.04 0.51 0.38 0.29
1024 KAP 527 458 308 174 128 110
  MARS 524 457 310 179 130 113
  CYC 525 267 154 69 55 49
  CAN 524 265 152 66 46 36


next up previous
Next: Conclusion Up: Compile-Time Partitioning of Three-Dimensional Previous: Methodology
Rizos Sakellariou 2000-11-12