next up previous
Next: Conclusion Up: Evaluation and Experimental Results Previous: Upper Triangular Matrix Multiplication

Banded SYR2K

Figure 5: Performance of mapping schemes on the KSR1 for upper triangular matrix multiplication.
\begin{figure}\begin{center}
\noindent\epsfig{figure=ex3/ex3a.ps, width=88truemm...
...ps, width=88truemm}\end{center}\par {\em ~~~b) {\tt N}~$=1024$}
\par\end{figure}

The second benchmark, banded symmetric rank-$2k$ update (SYR2K), contains non-affine bounds; the version we use is shown below.

 DOALL I=1,MIN(N,2*BB-1)
    DO J=MAX(1-BB,1-N),MIN(BB-I,N-I)
       DO K=MAX(1,I+J),MIN(N+J,N)
          C(-I-J+K+1,I)=C(-I-J+K+1,I)+A(K,-I-J+BB+1)*B(K,-J+BB)
&                                    +A(K,-J+BB)*B(K,-I-J+BB+1)
       ENDDO
    ENDDO
 ENDDO

Clearly, the loop nest is not canonical (see Definition 1). However, converting the MIN and MAX functions to IF statements, and removing the latter by index set splitting (see Section 3.2), the code can be transformed into four consecutive canonical loop nests assuming that N $>$ 2*BB-1 [15]; this version is denoted CAN-3t. For comparison, two additional mapping schemes are also implemented; they are based on direct application of the partitioning schemes described by (7) for $m=2$ and $m=3$ to the original loop nest, regardless of the fact that the latter is not canonical. These two versions are henceforth referred to as CAN-2 and CAN-3, respectively. No version based on balanced chunk scheduling is implemented since loop nests having bounds containing MIN and MAX functions do not conform to its requirements.


Table 2: Expected load imbalance and relative load imbalance for banded SYR2K.
5.24pt
N, Mapping Number of processors
BB scheme 12 16
$L$ $L_R$ $L$ $L_R$ $L$ $L_R$ $L$ $L_R$ $L$ $L_R$
512, KAP/MARS 1004896 .350 764592 .450 447832 .490 331685 .516 240300 .507
64 CYC 15360 .008 23056 .024 26936 .055 28645 .084 28940 .110
CAN-2 992 .001 19216 .020 17920 .037 12597 .039 9920 .041
CAN-3 8192 .004 1024 .001 128 .000 1633 .005 560 .002
CAN-3t 2080 .001 31248 .032 60360 .114 73227 .191 31668 .120
1024, KAP/MARS 30758272 .367 23767744 .473 13981024 .513 9924928 .529 7514800 .531
256 CYC 114688 .002 172096 .006 200928 .015 211168 .023 215600 .031
CAN-2 1851264 .034 1478464 .053 1146880 .079 692496 .073 537360 .075
CAN-3 524288 .010 65536 .002 8192 .001 22392 .003 1024 .000
CAN-3t 128 .000 244800 .009 252960 .019 510110 .054 452880 .064


Figure 6: Performance of mapping schemes on the KSR1 for banded SYR2K.
\begin{figure}\begin{center}
\noindent\epsfig{figure=ex4/ex4a.ps, width=88truemm...
...m}\end{center}\par {\em ~~~b) {\tt N}~$=1024$, {\tt BB}~$=256$}
\par\end{figure}

The load imbalance, $L$, computed in terms of the work associated with the assignment statement of the loop body, and the corresponding relative load imbalance, $L_R$, for two different pairs of values for N and BB, $\{512, 64\}$, and $\{1024, 256\}$, are shown in Table 2. It can be seen that MARS and KAP exhibit high load imbalance, while the remaining four schemes exhibit significantly smaller load imbalance; CAN-3 exhibits, on average, the smallest value of load imbalance.

The partitioned programs were run on the KSR1, using the same two pairs of values for N and BB; their performance is depicted in Figure 6. In the first case, in Figure 6.a, KAP and MARS perform worst of all, except when running on 16 processors where CYC performs worst of all. CAN-3t performs best of all when using less than 12 processors; equally good results are achieved by CAN-3 and, to some extent, CAN-2. CYC exhibits a somewhat odd behaviour; it performs nearly best of all when running on 12 processors, but performs worst of all when running on 16 processors, and nearly worst when running on 8 processors. This is due to a significant number of cache misses when the number of processors is a power of 2.

Similar remarks can be made about the results depicted in Figure 6.b. CAN-3t performs best of all; CAN-3 exhibits a comparable performance, but CAN-2 performs significantly worse. KAP and MARS perform worst of all except when using 8 or 16 processors; in these cases, CYC, whose performance also suffers from a high number of cache misses as in Figure 6.a, performs worst of all.

Another interesting observation, in connection with the load imbalance computed in Table 2 and the actual performance depicted in Figure 6, is that, although CAN-3t nearly always exhibits a higher load imbalance than CAN-3 (except when the number of processors is 2), its actual performance is generally better than that of CAN-3 (except when running on more than 12 processors, where the difference in the anticipated load imbalance between the two partitioning schemes becomes comparatively higher); the performance gains of CAN-3t are due to the elimination of MIN and MAX functions from the loop bounds (apart from those necessary for partitioning the outermost, parallel loop).


next up previous
Next: Conclusion Up: Evaluation and Experimental Results Previous: Upper Triangular Matrix Multiplication
Rizos Sakellariou 2000-07-31