Experimental Results

**Figure 4:** Upper Triangular Matrix Multiplication.
$\begin{figure}\begin{small} \begin{verbatim}DO J=1,N DO I=1,J DO K=I,J A(I,J)=A(I,J)+B(I,K)*C(K,J) ENDDO ENDDO ENDDO\end{verbatim}\end{small}\end{figure}$

In order to evaluate the performance gains of the partitioning strategy described in the previous section, we consider the parallelisation of the code shown in figure 4, which corresponds to the multiplication of two upper triangular $n \times n$ matrices [6] (the two outer loops have been interchanged in order to increase the number of unit stride array references). Clearly, the loop nest is canonical (see Definition 3.1), and, given that it has a depth of 3, a partitioning scheme based on the proof described in Theorem 3.1 may lead to perfect load balance. We compared this scheme (henceforth denoted by CAN) with three other compile-time mapping schemes, subsequently denoted by the shorthands KAP, MARS, and CYC; KAP corresponds to the mapping strategy of the KAP auto-parallelising compiler (which is based on scheduling chunks of consecutive iterations having a fixed size), MARS corresponds to the mapping strategy of the MARS experimental parallelising compiler [2] (which is equivalent to compile-time partitioning into a number of chunks of consecutive iterations equal to the number of processors), and CYC corresponds to a cyclic (or wrap [5]) way of mapping the iterations onto processors (i.e., processor 0 executes iterations $1, p+1, 2p+1, \dots$ , processor 1 executes iterations $2, p+2, 2p+2, \dots$ , in general, processor

, $0 \leq i \leq p-1$ , executes iterations $i+1+kp, k=0,1,2,\dots,n/p-1$ ).

The parallelised programs were run on a KSR1, using two different values for N,

and

; their execution time is shown in Table 1. In both cases, KAP and MARS perform worst of all while CAN performs best. The performance of CYC is comparable with that of CAN when using a relatively small number of processors; for more than 12 processors, CYC causes a large number of cache misses.

10pt

**Table 1:** Execution time (in seconds) of upper triangular matrix multiplication for various mapping schemes on the KSR1.
	Mapping	Number of processors
`N`	scheme	1	2	4	8	12	16
256	KAP	4.03	3.64	2.78	1.71	1.12	0.97
	MARS	3.99	3.63	2.76	1.68	1.06	0.95
	CYC	4.02	2.05	1.07	0.55	0.40	0.35
	CAN	3.99	2.03	1.04	0.51	0.38	0.29
1024	KAP	527	458	308	174	128	110
	MARS	524	457	310	179	130	113
	CYC	525	267	154	69	55	49
	CAN	524	265	152	66	46	36