next up previous
Next: Background Up: Compile-Time Minimisation of Load Previous: Compile-Time Minimisation of Load

Introduction

In order to evaluate the performance trade-offs of different transformations, parallelising compilers are usually armed with some performance estimation capability; this issue has been addressed recently by a number of researchers [3,7,19]. Although the implementation details of these models may vary, generally they attempt to identify sources of performance loss, such as load imbalance, interprocessor communication, cache misses [4,6]. This has two implications for a parallelising compiler. The first is that the compiler must be capable of extracting quantitative measures from programs. Since parallelising compilers usually target at the parallelisation of loop nests, significant information lies in the number of loop iterations; this can be used, for instance, to estimate the amount of work assigned to each processor, or the number of non-local accesses [7]. A second implication of using a performance estimator is that the compiler should avoid postponing critical decisions concerning the parallelisation process until run-time; doing so would reduce the amount of information available, and consequently the accuracy of the performance prediction. Such a critical decision is the one dealing with the way that loop nests are mapped onto a parallel architecture, that is, how loop iterations are distributed amongst processors.

The problem of mapping loop nests has attracted significant interest; traditionally, researchers have been inclined towards run-time based schemes [8,11,13,18]. The underlying argument has been that information not available at compile-time may permit a more balanced distribution of the workload, especially in cases where each iteration of the parallel loop does not perform the same amount of work. Although such a distribution may be achieved at the expense of additional overheads (for instance, increasing the number of memory accesses to balance computational work), run-time mapping schemes seem to be preferable whenever the execution of the loop nest depends on unknown, at compile-time, expressions. However, in a number of cases, each iteration of the parallel loop may perform a different amount of work as a result of enclosed loops whose execution depends on the value of the index of the parallel loop. Such loop nests follow a regular pattern and it can be feasible to devise compile-time mapping schemes.

This thesis has been sustained by Haghighat and Polychronopoulos [9,10], who suggest that symbolic cost estimates can be used to design robust compile-time strategies for mapping loop nests. They propose balanced chunk scheduling, a mapping scheme for triangular perfect loop nests; the main idea is to partition the outermost loop in such a way that each processor executes the loop body the same (or almost the same) number of times. However, balanced chunk scheduling can only be applied to a limited class of loop nests; furthermore, not distributing evenly the iterations of the parallel loop is often a non-desirable option in practice (for instance, in a data parallel environment, if a triangular loop nest is followed by a rectangular loop nest where the same arrays are involved, a significant communication and/or load imbalance overhead would arise).

In this paper we develop a general approach for mapping, at compile-time, parallel loops in the body of which there are loops whose execution depends on the index of the parallel loop. We make use of symbolic cost estimates as a means of evaluating our strategy, and the minimisation of load imbalance is considered the main objective. However, by avoiding options which may increase other sources of overhead, our scheme also appears to be more efficient in practice.

The remainder of the paper is structured as follows: Section 2 provides a brief background on measuring load imbalance and on counting symbolically the number of loop iterations. Section 3 forms the main body of this paper; after describing first a strategy for mapping a certain class of loop nests termed as canonical, it is then shown how to transform generalised loop nests into multiple canonical loop nests. This strategy is evaluated and compared to other mapping schemes in Section 4; finally, Section 5 epitomises the results of the paper.


next up previous
Next: Background Up: Compile-Time Minimisation of Load Previous: Compile-Time Minimisation of Load
Rizos Sakellariou 2000-07-31