Enumerating the symbolic sets and mappings

The second challenge is that we must enumerate all instances of each parallel task and each communication edge between tasks. These instances are described by symbolic sets and mappings respectively. In the context of complex computation partitioning strategies and arbitrary communication patterns, this presents a much more difficult problem at compile-time (i.e., without executing the actual program).

**Figure 3:** Example from NAS LU illustrating processor sets and task mappings for communication tasks. Problem size: $65 \times 65 \times 65$ ; Processor configuration: $4 \times 4$ .
$\begin{figure}\begin{center} \begin{small} {\tt\begin{tabbing} nnnnnnn\=nn\=nn\=... ...csThatSend}} \rule{\textwidth}{0.01in} \end{center}\vspace*{-0.2in} \end{figure}$

For example, consider the loop fragment from the NAS LU benchmark shown in Figure 3. The compiler automatically chooses a sophisticated computation partitioning, denoted by the ON HOME descriptors for each statement in the figure. For example, the ON HOME descriptor for the assignment to flux indicates that the instance of the statement in iteration (k,j,i,m) should be executed by the processors that own either of the array elements rsd(m,i-1,j,k) or rsd(m,i+1,j,k). This means that each boundary iteration of the statement will be replicated among the two adjacent processors. This replication eliminates the need for highly expensive inner-loop communication for the privatizable array flux [6]. Communication is still required for each reference to array

, all of which are coalesced by the compiler into a single logical communication event. The communication pattern is equivalent to two SHIFT operations in opposite directions. Part (b) of the figure shows the set of processors that must execute the SEND communication task (

), as well as the mapping between processors executing the SEND and those executing the corresponding wait-recv (

). (Note that both these quantities are described in terms of symbolic integer sets, parameterized by the variables jst, jend, nx and nz.) Each of these sets combines the information for both SHIFT operations. Correctly instantiating the communication tasks and edges for such communication patterns in a pattern-driven manner can be difficult and error-prone, and would be limited to some predetermined class of patterns that is unlikely to include such complex patterns.

Instead, we develop a novel and general solution to this problem that is based on an unusual use of code generation from integer sets. In ordinary compilation, dHPF and other advanced parallelizing compilers use code generation from integer sets to synthesize loop nests that are executed at runtime, e.g., for a parallel loop nest or for packing and unpacking data for communication [1,7,8,11]. If we could invoke the same capability but execute the generated loop nests at compile-time, we could use the synthesized loop nests to enumerate the required tasks and edges.

Implementing this approach, however, proved to be a non-trivial task. Most importantly, each of the sets is parameterized by several variables, including input variables and loop index variables (e.g., the two sets above are parameterized by jst, jend, nx and nz). This means that the set must be enumerated separately for each combination of these variable values that occurs during the execution of the original program. We solve this problem as follows. We first generate a subroutine for each integer set that we want to enumerate, and make the parameters arguments of the subroutine. Then (still at compile-time), we compile, link, and invoke this code in a separate process. The desired combinations of variable values for each node and edge are automatically available when interpreting the control-flow of the task graph as described earlier. Therefore, during this interpretation, we simply invoke the desired subroutine in the other process to enumerate the ids for a node or the id pairs for an edge.

To illustrate this approach, Figure 3(c) shows the subroutine generated to enumerate the elements of the set

described earlier. The loop nest in this subroutine is generated directly from the symbolic integer set in part (b) of the figure. This loop nest enumerates the instances of the SEND task, which in this case is one task per processor executing the SEND. This subroutine is parameterized by jst, jend, nx and nz. In this example, these are all unique values determined by the program input. In general, however, these could depend on loop index variables of some outer loop and the subroutine has to be invoked for each combination of values of its arguments.

Overall, the use of code generated from symbolic sets enables us to support a broad class of computation partitionings and communication patterns in a uniform manner. This approach fits nicely with the core capabilities of advanced parallelizing compilers.