18
Compiler Techniques for the Distribution of Data and Computation Angeles Navarro, Emilio Zapata, and David Padua, Fellow, IEEE Abstract—This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a Locality-Communication Graph (LCG) and the formulation of the compiler technique as a Mixed Integer Nonlinear Programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions—the decompositions—that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines. Index Terms—Parallelizing compiler, locality analysis, load balancing, communication pattern, mixed integer nonlinear programming. æ 1 INTRODUCTION M ULTIPROCESSOR architectures are converging toward an organization in which nodes containing memory and one or more processors are connected via a fast network. Processors have access to their local memories and to a hardware-supported global address space [1]. This organi- zation enables high-scalability at a reasonable cost. The organization also facilitates programming by enabling the gradual introduction of parallelism on sequential proto- types. However, the Non-Uniform Memory Access (NUMA) organization of these machines makes data locality a crucial performance factor. When data placement is controlled by the operating system, the programmer (or the compiler) must be aware of the page placement policy and either modify the program to adapt its memory access pattern to the operating system policy, or bypass the operating system and hand-code a customized page placement scheme. A possible page placement policy is first-touch, which maps a page onto the processor that references it first. Affinity scheduling can then be used to enhance locality. An alternative way to deal with dynamic memory reference patterns is dynamic page migration [2], which automatically changes page placement if a remote processor accesses the page significantly more frequently than the local processor where the page is allocated. In order to evaluate the effectiveness of the operating system strategies, we conducted a set of experiments on an SGI Origin 2000 using three codes: tfft2 from NAS, trfd from the Perfect Benchmark Suite, and tomcatv from SPEC95. One parallel version—version 0—of these pro- grams was generated automatically using PFA (Power Fortran Accelerator [3]). PFA is a source-to-source paralleli- zer that analyzes a sequential program and automatically inserts parallel directives similar to those of the OpenMP standard [4]. Version 0 relies exclusively on the page migration strategy of IRIX. 1 This strategy is based on counters that keep track of the number of memory accesses issued by each node in the system to each physical memory page. Whenever the number of remote accesses to a particular page exceeds the number of local accesses by a certain threshold, an interrupt is generated so that the OS can decide whether to migrate the page. We generated a second version of the parallel programs—version 1—by inserting SGI page-level distribution directives on version 0 programs, to measure how the compiler could help the operating system to improve locality. Indeed, we generated a third parallel version—version 2—that is a message-passing version of the program. This was generated by identifying data and computation distributions in which data elements are, whenever possi- ble, placed in the local memories of the processors accessing them. In this version, each processor allocates its own local data, and accesses to remote memories are handled via explicit put/get communication primitives. The goal of version 2 was to measure how much locality can be exploited in a NUMA architecture using true data distribu- tion techniques instead of data placement controlled by the operating system. A similar performance comparison, for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 6, JUNE 2003 545 . A. Navarro and E. Zapata are with the Department of Computer Architecture, University of Ma´laga, Complejo Tecnolo´gico, Campus de Teatinos, 29080 Ma´laga, Spain. E-mail: {angeles, ezapata}@ac.uma.es. . D. Padua is with the Department of Computer Science, University of Illinois at Urbana-Champaign, 3318 Digital Computer Laboratory, 1304 West Springfield Ave., Urbana, IL 61801. E-mail: [email protected]. Manuscript received 5 Mar. 2002; revised 17 Dec. 2002; accepted 26 Feb. 2003. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 116026. 1. IRIX is the operating system technology based on the capabilities and benefits of industry-standard UNIX and developed by SGI. 1045-9219/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society

Compiler techniques for the distribution of data and computation … · 2007. 4. 23. · Analysis Algorithm [18], [19], an algorithm that identifies the constraints that the iteration/data

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Compiler Techniques for the Distribution ofData and Computation

    Angeles Navarro, Emilio Zapata, and David Padua, Fellow, IEEE

    Abstract—This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the

    iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA

    architectures. One of the key ingredients in our approach is the representation of locality as a Locality-Communication Graph (LCG)

    and the formulation of the compiler technique as a Mixed Integer Nonlinear Programming (MINLP) optimization problem on this graph.

    The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this

    optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how

    the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization

    problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include

    a discussion about our model and the solutions—the decompositions—that it provides. The approach presented in the paper is

    evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation

    time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines.

    Index Terms—Parallelizing compiler, locality analysis, load balancing, communication pattern, mixed integer nonlinear programming.

    æ

    1 INTRODUCTION

    MULTIPROCESSOR architectures are converging toward anorganization in which nodes containing memory andone or more processors are connected via a fast network.Processors have access to their local memories and to ahardware-supported global address space [1]. This organi-zation enables high-scalability at a reasonable cost. Theorganization also facilitates programming by enabling thegradual introduction of parallelism on sequential proto-types. However, the Non-Uniform Memory Access (NUMA)organization of these machines makes data locality a crucialperformance factor.

    When data placement is controlled by the operating

    system, the programmer (or the compiler) must be aware of

    the page placement policy and either modify the program to

    adapt its memory access pattern to the operating system

    policy, or bypass the operating system and hand-code a

    customized page placement scheme. A possible page

    placement policy is first-touch, which maps a page onto

    the processor that references it first. Affinity scheduling can

    then be used to enhance locality. An alternative way to deal

    with dynamic memory reference patterns is dynamic page

    migration [2], which automatically changes page placement

    if a remote processor accesses the page significantly more

    frequently than the local processor where the page is

    allocated.

    In order to evaluate the effectiveness of the operatingsystem strategies, we conducted a set of experiments on anSGI Origin 2000 using three codes: tfft2 from NAS, trfdfrom the Perfect Benchmark Suite, and tomcatv fromSPEC95. One parallel version—version 0—of these pro-grams was generated automatically using PFA (PowerFortran Accelerator [3]). PFA is a source-to-source paralleli-zer that analyzes a sequential program and automaticallyinserts parallel directives similar to those of the OpenMPstandard [4]. Version 0 relies exclusively on the pagemigration strategy of IRIX.1 This strategy is based oncounters that keep track of the number of memory accessesissued by each node in the system to each physical memorypage. Whenever the number of remote accesses to aparticular page exceeds the number of local accesses by acertain threshold, an interrupt is generated so that the OScan decide whether to migrate the page. We generated asecond version of the parallel programs—version 1—byinserting SGI page-level distribution directives on version 0programs, to measure how the compiler could help theoperating system to improve locality.

    Indeed, we generated a third parallel version—version2—that is a message-passing version of the program. Thiswas generated by identifying data and computationdistributions in which data elements are, whenever possi-ble, placed in the local memories of the processors accessingthem. In this version, each processor allocates its own localdata, and accesses to remote memories are handled viaexplicit put/get communication primitives. The goal ofversion 2 was to measure how much locality can beexploited in a NUMA architecture using true data distribu-tion techniques instead of data placement controlled by theoperating system. A similar performance comparison, for

    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 6, JUNE 2003 545

    . A. Navarro and E. Zapata are with the Department of ComputerArchitecture, University of Málaga, Complejo Tecnológico, Campus deTeatinos, 29080 Málaga, Spain. E-mail: {angeles, ezapata}@ac.uma.es.

    . D. Padua is with the Department of Computer Science, University ofIllinois at Urbana-Champaign, 3318 Digital Computer Laboratory, 1304West Springfield Ave., Urbana, IL 61801. E-mail: [email protected].

    Manuscript received 5 Mar. 2002; revised 17 Dec. 2002; accepted 26 Feb.2003.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 116026.

    1. IRIX is the operating system technology based on the capabilities andbenefits of industry-standard UNIX and developed by SGI.

    1045-9219/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society

  • another set of benchmarks, has been conducted by Chap-man et al. in [5], achieving similar results.

    The sequential times and the parallel times for all threeversions of our experiments are shown in Fig. 1. The lack ofscalability of results in version 0 and version 1 is due to thefact that the page migration and page level distributionstrategies do not optimize the locality of references in aNUMA architecture, such as the Origin 2000, because thegranularity of a page is too coarse and, thus, a significantnumber of remote accesses happen. On the other hand,version 2 obtains good scalability for all the codes. Inaddition, the execution times of version 2 are the smallestfor any number of processors. The results of version 2 gaveus the idea that much more locality can be exploited in aNUMA architecture using true data distribution techniquesinstead of the page-level distribution (version 1) or dynamicpage migration (version 0) techniques.

    Given these results, we believe that the best way to exploitall the available locality of a code in a NUMA architecture is toidentify a suitable distribution for both iteration and data (inother words, the decomposition), as we did in version 2.However, to find and implement a good decomposition byhand is a difficult task requiring extensive analysis andcomplex transformations of the sequential source code.Fortunately, we think that an advanced compiler can alleviatethis cumbersome task. Our approach is to have theprogrammer write a conventional serial—nonannotated—-program and rely on the compiler to automatically parallelizeit, distribute the iteration and data between the processors,and generate the communications necessary to keep globaldata consistent. If such a compiler were truly successful, itwould become the key tool in a highly-scalable, easy-to-program computer system.

    In this paper, assuming that the compiler has detectedthe parallelism in a previous step, we summarize how thedata locality information can be extracted and focus on howto use it to formulate an optimization problem whose goal isto find a decomposition that minimizes overhead andexploits locality for the whole code.

    1.1 Related Work

    Current compiler literature contains a lot of informationregarding the exploitation of data locality relying on datacaching. Some of these works are based on loop transfor-mation techniques [6], [7]; other approaches have addresseddata transformation techniques [8], and some recent efforts

    have aimed at combining the benefits of loop and datatransformations [9]. The main difference between our workand these is that they focus on uniprocessor memoryhierarchy, whereas our work focuses on multiprocessormemory hierarchy. In this sense, we are nearer to theautomatic data distribution field, where the goal is finding adata distribution that minimizes the overhead due tononlocal accesses in NUMA architectures.

    A substantial body of work in the compiler literature hasformulated the data distribution problem into two parts:1) the alignment problem [10], [11], which tries to relate thedimensions of different arrays in order to minimize thenumber of accesses to remote memory positions (commu-nications), and 2) the mapping problem, which decideswhich of the aligned dimensions are distributed and thenumber of processors assigned to each distributed dimen-sion. Another influential approach [12] uses an algebraicrepresentation of data and computation mappings and triesto find a static data decomposition without communica-tions; if a free communications decomposition is not found,a naive algorithm tries to find a dynamic data distributionfor the most executed program segment. However, in theirformulations these approaches do not consider the loadimbalance problem which can become a critical issue inmachines with high-throughput and a low latency network[13]. One noteworthy difference in our approach is that,besides communications, we address load balancing. As wewill demonstrate in Sections 3 and 4, the decompositionsderived from considering the load imbalance are comple-tely different to the ones that we obtain if we do not takeinto account the imbalance. As we will see, these newdecompositions have an important impact on the perfor-mance of the parallel code in NUMA machines.

    One interesting strategy in the automatic data distribu-tion field consists of formulating an integer linear program-ming framework [14], [15], where the objective is todetermine the mappings that minimize communications.Only in [15] is the load imbalance taken into account in aparticular way. One important difference between our workand these others is that we base our techniques on the LinearMemory Access Descriptor (LMAD), a powerful and accuratearray access notation introduced by Paek et al. [16], forfinding parallelism in the Polaris compiler [17]. We use theLMAD to extract the locality information for all arraysacross the whole program. The LMAD notation lets us

    546 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 6, JUNE 2003

    Fig. 1. Sequential and parallel times of versions 0, 1, and 2: (a) tfft2, (b) trfd, and (c) tomcatv.

  • extend the domain of interest of the previous works toprograms that contain loops not perfectly nested and tocover nonaffine subscript expressions. Another majordifference is that our objective function is a nonlinearexpression of problem variables because we have incorpo-rated load imbalance into the problem formulation.

    Next, we summarize how the paper is organized. InSection 2, we briefly outline the Locality-CommunicationAnalysis Algorithm [18], [19], an algorithm that identifies theconstraints that the iteration/data distributions must fulfillin order to ensure the locality of array references. Whenremote accesses are unavoidable, our techniques canidentify the communication patterns suitable for eachsituation [19]. Using the information of locality andcommunication patterns provided by the Locality-Commu-nication Analysis, in Section 3 we present a new techniquebased on a Mixed Integer Non-Linear Programming(MINLP) problem that attempts to derive, for a code, theoptimal iteration/data distributions that exploit all theavailable locality. Our MINLP problem models the over-head of the parallel code due to the load imbalance andcommunications. In Section 4, we use several benchmarksto evaluate the approach presented in the paper, in terms ofthe complexity of the MINLP model and in terms ofperformance of the derived iteration/data distributionsusing as target different NUMA machines. Finally, inSection 5, we present the conclusions.

    2 COMPILER REPRESENTATION OF LOCALITY

    In our approach, the compiler first identifies the parallelloops. After loop parallelization, the first question to beanswered is which loop is to be executed in parallel in eachDO loop nest with two or more parallel loops because ourtechniques only exploit one-level parallelism. This isimportant because the selection of the parallel loopdetermines if communications are necessary in such loopnests as well as the amount of computation carried out byeach processor. We assume that this selection will be doneusing an exhaustive search of all possible combinations ofparallel loops. That is, at each step of this search, weconsider all the loop nests (these loop nests do not have tobe perfectly nested) of the program and, for each loop nest,at most one parallel loop is considered. Each of these loopnests we call a phase (see examples in Fig. 4).

    For each step of the search algorithm, the compiler builds agraph called the Locality-Communication Graph—LCG—which is our compile time representation of locality in thecode. Next, using the information provided by the the LCG,the compiler formulates the optimization problem that looksfor the decomposition that minimizes the overhead (commu-nication and load imbalance) at that step. Doing this, wereduce the complexity of the optimization problem. Thisapproach is guaranteed to produce the solution with the bestpredicted behavior although in some cases the number ofpossibilities could be too large. In these cases, the compilercould only analyze a subset of possibilities selected at randomor by using a search strategy. In this paper, we focus on howthe compiler works for one iteration of this exhaustive searchalgorithm: the process of building the LCG of the code,formulating the optimization problem, and solving it.

    2.1 Notation Overview

    For precise locality analysis, it is essential to have anaccurate representation of the array regions accessed ineach phase of the code. For this, we use the Linear MemoryAccess Descriptors (LMADs) [16]. Initially, the LMAD formsummarizes the region that an array reference touchesduring the execution of a phase. The collection of theLMADs for all the references to the array in a phasesummarizes the array region accessed in such a phase. Oneadvantage of the LMAD notation is that it can accuratelyrepresent array references with affine or nonaffine subscriptexpressions [16]. In addition, since the LMAD can representarray access across procedure boundaries efficiently [20],this notation lets us obtain a complete interproceduralanalysis of the source program. Another advantage of theLMAD notation is that it allows, even for complexsubscripts, summarization operations that need a polyno-mial time algorithm in contrast to worst-case exponentialalgorithms required in the summarization process of thelinear constraint-based techniques [21], [22]. More details ofthis comparison can be found in [16], [20].

    The Iteration Descriptor—ID [18], [19]—is an extension

    of the LMAD developed to more conveniently pinpoint the

    subregions of an array accessed by each iteration of the

    parallel loop in the phase. Due to space limitations, the

    LMADs and IDs are only briefly described in Section 2.2.

    The IDs are used to build the LCG (the Locality-Commu-

    nication Graph). The LCG is a set of directed, connected

    graphs, each of which represents an array. Each graph

    contains exactly one node for each phase accessing the array

    the graph represents. The nodes are connected according to

    the program control flow. The LCG may contain cycles if

    there are outer loops that are not included in any phase. For

    example, Fig. 2 shows the LCG for tomcatv and a

    fragment of tfft2. The LCG of tomcatv contains four

    graphs corresponding to arrays X, Y , RX, and RY . In the

    LCG of the tfft2 code section, there are two graphs,

    corresponding to arrays X and Y . The LCG is constructed

    by the Locality-Communication Analysis Algorithm, which we

    summarize in Section 2.3.

    2.2 Linear Memory Access and Iteration Descriptors

    In our framework, accessing an array is viewed as

    traversing a linear memory space. For example, in Fig. 3,

    the two-dimensional array access is, in reality, the traversal

    in a linear memory space starting from the base address � (=

    the memory location for A(1, 4)). The diagram in Fig. 3

    illustrates that memory traversals are driven by the loop

    indices I and J. The LMAD attempts to capture the pattern

    of a memory traversal driven by a single loop index with

    the notion of stride and span. The stride is the distance

    (measured in number of array elements) between accesses

    generated by consecutive values of the index. In this

    example, the stride for index I is 2. Similarly, we can see

    the stride for J is N. The span is the total number of elements

    that the reference traverses when the index moves across its

    entire range. In the example, the span for index I is 2 �K,which is the entire distance traversed between iterations 0

    NAVARRO ET AL.: COMPILER TECHNIQUES FOR THE DISTRIBUTION OF DATA AND COMPUTATION 547

  • and K for a fixed value of J. Similarly, the span of J is (M-1)

    �N because J takes values from 1 up to M.The LMAD contains the base offset and the collection of

    stride/span pairs of all indices involved in the array access.

    Its general form is: Astridei1 ;stridei2 ;���;strideidspani1 ; spani2 ;���; spanid þ � . Here, weassume that access to array A is driven by d loop indices,

    i1, i2, � � � , id. The base offset is � , which is the distance inelements from the location of the first reference to A to the

    beginning of the array. As can be seen in the above example,

    a single stride/span pair for a loop index—such as f2; 2 �Kg for I and fN; ðM ÿ 1Þ �Ng for J—characterizes the accesspattern generated by one of the indices assuming the other

    index or indices remain constant. The LMAD representing

    all the accesses in the example of Fig. 3 is A2;N2�K;ðMÿ1Þ�N þ 3 �(N-1).

    Likewise, the LMAD for the accesses of array D in Fig. 4a

    is: DM;NðKÿ1Þ�M;ðbMÿ1þNN cÿ1Þ�N þ 0. The LMAD that represents the

    accesses for the diagonal access of the reference in Fig. 4b is

    X1;IIÿ1;N2ÿN2

    þ 1, and the LMADs that summarize the accessesof the two references in the phase of Fig. 4c are:

    X 2�P;J �2Lÿ1;2Lÿ1;1

    ðQÿ1Þ�2�P;ðPÿ2Þ�J=2;ðPÿ2LÞ=2;2Lÿ1ÿ1 þ 0 and

    X 2�P;J �2Lÿ1;2Lÿ1;1

    ðQÿ1Þ�2�P;ðPÿ2Þ�J=2;ðPÿ2LÞ=2;2Lÿ1ÿ1 þP

    2;

    ð1Þ

    where P ¼ 2R. These examples illustrate how the LMADnotation can be used to capture the region of an array

    accessed by each reference in a phase.The last two LMADs are a good example of accurate

    representation of accesses controlled by nonaffine sub-

    scripts. In this case, despite the apparent complexity of the

    subscript expressions, a simplification operation such as

    coalescing can safely remove redundant information that

    may be contained in the initial LMADs without losing

    accuracy; for instance, the third and fourth stride/spans on

    548 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 6, JUNE 2003

    Fig. 2. (a) LCG for the tomcatv code, and (b) LCG for a tfft2 code section.

    Fig. 3. Traversal in a linear memory space due to the accesses of array A. In the diagram, dark areas denote the array elements accessed during the

    loop execution, and solid and dotted curves represent the memory traversal driven by indices I and J, respectively.

    Fig. 4. Abbreviated versions of phases from some benchmarks from the Perfect and NAS codes. All the loops are obtained after interprocedural

    value propagation, induction variable substitution, forward substitution, and parallelism detection. (a) ocean, (b) trfd, and (c) tfft2.

  • each LMAD can be coalesced as a single pair [23]. Afterapplying this operation twice to those descriptors, respec-tively, the new form of those LMADs is:

    X2�P;1ðQÿ1Þ�2�P;P=2ÿ1 þ 0 and X2�P;1ðQÿ1Þ�2�P;P=2ÿ1 þ

    P

    2: ð2Þ

    We note that, after the coalescing operation, the newLMADs still represent the accesses for the two arrayreferences accurately, and are more manageable expres-sions. More details of the coalescing and other simplifica-tion operations can be found in [16].

    The Iteration Descriptor (ID) [18] was developed to makeexplicit the subregions of an array accessed by eachiteration of the parallel loop in a phase. For example, thesubregions of array X accessed by one iteration, say i, of theparallel I-loop in the kth phase of the program, can berepresented by an ID with the general form:

    IkðX; iÞ ¼[j¼1:m

    Ikj ðX; iÞ ¼

    B;D;~��IðiÞ; ~��; access type; ½execution pred : par8: 9;: ð3Þ

    Assuming that there are m different LMADs representingall the accesses to array X in the phase, each Ikj ðX; iÞ item isderived from the jth collected LMAD. Each item representsa different subregion LMAD. Thus, the ID is, in general, thecollection of several LMADs. In the ID, B is a m� s matrixwith a row for each LMAD, and a column for each spanderived for a sequential loop index, assuming that there ares sequential loop indices in the LMADs (after applying thestride coalescing operation). The entry D represents asimilar matrix, but with column entries representing strides.~��IðiÞ is the extended offset vector whose elements are�Iðj; iÞ ¼ �j þ i � �P ðjÞ for j ¼ 1 : m, where �P ðjÞ is the strideassociated with the parallel loop index in the jth LMAD.�Iðj; iÞ points to the first memory position of the subregionaccessed by the jth LMAD in the ith iteration of the parallelloop. ~�� is the symmetry distance vector. This is originally setto null. The components of this vector are briefly describedlater.

    Fig. 5 shows a graphical representation of the IDs of thearray X, associated with each iteration i of the parallel loop(index I) in phase F3 of the tfft2 code (the code of Fig. 4c).After computing the two LMADs for the two arrayreferences in the code and applying the coalescing opera-tion mentioned before, we find two LMADs (the expres-sions in (2)). As there are two LMADs, there are two IDitems, thus,

    I31ðX; iÞ ¼P

    2ÿ 1

    � �; ð1Þ; ð2 � P � iÞ; ð�Þ; W; LE : 1

    8>>: 9>>; ð4ÞI32ðX; iÞ ¼

    P

    2ÿ 1

    � �; ð1Þ; P

    2þ 2 � P � i

    � �; ð�Þ; R; LE : 1

    8>>: 9>>;; ð5Þwhere i is one of the iterations of the parallel loop. Thegeneral expression for the ID I 3ðX; iÞ is shown in Fig. 5.When i is instantiated to a value, we then find the elementsof the array subregions accessed during such paralleliterations. For instance, I3ðX; 0Þ defines the elements ofarray subregions accessed during parallel iteration i ¼ 0,I3ðX; 1Þ the elements of subregions for iteration i ¼ 1, andso on. On each parallel iteration, two subregions areaccessed: one starts at element 2 � P � i, which has a spanof P=2ÿ 1 and the elements are separated by stride 1; theother starts at element P=2þ 2 � P � i, which has a span ofP=2ÿ 1 and again the elements are separated by stride 1.

    The ID of an array also includes access and control flowinformation about the execution of the phase. The termaccess_type in (3) represents the memory access type forall the subregions accessed by an iteration of the parallel loop.This will be used to annotate the nodes of the LCG (Fig. 2)with one of these possible values: P if the array is privatizable[17] in the phase, W when all accesses are writes, R if allaccesses are reads, and R/W if there are both read and writeaccesses. In the example of Fig. 5, the memory access type forthe array in all the subregions isR/W. In addition, one or morethan one execution preds in the ID specify the executionproperties of the phase. The possible values are: 1) CE if thephase is conditionally executed, in which case par is theprobability of execution of that phase, 2) IE if the phase isinside an outer loop, in which case par is the number ofiterations of the external loop, and 3) LE if the phase isexecuted in lexicographic order, in which case par is always 1.Cases 2 and 3 are exclusive because, during the building of theLCG in the Locality-Communication Analysis algorithm, weneed to add a backward edge in the second case (e.g., thebackward edge that connects phases F1-F7 in the LCG ofFig. 2a), but not in the third case (e.g.,LCGof Fig. 2b, where allthe phases are executed in lexicographic order). In theexample of Fig. 5, LE states that F3 is executed lexicographi-cally (and the parameter is 1). The execution predicates andtheir parameters are used to compute the cost functions, as wewill discuss in Section 3.

    The ~�� vector of a ID can be null or, in some cases, cancontain symmetry distances to indicate that there are

    NAVARRO ET AL.: COMPILER TECHNIQUES FOR THE DISTRIBUTION OF DATA AND COMPUTATION 549

    Fig. 5. I 3ðX; iÞ represents the subregions of array X accessed on each iteration i (0; 1; 2 . . . ) of the parallel I-loop in the phase of Fig. 4c.

  • symmetries between the subregions accessed in a paralleliteration. This situation appears when the ID contains morethan one Ikj ðX; iÞ item. In these cases, two ID componentscan describe subregions with the same spans and strides,but different offsets. We identify the symmetry throughthree classes of symmetry distances [18]. The most relevantsymmetry distance for the Locality-Communication Analy-sis algorithm is the Overlapping distance or �o: Thisrepresents that two subregions accessed in different paralleliterations have the same access pattern, but are partiallyoverlapped. For instance, the ID items that describe them, Ikjand Ikl , have the same spans, strides, and parallel stride, butIkl ðX; iÞ contains elements of Ikj ðX; iÿ 1Þ. In this case, �o ¼�Iðj; iÞ ÿ �Iðl; iÞ specifies the overlapping distance. �o can beused to compute the overlapped elements (these elementsdefine the shadow areas [19] that will be array subregionsthat our approach will replicate in the local memory of twoprocessors). In the example in Fig. 5, there is no overlappedelements (i.e., the subregion I 32ðX; iÞ does not containelements of I 31ðX; iÿ 1Þ), so �o ¼ null.

    2.3 Locality-Communication Analysis Algorithm

    As we mentioned before, we use the LCG representation tocapture the memory locality and communication patterns.Each node of the LCG corresponds to a phase accessing anarray X. The compiler labels each node of the graph usingthe flag access_type of the corresponding ID (i.e., R, W,R/W, and P). The edges connecting nodes are also labeledwith two attributes, which we call the locality labels: C ifinterprocessor communication is needed for the array whenexecution proceeds from the first node to the second, and Lif communication is not needed. The locality labels, L andC, will be determined at the end of the Locality-Commu-nication Analysis Algorithm.

    Our Locality-Communication Analysis Algorithm looksfor decompositions in which: 1) the iterations of each parallelloop are statically distributed between the H processorsinvolved in the execution of the code according to aCYCLIC(k) (block-cyclic) pattern, and 2) the data distributionfor each array is induced from the iteration distribution. Inother words, our algorithm ensures the affinity between theparallel iterations and the data required by them.

    In the following sections, we present some details of ourLocality-Communication Analysis. The interested readercan find more details in [19]. Here, we assume that array Xis accessed in phases Fk and Fg, thus there is a node for eachphase in the graph that represents the array in the LCG.

    2.3.1 Intraphase Locality

    Our algorithm must ensure the affinity between a paralleliteration and the data required by it. Let us assume that

    parallel iteration i (of phase Fk) is scheduled in processorPEl; 0 � l � H ÿ 1. H is the total number of processors. Wecan state that the affinity is achieved when all accesses toarray X in that iteration are local to PEl, which means thatthe subregions described by the ID of the array for theparallel iteration i—i.e., IkðX; iÞ—are allocated to the localmemory of PEl. This is what we call the intraphase localitycondition. This is a sufficient condition to ensure that allaccesses to array X in the iteration i are local to PEl. Clearly,this condition must be fulfilled for all the parallel iterationsscheduled on each processor, in order to guarantee thelocality of array references in Fk. The data distribution forthe array in that phase will be derived in order to hold thiscondition, as we will see next.

    2.3.2 Interphase Locality

    Once we have established the locality of array references inphase Fk, the Locality-Communication Analysis algorithmdetermines when a decomposition for the array in twophases, Fk and Fg, can avoid communications whenexecution moves from the first phase to the second.

    First, we analyze the case when array X is nonprivatiz-able in Fk and Fg. In this situation, the algorithm needs torelate a sequence of parallel iterations on a phase to thearray region accessed by these iterations because we arelooking for block-cyclic distributions. To do this, ouralgorithm uses two values: the upper limit and the memorygap. As shown in Fig. 6, the upper limit, ULðIkðX; iÞÞ, ofarray X for a parallel iteration i represents the highestmemory position of the subregions described by the ID of Xin the iteration i. This is computed from the sum of all spansplus the extended offset of the ID:

    ULðIkðX; iÞÞ ¼ maxXl

    Bðj; lÞ þ �Iðj; iÞ8>>>>:

    9>>>>;:If there is more than one Ikj ðX; iÞ item, then the sum is

    computed for each component and the upper limit is themaximum value. Similarly, ULðIkðX; iÞ; pÞ = maxi:iþpÿ1ðULðIkðX; iÞÞÞ represents the highest memory position ofall the subregions of X for a sequence of p paralleliterations (the sequence goes from i to iþ pÿ 1). In theexample of Fig. 6, the upper limit for the parallel iteration0 is ULðI 3ðX; 0ÞÞ = ðP ÿ 1Þ þ 2 � P � i = P ÿ 1, whereas theupper limit for the sequence of iterations 0 : 2 is

    ULðI 3ðX; 0Þ; 3Þ ¼ maxi¼0:2ðULðI 3ðX; iÞÞÞ ¼ 5 � P ÿ 1:

    The memory gap, hk, of array X in phase Fk is defined asbeing the distance (measured in number of array elements)between the lowest memory position of the subregionsassociated with the ID of the iteration iþ 1 and the highest

    550 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 6, JUNE 2003

    Fig. 6. The upper limit, the memory gap, and regular regions of each ID I 3ðX; iÞ, shown in Fig. 5. Symbol . represents the upper limits, and thememory gap is h3 ¼ P .

  • memory position of the subregions associated with the ID ofthe parallel iteration i. There can be phases in the programwhere this distance is not zero, for instance, the memorygap is a positive value. This happens when the sequentialloops of the phase do not access all the memory positionsbetween two consecutive parallel iterations. For example, inFig. 6, the memory gap between subregions is h3 ¼ P . Inother cases, this distance can be negative, but then ouralgorithm sets hk ¼ 0 because, in this situation, the paralleliterations cover all the memory positions.

    The Regular Region, RkðX; iÞ, of an array for a paralleliteration i of phase Fk, can be characterized by the upper

    limit ULðIkðX; iÞÞ and the memory gap hk. As illustrated inFig. 6, each RkðX; iÞ (represented by a bounding box)contains all the elements of the array subregions described

    by the corresponding ID. Similarly, the aggregation of all

    the regular regions that a sequence of p parallel iterations

    covers,Si:iþpÿ1RkðX; iÞ, can be characterized by the upper

    limit for the sequence of parallel iterations i to iþ pÿ 1, andthe memory gap, i.e., ULðIkðX; iÞ; pÞ and hk. Henceforth, wewill call

    Si:iþpÿ1RkðX; iÞ the Aggregated Regular Region.S

    i:iþpÿ1RkðX; iÞ contains all the elements of the arraysubregions accessed in the chunk of parallel iterations

    i : iþ pÿ 1.The ULðIkðX; iÞ; pÞ and hk values are used to formulate

    the following balanced locality condition for array X in twophases Fk and Fg, in which the array is accessed:

    ULðIkðX; iÞ; pkÞ þ hk ¼ ULðI gðX; i0Þ; pgÞ þ hg ð6Þ

    1 � pk �uk þ 1H

    � �ð7Þ

    1 � pg �ug þ 1H

    � �; ð8Þ

    where uk and ug represent the upper bounds of the parallelloops in phases Fk and Fg, and H is the number ofprocessors. For instance, starting at parallel iteration i = 0,from Fig. 6, we can deduce for array X that,

    ULðI3ðX; i ¼ 0Þ; p3Þ þ h3 ¼ ðP ÿ 1Þ þ 2 � ðp3 ÿ 1ÞþP ¼ 2 � P � p3 ÿ 1;

    and from the code of Fig. 4c, we know that u3 ¼ Qÿ 1, thusp3 must verify (7),

    1 � p3 �Q

    H

    � �:

    Equation (6) determines the length of the sequence ofparallel iterations (pk, pg) that ensure that the aggregatedregular region is identical in Fk and Fg. Equations (7) and (8)limit the maximum size for each sequence of paralleliterations that can be scheduled in a processor for phases Fkand Fg. Thus, if we find a solution (pk, pg) that verifies thesystem of (6), (7), and (8), we say that the balanced localitycondition holds. In other words, we can find a (pk, pg) pair,for which the sequence of parallel iterations i : iþ pk ÿ 1 inFk, and the sequence i

    0 � pg : ði0 þ 1Þ � pg ÿ 1 in Fg access thearray elements contained in the same aggregated regularregion of the array. To find a (pk, pg) pair means that (6)

    must be held by sequences 0 : pk ÿ 1 and 0 : pg ÿ 1 of Fk andFg, respectively, then by sequences pk : 2 � pk ÿ 1 andpg : 2 � pg ÿ 1, and so on until all parallel iterations of Fkand Fg have been considered.

    If the sequence of parallel iterations i : iþ pk ÿ 1 andi0 � pg : ði0 þ 1Þ � pg ÿ 1 are scheduled in PEl, then theaccesses to array X are local when the commonaggregated regular region is stored in the local memoryof PEl (the intraphase locality condition). This processcan be repeated in a round robin fashion to cover allsequences of parallel iterations. That is, we can find adecomposition in which the sequences of iterations ðl � pk :ðlþ 1Þ � pk ÿ 1Þ, ððlþHÞ � pk : lþH þ 1Þ � pk ÿ 1Þ, . . . fromFk, and the sequences of iterations ðl � pg : ðlþ 1Þ � pg ÿ 1Þ,ððlþHÞ � pg : ðlþH þ 1Þ � pg ÿ 1Þ, . . . from Fg, can bescheduled in processor PEl, following a CYCLIC (pk) anda CYCLIC (pg) iteration distribution, respectively. Next,we can build a data distribution for the two phases insuch a way that the corresponding common aggregatedregular region of array X covered by each sequence ofiterations is allocated in the corresponding local memory.That is, the regular regions of array X induced by theiteration distributions will define the data distributions ofthe array. The implementation of these data distributionstakes place in the data allocation procedure during thecode generation step (see Section 3.4). From now on, wewill call a sequence of parallel iterations a chunk.

    The Locality-Communication Analysis algorithm checksif one of the next four situations occur and labels the LCGaccordingly:

    1. Array X is nonprivatizable in phases Fk and Fg andthe balanced locality condition holds. Thus, it ispossible to find iteration distributions in Fk (CYCLIC(pk)) and Fg (CYCLIC (pg)), and to build a datadistribution for the two phases, able to avoidcommunications, following the aggregated regularregion strategy explained above. In this case, theedges connecting the corresponding nodes in theLCG are labeled with L.

    2. Array X is nonprivatizable in phases Fk and Fg andthe balanced locality condition holds. However, nowthere are shadow areas for the array in anotherphase (i.e., 9 �o—see Section 2.2—) and accesses tothe array in Fk are writes. In this case, an updatingof the replicated shadow areas may happen, and wesay that the Frontier communication pattern occurs.Thus, the nodes are connected with a CF label (acommunication label).

    3. Array X is nonprivatizable in phases Fk and Fg andthe balanced locality condition does not hold. Inother words, each phase requires a different staticdata distribution, so an array redistribution betweenthe two connected nodes must take place, and wesay that the Global communication pattern occurs.Thus, the edges connecting the corresponding nodesin the LCG are labeled with CG (another commu-nication label)

    4. Array X is privatizable in phase Fk (Fg). Bydefinition, the value of X in Fg (Fk) does not dependon the value of X in phase Fk (Fg). In this case, the

    NAVARRO ET AL.: COMPILER TECHNIQUES FOR THE DISTRIBUTION OF DATA AND COMPUTATION 551

  • data allocation procedure—during the code genera-tion step—will privatize on the local memory of eachprocessor the regular regions that correspond to theparallel iterations scheduled on it. The edge con-necting the two nodes can be removed (these are thedashed edges in Fig. 2b).

    More details and examples that illustrate these situations

    are found in [19].

    2.3.3 Chains

    Notice that (Global or Frontier) communications occur

    between the phases that are connected by a C edge that

    represents a data redistribution (CG) or an updating of

    shadow areas (CF ) when the program control crosses from

    one phase to the other. On the other hand, the set of nodes

    that are connected consecutively by L edges we call a chain

    of nodes. There can be more than one chain for an array,

    and each array of the LCG has at least one chain. For

    instance, there is a chain of nodes for array X that includes

    the nodes corresponding to phases F3 ÿ F4 ÿ F5 ÿ F6 ÿF7 ÿ F8 in the LCG of Fig. 2b.

    It is possible to find a static data distribution for all the

    nodes of a chain, such that all the accesses to the array in

    any of the connected nodes of the chain are in local

    memory. Intuitively, the reason is because the verification

    of the balanced locality condition for the L-connected nodes

    of a chain will guarantee that all the nodes belonging to the

    same chain cover the same common aggregated regular

    regions of the array. A data allocation procedure must

    allocate the array elements of the common aggregated

    regular regions, in the corresponding local memory, before

    the first node of a chain. This is implemented during the

    code generation step (see Section 3.4).

    3 PROBLEM FORMULATION

    The goal of this section is to show how, from the array

    access and control flow, information supplied by the IDs of

    each array, and from the locality information of the LCG,

    we can derive an optimization problem to find out the

    decompositions (iteration/data distributions) that minimize

    the execution overhead of the parallel code, while exploit-

    ing all the available locality detected by the LCG. The

    pk values (the size of the chunk in the block-cyclic or

    CYCLIC (pk) iteration distribution for each phase) are the

    variables of the problem. The objective function of our

    programming problem is the overhead (in seconds) due to:

    1) the load imbalance (which depends on the iteration

    distribution of each phase, i.e., the computation), and 2) the

    communications (which depend on the communication

    patterns of the C edges on the LCG). We compute the

    objective function in Section 3.1. This objective function is

    subject to several constraints that are described in Section

    3.2. We will see in these sections that, for the computation of

    the costs functions (especially the communication costs) and

    the collection of the constraints, we use the graph structure

    of the LCG. Next, in Section 3.3, we discuss several issues

    relating to our model and their solutions.

    3.1 Objective Function

    The C-connected nodes take part in the definition of theobjective function because, as we said, communications areone of the overhead sources that we consider in ourapproach. The other overhead source is the accumulation ofthe load imbalance assigned to processors. The generalexpression of this objective function is:

    o:f : ¼ MinXj

    Xk

    DkðXj; pkÞ þ CkgðXj; pkÞ( )

    ; ð9Þ

    where j traverses all arrays (Xj) of the LCG and k traverses,for each array, all the nodes where the array is referenced.DkðXj; pkÞ represents, for array Xj, the load imbalance costfunction in phase Fk, and C

    kgðXj; pkÞ represents, for array Xj,the communication cost function for the communications thattake place between phasesFk andFg when the correspondingnodes of the LCG are connected with a C-edge. Both costfunctions will be described next.

    3.1.1 Load Imbalance Cost Function

    In this paper, to simplify the computation of the loadimbalance cost function, let us assume that, for each phase,loops are perfectly nested and that they have beenlinearized by the compiler. We compute the load imbalancecost function, DkðXj; pkÞ, of a phase Fk and for an array Xj,as the difference between the time consumed by the mostloaded processor and the least loaded one. Roughly, thiscost function depends on pk and the number of references toarray Xj. In order to compute this cost function, we must beable to estimate the computational load of a processor PEl,in other words, the time consumed by all the occurrences ofan array Xj in a phase Fk when a CYCLICðpkÞ iterationdistribution has been selected for the parallel loop. Thistime can be estimated as being due to two factors: 1) theload of processor PEl, i.e., the total number of iterationsscheduled in that processor, and 2) the time consumed byall the occurrences of array Xj inside the phase.

    In order to calculate factor 1), we start computing thesequential load for a chunk of pk parallel iterations in aphase Fk, L

    kðpkÞ, as the number of sequential iterationsexecuted by the pk parallel iterations. For example, in Fig. 7,we compute the sequential load for a chunk in a rectangularphase and in a triangular phase. Of course, there are otherkinds of triangular phases and domains. However, wechoose these two simple cases to illustrate how our methodcomputes the cost functions. In the L1ðp1Þ expression (i.e.,the number of sequential iterations executed by a chunk ofp1 parallel iterations in F1), M and R are the number ofiterations of the J-loop and the L-loop (the sequential loops),respectively. In the L2ðp2Þ expression, R is the number ofiterations of the L-loop and Chunk is the chunk number.That is, if the chunk number for which we are computingL2ðp2Þ is Chunk ¼ 0 (i.e., the first chunk), that chunkcontains the parallel iterations 0 : p2 ÿ 1; if the chunknumber for which we are computing L2ðp2Þ is Chunk ¼ 1(i.e., the second chunk), the chunk now contains paralleliterations p2 : 2 � p2 ÿ 1, and so on.

    Next, we compute the load of processor PEl in phase Fk,LkðPEl; pkÞ, as the accumulation of the sequential load due

    552 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 6, JUNE 2003

  • to all the chunks scheduled in that processor following a

    CYCLICðpkÞ iteration distribution, i.e., LkðPEl; pkÞ =P8 Chunk in PElðL

    kðpkÞÞ. In other words, LkðPEl; pkÞ repre-sents the total number of sequential iterations executed by

    all the chunks scheduled in processor PEl. Obviously, the

    load of a processor depends on the chunks scheduled on

    such processor (Chunk in PEl), and the size of a chunk (pk).We use a simple approach for the computation of

    factor 2—which we call RefkðXjÞ. We only roughly estimatethe time consumed by all the references to Xj in the loop

    body of phase Fk. In our estimation, the compiler analyzes

    the subscript expressions of the references to Xj in the

    phase. We have assumed that arrays are laid out in memory

    following the main column order because we are analyzing

    Fortran codes. If there are references with subscript

    expressions that contain the stride 1 (here, we only look

    through the strides associated with the sequential loop

    indexes), then we assume that these references will be

    satisfied mainly in cache. The other references will be

    satisfied in local memory. Recall that, in our model, there

    are no remote references inside a phase; the intraphase

    locality condition ensures this fact. Therefore, for each

    phase, and for each array Xj, we count the number of

    references that will be satisfied mainly in cache and

    multiply them by the cache latency of the system. We

    count the rest of the references (that will be satisfied in local

    memory) and we multiply them now, by the local memory

    latency. As we see, RefkðXjÞ is a machine-dependentfunction. Another way to compute RefkðXjÞ is by usingprofile information [12].

    Finally, once factors 1 and 2 have been calculated, we

    can estimate, for processor PEl, the time consumed by all

    the occurrences of the array Xj in Fk, when a

    CYCLICðpkÞ iteration distribution has been selected, asLkðPEl; pkÞ �RefkðXjÞ. Now, we can compute the loadimbalance cost function for each array Xj referenced in

    Fk, DkðXj; pkÞ, as the difference between the time

    consumed by the processor where the load is maximum

    (LkðPEM; pkÞ �RefkðXjÞ) and the processor where the loadis minimum (LkðPEm; pkÞ �RefkðXjÞ):

    DkðXj; pkÞ ¼�LkðPEM; pkÞ ÿ LkðPEm; pkÞ

    ��RefkðXjÞ

    �Ya

    para

    8>>>: 9>>>;:ð10Þ

    Equation (10) includes the factors 1 and 2, and the termQa para

    ÿ �. This term represents the contribution of phase Fk

    in the total overhead time. Thus, we must take into accountwhether the phase is nested or not in some control

    sentences (such as iterative and/or conditional sentences).para represents the parameter associated with one of theexecution predicates of the ID for array Xj in Fk (see (3)).Thus,

    Qa para

    ÿ �is the contribution due to all parameters of

    the ID when the phase is nested in several control sentences.Equation (10) represents a rough approach to the

    computation of the load imbalance overhead. In our loadimbalance cost function, we have to deal with a factdetected in real measurements of the load imbalance: If the

    number of chunks in the CYCLIC (pk) iteration distributionis a multiple of the number of processors (H), then the loadimbalance is smaller than the load imbalance when thiscondition does not hold. This is due to the block-cyclic

    iteration distribution pattern because when the number ofchunks is a multiple of H, then all processors have the samenumber of chunks. Otherwise, some processors have onechunk more than others. To capture this issue, we compute

    (10) in two situations:

    1. The number of chunks is a multiple of the number ofprocessors. Then, the number of chunks is the samefor all the processors. Clearly, it is easy to deducethat if the phase is rectangular, then

    LkðPEM; pkÞ ÿ LkðPEm; pkÞ ¼ 0: ð11Þ

    The expression for a triangular phase is a little morecomplex because, in this case, the most loadedprocessor is PEM ¼ H ÿ 1 and the least loaded isPEm ¼ 0. Then,

    LkðPEM; pkÞ ÿ LkðPEm; pkÞ ¼uk þ 1pk �H

    � �� p2k � ðH ÿ 1Þ � R:

    ð12Þ

    Remember that, for a triangular phase, we mustaccumulate the sequential load for the correspond-

    ing chunk numbers (see L2ðp2Þ expression in Fig. 7).uk þ 1 is the number of iterations of such a parallelloop, and R the number of iterations of all sequentialloops that do not depend on the parallel loop index

    (only the L-loop in our example of F2 in Fig. 7).2. The number of chunks is not a multiple of the

    number of processors. In this case, the most loadedprocessor is the one with the last chunk, i.e.,

    NAVARRO ET AL.: COMPILER TECHNIQUES FOR THE DISTRIBUTION OF DATA AND COMPUTATION 553

    Fig. 7. Computation of the sequential load for a chunk of parallel iterations: F1 is rectangular, F2 is triangular, and Chunk represents the chunk

    number.

  • PEM ¼ mod moduk þ 1pk

    � �; H

    8>>: 9>>;ÿ 1þH;H8>>: 9>>;;and the least loaded is PEm ¼ PEM þ 1. Forexample, when the phase is rectangular, then

    LkðPEM; pkÞ ÿ LkðPEm; pkÞ ¼ LkðpkÞ; ð13Þ

    that is, there is one chunk of load imbalance. Whenthe phase is triangular, then

    LkðPEM; pkÞ ÿ LkðPEm; pkÞ ¼ $uk þ 1pk �H

    %þ 1!� pk þ 1

    �� pk � ðH ÿ 1Þ �

    R

    2:ð14Þ

    Once we have obtained the two load imbalance costexpressions for situations 1 and 2, we can combine theminto a single expression. For this, we can use a binaryvariable bk: bk is equal to 0 when the number of chunks is amultiple of H, and equal to 1 in other cases. In Table 1, wesummarize the final expressions of the load imbalance costfunction ((10)) derived using this method for the rectan-gular and triangular phases of Fig. 7. These expressions arederived using (11) and (13) for the rectangular case, and (12)and (14) for the triangular one. From Table 1, we notice thatthe load imbalance cost functions are nonlinear expressions.

    3.1.2 Communication Cost Function

    The second overhead source in our model is due to the edgeslabeled with C (CF or CG edges) that connect the nodes thatcorrespond to phasesFk andFg in theLCG graph of arrayXj.They represent the need for communications when theprogram control crosses from one phase to the other becausewe cannot ensure the locality of references to array Xj in thetwo connected nodes. In these cases, we compute thecommunication cost function, CkgðXj; pkÞ, which is definedas the time consumed in the communications that takes placebetween the execution of the corresponding phases.

    The communication cost for M messages of n elements,for a single-step communication routine in absence ofmemory contention, is:

    CkgðXj; pkÞ ¼M � � þ1

    !� n

    � ��Ya

    para

    8>>>: 9>>>;; ð15Þwhere � represents the startup time and ! represents thebandwidth of the communication primitive which, ob-viously, depend on the target architecture. On the otherhand,

    Qa para

    ÿ �represents the contribution due to all

    parameters of the ID for array Xj in phase Fk, when thephase is nested in some control sentences, as we explainedin Section 3.1.1.

    As mentioned in Section 2.3, we have considered twocommunication patterns in our approach: the FrontierCommunication pattern (a CF edge) and the Globalcommunication pattern (a CG edge). Thus, we have tocompute two communication cost functions. What we do isuse (15) as the general communication cost function, whereM and n depend on the communication pattern.

    In this work, we use the put primitive as the commu-nication primitive. Thus, the communications are asynchro-nously initiated by the processor that owns the data. Inother words, the number and size of messages that aprocessor sends depend on the data that are allocated in thelocal memory of such a processor.

    Let’s focus our discussion on the Global communication

    pattern; the detailed analysis of how to derive the cost

    functions for the Frontier communication patterns can be

    found in [24]. The Global communication pattern represents a

    data reallocation. In this case, two phases Fk and Fg (or two

    chains of nodes) require two different static data distributions

    of array Xj. Thus, a redistribution of all the elements of the

    array must take place after the execution of Fk and before the

    execution of Fg. The general case of redistribution consists of

    sent messages of size n ¼ 1, that we call Global Communica-tion without Aggregation. In this case, the number of

    messages M is the number of elements in the local memory

    of each processor after the execution of Fk. M depends on:

    1) the number of chunks per processor scheduled in phase Fk

    ( ukþ1H�pk

    l m), 2) the size of the chunk (pk), and 3) the number of

    elements of array Xj that are in the regular region that is

    allocated in the local memory of a processor, for each parallel

    iteration, that we call �kðXjÞ and can be computed from thecorresponding ID. Other cases of redistribution consist of

    aggregating messages in blocks in order to reduce the latency

    of the communications, as well as the number of messages in

    the network. Message aggregation is provided in our model

    for two particular cases:

    1. Messages are aggregated in blocks of size n ¼ pk. Inthis case, the number of messages M depends onlyon the number of chunks per processor scheduled inphase Fk, and the number of elements of array thatbelongs to a regular region (�kðXjÞ).

    2. Messages are aggregated in blocks of size n ¼ �kðXjÞ.In this case, the number of messagesM depends onlyon the number of chunks per processor scheduled inphase Fk, and the size of the chunk.

    We have chosen these cases of message aggregationbecause they help our compiler simplify the code generation

    554 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 6, JUNE 2003

    TABLE 1The Load Imbalance Cost Function: F1 and F2 are the Rectangular and Triangular Phases from Fig. 7

  • for the Global communication pattern. Summarizing, weshow in Table 2 the number of messages (M) and the size ofeach message (n) that must be considered in (15), for thegeneral case of the Global communication pattern withoutaggregation of messages (n ¼ 1), and for the particular casesof the Global communication pattern when messages areaggregated in blocks of size n ¼ pk (case 1) or in blocks of sizen ¼ �kðXjÞ (case 2). More details can be found in [24], wherewe show the conditions to perform automatic messageaggregation (this issue is beyond the scope of this paperbecause its focus is to show how to derive the communicationcost function in our approach, and that our model is able toconsider some cases of message aggregation). In any case,from Table 2, we notice that the communication cost functionsare nonlinear expressions.

    Table 3 illustrates the objective function for theLCG of thetfft2 code example (Fig. 2b). Xj represents one of the twoarrays of the program (X1 � X,X2 � Y ) and indexk traversesall the nodes where the corresponding array is accessed,following the flow of the corresponding LCG graph.

    3.2 Constraints

    As we mentioned earlier, our objective function is subject todifferent kinds of constraints:

    1. Locality constraints. These are derived for each pairof phases in the LCG, connected with L, and theyrepresent the balanced locality condition (Section 2.3).The expression of the locality constraints is derivedfrom (6). The verification of these constraints statesthat it is possible to exploit memory locality withoutcommunications between the L-connected nodes.

    2. Maximum chunk size constraints. For each phase inthe LCG, these constraints set the limits for the size

    of the chunks. The expression of these constraints issimilar to those of (7) and (8). In these equations, wesee that the size of the chunks in a phase Fk, must bebetween 1 and ukþ1H

    � �.

    3. Nonlinear behavior constraints. These constraintsare due to the nonlinear behavior of the loadimbalance cost functions. We have seen that theload imbalance cost functions are nonlinear expres-sions that depend on a binary variable bk. Thisbinary variable is 0 when the number of chunks in aphase is a multiple of the number of processors, H,and 1 when the number of chunks is not a multipleof H. The expressions of the nonlinear behaviorconstraints are like those that we show in Table 3. Inthese expressions, ck represents the number ofchunks, bbkðlÞ is a binary variable, and MULkðlÞ isa table of constants (each constant in this table is amultiple of H).

    4. Integrality constraints. These constraints ensurethat the size of a chunk (pk) and the number ofchunks (ck) are just integer variables, while bk andbbkðlÞ are binary variables.

    Examples of all these constraints are shown in Table 3 for

    the LCG of the tfft2 code section of Fig. 2b.We note that our method formulates a Mixed Integer Non-

    Linear Programming problem or MINLP which is NP-hard.

    We have chosen to rely on a general purpose programming

    solver to find the optimal solution of our complex problem,

    rather than to use a heuristic. In order to compute the

    solution to this problem, we have invoked a MINLP solver

    of GAMS (General Algebraic Modeling System) [25] called

    DICOPT (DIscrete and Continuous OPTimizer) [26]. DICOPT is

    based on the extensions of the outer-approximation algo-

    rithm for the equality relaxation strategy. The MINLP

    algorithm inside DICOPT solves a series of NLP (Non-Linear

    Programming) and MIP (Mixed Integer Programming) sub-

    problems. By solving our optimization problem, we find pk.

    In other words, we obtain the optimal CYCLIC(pk) iteration

    distribution for each phase in the code, which minimizes

    the overhead due to the communications and load imbal-

    ance. Once we have obtained the iteration distributions, the

    arrays are distributed across the processor memories

    following the procedure sketched in Section 3.4.

    NAVARRO ET AL.: COMPILER TECHNIQUES FOR THE DISTRIBUTION OF DATA AND COMPUTATION 555

    TABLE 2Number (M) and Size (n) of Messages for the

    Global Communication Pattern

    TABLE 3Objective Function and Constraints for tfft2 Code Example

  • 3.3 Analysis of the Model and the Solutions

    An important question at this point is which of the manycomponents of our overhead model are more relevant. Toillustrate this discussion, we present as a case study thefragment of code of Fig. 9a, where we only consider thereferences to array X. The Locality-Communication Analy-sis Algorithm (see Section 2.3) detected the Global commu-nication pattern with aggregation-case 1 for array X, whenthe execution moves from phase Fk to Fg. For this reason,our compiler inserted the “Global communication routine”in the code. Let us assume that the number of processors isH ¼ 4.

    Fig. 8a represents the overhead due to the globalcommunications. Fig. 8b represents the overhead due tothe load imbalance when we assume that Fk is a rectangularphase (for instance, v ¼ 255). Fig. 8c is the overhead due tothe load imbalance when we assume that Fk is a triangularphase (i.e., v depends on the outer loop index, for instance,v ¼ I). In the figures, we show the overheads estimated byour cost functions and the overheads measured in a realmachine.2 The overheads are times in seconds (Y-axis).

    To estimate the cost of the global communicationspattern with aggregation, we use an expression such asthe one in the second row of Table 2. Regarding the loadimbalance overhead, we use similar expressions to thoseshown in Table 1 for a rectangular and for a triangularphase, respectively. One point here is that our cost functionsseem to model quite accurately the real behavior of the codeoverhead in a real machine. We have estimated andmeasured each one of the overheads for different valuesof pk (X-axis). As there are 256 iterations in the parallel loopof Fk, the size of the chunk of the iteration distribution thatwe are looking for, i.e., pk, could range from 1 to 64, i.e.,from CYCLIC(1) to CYCLIC(64) (= BLOCK) distribution.Another point is that we can deduce directly from thefigures the optimum pk that achieves minimum overhead,which is the value of interest.

    If we just consider the communication cost function as theonly overhead source, our method tends to select CYCLIC(64)= BLOCK distributions for Fk (see Fig. 8a, where theminimum overhead is reached when pk ¼ 64). If we onlyconsider the load imbalance cost function, the selection of pkdepends on the type of phase: If the phase is rectangular ourmethod could select those values of pk for which the number

    of chunks is a multiple ofH (see Fig. 8b, where the minimumis reached for pk ¼ f1; 2; 4; 8; 16; 32; 64g); if the phase istriangular, our method selects pk ¼ 1, because bigger valuesof pk increase the load imbalance overhead (see Fig. 8c). Whenwe combine the overhead of Figs. 8a and 8b (i.e., commu-nications and load imbalance in a rectangular phase), ourmethod selects pk ¼ 64 (i.e., the BLOCK distribution). Whenwe combine the overhead of Figs. 8a and 8c (i.e., commu-nications and load imbalance in a triangular phase), ourmethod selects a CYCLIC (pk) distribution in which the size ofthe chunk, pk, is a trade off between the communications andload imbalance. More specifically, the value of pk dependsbasically on a machine-dependent parameter: the latencies ofthe memory hierarchy. We now go into this issue moredeeply.

    In our research, we consider two types of latencies: thelatencies of remote memory access (i.e., the latencies ofcommunication primitives) and the latencies of cache andlocal memory access. The latencies of communicationprimitives are considered in the computation of the commu-nication cost function (the parameter �). On the other hand,the latencies of cache and local memory access affect thecomputation of the load imbalance cost function (becausethey take part in the estimation of RefkðXjÞ). Let us studyFigs. 9b and 9c to illustrate how the latencies affect theselection of the optimum chunk size in our model when weanalyze the case that we are now studying: a triangular phasewith communications. In these figures, we represent theoptimum chunk size selected by our approach (the Y-axis),taking into account communication and load imbalanceoverheads for the triangular phase of Fig. 9a (i.e., v ¼ I)when H ¼ 4.

    In Fig. 9b, we fix the parameter � (the latency of thecommunication primitive) and vary the latency of localmemory (LT ) from 1 nanosecond to 10 microseconds (theX-axis). In Fig. 9c, we fix the latency of local memory andvary the latencies of the communication primitive (�) from1 nanosecond to 10 microseconds (the X-axis). The othervalues of the cost functions are fixed to the Cray T3E’sparameters. We note that the optimum chunk size rangesfrom pk ¼ 64 (the BLOCK distribution) to pk ¼ 1 (theCYCLIC(1) distribution). We note that our method onlyselected those values of pk for which the number of chunksis a multiple of H (in this example, the possible values forpk are f1; 2; 4; 8; 16; 32; 64g). The main conclusion drawnfrom Fig. 9b is that, when the latency of local memory

    556 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 6, JUNE 2003

    2. The target machine is the Cray T3E.

    Fig. 8. Estimated versus measured overhead in a Cray T3E: (a) due to communications only, (b) due to load imbalance in a rectangular phase, and

    (c) due to load imbalance in a triangular phase. X-axis: the size of the chunk, pk; Y-axis: time in seconds.

  • increases, the load imbalance becomes the more importantfactor in the cost function and the optimum tends to smallerchunks. For example, with a latency of local memory of10 nanoseconds, the optimum is pk ¼ 64; with a latency of1 microsecond the optimum is pk ¼ 16; and beyond7 microseconds, the optimum is pk ¼ 1. On the other hand,the main conclusion from Fig. 9c is that, when the latency ofthe communication primitive increases, the communicationcost becomes the more important factor in the cost functionand the optimum leads to bigger chunks. In this case, witha latency of remote memory of 1 nanosecond, the optimumis pk ¼ 1; with a latency of 1 microsecond, the optimum ispk ¼ 16, and beyond 9 microseconds, the optimum ispk ¼ 64. As we can see from these two figures, the effectsof the local memory latencies are the opposite to those ofthe remote memory (communication) latencies when welook for the optimum chunk size in a triangular phase.Intuitively, if we only consider the load imbalance in atriangular phase, scheduling chunks of the smallest size isthe right decision. If we only consider communications,aggregating messages to build the biggest chunk is the bestoption. When we consider the communications and theload imbalance together, we have to find a trade off value.As we see, this trade off is captured in our approach.

    Some other issues related to our MINLP formulation(such as the complexity of the nonlinear programmingproblem and optimality of the solutions) are brieflydiscussed in Section 4.

    3.4 Parallel Code Generation

    In the following, we simply outline the procedure togenerate the parallel code, once our method has found,for each phase Fk, the optimum pk, i.e., the CYCLIC (pk)iteration distribution that minimizes the overhead of theparallel code.

    In Fig. 10, we show how the parallel code is generated byour method for a fragment of the tfft2 code that includesphases F1 and F2. We notice that loops marked as parallel bythe compiler in a previous step (Fig. 10a), are decomposedinto two nested loops (following a strip-mining scheme) as wesee in Fig. 10b. This is the implementation of the CYCLIC (pk)iteration distribution for each phase. In this way, the externalloop—driven by I1—traverses all the chunks, CH

    kðPElÞ,scheduled in processor PEl, 0 � l � H ÿ 1, and the internalloop—driven by I2—crosses the parallel iterations of eachchunk.

    Now, we can build the data distribution for each array ofa phase. Our data distribution follows the aggregatedregular region strategy explained in Section 2.3.2, and takescare of holding the intraphase locality condition that weidentified in Section 2.3.1. This is done in the “Dataallocation procedure,” which is called at the beginning ofthe code execution. In [24], we show the algorithm of ageneric data allocation procedure. Basically, this procedureuses the information provided by the ID of array X to selectthe array elements that must be allocated in the local

    NAVARRO ET AL.: COMPILER TECHNIQUES FOR THE DISTRIBUTION OF DATA AND COMPUTATION 557

    Fig. 10. Phases F1 and F2 of tfft2 code: (a) input code and (b) parallel code.

    Fig. 9. (a) A phase with references to array X and communications; optimum chunk size when: (b) � ¼ 1 microsecond and the latency of localmemory ranges from 1 nanosecond to 10 microseconds; (c) latency of local memory = 180 nanoseconds and � varies from 1 nanosecond to

    10 microseconds.

  • memory of processor PEl for each chunk of pk paralleliterations scheduled on such a processor.

    Let us return to Fig. 10. Before the execution of F2, thecompiler inserts a call to a “Global communication routine”for array X. This is because the Locality-CommunicationAnalysis algorithm detected the Global communicationpattern for array X between the execution of phases F1 andF2, and this was indicated in the LCG of Fig. 2b with aCG edge. In addition, from this LCG we can deduce that thecompiler will insert another call to a “Global communica-tion routine” for array X between the execution of phasesF2 and F3 and, after this point, all references to X will belocal for the chain of nodes that correspond to phasesF3 ÿ F4 ÿ F5 ÿ F6 ÿ F7 ÿ F8.

    The generation of the “Global communication routine”for an array X follows a similar process to a data allocationprocedure. The idea of the global communication routine isto traverse all array positions of the regular regionsallocated in the local memory of each processor, accordingto the data distribution of the source phase Fk. The task ofthe global communication routine is to compute the targetprocessor and the remote array position where each localelement must be sent (using the put primitive), accordingto the data distribution of the target phase. We do not godeeper into this issue due to space constraints, but moredetails can be found in [24].

    4 EXPERIMENTAL RESULTS

    In this section, we attempt to evaluate the impact of ourapproach using six benchmarks: two codes from SPECfp95(tomcatv and swim), three codes from the Perfect Bench-mark Suite (bdna, mdg, and trfd), and one code from theNAS (tfft2). In Table 4a, we summarize the size of themain arrays in the codes.

    The first task in our experiments was to measure thecomplexity of our model in the GAMS optimization tool. Forall the cases, we built the LCG of the code, derived thenonlinear integer programming problem, and obtained thesolutions invoking the DICOPT solver of GAMS. The secondquestion was to evaluate the efficiency for the decomposi-tions (iterations and data distributions) derived with ourtechniques in different NUMA machines.

    4.1 Complexity of the Method

    In Table 4b, we summarize, for each benchmark, someissues about the complexity of our nonlinear integerprogramming model: the number of equations (E), thenumber of variables (V), the number of terms (T), thenumber of nonlinear terms (NL.T), the number of

    iterations (I) till DICOPT found an optimal solution, andthe time in milliseconds (Time) that DICOPT spent infinding a solution in a MIPS R10000 running at 195 MHz.

    We note that the DICOPT solver found optimal solutionsin a relatively small number of iterations. For this reason,the time that DICOPT needed to find the solutions was verysmall: from 90 milliseconds for mdg to 160 milliseconds fortfft2. We believe that these times show the feasibility ofour approach. In fact, the times are quite competitive,justifying the fact that our method can be implemented inthe Polaris parallelizing compiler as a new pass. We caneven take advantage of these very small times to avoid thenecessity of fixing H at compile time. Thus, the compilercan generate the parallel code in which the value of pk foreach phase is an unknown parameter. At the beginning ofthe source parallel code, we insert a call to a routine wherewe have parametrized the objective function and con-straints of the code. At runtime, this routine has the task offinding out the number of processors (H) and invokingDICOPT to solve the parametrized MINLP problem of thecode, but now knowing H. The output of this routine arethe values of pk for each parallel loop of the code.

    Another issue that we should address is that, althoughthe complexity of the source code increases, the runtimeof the solver to find the solution does not growdramatically. For example, bdna is the most complexcode, so the number of variables and equations is thebiggest. However, if we compare the runtime of bdnaversus trfd, which is the least complex code, we noticethat they are not significantly different. The reason isbecause, based on experience with other models, theauthors of the DICOPT solver have chosen as the defaultstopping rule “stop as soon as the NLP subproblem hasan optimal objective function that is worse than theprevious NLP subproblem” [26]. In several cases, thisstopping rule makes DICOPT find the best integersolution in the first few iterations (as happens in ourcodes). Obviously, this heuristic may provide a subopti-mal solution (a local minimum instead the globalminimum) because our objective function is not a convexfunction. As we see, the optimization technology has notyet reached the stage of maturity in MINLP problems.One of the best current approaches to avoid a localminimum in MINLP problems is to compute the convexenvelope of the objective function and to formulate a newoptimization problem with this new function. In this case,DICOPT would always give us the global minimum. Wecould compare this global minimum with the minimumobtained with our original nonconvex objective function.

    558 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 6, JUNE 2003

    TABLE 4(a) Main Arrays and Their Sizes in Our Experiments, and (b) Complexity of Our Model in GAMS

  • If they had the same values of pk, then we would havefound the global minimum. Otherwise, we could setbounds to pk in the area where the global minimum forthe convex envelope was found, and again, formulate theoriginal nonconvex objective function with this newconstraint, to try to fit the local minimum of thenonconvex objective function to the global minimum ofthe convex envelope. As we see, this is a challenging areathat needs more research and is beyond the scope of thispaper. In any case, for our codes, DICOPT always gave usthe global optimum in our initial formulation.

    4.2 Results of the Iteration/Data DistributionsDerived With Our Method

    In order to evaluate the efficiency of the iterations and datadistributions derived in our method, we conducted two setsof experiments. In the first set of experiments, two differentparallel versions were generated as targets from eachsequential program: version 1 and version 2. Version 1was generated by Polaris applying the techniques currentlyimplemented in the compiler [16], [19]: BLOCK iteration

    distribution for the rectangular parallel loops or CYCLIC (1)

    iteration distribution for the triangular ones, and always

    BLOCK distribution for data. Version 2 was generated

    applying the techniques discussed in Section 3: The LCG of

    the source code was built, then the MINLP problem for the

    optimal iteration/data distributions was derived. We used

    the DICOPT solver of GAMS to obtain the solutions. Finally,

    the distributions were hand-coded following the procedure

    explained in Section 3.4. We note that the same loops were

    parallelized for both versions. These experiments were

    originally intended to evaluate the effectiveness of our

    distribution model and to roughly estimate how much we

    could improve the parallel performance once our techni-

    ques are fully implemented in the Polaris compiler. In

    Table 5, we summarize for each program the decomposition

    selected by Polaris (version 1) and, by our method,

    (version 2) for the major parallel loops and arrays.In Fig. 11, we show the execution results of these

    benchmark programs on the Cray T3D. When we

    compare the execution times for the two versions, we

    NAVARRO ET AL.: COMPILER TECHNIQUES FOR THE DISTRIBUTION OF DATA AND COMPUTATION 559

    TABLE 5Iteration and Data Distributions in Version 1 and Version 2 for the Cray T3D

    H is the number of processors and NMOL, N, P, and NRS are input parameters. A.R.R.S. stands for Aggregated Regular Region Strategy.

    Fig. 11. Executions of the sequential code and its parallel code for version 1 and version 2 on the T3D.

  • notice that version 2 always outperforms version 1. Let’sexamine the reasons for these time differences. In swimand tomcatv, both version 1 and version 2 select BLOCKiteration distributions for all of the parallel loops (BLOCK= CYCLIC(N/H)). However, version 2 is faster thanversion 1 due to the fact that our approach detects andexploits the Frontier communication pattern ([19]), whichreduces the number of communications drastically.

    Regarding the codes bdna and trfd, these containimportant triangular loops that were distributed following aCYCLIC(1) and BLOCK schedule in version 1, and follow-ing a block-cyclic schedule in version 2. However, the key isthat the main arrays of these codes were distributed usingthe BLOCK data distribution in version 1, whereas thearrays were distributed using the aggregated regular regionstrategy in version 2. The problem in version 1 is that thereis no affinity between the iteration distributions (CYCLIC(1)) and data distributions (BLOCK) in the triangular loops,so version 1 needs to generate many communications.However, in our approach, the affinity of iterations anddata is guaranteed thanks to the intraphase localitycondition and the aggregated regular regions strategy.Thus, using our techniques, we remove the need forcommunications while the load is balanced in thesetriangular loops, improving the performance of version 1by approximately 50 percent in these cases.

    On the other hand, the tfft2 code is an interestingcase. The main loop nests are rectangular, but now thearrays present symmetries between subregions. In otherwords, in the same iteration of a parallel loop, differentsubregions on an array are accessed with the same accesspattern. This information is captured in version 2 ([19]),where several block-cyclic schedules of iterations areselected. The locality constraints and the aggregatedregular regions strategy guarantee the affinity of iterationsand data accessed in such symmetric subregions. In otherwords, our technique can identify chunks of parallel

    iterations and the subregions accessed by these chunks

    and can schedule the iterations and place the data in the

    corresponding processor. However, in version 1, the

    BLOCK distribution for iteration and data cannot handle

    this complex situation.We believe that our techniques will be effective for a

    wide variety of NUMA machines, including machines with

    high throughput and low latency in which the data locality

    issue is a less critical factor than load balancing. The reason

    for this is because our approach looks for a trade off

    between a minimization of communications and load

    balancing. In order to check the behavior of our approach

    in a machine with these characteristics, we conducted a

    second set of experiments, using two different NUMA

    machines: a Cray T3E and an Origin 2000.In Tables 6 and 7, we show the execution times (in

    seconds) for version 2 (the parallel version based on our

    techniques) on the Cray T3E and on the Origin 2000,

    respectively. In addition, we show the efficiencies of the

    parallel codes. We see superlinear behavior in the swim and

    tomcatv codes (due to a cache effect) on the Cray T3E and

    on the Origin 2000. In fact, we achieve in both machines

    better efficiencies than in the Cray T3D. However, we notice

    from the tables that parallel efficiencies for 16 processors

    are, in general, worse in the Origin 2000 than in the Cray

    T3E. This can be explained to some extent because the put

    primitives are less scalable in the Origin than in the Cray

    T3E. In other words, we note that for the same parallel code

    that achieves the same load balance in both machines, the

    higher the remote latency the worse the program scalability.In summary, we believe that these results show that our

    MINLP formulation helps the compiler to find efficient

    decompositions for real codes to be executed in a variety of

    NUMA platforms.

    560 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 6, JUNE 2003

    TABLE 6Cray T3E: Sequential and Parallel Times of Version 2 in Seconds; Efficiencies of the Parallel Codes

    TABLE 7Origin 2000: Sequential and Parallel Times of Version 2 in Seconds; Efficiencies of the Parallel Codes

  • 5 CONCLUSIONS

    We summarize the contributions of this paper as follows:

    1. We have modeled the overhead due to communica-tions and load imbalance in NUMA architectures.Using this model, we have formulated a MINLPoptimization problem to find out the optimal itera-tion/data distributions that minimize this overheadwhile exploiting the available locality.

    2. We have demonstrated, with measurements, thatthe MINLP formulation does not increase compila-tion time significantly and that our approachgenerates efficient decompositions for a variety ofNUMA architectures.

    3. We have shown that communications (latencies ofremote memory) and load imbalance (latencies ofcache and local memory) can affect in opposite waysthe selection of the optimal iteration/data distribu-tions, and that our model addresses this issue.

    We note that some of the ideas presented in this paperhave been described and reported in a number of journalarticles: the use of block-cyclic decompositions for loadbalancing and scalability, redistribution of data betweenphases when a static data distribution is not found for thewhole program, or the use of nonlinear programmingtechniques to optimize a MINLP problem. However, themain contribution of this paper is to present a newapproach that puts the pieces together and that can beused by a parallelizing compiler to generate efficientparallel code from conventional sequential code withoutuser intervention.

    ACKNOWLEDGMENTS

    The authors would like to thank the anonymous referees fortheir helpful and insightful suggestions. This work wassupported in part by the European Union under contract1FD97-2103, and by the Ministry of Education of Spainunder contract TIC2000-1658.

    REFERENCES[1] J. Nielocha, R. Harrison, and R. Littlefield, “Global Arrays: A

    Portable Shared-Memory Programming Model for DistributedMemory Computers,” Proc. Int’l Conf. Supercomputing, pp. 340-349,Nov. 1994.

    [2] D.S. Nikolopoulos, T.S. Papatheodorou, C.D. Polychronopoulos, J.Labarta, and E. Ayguade, “A Case for User-Level Dynamic PageMigration,” Proc. Int’l Conf. Supercomputing, pp. 119-130, May2000.

    [3] Silicon Graphics, Inc., POWER Fortran Accelerator User’s Guide,1997.

    [4] OpenMP Architecture Review Board, OpenMP: A ProposedIndustry Standard API for Shared Memory Programming,http://www.openmp.org, 1997.

    [5] B. Chapman, A. Patil, and A. Prabhakar, “Performance OrientedProgramming for NUMA Architectures,” Lecture Notes in Compu-ter Science, vol. 2104, p. 137, 2001.

    [6] K. McKinley, S. Carr, and C. Tseng, “Improving Data Localitywith Loop Transformations,” ACM Trans. Programming Languages& Systems, vol. 18, no. 4, pp. 424-453, July 1996.

    [7] I. Kodukula, N. Ahmed, and K. Pingali, “Data-Centric MultilevelBlocking,” Programming Language Design and Implementation, June1997.

    [8] M. Kandemir, A. Choudhary, J. Ramanujan, and P. Banerjee, “AGraph Based Framework to Detect Optimal Memory Layouts forImproving Data Locality,” Proc. Int’l Parallel Processing Symp., Apr.1999.

    [9] M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujam, and E.Ayguade, “An Integer Linear Programming Approach forOptimizing Cache Locality,” Proc. ACM Int’l Conf. Supercomputing,pp. 500-509, June 1999.

    [10] D. Bau, I. Kodukula, V. Kotlyar, K. Pingali, and P. Stodghill,“Solving Alignment Using Elementary Linear Algebra,” Proc. Int’lWorkshop Languages and Compilers for Parallel Computing,K. Pingali, et al., eds., pp. 61-75, Aug. 1994.

    [11] S. Chatterjee, J.R. Gilbert, R. Schreiber, and T.J. Sheffler, “Algo-rithms for Automatic Alignment of Arrays,” J. Parallel andDistributed Computing, vol. 38, no. 2, pp. 145-157, Nov. 1996.

    [12] J.M. Anderson and M.S. Lam, “Global Optimizations forParallelism and Locality on Scalable Parallel Machines,” Proc.ACM SIGPLAN Conf. Programming Language Design and Implemen-tation, pp. 112-125, June 1993.

    [13] T. LeBlanc and E. Markatos, “Shared Memory vs. MessagePassing in Shared-Memory Multiprocessors,” Proc. Fourth Symp.Parallel and Distributed Processing, Dec. 1992.

    [14] K. Kennedy and U. Kremer, “Automatic Data Layout Using 0-1Integer Programming,” Proc. Int’l Conf. Parallel Architectures andCompilation Techniques, Aug. 1994.

    [15] J. Garcia, E. Ayguade, and J. Labarta, “Dynamic Data Distributionwith Control Flow Analysis,” Proc. Supercomputing, Nov. 1996.

    [16] Y. Paek, J. Hoeflinger, and D. Padua, “Simplification of ArrayAccess Patterns for Compiler Optimizations,” Proc. ACM SIG-PLAN Conf. Programming Language Design and Implementation, June1998.

    [17] W. Blume, R. Doallo, R. Eigenmann, J. Grout, J. Hoeflinger, T.Lawrence, J. Lee, D. Padua, Y. Paek, W. Pottenger, L. Rauchwer-ger, and P. Tu, “Parallel Programming with Polaris,” Computer,pp. 78-82, Dec. 1996.

    [18] A. Navarro, R. Asenjo, E. Zapata, and D. Padua, “AccessDescriptor Based Locality Analysis for Distributed-Shared Mem-ory Multiprocessors,” Proc. Int’l Conf. Parallel Processing, pp. 86-94,Sept. 1999.

    [19] Y. Paek, A. Navarro, E. Zapata, J. Hoeflinger, and D. Padua, “AnAdvanced Compiler Framework for Noncache-Coherent Multi-processors,” IEEE Trans. Parallel and Distributed Systems, vol. 3,no. 3, pp. 241-259, Mar. 2002.

    [20] J. P. Hoeflinger, “Interprocedural Parallelization Using MemoryClassification Analysis,” PhD thesis, Univ. of Illinois at Urbana-Champaign, Dept. of Computer Science, 1998.

    [21] R. Triolet, F. Irigoin, and P. Feautrier, “Direct Parallelization ofCall Statements,” Proc. SIGPLAN Symp. Compiler Construction,pp. 176-185, 1986.

    [22] W. Pugh, “A Practical Algorithm for Exact Array DependenceAnalysis,” Comm. ACM, vol. 35, no. 8, Aug. 1992.

    [23] Y. Paek, “Automatic Parallelization for Distributed MemoryMachines Based on Access Region Analysis,” PhD thesis, Univ.of Illinois at Urbana-Champaign, Dept. of Computer Science, Apr.1997.

    [24] A.G. Navarro and E.L. Zapata, “An Automatic Iteration/DataDistribution Method Based on Access Descriptors for DSMMultiprocessors,” Technical Report UMA-DAC-99/07, Dept. ofComputer Architecture, Univ. of Málaga, 1999.

    [25] A. Brooke, D. Kendrick, and A. Meeraus, Release 2.25 GAMS: AUser’s Guide, http://www.gams.com/Default.htm, 1992.

    [26] J. Viswanathan and I.E. Grossmann, “A Combined PenaltyFunction and Outer Approximation Method for MINLP Optimi-zation,” Computers and Chemical Eng., vol. 14, pp. 769-782, 1990.

    Angeles Navarro received the engineeringdegree in telecommunications in 1995, and thePhD degree in computer science in 2000, bothfrom the University of Málaga, Spain. From 1997to 2001, she was an assistant professor in theComputer Architecture Department at the Uni-versity of Málaga, and has been an associateprofessor in the same department since 2001.She lectures on computer organization andarchitecture. Her research interests are in

    parallelizing compilers, multiprocessor architectures, and multimediadistributed systems.

    NAVARRO ET AL.: COMPILER TECHNIQUES FOR THE DISTRIBUTION OF DATA AND COMPUTATION 561

  • Emilio Zapata received the degree in physicsfrom the University of Granada in 1978, and thePhD degree in physics from the University ofSantiago de Compostela in 1983. From 1978 to1982, he was assistant professor at the Uni-versity of Granada. In 1982, he joined theUniversity of Santiago de Compostela wherehe became full professor in 1990. Since 1991,he has been a full professor at the University ofMálaga. He has published more than 90 journal

    and 200 conference papers on the parallel computing field (applications,compilers, and architectures). His main research topics include:application fields, compilation issues for irregularly structured computa-tions and computer arithmetic, and application-specific array proces-sors. He is a member of the editorial board of the Journal of ParallelComputing and Journal of Systems Architecture. He has also been aguest editor of special issues of the Journal of Parallel Computing(Languages and Compilers for Parallel Computers) and the Journal ofSystems Architecture (Tools and Environments for Parallel ProgramDevelopment).

    David Padua is a professor of computer scienceat the University of Illinois at Urbana-Cham-paign, where he has been a faculty membersince 1985. He has served as a programcommittee member, program chair, or generalchair of more than 40 conferences and work-shops. He served on the editorial board of theIEEE Transactions of Par