72
Proceedings of IMPACT 2013

web.cse.ohio-state.eduweb.cse.ohio-state.edu/~pouchet.2//doc/impact...Contents Committees 1 Tiling & Dependence Analysis David G. Wonnacott, Michelle Mills Strout OntheScalabilityofLoopTilingTechniques

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Proceedings of IMPACT 2013

  • IMPACT 2013

    Proceedings of the

    3rd International Workshop onPolyhedral Compilation Techniques

    Berlin, GermanyJanuary 21, 2013

    in conjunction with HiPEAC 2013

    Workshop organizers and proceedings editors:

    Armin GrößlingerLouis-Noël Pouchet

  • Each article in these proceedings is copyrighted by its respective authors.

  • Acknowledgements

    The program chairs are grateful to the members of the program committee for their incrediblework, time commitment and dedication during the review process for IMPACT 2013. We arealso grateful to the authors who submitted to IMPACT, Albert Cohen for his keynote speech,and all participants and attendees of IMPACT 2013 in Berlin.

    IMPACT 2013 in Berlin, Germany (in conjuction with HiPEAC 2013) is the third workshop in aseries of international workshops on polyhedral compilation techniques. The previous workshopswere held in Chamonix, France (2011) in conjuction with CGO 2011 and Paris, France (2012) inconjuction with HiPEAC 2012.

    iii

  • iv

  • Contents

    Committees 1

    Tiling & Dependence Analysis

    David G. Wonnacott, Michelle Mills StroutOn the Scalability of Loop Tiling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 3

    Tomofumi Yuki, Sanjay RajopadhyeMemory Allocations for Tiled Uniform Dependence Programs . . . . . . . . . . . . . . . . 13

    Sven Verdoolaege, Hristo Nikolov, Todor StefanovOn Demand Parametric Array Dataflow Analysis . . . . . . . . . . . . . . . . . . . . . . . 23

    Parallelism Constructs & Speculation

    Imèn Fassi, Philippe Clauss, Matthieu Kuhn, Yosr SlamaMultifor for Multicore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    Dustin Feld, Thomas Soddemann, Michael Jünger, Sven MallachFacilitate SIMD-Code-Generation in the Polyhedral Model byHardware-aware Automatic Code-Transformation . . . . . . . . . . . . . . . . . . . . . . . 45

    Johannes Doerfert, Clemens Hammacher, Kevin Streit, Sebastian HackSPolly: Speculative Optimizations in the Polyhedral Model . . . . . . . . . . . . . . . . . 55

    v

  • vi

  • Organizers and Program ChairsArmin Größlinger (University of Passau, Germany)Louis-Noël Pouchet (University of California Los Angeles, USA)

    Program CommitteeChristophe Alias (ENS Lyon, France)Cédric Bastoul (INRIA, France)Uday Bondhugula (IISc, India)Philippe Clauss (University of Strasbourg, France)Albert Cohen (INRIA, France)Alain Darte (ENS Lyon, France)Paul Feautrier (ENS Lyon, France)Martin Griebl (University of Passau, Germany)Sebastian Hack (Saarland University, Germany)François Irigoin (MINES ParisTech, France)Paul Kelly (Imperial College London, UK)Ronan Keryell (Wild Systems/Silkan, USA)Vincent Loechner (University of Strasbourg, France)Benoît Meister (Reservoir Labs, Inc., USA)Sanjay Rajopadhye (Colorado State University, USA)P. Sadayappan (Ohio State University, USA)Michelle Mills Strout (Colorado State University, USA)Nicolas Vasilache (Reservoir Labs, Inc., USA)Sven Verdoolaege (KU Leuven/ENS, France)

    Additional ReviewersSomashekar BhaskaracharyaRoshan DathathriAlexandra JimboreanAthanasios KonstantinidisAndreas Simbürger

    1

  • 2

  • On the Scalability of Loop Tiling Techniques

    David G. WonnacottHaverford College

    Haverford, PA, U.S.A. [email protected]

    Michelle Mills StroutColorado State University

    Fort Collins, CO, U.S.A. [email protected]

    ABSTRACTThe Polyhedral model has proven to be a valuable tool forimproving memory locality and exploiting parallelism for op-timizing dense array codes. This model is expressive enoughto describe transformations of imperfectly nested loops, andto capture a variety of program transformations, includingmany approaches to loop tiling. Tools such as the highly suc-cessful PLuTo automatic parallelizer have provided empiri-cal confirmation of the success of polyhedral-based optimiza-tion, through experiments in which a number of benchmarkshave been executed on machines with small- to medium-scaleparallelism.

    In anticipation of ever higher degrees of parallelism, we haveexplored the impact of various loop tiling strategies on theasymptotic degree of available parallelism. In our analysis,we consider “weak scaling” as described by Gustafson, i.e.,in which the data set size grows linearly with the number ofprocessors available. Some, but not all, of the approaches totiling provide weak scaling. In particular, the tiling currentlyperformed by PLuTo does not scale in this sense.

    In this article, we review approaches to loop tiling in thepublished literature, focusing on both scalability and imple-mentation status. We find that fully scalable tilings are notavailable in general-purpose tools, and call upon the polyhe-dral compilation community to focus on questions of asymp-totic scalability. Finally, we identify ongoing work that mayresolve this issue.

    1. INTRODUCTIONThe Polyhedral model has proven to be a valuable tool forimproving memory locality and exploiting parallelism for op-timizing dense array codes. This model is expressive enoughto express a variety of program transformations, includingmany forms of loop tiling, which can improve cache line uti-lization and avoid false sharing [16, 37, 36], as well as in-crease the granularity of concurrency.

    For many codes, the most dramatic locality improvementsoccur with time tiling, i.e., tiling that spans multiple itera-tions of an outer time-step loop. In some cases, the degreeof locality can increase with the number of time steps in atile, providing scalable locality [39]. For non-trivial exam-ples, time tiling often requires loop skewing with respect tothe time step loop [27, 39], often referred to as time skew-ing [39, 38]. This transformation typically involves imper-fectly nested loops, and was thus not widely implemented

    before the adoption of the polyhedral approach. However,the PLuTo automatic parallelizer [19, 6] has demonstratedconsiderable success in obtaining high performance on ma-chines with moderate degrees of parallelism by using thistechnique to automatically produce OpenMP parallel code.

    Unfortunately, the specific tiling transformations that havebeen implemented and released in tools like PLuTo involvepipelined execution of tiles, which prevents full concurrencyfrom the start. The lack of immediate full concurrency issometimes dismissed as a start-up cost that will be triv-ially small for realistic problem sizes. While this may betrue for the degrees of parallelism provided by current multi-core processors, this choice of tiling can impact the asymp-totic degree of concurrency available if we try to scale updata set size and machine size together, as suggested byGustafson [15]. Furthermore, Van der Wijngaart et al. [35]have modeled and experimentally demonstrated the load im-balance that occurs on distributed memory machines whenusing the pipelined approach.

    In this paper, we review the status of implemented and pro-posed techniques for tiling dense array codes (including theimportant sub-case of stencil codes) in an attempt to de-termine whether or not the techniques that are currentlybeing implemented are well suited to machines with higherdemands for parallelism and control of memory traffic andcommunication. The published literature on tiling for auto-matic parallelization seems to be divided into two disjointcategories: “practical” papers describing implemented butunscalable techniques for automatic parallelizers for densearray codes, and “theoretical” papers describing techniquesthat scale well but are either not implemented or not inte-grated into a general automatic parallelizer.

    In Section 2 of this paper, we discuss that the approach cur-rently used by PLuTo does not allow full scaling as describedby Gustafson [15]. In Section 4, we survey other tilings thathave been suggested in the literature, classify each approachas fully scalable or not, and discuss its implementation sta-tus in current automatic parallelization tools. We also ad-dress recent work by Bondhugula et al. [3] on a tiling tech-nique that we believe will be scalable, though asymptoticscaling is not addressed in [3]. Section 5 presents our con-clusions: we believe the scalable/implemented dichotomy isan artifact of current design choices, not a fundamental lim-itation of the polyhedral model, and can thus be addressedvia a shift in emphasis by the research community.

    3

  • // update N pseudo-random seeds T times

    // assumes R[ ] is initialized

    for t = 1 to T

    for i = 0 to N-1

    S0: R[i] = (a*R[i]+c) % m

    Figure 1: “Embarrassingly Parallel” Loop Nest.

    2. TILING AND SCALABILITYIn his 1988 article “Reevaluating Amdahl’s Law” [15],Gustafson observed that, in actual practice, “One does nottake a fixed size problem and run it on various numbers ofprocessors”, but rather “expands [the problem] to make useof the increased facilities”. In particular, in the successfulparallelizations he described, “as a first approximation, theamount of work that can be done in parallel varies linearlywith the number of processors”, and it is “most realistic toassume run time, not problem size, is constant”. This formof scaling is typically referred to as weak scaling or scal-able parallelism, as opposed to the strong scaling needed togive speed-up proportional to the number of processors fora fixed-size problem.

    Weak scaling can be found in many data parallel codes, inwhich many elements of a large array can be updated simul-taneously. Figure 1 shows a trivial example that we will useto introduce our diagrammatic conventions (following [38]).In Figure 2 each statement execution/loop iteration is drawnas an individual node, with sample values given to symbolicparameters. The time axis, or outer loop, moves from leftto right across the page. The grouping of nodes into tiles isillustrated with variously shaped boxes around sets of nodes.Arrows in the figure denote direction of flow of informationamong iterations or tiles. Line-less arrowheads indicate val-ues that are live-in to the space being illustrated. (Whencomparing our figures to other work, note that presentationstyle may vary in several ways: some authors use a time axisthat moves up or down the page; some draw data depen-dence arcs from a use to a definition, thus pointing into thedata-flow like a weather vane; some illustrate tiles as rectan-gular grids on a visually transformed iteration space, ratherthan with varied shapes on the original iteration space.)

    Figure 2 makes clear the possibility of both (weak) scalableparallelism and scalable locality. In the execution of a tileof size (τ × σ), i.e., τ iterations of the t (time) loop and σiterations of the i (data) loop, data-flow does not prevent Pprocessors from concurrently executing P such tiles. Eachtile performs O(σ × τ) computations and has O(σ) live-inand live-out values; if each processor performs all updates ofone data element before moving to the next, O(τ) operationscan be performed with O(1) accesses to main memory.

    Inter-iteration data-flow can constrain, or even prevent, scal-able parallelism or scalable locality. For the code in Figure 1,we can scale up parallelism by increasing N and P , or wecan scale up locality by increasing T with the machine bal-ance. However, beyond a certain point (i.e., σ = 1), we canno longer use additional processors to explore ever increas-ing values of T for a given N . (An increase in CPU clockspeed might help in this situation, though beyond a certainpoint it would likely not help performance for increasing N

    1 2 3 4 5

    i

    1

    3

    4

    5

    6

    2

    7

    t

    0

    Figure 2: Iteration Space of Figure 1 with T=5, N=8,tiled with τ = 5, σ = 4.

    for a given T .)

    As we show in the next two sections, scalable parallelismmay be constrained not only by the fundamental data-flow,but also by the approach to parallelization.

    3. PIPELINED PARALLELISMIn a Jacobi stencil computation, each array element isupdated as a function of its value and its neighbors’values, as shown (for a one-dimensional array) in Fig-ure 3. Thus, we must perform time skewing to tile theiteration space (except in the degenerate case of τ = 1,which prevents scalable locality). Figure 4 illustrates theusual tiling performed by automatic parallelizers suchas PLuTo, though for readability our figure shows farfewer loop iterations per tile. Nodes represent execu-tions of statement S1; for simplicity, executions of S2are not shown. (The same data-flow also arises from adoubly-nested execution of the single statement A[t%2,i]= (A[(t-1)%2,i-1]+2*A[(t-1)%2,i]+A[(t-1)%2,i+1])/4,but some tools may not recognize the program in this form.)

    Array data-flow analysis [11, 22, 23] is well understood forprograms that fit the polyhedral model, and can be usedto deduce the data-flow arcs from the original imperativecode. The data-flow arcs crossing a tile boundary describethe communication between tiles; in most approaches totiling for distributed systems, inter-processor communica-tion is aggregated and takes place between executions oftiles, rather than in the middle of any tile. The topology ofthe inter-tile data-flow thus gives the constraints on possibleconcurrent execution of tiles. For Figure 4, concurrent ex-

    4

  • for t = 1 to T

    for i = 1 to N-2

    S1: new[i] = (A[i-1]+2*A[i]+A[i+1])/4

    for i = 1 to N-2

    S2: A[i] = new[i]

    Figure 3: Three Point Jacobi Stencil.

    1 2 3 4 5

    i

    1

    3

    4

    5

    6

    2

    7

    t

    8

    Figure 4: Iteration Space of Figure 3 with T=5,N=10, tiled with τ = 2, σ = 4.

    ecution is possible in pipelined fashion, in which executionof tiles progresses as a wavefront that begins with the lowerleft tile, then simultaneously executes the two tiles border-ing it (above and to the right), and continues to each waveof tiles adjacent to the just-completed wave.

    As has been noted in the literature, this is not the onlyway to tile this set of iterations; however, other tilings arenot (currently) selected by fully-automatic loop parallelizerssuch as PLuTo [19]. Even the semi-automatic AlphaZ sys-tem [40], which is designed to allow programmers to exper-iment with different optimization strategies, cannot expressmany of these tilings. If such tools are to be consideredfor extreme scale computing, we must consider whether ornot the tiling strategies they support provide the necessaryscaling characteristics.

    3.1 ScalabilityTo support our claim that this pipelined tiling does not al-ways provide scalable parallelism, we need only show that itfails to scale on one of the classic examples for which it hasshown dramatic success for low-degree parallelism, such asthe easily-visualized one-dimensional Jacobi stencil of Fig-

    t

    i

    12

    12

    12

    12

    12

    12

    12

    12

    11

    10

    9

    8

    7

    6

    5

    4

    3

    2

    1

    11

    11

    2

    3

    4

    4

    5

    5

    5

    6

    6

    6

    7

    7

    7

    7

    8

    8

    8

    8

    9

    9

    9

    9

    9

    10

    10

    10

    10

    10

    108

    11

    11

    11

    11

    11

    Figure 5: Wavefronts of Pipelined Tile Execution.

    ure 3. We will first do so, and then discuss the issue inhigher dimensions.

    For some problem sizes, the tiling of Figure 4 can come closeto realizing scalable parallelism: if P = N

    σ+τand T

    τis much

    larger than P , most of the execution is done with P pro-cessors. Figure 5 illustrates the first 8τ time steps of suchan example, with P = 8, σ = 2τ , and N = P (σ + τ) (theellipsis on the right indicates a large number of additionaltime steps). The tiles executed in the first 12 waves arenumbered 1 to 12 for reference, and individual iterationsand data-flow are omitted for clarity. For all iterations af-ter 11, this tiling provides enough parallelism to keep eightprocessors busy for this problem size, and for large T therunning time approaches 1

    8of the sequential execution time

    (plus communication/synchronization time, which we willdiscuss later). If we double both N and P , a similar ar-gument shows the running time approaches 1

    16of the now

    twice-as-large sequential execution time, i.e., the same par-allel execution time, as described by Gustafson.

    However, as N and P continue to grow, the assumptionthat T

    τ� P eventually fails, and scalability is lost. Con-

    sider what happens in Figure 5 if T = 8τ , i.e., the ellipsiscorresponds to 0 additional time steps. At this point, dou-bling N and P produces a figure that is twice as tall, but nowider; parallelism is limited to degree 8, and execution with16 processors requires 35 steps rather than the 23 needed forFigure 5 when T = 8τ (note that the upper-right region issymmetric with the lower-left). Thus, communication-freeexecution time has increased rather than remaining con-stant. Increasing T (rather than N) with P is no better,and in fact no combination of N and T increase can allow16 processors to execute twice the work of 8 in the same 23wavefronts: adding even one full row or one full column oftiles means 24 wavefronts are needed.

    Figure 5 was, of course, constructed to illustrate a lack ofscalability. But even if we start with a more realistic problem

    5

  • size, i.e., with N � P and T � P , the pipelined tiling stilllimits scalability. Consider what happens we apply the tilingof Figure 5 with parameters N = 10000σ, T = 1000τ, P =100, in which over 99% of the tiles can be run with full100-fold parallelism, and then scale up N and P togetherby successive factors of ten. Our first jump in size givesN = 100000σ, T = 1000τ, P = 1000, which still has full(1,000-fold) parallelism in over 98% of the tiles.

    After the next jump, to N = 1000000σ, T = 1000τ, P =10000 there is no 10,000-fold concurrency in the problem.Even if the application programmer is willing to scale upT rather than just N , in an attempt to reach full machineutilization, the execution for N = 38730σ, T = 25820τ, P =10000 still achieves 10,000-fold parallelism in only 85% of the109 tiles. No combination of N and T allows any 100,000-fold parallelism on 1010 tiles with this tiling... to maintaina given degree of parallelism asymptotically, we must scaleboth N and T with P , contrary to Gustafson’s original def-inition. While this may be acceptable for some applicationdomains, we do not presume it to be universally appropri-ate, and thus see pipelined tiling as a potential restrictionon the applicability of time tiling.

    It is not always realistic to scale the number of time steps T .Bassetti et al. [4] introduce an optimization they call slidingblock temporal tiling. They indicate that in these relax-ation algorithms such as Jacobi “[g]enerally several sweepsare made”. In their experiments they use up to 8 sweeps.Zhou et al. [41] use 16,384 in their modified benchmarks. Ex-periments with the Pochoir compiler [34] used 200 time stepsbecause their cache oblivious performance optimization im-proves temporal locality. In summary, the number of timeiterations in stencil computation performance optimizationresearch varies dramatically. Work from the BeBOP groupat Berkeley [8] discusses how often multiple sweeps over thegrid within one loop occur and indicate that it may not beas common as those of us working on time skewing imagine.This makes it even more important for tiling strategies toprovide scalable parallelism that does not require the num-ber of time steps to be on par with the spatial domain.

    3.2 Tile Size Choice and Communication CostThe above argument presumes a fixed tile size, ignoring thepossibility of reducing the tile size to increase the number oftiles. However, communication costs (either among proces-sors or between processors and RAM) dictate a minimal tilesize below which performance will be negatively impacted(see [38, 19] for further discussion).

    Note that per-processor communication cost is not likely toshrink as the data set size and number of processors is scaledup: Each processor will need to send the same number oftiles per iteration, producing per-processor communicationcost that remains roughly constant (e.g., if processors areconnected to nearest neighbors via a network of the samedimensionality as the data set space) or rising (e.g., if pro-cessors are connected via a shared bus).

    Even if we ignore communication costs entirely (i.e. in thenotoriously unscalable PRAM abstraction), tile size cannotshrink below a single-iteration (or single-instruction) tile,and eventually our argument of Section 3.1 comes into play

    in asymptotic analysis.

    3.3 Other Factors Influencing Scalability andPerformance

    Our argument focuses on tile shape, but a number ofother factors will influence the degree of parallelism actu-ally achieved by a given tiling. As noted above, reductionsin tile size could, up to a point, provide additional paral-lelism (possibly at the cost of performance on the individualnodes).

    The use of global barriers or synchronization is commonin the original code generators, but note that MPI codegenerators are under development for both Pluto and Al-phaZ. While combining different tilings and code generationschemes raises practical challenges in implementation, we donot see any reason why any of the tilings discussed in thenext section could not, in principle, be executed withoutglobal barriers within the tiled iteration space.

    High performance on each node also requires attention to anumber of other code generation issues, such as code com-plexity and impact on vectorization and prefetching. Fur-thermore, these issues could be exacerbated by changes intile shape [3, Section III.B]. Thus, different tiling shapes maybe optimal for different hardware platforms or even differ-ent problem sizes, depending on the relative costs of limit-ing parallelism vs. per-node performance. Both [3, SectionIII.B] and [33] discuss these issues and the possibility ofchoosing a tiling that is scalable in some, but not all, datadimensions.

    3.4 Higher-Dimensional CodesWhile the two-dimensional iteration space of the three-pointJacobi stencil is easy to visualize on paper, many of thesubtleties of tiling techniques are only evident in problemswith at least two dimensions of data and one of time. Forpipelined tiling, the conflict between scalability is essentiallythe same in higher dimensions: for a pipelined tiling of ahyper-rectangular iteration space of dimension d, eventuallythe amount of work must grow by O(kd) to achieve paral-lelism O(kd−1).

    Conversely, in higher dimensions, the existence of a wave-front that is perpendicular to the time dimension (or anyother face of a hyper-rectangular iteration space) is fre-quently the sign of a parallelization that admits some formof weak scalability. However, as we will see, the parallelismof some tilings scales with only some of the spatial dimen-sions.

    4. VARIATIONS ON THE TILING THEMEThe published literature describes many approaches to looptiling. In this section, we survey these approaches, group-ing together those that produce similar (or identical) tilings.Our descriptions focus primarily on the tiling that would beused for the code of Figure 3, which is used as an intro-ductory example in many of the descriptions. We illustratethe tilings of this code with figures that are analogous toour Figure 5, with gray shading highlighting a single tile

    6

  • t

    i

    21 3

    1

    1

    1

    2

    2

    2

    3

    3

    3

    Figure 6: Overlapped Tiling.

    from time step two. We delve into the complexities of morecomplex codes only as necessary to make our point.

    For iterative codes with intra-time-step data-flow, thetile shapes discussed below are not legal (or cannot belegally executed in an order that permits scalable paral-lelism). For example, consider an in-place update of asingle copy of a data set, e.g. the single statement A[i]= (A[i-1]+2*A[i]+A[i+1])/4 nested inside t and i loops.Since each tile must wait for data from tiles of with lowervalues of i and the same value of t, pipelined startup isnecessary. While such code provides additional asymptoticconcurrency when both T and N increase, we see no way toallow concurrency to grow linearly with the total work to bedone in parallel. Thus, pipelined tiling does not produce ascalability disadvantage for such codes.

    Note that our discussion below focuses on distinct tileshapes, rather than distinctions among algorithms used todeduce tile shape or size or the manner in which individualtiles are scheduled or assigned to processors. For example wedo not specifically discuss the “CORALS” approach [32], inwhich an iteration space is recursively subdivided into paral-lelograms, avoiding the need to choose a tile size in advanceof starting the computation. Regardless of size, variationin size, and algorithmic provenance, the information flowamong atomic parallelogram tiles still forces execution toproceed along the diagonal wavefront, and thus still limitsasymptotic scalability.

    4.1 Overlapped TilingA number of projects have experimented with what is com-monly called overlapped tiling [26, 4, 2, 25, 10, 24, 19, 7,20, 41]. In overlapped tiling for stencil computations, alarger halo is maintained so that each processor can execute

    t

    i2

    2

    3

    31

    1

    4

    4

    5

    6

    5

    6

    Figure 7: Trapezoidal Tiling.

    more than one time step before needing to communicatewith other processors. Figure 6 illustrates this tiling. Twoindividual tiles from the second wavefront have been indi-cated with shading, one with gray and one with polkadots;the triangular polkadotted and gray region is in both tiles,and thus represents redundant computation. This overlapmeans that all tiles along each vertical wavefront can be exe-cuted in parallel while still improving temporal data locality.

    In terms of parallelism scalability, overlapped tiling doesscale because all of the tiles can be executed in parallel. If atwo-dimensional tiling in a two-dimensional spatial part of astencil is used as the seed partition, then two dimensions ofparallelism will be available with no need to fill a pipeline.This means that as the data scale, so will the parallelism.

    The problem with overlapped tiling is that redundant com-putation is performed. This leads to a trade-off betweenparallel scalability and execution time. Tile size selectionmust also consider the effect of the expanded memory foot-print caused by overlapped tiling.

    Auto-tuning between overlapped sparse tiling and non-overlapped sparse tiling [29, 30] for irregular iteration spaceshas also been investigated by Demmel et al. [9] in the con-text of iterative sparse matrix computations where the tilingis a run-time reordering transformation [28].

    4.2 Trapezoidal TilingFrigo and Strumpen [12, 13, 14] propose an algorithm forlimiting the asymptotic cache miss rate of “an idealized par-allel machine” while providing scalable parallelism. Figure 7illustrates that even a simplified version of their approach

    7

  • t

    i

    21

    1

    1

    1

    2

    2

    2

    1

    3

    3

    3

    3

    3

    Figure 8: One data dimension and the time dimen-sion in a diamond tiling [33]. The diamond extendsinto a diamond tube in the second data dimension.

    can enable weak scaling for the examples we discuss here(their full algorithm involves a variety of possible decompo-sition steps; our figure is based on Figure 4 of [14]). In ourFigure 7, the collection of trapezoids marked “1” can startsimultaneously; after these tiles complete, the mirror-imagetrapezoids that fill the spaces between them, marked “2”,can all be executed; after these steps, a similar pair of setsof tiles “3” and “4” complete another τ time steps of com-putation, etc. For discussion of the actual transformationused by Frigo and Strumpen, and its asymptotic behavior,see [14].

    The limitation of this approach is not its scalability, butrather the challenge of implementing it in a general-purposecompiler. Tang et al. [34] have developed the Pochoir com-piler, based on a variant of Frigo and Strumpen’s techniqueswith a higher degree of asymptotic concurrency [34]. How-ever, Pochoir handles a specialized language that allows onlystencil computations. Tools like PLuTo handle a larger do-main of dense array codes; it may be possible to generalizetrapezoidal tiling to PLuTo’s domain, but we know of nosuch work.

    4.3 Diamond TilingStrzodka et al. [33, 31] present the CATS algorithm for cre-ating diamond “tube” tiles in a 3-d iteration space. The dia-monds occur in the time dimension and one data dimension,as in Figure 8. The tube aspect occurs because there is notiling in the other space dimension. Each diamond tube canbe executed in parallel with all other diamond tubes withina temporal row of diamond tubes. For example, in Figure 8all diamonds labeled “1” can be executed in parallel, afterwhich all diamonds labeled “2” can be executed in parallel,

    t

    i

    S

    C

    R1

    1

    1

    S

    C1

    1

    C

    R1

    1

    S

    C

    R2

    2

    2

    S

    C2

    2

    C

    R2

    2

    S

    C

    R3

    3

    3

    S

    C3

    3

    C

    R3

    3

    Figure 9: Molecular tiling.

    etc. Within each diamond tube, the CATS approach sched-ules another level of wavefront parallelism at the granularityof iteration points.

    Although Strzodka et al. [33] do not use diamond tiles for 1-ddata/2-d iteration space, diamond tiles are parallel scalablewithin that context. They actually focus on the 2-d data/3-d iteration space, where asymptotically, diamond tiling onlyscales for one data dimension. The outermost level of par-allelism over diamond tubes only scales with one dimensionof data since the diamond tiling occurs across time and onedata dimension. On page 2 of [33], Strzodka et al. explic-itly state that their results are somewhat surprising in thatasymptotically their behavior should not be as good as previ-ously presented tiling approaches, but the performance theyobserve is excellent probably due to the concurrent parallelstartup that the diamond tiles provide.

    Diamond tiling is a practical approach that can perform bet-ter than pipelined tiling approaches because it avoids thepipeline fill and drain issue. The diamond tube also hasadvantages in terms of intra-tile performance: fine-grainedwavefront parallelism and leveraging pre-fetchers. The dis-advantages of the diamond tiling approach are that it hasnot been expressed within a framework such as the poly-hedral model (although it would be possible, just not withrectangular tiles); that the approach does not cleanly extendto higher dimensions of data (only one dimension of diamondtiles are possible with other dimensions doing some form ofpipelined or split tiling); and that the outermost level ofparallelism can only scale with one data dimension.

    4.4 Molecular TilingWonnacott [38] described a tiling for stencils that allows trueweak scaling for higher-dimensional stencils, performs no re-

    8

  • dundant work, and contains tiles that are all the same shape.However, Wonnacott’s molecular tiles required mid-tile com-munication steps, as per Pugh and Rosser’s iteration spaceslicing [21] as illustrated in Figure 9. Each tile first executesits send slice (labeled “S”), the set of iterations that producevalues that will be needed by another currently-executingtile, and then sends those values; it then goes on to executeits compute slice (“C”), the set of iterations that require noinformation from any other currently-executing tile; finally,each tile receives incoming values and executes its receiveslice (“R”), the set of iterations that require these data. Inhigher dimensions, Wonnacott discussed the possibility ofextending the parallelograms into prisms (as diamonds areextended into diamond tubes in the diamond tiling), butalso presented a multi-stage sequence of send and receiveslices to provide full scalability.

    Once again, a transformation with potential for true weakscaling remains unrealized due to implementation chal-lenges. No implementation was ever released for iterationspace slicing [21]. For the restricted case of stencil com-putations, these molecular tiles can be described withoutreference to iteration space slicing, but they make extensiveuse of modulo constraints supported by the Omega Library’scode generation algorithms [18], and Omega has no direct fa-cility for generating the required communication primitives.

    The developers of PLuTo explored a similar split tiling [19]approach, and demonstrated improved performance over thepipelined tiling, but this approach was not used for the re-leased implementation of PLuTo.

    4.5 A New HopeRecent work on the PLuTo system [3] has produced a tilingthat we believe will address the issue of true scalability withdata set size, though the authors frame their approach pri-marily in terms of“enabling concurrent start-up”rather thanimproving asymptotic scalability. For a one-dimensionaldata set, this approach is essentially the same as the “dia-mond tiling” of Section 4.3, but for higher-dimensional sten-cils it allows scalable parallelism in all data dimensions.

    Although a Jacobi stencil on a two-dimensional data set hasan“obvious” four-sided pyramid of dependences, a collectionof four-sided pyramids (or a collection of octahedrons madefrom pairs of such pyramids) cannot cover all points, andthus does not make a regular tiling. The algorithm of [3]produces, instead, a collection of six-faced tiles that appearto be balanced on one corner (these figures are shown withthe time dimension moving up the page). Various sculpturesof balanced cubes may be helpful in visualizing this tiling;those with limited travel budgets may want to search theinternet for images of the “Zabeel park cube sculpture”. Theapproach of [3] manages to construct these corner-balancedsolids in such as way that the three faces at the bottom ofthe tile enclose the data-flow.

    Experiments with this approach [3] demonstrate improvedresults (vs. pipelined tiling) for current shared-memory sys-tems up to 16 cores. The practicality of this approach onsuch a low degree of parallelism suggests that the per-nodepenalties discussed in our Section 3.3 are not prohibitivelyexpensive.

    While the algorithm is described in terms of stencils, andthe authors only claim concurrent startup for stencils, it isimplemented in a general automatic parallelizer (PLuTo).We belive it would be interesting to explore the full domainover which this tiling algorithm provides concurrent startup.

    The authors of [3] do not discuss asymptotic complexity, butwe hope that future collaborations could lead to a detailedtheoretical and larger-scale empirical study of the scalabilityof this technique, using the distributed tile execution tech-niques of [1] or [5].

    4.6 A Note on Implementation ChallengesThe pipelined tile execution shown in Figures 4 and 5 is of-ten chosen for ease of implementation in compilers based onthe polyhedral model. Such compilers typically combine alliterations of all statements into one large iteration space; thepipelined tiling can then be seen as a simple linear transfor-mation of this space, followed by a tiling with rectangularsolids. This approach works well regardless of choice of soft-ware infrastructure within the polyhedral model.

    The other transformations may be more sensitive to choiceof software infrastructure, or the subtle use thereof. At thistime, we do not have an exact list of which transformationscan be expressed with which transformation and code gen-eration libraries. We are working with tools that allow thedirect control of polyhedral transformations from a text in-put, such as AlphaZ [40] and the Omega Calculator [17],in hopes of better understanding the expressiveness of thesetools and the polyhedral libraries that underlie them.

    5. CONCLUSIONSCurrent work on general-purpose loop tiling exhibits a di-chotomy between largely unimplemented explorations ofasymptotically high degrees of parallelism and carefullytuned implementations that restrict or inhibit scalable paral-lelism. This appears to result from the challenge of generalimplementation of scalable approaches. The pipelined ap-proach requires only a linear transformation of the iterationspace followed by rectangular tiling, but does not providetrue scalable parallelism. Diamond tiling scales with onlyone data dimension. Overlapped, trapezoidal, and molec-ular tiling each pose implementation challenges (due to re-dundant work, non-uniform tile shape/orientation, or non-atomic tiles, respectively).

    We believe automatic parallelization for extreme scale com-puting will require a tuned implementation of a general tech-nique that does not inhibit or restrict scalability; thus fu-ture work in this area must address scalability, generality,and quality of implementation. We are optimistic that on-going work by Bondhugula et al. [3] may already provide ananswer to this to dilemma.

    6. ACKNOWLEDGMENTSThis work was supported by NSF Grant CCF-0943455, by aDepartment of Energy Early Career Grant DE-SC0003956,and the CACHE Institute grant DE-SC04030.

    9

  • 7. REFERENCES[1] Abdalkader, M., Burnette, I., Douglas, T., and

    Wonnacott, D. G. Distributed shared memory andcompiler-induced scalable locality for scalable clusterperformance. Cluster Computing and the Grid, IEEEInternational Symposium on 0 (2012), 688–689.

    [2] Allen, G., Dramlitsch, T., Foster, I., Goodale,T., Karonis, N., Ripeanu, M., Seidel, E., andToonen, B. The cactus code: A problem solvingenvironment for the grid. In Proceedings of the NinthIEEE International Symposium on High PerformanceDistributed Computing (HPDC9) (Pittsburg, PA,USA, 2000).

    [3] Bandishti, V., Pananilath, I., and Bondhugula,U. Tiling stencil computations to maximizeparallelism. In Proceedings of 2012 InternationalConference for High Performance Computing,Networking, Storage and Analysis (November 2012),SC ’12, ACM Press.

    [4] Bassetti, F., Davis, K., and Quinlan, D.Optimizing transformations of stencil operations forparallel object-oriented scientific frameworks oncache-based architectures. Lecture Notes in ComputerScience 1505 (1998).

    [5] Bondhugula, U. Compiling affine loop nests fordistributed-memory parallel architectures. Preprint,2012.

    [6] Bondhugula, U., Hartono, A., and Ramanujam,J. A practical automatic polyhedral parallelizer andlocality optimizer. In In PLDI âĂŹ08: Proceedings ofthe ACM SIGPLAN 2008 conference on Programminglanguage design and implementation (2008).

    [7] Christen, M., Schenk, O., Neufeld, E.,Paulides, M., and Burkhart, H. Manycore StencilComputations in Hyperthermia Applications. InScientific Computing with Multicore and Accelerators,J. Dongarra, D. Bader, and J. Kurzak, Eds. CRCPress, 2010, pp. 255–277.

    [8] Datta, K., Kamil, S., Williams, S., Oliker, L.,Shalf, J., and Yelick, K. Optimizations andperformance modeling of stencil computations onmodern microprocessors. SIAM Review 51, 1 (2009),129–159.

    [9] Demmel, J., Hoemmen, M., Mohiyuddin, M., andYelick, K. Avoiding communication in sparse matrixcomputations. In Proceedings of International Paralleland Distributed Processing Symposium (IPDPS) (LosAlamitos, CA, USA, 2008), IEEE Computer Society.

    [10] Ding, C., and He, Y. A ghost cell expansion methodfor reducing communications in solving pde problems.In Proceedings of the ACM/IEEE Conference onSupercomputing (New York, NY, USA, 2001),Supercomputing ’01, ACM, pp. 50–50.

    [11] Feautrier, P. Dataflow analysis of scalar and arrayreferences. International Journal of ParallelProgramming 20, 1 (Feb. 1991), 23–53.

    [12] Frigo, M., and Strumpen, V. Cache obliviousstencil computations. In Proceedings of the 19th annualinternational conference on Supercomputing (NewYork, NY, USA, 2005), ICS ’05, ACM, pp. 361–366.

    [13] Frigo, M., and Strumpen, V. The cache complexityof multithreaded cache oblivious algorithms. In

    Proceedings of the eighteenth annual ACM symposiumon Parallelism in algorithms and architectures (NewYork, NY, USA, 2006), SPAA ’06, ACM, pp. 271–280.

    [14] Frigo, M., and Strumpen, V. The cache complexityof multithreaded cache oblivious algorithms. Theor.Comp. Sys. 45, 2 (June 2009), 203–233.

    [15] Gustfson, J. L. Reevaluating Amdahl’s law.Communications of the ACM 31, 5 (May 1988),532–533.

    [16] Irigoin, F., and Triolet, R. Supernodepartitioning. In Conference Record of the FifteenthACM Symposium on Principles of ProgrammingLanguages (1988), pp. 319–329.

    [17] Kelly, W., Maslov, V., Pugh, W., Rosser, E.,Shpeisman, T., and Wonnacott, D. The OmegaCalculator and Library. Tech. rep., Dept. of ComputerScience, University of Maryland, College Park, Apr.1996.

    [18] Kelly, W., Pugh, W., and Rosser, E. Codegeneration for multiple mappings. In The 5thSymposium on the Frontiers of Massively ParallelComputation (McLean, Virginia, Feb. 1995),pp. 332–341.

    [19] Krishnamoorthy, S., Baskaran, M.,Bondhugula, U., Ramanujam, J., Rountev, A.,and Sadayappan, P. Effective automaticparallelization of stencil computations. In Proceedingsof Programming Languages Design andImplementation (PLDI) (New York, NY, USA, 2007),vol. 42, ACM, pp. 235–244.

    [20] Nguyen, A., Satish, N., Chhugani, J., Kim, C.,and Dubey, P. 3.5-d blocking optimization for stencilcomputations on modern cpus and gpus. InProceedings of the ACM/IEEE InternationalConference for High Performance Computing,Networking, Storage and Analysis (Washington, DC,USA, 2010), SC ’10, IEEE Computer Society,pp. 1–13.

    [21] Pugh, W., and Rosser, E. Iteration slicing forlocality. In 12th International Workshop on Languagesand Compilers for Parallel Computing (Aug. 1999).

    [22] Pugh, W., and Wonnacott, D. Eliminating falsedata dependences using the Omega test. In SIGPLANConference on Programming Language Design andImplementation (San Francisco, California, June1992), pp. 140–151.

    [23] Pugh, W., and Wonnacott, D. Constraint-basedarray dependence analysis. ACM Trans. onProgramming Languages and Systems 20, 3 (May1998), 635–678.

    [24] Rastello, F., and Dauxois, T. Efficient tiling foran ode discrete integration program: Redundant tasksinstead of trapezoidal shaped-tiles. In Proceedings ofthe 16th International Parallel and DistributedProcessing Symposium (Washington, DC, USA, 2002),IPDPS ’02, IEEE Computer Society, pp. 138–.

    [25] Ripeanu, M., Iamnitchi, A., and Foster, I. T.Cactus application: Performance predictions in gridenvironments. In Proceedings of the 7th InternationalEuro-Par Conference Manchester on ParallelProcessing (London, UK, UK, 2001), Euro-Par ’01,Springer-Verlag, pp. 807–816.

    10

  • [26] Sawdey, A., and O’Keefe, M. T. Program analysisof overlap area usage in self-similar parallel programs.In Proceedings of the 10th International Workshop onLanguages and Compilers for Parallel Computing(London, UK, UK, 1998), LCPC ’97, Springer-Verlag,pp. 79–93.

    [27] Song, Y., and Li, Z. New tiling techniques toimprove cache temporal locality. ACM SIGPLANNotices (PLDI) 34, 5 (May 1999), 215–228.

    [28] Strout, M. M., Carter, L., and Ferrante, J.Compile-time composition of run-time data anditeration reorderings. In Proceedings of the ACMSIGPLAN Conference on Programming LanguageDesign and Implementation (PLDI) (New York, NY,USA, June 2003), ACM.

    [29] Strout, M. M., Carter, L., Ferrante, J.,Freeman, J., and Kreaseck, B. Combiningperformance aspects of irregular Gauss-Seidel viasparse tiling. In Proceedings of the 15th Workshop onLanguages and Compilers for Parallel Computing(LCPC) (Berlin / Heidelberg, July 2002), Springer.

    [30] Strout, M. M., Carter, L., Ferrante, J., andKreaseck, B. Sparse tiling for stationary iterativemethods. International Journal of High PerformanceComputing Applications 18, 1 (February 2004),95–114.

    [31] Strzodka, R., Shaheen, M., and Pajak, D. Timeskewing made simple. In PPOPP (2011), C. Cascavaland P.-C. Yew, Eds., ACM, pp. 295–296.

    [32] Strzodka, R., Shaheen, M., Pajak, D., andSeidel, H.-P. Cache oblivious parallelograms initerative stencil computations. In ICS ’10: Proceedingsof the 24th ACM International Conference onSupercomputing (June 2010), ACM, pp. 49–59.

    [33] Strzodka, R., Shaheen, M., Pajak, D., andSeidel, H.-P. Cache accurate time skewing initerative stencil computations. In Proceedings of the40th International Conference on Parallel Processing(ICPP) (Taipei, Taiwan, September 2011), IEEEComputer Society, pp. 517–581.

    [34] Tang, Y., Chowdhury, R. A., Kuszmaul, B. C.,Luk, C.-K., and Leiserson, C. E. The pochoirstencil compiler. In Proceedings of the 23rd ACMsymposium on Parallelism in algorithms andarchitectures (New York, NY, USA, 2011), SPAA ’11,ACM, pp. 117–128.

    [35] Van der Wijngaart, R. F., Sarukkai, S. R.,Mehra, and P. The effect of interrupts on softwarepipeline execution on message-passing architectures.In FCRC ’96: Conference proceedings of the 1996International Conference on Supercomputing:Philadelphia, Pennsylvania, USA, May 25–28, 1996(New York, NY 10036, USA, 1996), ACM, Ed., ACMPress, pp. 189–196.

    [36] Wolf, M. E., and Lam, M. S. A data localityoptimizing algorithm. In Programming LanguageDesign and Implementation (New York, NY, USA,1991), ACM.

    [37] Wolfe, M. J. More iteration space tiling. InProceedings, Supercomputing ’89, Reno, Nevada(Reno, Nevada, November 1989), ACM, Ed., ACMPress, pp. 655–664.

    [38] Wonnacott, D. Using Time Skewing to eliminateidle time due to memory bandwidth and networklimitations. In International Parallel and DistributedProcessing Symposium (May 2000), IEEE.

    [39] Wonnacott, D. Achieving scalable locality withTime Skewing. Internation Journal of ParallelProgramming 30, 3 (June 2002), 181–221.

    [40] Yuki, T., Basupalli, V., Gupta, G., Iooss, G.,Kim, D., Pathan, T., Srinivasa, P., Zou, Y., andRajopadhye, S. Alphaz: A system for analysis,transformation, and code generation in the polyhedralequational model. Tech. rep., Technical ReportCS-12-101, Colorado State University, 2012.

    [41] Zhou, X., Giacalone, J.-P., Garzarán, M. J.,Kuhn, R. H., Ni, Y., and Padua, D. Hierarchicaloverlapped tiling. In Proceedings of the TenthInternational Symposium on Code Generation andOptimization (New York, NY, USA, 2012), CGO ’12,ACM, pp. 207–218.

    11

  • 12

  • Memory Allocations forTiled Uniform Dependence Programs ∗

    Tomofumi YukiColorado State University

    Fort CollinsColorado, U.S.A.

    [email protected]

    Sanjay RajopadhyeColorado State University

    Fort CollinsColorado, U.S.A.

    [email protected]

    ABSTRACTIn this paper, we develop a series of extensions to schedule-independent storage mapping using Quasi-Universal Occu-pancy Vectors (QUOVs) targeting tiled execution of poly-hedral programs. By quasi-universality, we mean that werestrict the “universe” of the schedule to those that corre-spond to tiling. This provides the following benefits: (i)the shortest QUOVs may be shorter than the fully univer-sal ones, (ii) the shortest QUOVs can be found without anysearch, and (iii) multi-statement programs can be handled.The resulting storage mapping is valid for tiled execution byany tile size.

    1. INTRODUCTIONIn this paper, we discuss storage mappings for tiled pro-

    grams, especially for the case when tile sizes are not knownat compile time. When the tile sizes are parameterized,most techniques for storage mappings [6, 13, 15, 20] cannotbe used due to the non-affine nature of parameterized tiling.However, we cannot combine parametric tiling with memoryre-allocation if we cannot find a legal allocation for all legaltile sizes.

    One approach that can find storage mappings for paramet-rically tiled programs is the Schedule-Independent StorageMapping proposed by Strout et al. [19]. For programs withuniform dependences, schedule-independent memory allo-cation finds storage mappings that are valid for any legalexecution of the program, including tiling by any tile size.We present a series of extensions to schedule-independentmapping for finding legal and compact storage mappings forpolyhedral programs with uniform dependences.

    Schedule-Independent Storage Mapping is based on whatare called Universal Occupancy Vectors (UOVs), that char-acterize when a value produced can safely be overwritten.As originally defined by Strout et al, UOVs are fully uni-versal, where the resulting allocations are valid for any legalschedule. In this paper, we restrict the UOVs to smalleruniverses and exploit their properties to efficiently find goodUOVs. In the remainder of this paper, we call such UOVsthat are not fully universal Quasi-UOVs (QUOVs) to distin-guish them from fully universal ones. Using QUOVs, we canfind valid mappings for tiled execution by any tile size, butnot necessarily valid for other legal schedules. This leads tomore compact storage mappings for cases when fully univer-sal allocation is an overkill.

    ∗This work was funded in part by the National Science Foun-dation, Award Numbers: 1240991 and 0917319

    The restriction on the universality leads to the following:

    • QUOVs may be shorter than the shortest UOV. Sincethe universe of possible schedules is restricted, validstorage mappings may be more compact. We use Man-hattan distance as the length of UOVs as a cost mea-sure when we describe optimality of a projective mem-ory allocation.

    • The shortest QUOV for tiled loop programs can be an-alytically found, and the dynamic programming algo-rithm presented by Strout et al. is no longer necessary.

    • Imperfectly nested loops can be handled. The origi-nal method assumed single statement programs (andhence perfectly nested loop programs.) We extend themethod by taking statement orderings, often expressedin the polyhedral model as constant dimensions, intoaccount. This is possible because we focus on tiled ex-ecution, where tiling applies the same schedule (exceptfor ordering dimensions) to all statements.

    The input to our analysis is a program in polyhedralrepresentation, which can be obtained from loop programsthrough array data-flow analysis [7, 14]. Our methods canbe used for loop programs in single assignment form; analternative view used in some of the prior work [19, 20].

    In addition, we present a method for isolating boundarycases based on index-set splitting [9]. UOV-based allocationassumes that every a dependence is active at all points inthe iteration space. In practice, programs have boundarycases where certain dependences are only valid at iterationspace boundaries. We take advantage of the properties ofQUOVs we develop to guide the splitting.

    2. BACKGROUNDIn this section, we present the background necessary

    for this paper. We first introduce the terminology used,and then present an overview of Universal Occupancy Vec-tors [19].

    2.1 Polyhedral RepresentationsPolyhedral representations of programs primarily consist

    of statement domains and dependences. Statement domainsrepresent the set of iteration points where a statement isexecuted. In polyhedral programs, such sets are describedby a finite union of polyhedra.

    In this paper, we focus on programs with uniform depen-dences, where the producer and the consumer differ by aconstant shift. We characterize these dependences using a

    13

  • vector, which we call data-flow vector, drawn the producerto the consumer. For example, if a value produced by aniteration [i, j] is used by another iteration [i+ 1, j + 2], thecorresponding data-flow vector is [1, 2].

    2.2 Schedules and Storage MappingsThe schedules are represented by affine mappings that

    map statement domains to a common dimensional space,where the lexicographic order denotes the order of execu-tion. In the polyhedral literature, statement orderings arecommonly expressed as constant dimensions in the schedule.Furthermore, one can represent arbitrary orderings of loopsand statements by adding d + 1 additional dimensions [8],where d is the dimensionality of statement domains in theprograms1. To make the presentation consistent, we assumesuch schedules are always used, resulting in a 2d+ 1 dimen-sional schedule.

    The storage mappings are often represented as a combi-nation of affine mappings and dimension-wise modulo fac-tors [13, 16, 19].

    2.3 TilingTiling is a well known loop transformation that was orig-

    inally proposed as a locality optimization [12, 17, 18, 22]. Itcan also be used to extract coarser grained parallelism, bypartitioning the iteration space to tiles (blocks) of compu-tation, some of which may run in parallel [12, 17].

    Legality of tiling is a well established concept defined overcontiguous subsets of the schedule dimensions (in the rangeof the scheduling function; scheduled space), also calledbands [3]. These dimensions of the schedules are tilable,and are also known to be fully permutable [12].

    The range space of the schedules given to statements ina program all refers to the common space, and thus havethe same number of dimensions. Among these dimensions,a dimension is tilable if all dependences are not violated(i.e., the producer is not scheduled after the consumer, butpossibly be scheduled to the same time stamp,) with a one-dimensional schedule using only the dimension in question.Then any contiguous subset of such dimensions forms a legaltilable band.

    We call a subset of dimensions in an iteration space tobe tilable, if the identity schedule is tilable for the corre-sponding subset. The iteration space is fully-tilable if alldimensions are tilable.

    2.4 Universal Occupancy VectorsA Universal Occupancy Vector (UOV) [19] is a vector that

    denotes the “distance after which” when a value may safelybe overwritten in the following sense. When an iterationz is executed, its value is stored in some memory location.Another iteration z′ can reuse this same memory locationif all iterations that use the value produced by z have beenexecuted. When a storage mapping is such that z and z′

    are mapped to the same memory location, the difference ofthese points, z′ − z, is called the occupancy vector.

    Universal Occupancy Vector is a specialization of such vec-tors, where all iterations k that use z are guaranteed to beexecuted before z′ in any legal schedule. Since most machinemodels assume that, within a single assignment statement,

    1Note that for uniform dependence programs, all statementdomains have the same number of dimensions.

    reads happen before writes in a given time step, z′ may alsouse the value produced by z.

    Additionally, we introduce a notion of scoping to UOVs.We call a vector v to be an UOV with respect to a set of de-pendences I, if the vector v satisfies the necessary propertyto be an UOV for a subset of all points k that use the valueproduced by z with one of the dependences in I.

    Once the UOV is computed, the mapping that corre-sponds to the projection of the statement domain along theUOV is a legal storage mapping. If the UOV crosses morethan one integer points, then an array that corresponds toa single projection is not sufficient. Instead, multiple ar-rays are used in turn, implemented using modulo factors.The necessary modulo factor is the GCD of elements of theUOV.

    The trivial UOV; a valid, but possibly suboptimal UOVis computed as follows.

    1. Construct the set of data-flow vectors corresponding toall the dependences that use the result of a statement.

    2. Compute the sum of all data-flow vectors in the con-structed set.

    The above follows from a simple proposition shown below,and an observation that the data-flow vector of a dependenceis a legal UOV with respect to that dependence.

    Proposition 1 (Sum of UOVs). Let u and v be, re-spectively, the UOVs for two sets of dependences U and V.Then u+ v is a legal UOV for U ∪ V.

    Proof. The value produced at z is dead when z + u canlegally be executed with respect to the dependences in U ,and similarly for V at z + v. Since there is a path from z toz + u + v by following the edges z + u and z + v (in eitherorder), the value produced at z is guaranteed to be used byall uses, z + u and z + v, when z + u + v can legally beexecuted.

    The optimality of UOVs without any knowledge of sizeor shape of the iteration space is captured by the length ofthe UOV. However, the length that should be compared isnot the Euclidean length, but the Manhattan distance. Wediscuss the optimality of UOV-based allocations and otherprojective allocations in Section 7.

    3. OVERVIEW OF OUR APPROACHUOV-based allocation give legal mappings even for sched-

    ules that cannot be implemented as loops. For example,even a run-time work stealing scheduler can use UOV-basedallocation. However, this is obviously an overkill if we onlyconsider schedules that can be implemented as loops.

    The important change in perspective is that we are notinterested in schedule-independent storage mappings, al-though the concept of UOV is used. We are only interestedin using UOV-based allocation in conjunction with tiling.Thus, our allocation is partially schedule-dependent. Theoverview of our storage mapping strategy is as follows:

    1. Extract polyhedral representations of programs (arrayexpansion.)

    2. Perform schduling and apply the schedules as transfor-mations to the iteration space2. After the transforma-tion, lexicographic scan of the resulting iteration space

    2This can be viewed as pre-processing to code generation [2].

    14

  • respects the schedules. The resulting space should be(partially) tilable to take advantage of our approach.

    3. Apply UOV-guided index-set splitting (Section 6.)This step attempts to isolate boundaries of statementdomains that negatively influence storage mappings.

    4. Apply QUOV-based allocation (Section 4.) Our pro-posed storage mapping based on extensions to theUOVs are applied to each statement after the split-ting. Although inter-statement sharing of arrays maybe possible, such optimization is beyond the scope ofthis paper.

    The order of presentation does not follow the above fortwo reasons. One is that the UOV-guided splitting is anoptional step that can further optimize memory usage. Inaddition, splitting introduces multiple statements to the pro-gram, and requires our extension to handle multiple state-ments presented in Section 5.

    4. QUOV-BASED ALLOCATION FORTILED PROGRAMS

    In this section, we present a series of formalism to ana-lytically find the shortest QUOV. We first develop a lemmathat can eliminate dependences while constructing UOVs.We then apply the lemma to find the shortest QUOVs indifferent contexts.

    4.1 Relevant Set of Dependences for UOVConstruction

    The trivial UOV, which also serves as the starting pointfor finding the optimal UOV, is found by taking the sumof all dependences. However, this formulation may lead tosignificantly inefficient starting points. For example, if twodependences with data-flow vectors [1, 0] and [2, 0] exist, theformer dependence may be ignored during UOV construc-tion since a legal UOV-based allocation using only the latterdependence is also guaranteed to be legal for the former de-pendence.

    We may refine both the construction of the trivial UOVand the optimality algorithm by reducing the set of depen-dences considered during UOV construction. The optimal-ity algorithm presented by Strout et al. [19] searches a spacebounded by the length of trivial UOV using dynamic pro-gramming. Therefore, reducing the number of dependencesto consider will improve both the trivial UOV and the dy-namic programming algorithm.

    The main intuition is that if a dependence can be tran-sitively expressed by another set of dependences, then it isthe only dependence that needs to be considered. This isformalized in the following lemma.

    Lemma 1 (Dependence Subsumption). If a depen-dence f can be expressed as compositions of dependences ina set G, where all dependences in G are used at least once inthe composition, then a legal UOV with respect to f is alsoa legal UOV with respect to all elements of G.

    Proof. Given a legal UOV with respect to a single de-pendence f , the value produced at z is preserved at leastuntil z′ defined by f(z′) = z, can be executed. Let the setof dependences in G be denoted as gx, 1 ≤ x ≤ |G|. Sincecomposition of uniform functions is associative and commu-tative, there is always a function g∗ obtained by composing

    dependences in G, such that f = g∗ ◦ gx for each x. Thus,all points z′′, gx(z′′) = z, are executed before z′ for allx. Therefore, a legal UOV with respect to f is guaranteedto preserve the value produced at z until all points thatdirectly depend on z by a dependence in set G have beenexecuted.

    Finding a composition in the above can be implemented asan integer linear programming problem. The problem mayalso be viewed as determining if a set of vectors are linearlydependent when restricted to positive combinations. Theunion of all sets G, called subsumed dependences, found inthe initial set of dependences can be ignored when construct-ing the UOV.

    Applying Lemma 1 may significantly reduce the num-ber of dependences to be considered. However, the triv-ial UOV of the remaining dependences may still not bethe shortest UOV. For example, consider data-flow vectors[1, 1], [1,−1], [1, 0]. Although the vectors are independentby positive combinations, the trivial UOV [3, 0] is clearlylonger than another UOV [2, 0]. Further reducing the setof dependences to consider requires a variation of Lemma 1that allows f also to be a composition of dependences. Thisleads to complex operations, and the dynamic programmingalgorithm for finding optimal UOV by Strout et al. [19] maybe a better alternative. Instead of finding the shortest UOVin the general case, we show that such UOV can be foundvery efficiently for a specific context, namely tiling.

    4.2 Finding the Shortest QUOV for Tiled Pro-grams

    When UOV-based allocation is used in the specific con-text of tiling, the shortest QUOV can be analytically found.If we know that the program is to be tiled, we can adddummy dependences to restrict the universality of the stor-age mapping, while maintaining tilability. In addition, wemay assume that the dependences are all non-positive (forthe tilable dimensions) as a result of pre-scheduling step toensure the legality of tiling. For the remainder of this sec-tion, the“universe”of UOVs is one of the following restricteduniverses: fully tilable, fully sequential, and mixed sequen-tial and tilable.

    Note that the expectation is that “tilable” iteration spacesare tiled in a later phase. We analyze iteration spaces thatwill be tiled using QUOV, and then apply tiling. We alsoassume that the iteration points are scanned in the lexico-graphic order within a tile.

    Theorem 1 (Shortest QUOV, Fully Tilable).Given a set of dependences I in a fully tilable space, theshortest QUOV u for tiled execution is the element-wisemaxima of data-flow vectors of all dependences in I.

    Proof. Let the element-wise maxima of all data-flow vec-tors be the vector m, and fm be a dependence with data-flowvector m. For unit vectors ud in each of the d dimensions, weintroduce dummy dependences fd with data-flow vector ud.Because these dependences have non-negative components,the resulting space is still tilable. For all dependences in Ithere exists a sequence of compositions with the dummy de-pendences to transitively express fm. Using Lemma 1, theonly dependence to be considered in UOV construction cantherefore be reduced to fm, which has the trivial UOV of m.

    It remains to show that no QUOV shorter than m existsfor the set of dependences I. The shortest QUOV is defined

    15

  • `

    i

    j Data-flow Vectors

    Dummy Vectors

    Element-wise Maxima

    Bounds by Maxima

    Points with M. Distance 4

    Figure 1: Illustration of Theorem 1 for the set of depen-dences with data-flow vectors [2, 0], [2, 1], [1, 2], and [0, 2].The element-wise maxima of the data-flow vectors corre-spond to the shortest UOV. The value produced by the bot-tom left iteration is used by the destination of the data-flowvectors. The data-flows induced by dummy dependencesguarantees that the iteration pointed by the element-wisemaxima is only executed after all iterations that depend onthe bottom left. None of the other iterations with the sameManhattan distance (4) can be reached, since it requiresbackward data-flow along at least one of the axes.

    by the closest3 point from z that can be reached from alluses of z by following the dependences. Since the choice ofz does not matter, let us use the origin, ~0, to simplify ourpresentation. This allows us to use the data-flow vectorsinterchangeably with coordinate vectors.

    Then, the hyper-rectangle with diagonal m includes allI, and all bounds of the hyper-rectangle are touched byat least one dependence. Since all dependences in a fullytilable space are restricted to have non-negative data-flowvectors, no points within the hyper-rectangle can be reachedby following dependences. Thus, it is clear that m is theclosest common point that can be reached by those thattouch the bounds.

    The theorem is illustrated in Figure 1, and is contrastedwith the trivial UOV used by Strout et al. [19] in Figure 2.

    The basic idea of inserting dummy dependences to restrictthe possible schedule can be used beyond tilable schedules.One important corollary for sequential execution is the fol-lowing.

    Corollary 1 (Sequential Execution). Given a setof dependences I in an n-dimensional space where lexico-graphic scan of the space is a legal schedule. Let m be thelexicographic maximum of the data-flow vectors of all depen-dences in I. Then the shortest QUOV u for lexicographicexecution is either m or the vector [m1 + 1, 0, · · · , 0] wherem1 is the first element of m.

    Proof. For sequential execution, we may introducedummy dependences to any lexicographically precedingpoint. Then the dependence, with a data-flow vector whosefirst element is m1 can can subsume other dependences withlower values in the first element according to Lemma 1 byintroducing appropriate dummy dependences.

    3Shortest and closest are both in terms of Manhattan dis-tance.

    i

    j

    Dynamic Programming Search SpaceTrivial UOVShortest QUOV

    Data-flow Vectors

    Shortest UOV

    Figure 2: Comparison against the trivial UOV computedas proposed by Strout et al. [19] for the same set of de-pendences as in Figure 1. The trivial UOV is [5, 5] and itbecomes the radius on the bounds of the search space forthe dynamic programming algorithm proposed by Strout etal. [19]. In contrast, Theorem 1 gives the shortest QUOVby simply computing the element-wise maxima of the data-flow vectors. Furthermore, the shortest fully universal UOVis twice as long as the shortest QUOV, since the unit lengthdummy dependences cannot be assumed. The search spaceis bounded by a sphere since Euclidean distance is used byStrout et al. [19], but can be adapted to Manhattan distance.

    For the remaining dependences there are two possibilities:

    • We may use dummy dependences of the form[1, ∗, · · · , ∗] to let a dependence with data-flow vector[m1 + 1, 0, · · · , 0] subsume all remaining dependences.

    • We may use m as the QUOV following Theorem 1.

    It is obvious that when the former options is used, [m1 +1, 0, · · · , 0] is the shortest. The optimality of the latter casefollows from Theorem 1. Thus, the shortest UOV is theshortest among these two options.

    Note that m can be shorter than [m1+1, 0, · · · , 0] only whenm = [m1, 0, · · · , 0]. In addition, although the above coro-rally can be applied to tilable iteration spaces, the tilabilitymay be lost due to memory-based dependences introducedby the allocation.

    The allocation given by the above corollary is notschedule-independent at all. It is an analytical solutionto the storage mapping of uniform dependence programs,where the schedule is the lexicographic scan of the iterationspace.

    The following corollary can trivially be established by thecombination of the above.

    Corollary 2 (Sequence of Tilable Spaces).Given a set of dependences I in a space where a subset of

    16

  • for (i=0:N)

    S1[i] = foo();

    for (j=0:N)

    S2[j] = bar(S1[j]);

    (a) When θS1 = (i→ 0, i, 0) and θS2 = (j → 1, j, 0).

    for (i=0:N)

    S1[i] = foo();

    S2[i] = bar(S1[i]);

    (b) When θS1 = (i→ 0, i, 0) and θS2 = (j → 0, j, 1).

    Figure 3: Two possible schedules for statements S1 and S2.Note that statement S1 in Figure 3a requires O(N) memorywhereas it only requires a scalar in Figure 3b, although thecode shown is still in single assignment.

    the dimensions are tilable, and lexicographic scan is legal forother dimensions, the shortest QUOV u for tiled executionof the tilable space, and sequential execution of the rest isthe combination of vectors computed for each contiguoussubset of either sequential or tilable spaces.

    Note that the above corollary only takes effect when thereare sequential subsets with at least two contiguous dimen-sions. When a single sequential dimension is surroundedby tilable dimensions, its element-wise maxima and lexico-graphic maxima are equivalent.

    Using the above, the shortest QUOV for sequential, tiled,or a hybrid combination, can be computed very efficiently.

    5. HANDLING OF PROGRAMS WITHMULTIPLE STATEMENTS

    In many programs, there are multiple statements depend-ing on each other. The original formulation of UOVs are forsingle statement programs [19]. In this section, we show thatthe concept can be adapted to multi-statement programs,knowing that we restrict the universe to tiled execution ofthe iteration space.

    5.1 Limitations of UOV-based AllocationAllocations based on UOVs have a strong property that

    they are valid for any legal schedule. Here, the schedule isnot limited to affine schedules in the polyhedral model, andtime stamps to each operation can be assigned arbitrarily, aslong as they respect the dependences. The concept of UOVapplies to reuse among writes to a single common space,and relies on the fact that every iteration writes to the samespace (or array.) Different statements may write to differentarrays, in programs with multiple statements, and henceUOV cannot be directly used.

    For example, consider a program with two statements S1and S2:

    • DS1 = {i|0 ≤ i ≤ N}• DS2 = {j|0 ≤ j ≤ N}

    where the dependence is such that iteration x of S1 mustexecuted before x of S2.

    Figure 3 illustrates two possible schedules and its implica-tions on memory usage. Note that we do not discuss storage

    mapping for S2, since it is not used within the code frag-ment above. A fully universal UOV-based allocation wouldhave to take account for such variations of a schedule, butsuch extension may not even make sense. When two state-ments are scheduled differently, the dependence between twostatements in the scheduled space may no longer be uniform.

    When we apply the concept of UOV for tiling, we areno longer interested in arbitrary schedules. We first applyall non-tiling scheduling decisions before performing stor-age mapping. Therefore, the only change in the executionorder comes from a tiling transformation viewed as a post-processing, so the same “schedule” applied to all statementsinvolved. This allows us to extend the concept of UOVs toimperfectly nested loops.

    5.2 Handling of Statement OrderingWhen the ordering dimensions are represented as constant

    dimensions, the elements in the UOV require special han-dling. In the original formulation there is only one state-ment, and thus every iteration point writes to the samearray. When multiple statements exist in a program, theiteration space of a statement is a subset of the combinedspace, and are made disjoint by statement ordering dimen-sions. Thus, not all points in the common space correspondto a write, and this affects how the UOV is interpreted.

    Consider the program in Figure 3b. The only dependenceinvolving S1 has data-flow [0, 0, 1], and since it is the onlydependence, it is the shortest UOV. Literal interpretationof this vector as UOV means that the iteration z + [0, 0, 1]can safely overwrite the value produced by z. However, thisinterpretation does not make sense in the context of by-statement allocation, since statement S2 at z+[0, 0, 1] writesto a different array.

    We handle multi-statement programs by removing d outd + 1 constant dimensions for the purpose of UOV-basedanalysis. We apply the following rule to remove constantdimensions by transferring the information carried by thesedimensions to others. Let v be a data-flow vector of a depen-dence in the scheduled space. We first apply the followingrule to the vector v:

    • For each constant dimension x > 0, where vx > 0, setvx−1 = max(vx−1, 1)

    Once the above rule is applied, all the constant dimensions,except for the first one, are removed to obtain d + 1 di-mensional data-flow vector. We repeatedly apply the aboveto all dependences in the program, resulting with a set ofdata-flow vectors with d+ 1 dimensions.

    We justify the above rule in the following. When the con-stant dimension of the data-flow is greater than 0, it meansthat some textually later statement uses the produced value.With respect to this particular dependence, the memory lo-cation may be reused once this textually later statement hasbeen executed. However, there is always exactly one itera-tion of a specific statement in a constant dimension, since itis an artificial dimension for statement ordering. Therefore,the earliest possible iteration that can overwrite the memorylocation in question is in the next iteration of the immediatesurrounding loop dimension.

    For example, the only dependence of S1 in Figure 3b is[0, 0, 1]. The value produced by S1 at i is used by S2 ati, but only overwritten by another iteration of S1 at i +1. Therefore, we transfer the dependence information to a

    17

  • for (t=0:T)

    for (i=0:N)

    A[i] = foo(A[i]); //S1

    for (i=1:N)

    A[i] = bar(A[i-1], A[i]); //S2

    Figure 4: Example to illustrate influences of boundarycases. Note that the value produced by S1 is last used byS2 of the same outer loop iteration, except for when i = 0,in which case it is used again by S1 at [t+ 1, 0].

    preceding dimension.Projecting a d+1 dimensional domain along a vector does

    indeed produce a domain with d dimensions. This is be-cause in the presence of imperfect nests, d dimensional stor-age may be required for d dimensional statements even withuniform dependences. Consider the code in Figure 3a. Theonly use of S1 is by S2 with data-flow [1, 0, 0] in the sched-uled space. Applying the rule described above removes thelast dimension, yielding [1, 0] as the QUOV. Projecting thestatement domain of S1 (in the scheduled space with d con-stant dimensions removed) along the vector [1, 0] gives a two-dimensional domain: {i, x|0 ≤ i ≤ N∧x = 0}. Although thedomain is two-dimensional, it is effectively one-dimensionalbecause of the equality in the constant dimension.

    It is also important to note a special case when the UOVtakes the form: [0, · · · , 0, 1]. When the last constant di-mension is the only non-zero entry, it is obvious that thestatement requires only a scalar, since its immediately con-sumed.

    6. UOV GUIDED INDEX SET SPLITTINGIn the polyhedral representation of programs there are

    usually boundary cases that behave differently from the rest.For instance, the first iteration of a loop may read frominputs, whereas successive iterations use values computedby previous iterations.

    In the polyhedral model, storage mapping is usually com-puted for each statement. With pseudo-projective alloca-tions, the same allocation must be used for all points in thestatement domain. Thus, dependences that only exist at theboundaries influence the entire allocation.

    For example, consider the code in Figure 4. The valueproduced by S1 at [t, i] is last used by S2 at [t, i+1] for i > 0.However, the value produced by S1 at [t, 0] is last used by S1at [t + 1, 0]. Thus, the storage mapping for S1 must ensurethat a value produced at [t, i] is live until [t + 1, i] for allinstances of S1. This clearly leads to wasteful allocations,and our goal is to avoid them.

    One solution to the problem is to apply a form of index setsplitting [9] such that the boundary cases and common caseshave different mappings. In the example above, we wish touse piece-wise mappings for S1 at [t, i] where the mapping isdifferent for two disjoint sets i = 0 and i > 0. This reducesthe memory usage from an array of size N + 1 to 2 scalars.

    Once, the pieces are computed, application of piece-wisemappings can done through splitting the statements (nodes)as defined by the pieces, and then applying a separate map-ping to each of the statements after split. Thus, the onlyinteresting problem that remain is finding meaningful splits.

    In this section, we present an algorithm for finding the

    split with the goal of minimizing memory usage. The algo-rithm is guided by Universal Occupancy Vectors and worksbest with UOV-based allocations. The goal of our indexset splitting is to isolate boundaries that require longer life-time than the main body. Thus, we are interested in a sub-domain of a statement with a different dependence patternthan that in the rest of the statement’s domain. We focus onboundary domains that contain at least one equality. Theapproach may be generalized to boundary planes of constantthickness using Thick Face Lattices [11].

    The original index-set splitting [9] aimed at finding betterschedules. The quality of a split is measured by its influenceon possible schedules: whether different scheduling functionsfor each piece yields a better schedule.

    In our case, the goal is different. Our starting point isthe program after affine scheduling, and we are now inter-ested in finding storage mappings. When the dependencepattern is the same at all points in the domain, splittingcannot improve the quality of the storage mapping. Sincethe dependence pattern is the same, the same storage map-ping will be used for all pieces (with a possible exception ofthe cases when the split introduces equalities or other prop-erties related to shape of the domains). Because two pointsthat may have been in the nullspace of the projection maynow be split into different pieces, the number of points thatcan share the same memory location may be reduced as theresult of splitting.

    Thus, as a general measure of quality, we seek to ensurethat a split influences the choice of storage mapping for eachpiece. The obvious case when splitting is useless is when adependence function at a boundary is also in the main part.We present Algorithm 1 based on this intuition to reducethe number of splits.

    The intuition of the algorithm is that we start with alldependences with equalities in their domain as candidatepieces. Then we remove some of the dependences wheresplitting does not improve the allocation from candidatepieces. The obvious case is when the same dependence func-tion exists in the non-boundary cases (i.e., dependences withno equalities in their domain). In addition, more sophisti-cated exclusion is performed using Theorem 1.

    The algorithm can also be easily adapted for non-uniformprograms. It may also be specialized/generalized by addingmore dependence elimination rules to Step 2. This requiresa method similar to Lemma 1 for other memory allocationmethods.

    ExampleLet us describe the algorithm in more detail with an ex-ample. The statement S1 in Figure 4 has three data-flowdependences:

    • I1 = S1[t, i]→ S2[t, i] when 0 ≤ i ≤ N

    • I2 = S1[t, i]→ S2[t, i+ 1] when i < N

    • I3 = S1[t, i]→ S1[t+ 1, i] when i = 0

    The need for index-set splitting does not arise until someprior scheduling fuses the two inner loops. Let the schedul-ing functions be:

    • θS1 = (t, i→ 0, t, 0, i, 0)

    • θS2 = (t, i→ 0, t, 0, i, 1)

    18

  • Algorithm 1 UOV-Guided Split

    Input:

    I: Set of uniform dependences that depend on a statement S. A dependence is a pair 〈f,D〉 where f is the dependencefunction and D is a domain. The domain is the constraints on the producer statement.

    Output:

    P: A partition of DS , the domain of S, where each element defines a piece of the split.

    Algorithm:We first inspect the domain of dependences in I to detect equalities.Let

    Ib be the set of dependences with equalities, andIm be the set of those without equalities.

    Then,

    1. foreach 〈f,D〉 ∈ Ib,

    if ∃〈g,E〉 ∈ Im; f = g then remove 〈f,D〉 from Ib2. Further remove dependences from Ib using the following if applicable:

    (a) Theorem 1 and its corollaries. The following steps are only for Theorem 1.

    Let m be the element-wise maxima of dataflow vectors in Im.foreach 〈f,D〉 ∈ Ib– Let v be the dataflow vector of f .– if ∀i : vi ≤ mi then remove 〈f,D〉 from Ib.

    3. Group the remaining dependences in Ib into groups Gi, 0 ≤ i ≤ n, where ∀X,Y ∈ Gi;DX ∩ DY 6= ∅. In other words,group the dependences with overlapping domains in the producer space.

    4. foreach i ∈ 0 ≤ i ≤ n, Pi =⋃

    ∀X∈Gi

    DX

    5. if n ≥ 0 then Pn+1 = DS \n⋃

    i=0

    Pi else P0 = DS

    The data-flow vectors in the scheduled space, after remov-ing constant dimensions (we also remove the outer-most onesince there is only one loop at the outer-most level) are:

    • I′1 = [0, 1] when 0 ≤ i ≤ N

    • I′2 = [0, 1] when i < N

    • I′3 = [1, 0] when i = 0

    As the pre-processing step, we separate the dependencesinto two sets based on the equalities in the domain:

    • Ib = {I′3}; those with equalities, and

    • Im = {I′1, I

    ′2}; those without equalities.

    Then we use Step 1 to eliminate identical dependences. Ifthe same dependence function is both in the boundary andthe main body, separating the boundary does not reduce thenumber of distinct dependences to be considered. There-fore, splitting such domain does not positively influence thestorage mapping, and hence is removed from further consid-

    eration. Since I′3 is different from the other two, this stepdoes not change the set of dependences for this example.

    Step 2a is the core of this algorithm. The general ideais the same as Step 1, but we use additional properties ofUOV-based allocation to identify more cases where splittingdoes not positively influence the storage mapping.

    Theorem 1 states that if all elements of a data-flow vectorare less than the corresponding element of the element-wisemaxima of all data-flow vectors under consideration, the de-pendence does not influence shortest QUOV. Therefore, ifthe candidate dependence to split does not contribute tothe element-wise maxima, the split is useless in terms offurther shortening the QUOV. As an illustration, considerthe rectangle in Figure 1 defined by the element-wise max-ima. If separating a dependence does not shrink the size ofthe rectangle, the length of QUOV cannot be shortened.

    In this example, the element-wise maxima of dependencesin Im is [0, 1]. However, the boundary dependence has data-flow vector [1, 0], and when combined, the element-wise max-ima becomes [1, 1]. Therefore, the boundary dependencedoes contribute to the element-wise maxima, and is not re-moved from the candidate set in Step 2a. The dependencesthat are left in the set Ib after this step are the set of de-pendences that will be split from the main body.

    Step 3 is a grouping step, where dependences to be splitare grouped into those that have overlap in their domains.

    19

  • Two sub-domains of a statement cannot be split into sepa-rate statements, unless the computation is duplicated. Al-though computing the values redundantly may be an option,we enforce the two statements to be jointly split as one ad-ditional statement. This step is irrelevant for our example,since our example only has one dependence in set Ib.

    The last step is a cleanup step, which adds the remainderof splits to the set of partitions.

    Let S1b be the statement after splitting the domain of I3from S1, and S1a be the remainder of the main body of S1.The domain of statements in the program are now:

    • DS1a = {t, i|0 ≤ t ≤ T ∧ 1 ≤ i ≤ N}• DS1b = {t, i|0 ≤ t ≤ T ∧ i = 0}• DS2 = {t, i|0 ≤ t ≤ T ∧ 1 ≤ i ≤ N}

    and the dependences are:

    • I∗1 = S1a[t, i]→ S2[t, i] when 1 ≤ i ≤ N• I∗2 = S1a[t, i]→ S2[t, i+ 1] when i < N• I∗3 = S1b[t, i]→ S1a[t+ 1, i] when i = 0

    The QUOV for S1a is [0, 0, 0, 0, 1] in the scheduled spacewith constant dimensions, which is the aforementioned spe-cial case, and only a scalar is required for S1a. The QUOVfor S1b is [1, 0] after removing the constant dimensions. Theprojection of its domain along this vector is also a constant,due to the equality in its domain. Thus, the storage require-ment for S1 in the original program becomes two scalars,which is much smaller than what is required without thesplitting.

    7. RELATED WORKThere is a lot of prior work on storage mappings for poly-

    hedral programs [1, 4, 5, 13, 15, 19, 20, 21]. Most approachesfocus on the case when the schedule for statements are given.

    7.1 Efficiency of UOV-based AllocationBy the nature of its strategy, UOV-based allocation, in-

    cluding those using QUOVs, cannot yield more compactstorage mappings compared to alternative strategies for aspecific schedule. However, UOV-based allocation may notbe as inefficient as one might think for programs that required − 1 dimensional storage. The misconception is (at leastpartially) due to the trade-off between memory usage andparallelism that is often overlooked. Consider the followingcode fragment with Smith-Waterman(-like) dependences.

    for (i=1:N)

    for (j=1:M)

    H[i,j] = foo(H[i-1,j], H[i,j-1]);

    As illustrated in Figure 5, UOV-based allocation, evenwith QUOVs, gives a storage mapping that use O(N + M)memory for this program. However, the program cannotbe tiled if O(N) or O(M) storage mappings are used dueto memory-based dependences. One approach to still ac-complish tiled execution of this program is to transform theprogram such that the iteration space is tilable even whenthe memory-based dependences are under consideration.

    For the above program, this requires a skewing as depictedin Figure 6. Once the skewing is applied so that O(M)

    i

    j

    (a) Iteration space and pos-sible storage mappings

    i

    j

    (b) Set of live values in tiledexecution

    Figure 5: Iteration space of Smith-Waterman(-like) codeand possible storage mappings. Figure 5a shows two pos-sible storage mappings, either horizontal or vertical projec-tion of the iteration space. The size of memory is O(M)or O(N), depending of the direction of the projection. Fig-ure 5b shows the iteration space after tiling, and the val-ues that are live after executing two tiles on the diagonal.Observe that neither projection (horizontal or vertical) canstore all the live values. One example of a valid projec-tive storage mapping is the projection along [1, 1] that useO(N +M) memory.

    storage mapping can be tiled, the UOV-based allocation forthe same program will also yield a storage mapping withO(M) memory. Note that the skewed iteration space hasless parallelism (longer critical path length) when comparedto the original rectangular iteration space. The parallelismis effectively traded off with decreased memory usage.

    For this exam