Multi Core Compilation

Embed Size (px)

Citation preview

  • 7/30/2019 Multi Core Compilation

    1/50

    Automatic Parallelization for MulticoreArchitectures

    P. Sadayappan

    Department of Computer Science & EngineeringThe Ohio State University

    Joint work withUday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy

    Atanas Rountev, J. Ramanujam

    December 18, 2007Supported in part by NSF

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    http://goforward/http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    2/50

    Introduction

    1 Introduction

    Emerging parallel architectures and automatic parallelization

    2 Polyhedral techniques for program optimization

    3

    Integrating tiling into the polyhedral framework

    4 Implementation

    5 Experimental results

    6 Related and Future work

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    http://find/
  • 7/30/2019 Multi Core Compilation

    3/50

    Introduction Emerging parallel architectures and automatic parallelization

    New parallel architectures and accelerators

    On-chip parallel architectures and accelerators have emerging

    General purpose: multi-core microprocessorsSpecialized: GPGPUs, Cell, FPGAs, MPSoCs

    ProgrammingParallelization (large number of processing elements)Locality optimization (limited external memory bandwidth)Scratchpad memory minimization (fast software controlledcaches)

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    http://find/
  • 7/30/2019 Multi Core Compilation

    4/50

    Introduction Emerging parallel architectures and automatic parallelization

    New parallel architectures and accelerators

    On-chip parallel architectures and accelerators have emerging

    General purpose: multi-core microprocessorsSpecialized: GPGPUs, Cell, FPGAs, MPSoCs

    ProgrammingParallelization (large number of processing elements)Locality optimization (limited external memory bandwidth)Scratchpad memory minimization (fast software controlledcaches)

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    P l h d l h i f i i i

    http://find/
  • 7/30/2019 Multi Core Compilation

    5/50

    Polyhedral techniques for program optimization

    1 Introduction

    2 Polyhedral techniques for program optimization

    3 Integrating tiling into the polyhedral framework

    4 Implementation

    5 Experimental results

    6 Related and Future work

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    P l h d l t h i f ti i ti

    http://find/
  • 7/30/2019 Multi Core Compilation

    6/50

    Polyhedral techniques for program optimization

    Polyhedral optimization

    1 Dependence analysis (exact affine dependences) [Feautrier91,Pugh92, Vasilache06ICS]

    2

    Automatic transformations [Feautrier92, Lim/Lam97,Griebl04]

    3 Code generation from specified transforms[Omega90s,Quillere00, Bastoul04, CLooG, Vasilache06CC]

    Significant recent advances in first and last step

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Polyhedral techniques for program optimization

    http://find/
  • 7/30/2019 Multi Core Compilation

    7/50

    Polyhedral techniques for program optimization

    Polyhedral optimization

    1 Dependence analysis (exact affine dependences) [Feautrier91,Pugh92, Vasilache06ICS]

    2

    Automatic transformations [Feautrier92, Lim/Lam97,Griebl04]

    3 Code generation from specified transforms[Omega90s,Quillere00, Bastoul04, CLooG, Vasilache06CC]

    Significant recent advances in first and last step

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Polyhedral techniques for program optimization

    http://find/
  • 7/30/2019 Multi Core Compilation

    8/50

    Polyhedral techniques for program optimization

    Background: polyhedral/polytope model

    Powerful model for optimizing loops with regular accesses

    h.x=k

    .x k

    h.x k

    h

    (a) A hyperplane (b) A polytope

    Figure: Iteration space represented as a polytope

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Polyhedral techniques for program optimization

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    9/50

    Polyhedral techniques for program optimization

    Affine Transformations in the polytope model

    A one-dimensional affine transform for statement Sk is defined by:

    Sk =

    c1 c2 . . . cmSk

    i

    + c0

    =

    c1 c2 . . . cmSkc0

    i1

    = hSki + c0

    where hSk = [c1, c2, . . . , cmSk] with c1, c2, . . . , cmSk Z.

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Polyhedral techniques for program optimization

    http://find/
  • 7/30/2019 Multi Core Compilation

    10/50

    Polyhedral techniques for program optimization

    Affine transformations

    A transformation for each statement: S(i) = TSi + bS

    i1i2i

    3...in

    =

    c11 c12 . . . c1nc21 c22 . . . c2n

    ... ... ... ...cn1 cn2 . . . cnn

    i1i2i3...

    in

    +

    c01c02c03

    ...c0n

    Full column-ranked transform is a one-to-one mapping

    Each 1-d transform () can later be marked as a space loop ortime (sequential) loop or a band of them can be tiled

    Problem: how do you find good transformations optimizedfor parallelism and locality?

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Polyhedral techniques for program optimization

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    11/50

    y q p g p

    Affine transformations

    A transformation for each statement: S(i) = TSi + bS

    i1i2i

    3...in

    =

    c11 c12 . . . c1nc21 c22 . . . c2n

    ... ... ... ...cn1 cn2 . . . cnn

    i1i2i3...

    in

    +

    c01c02c03

    ...c0n

    Full column-ranked transform is a one-to-one mapping

    Each 1-d transform () can later be marked as a space loop ortime (sequential) loop or a band of them can be tiled

    Problem: how do you find good transformations optimizedfor parallelism and locality?

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Polyhedral techniques for program optimization

    http://find/
  • 7/30/2019 Multi Core Compilation

    12/50

    y q p g p

    Per-statement affine transformations

    do i = 1, ndo j = 1, n

    (S1) C[i,j] = 0;end do

    end do

    do i = 1, ndo j = 1, n

    do k = 1, n(S2) C[i , j ] = C[i, j ] + A[i,k]B[k,j]end do

    end doend do

    do i = 1, ndo j = 1, n

    do k = 1, n

    (S3) D[i,j] = D[i,j] + E[i,k]C[k,j]end doend do

    end do

    S1 S2 S3i j const i j k const i j k const

    c1 1 0 0 1 0 0 0 0 0 1 0c2 0 1 0 0 1 0 0 0 1 0 0c3 0 0 0 0 0 0 1 0 0 0 1c4 0 0 0 0 0 0 0 0 0 0 1c5 0 0 0 0 0 1 0 1 0 0 0

    do c1=1, ndo c2=1, n

    C[c1,c2] = 0;do c5=1, n

    C[c1,c2] = C[c1,c2] + A[c1,c5] B[c5,c2]end dodo c5=1, n

    D[c5,c2] = D[c5,c2] + E[c5,c1] C[c1,c2]end do

    end doend do

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Polyhedral techniques for program optimization

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    13/50

    y g

    Polyhedral Transformation FrameworksWhat remains to be done?

    Finding fine-grained parallelism from sequential (affine) inputprogram: solved problem (Feautrier, Darte etc.)

    Minimizing synchronization (O(1) vs. O(N)): solved(Lim/Lam)

    Efficient coarse-grained, load-balanced execution onmulti-level parallel systems: Good progress (LooPo, CLooGetc.); need more

    Tiling is a key loop transformation for efficient coarse-grainedparallel execution, and for data locality optimization

    But tiling has not been tightly integrated into polyhedralframeworks: usually a separated post-processing step on fullypermutable loop nests in target loop nests

    Transformed loop structure has a higher dimensional iterationspace, not expressible as an affine transformation of input

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Polyhedral techniques for program optimization

    http://find/
  • 7/30/2019 Multi Core Compilation

    14/50

    Polyhedral Transformation FrameworksWhat remains to be done?

    Finding fine-grained parallelism from sequential (affine) inputprogram: solved problem (Feautrier, Darte etc.)

    Minimizing synchronization (O(1) vs. O(N)): solved(Lim/Lam)

    Efficient coarse-grained, load-balanced execution onmulti-level parallel systems: Good progress (LooPo, CLooGetc.); need more

    Tiling is a key loop transformation for efficient coarse-grainedparallel execution, and for data locality optimization

    But tiling has not been tightly integrated into polyhedralframeworks: usually a separated post-processing step on fullypermutable loop nests in target loop nests

    Transformed loop structure has a higher dimensional iterationspace, not expressible as an affine transformation of input

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    15/50

    1 Introduction

    2 Polyhedral techniques for program optimization

    3 Integrating tiling into the polyhedral framework

    4 Implementation

    5 Experimental results

    6 Related and Future work

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    16/50

    New algorithm

    We develop a new algorithm to find statement-wise affinetransforms, interpreted as partitioning hyperplanes

    Models communication/locality optimized tiling in the affinetransformation framework

    Simultaneously optimizes parallelism and locality forsequences of imperfectly nested loops

    Enables fusion across producing-consuming loops

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    17/50

    Tiling and its legality

    Tiling for parallelism Enables coarse-grained parallelization:

    reduces frequency of communicationTiling for locality Allows reuse along multiple dimensions -tile fits in faster memory

    Tile shape and size affect the volume of communication /number of cache misses

    If i and i are dependent (condition for dependence capturedby the polyhedral representation):

    si(i) sj(

    i) 0, i, i Pe

    Classic condition from Irigoin and Triolet [PoPL87]:dependence only has positive components along

    Extended for affine dependences and multi-statementimperfectly nested loops

    More than just legal transformations

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    http://find/
  • 7/30/2019 Multi Core Compilation

    18/50

    Tiling and its legality

    Tiling for parallelism Enables coarse-grained parallelization:

    reduces frequency of communicationTiling for locality Allows reuse along multiple dimensions -tile fits in faster memory

    Tile shape and size affect the volume of communication /number of cache misses

    If i and i are dependent (condition for dependence capturedby the polyhedral representation):

    si(i) sj(

    i) 0, i, i Pe

    Classic condition from Irigoin and Triolet [PoPL87]:dependence only has positive components along

    Extended for affine dependences and multi-statementimperfectly nested loops

    More than just legal transformations

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    http://find/
  • 7/30/2019 Multi Core Compilation

    19/50

    Capturing communication volume and reuse distance

    Feautrier exact dependences: generalized dependence graph (V, E)

    Ss, St V, e E; i DSs

    , i

    DSt

    , i, i

    e i = fe(i

    )i

    Se

    i

    t

    (1,0) (2,1)

    i i

    t tP1

    P0

    P3P2

    (1,1)(1,0)

    seq

    (1,1)(1,0)

    parallelseq parallelseqparallel

    P1 P2

    P0

    (i) (i) is a very important affine function

    Represents the component of a dependence along thehyperplane

    Communication volume (per unit area) at processor tileboundaries

    Cache misses at local tile edgesAutomatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    http://find/
  • 7/30/2019 Multi Core Compilation

    20/50

    Capturing communication volume and reuse distance

    Feautrier exact dependences: generalized dependence graph (V, E)

    Ss, St V, e E; i DSs

    , i

    DSt

    , i, i

    e i = fe(i

    )i

    Se

    i

    t

    (1,0) (2,1)

    i i

    t tP1

    P0

    P3P2

    (1,1)(1,0)

    seq

    (1,1)(1,0)

    parallelseq parallelseqparallel

    P1 P2

    P0

    (i) (i) is a very important affine function

    Represents the component of a dependence along thehyperplane

    Communication volume (per unit area) at processor tileboundariesCache misses at local tile edges

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    C f ffi

    http://find/
  • 7/30/2019 Multi Core Compilation

    21/50

    Cost function in Affine Framework

    We define an affine form e:

    e = si(i) sj(

    i)

    Minimizing e can be used to:

    Find hyperplane that minimizes inter-tile communicationvolume (rate per unit area)

    Find direction that minimizes reuse distance

    Find intra-tile schedule that minimizes on-chip memory usage

    But directly attempting to optimize e

    is problematic (notexpressible as a linear function of affine partitioning coefficients)e.g. (i) (i) could be c1i + (c2 c3)j, where1 i N 1 j N i j

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    U i B di F i A h

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    22/50

    Using a Bounding Function Approach

    Bound it from above and minimize the coefficients of the

    bound after linearizing with Farkas Lemma

    si(

    i) sj(i)

    u.n + w

    v(n) e 0, e E

    Even if e involves loop variables, we can find large enoughconstants in u to bound e.

    Since v(n) e 0, v should increase with an increase in thestructural parameters, i.e., the coordinates of u are positive.

    The reuse distance or communication volume for eachdependence is bounded in this fashion by the same affine form

    Minimize the maximum component along hyperplane normalacross all dependences

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    U i B di F i A h

    http://find/
  • 7/30/2019 Multi Core Compilation

    23/50

    Using a Bounding Function Approach

    Bound it from above and minimize the coefficients of the

    bound after linearizing with Farkas Lemma

    si(

    i) sj(i)

    u.n + w

    v(n) e 0, e E

    Even if e involves loop variables, we can find large enoughconstants in u to bound e.

    Since v(n) e 0, v should increase with an increase in thestructural parameters, i.e., the coordinates of u are positive.

    The reuse distance or communication volume for eachdependence is bounded in this fashion by the same affine form

    Minimize the maximum component along hyperplane normalacross all dependences

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    B di F ti A h (C td )

    http://find/
  • 7/30/2019 Multi Core Compilation

    24/50

    Bounding Function Approach (Contd.)

    Now, use Farkas lemma to obtain

    v(n) e(q) e0 +me

    k=1

    ek(fek)

    Farkas Lemma is applied to linearize the legality andcommunication volume/reuse distance bounding constraints

    Use PIP to find the lexicographic minimal solution thatsatisfies above system, using the following order, with u andw in the leading position, and the other variables following inany order.

    minimize {u1, u2, . . . , uk, w, . . . , c

    is, . . . }

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    I li ti f t f ti i i i ti

    http://find/
  • 7/30/2019 Multi Core Compilation

    25/50

    Implications of cost function minimization

    u = 0, w = 0: communication-free parallel hyperplane

    u = 0, w = const: constant minimum boundary line

    communication/cache missesu > 0: non-constant (large) amount of communication/cachemisses

    A solution for multiple statements at a level is a fused loop

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    Fi di g i de e de t sol tio s ite ati el

    http://find/
  • 7/30/2019 Multi Core Compilation

    26/50

    Finding independent solutions iteratively

    Once a solution (a hyperplane for all statements is found), thesame formulation is augmented with additional orthogonalityconstraints

    Find independent solutions one after the other till enoughhyperplanes to scan all statements are found

    Dependences are not removed as we proceed from hyperplaneto another unless they need to be

    A hierarchy of permutable loop nest sets is found

    Fully tilable, partially tilable (tiling at any arbitrary level) orinner tilable loops are identified

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    Algorithm summary

    http://find/
  • 7/30/2019 Multi Core Compilation

    27/50

    Algorithm summary

    Dependences are pushed as much inside as possible

    Outer loops are space with minimal communication

    Synchronization-free parallel loops end up first if they exist

    ((i) (i) = 0 for such hyperplanes)Next, space loops with minimal communication are found

    Inner loops are sequential time with maximum reuse

    Complexity: Very fast in practice

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    Algorithm summary

    http://find/
  • 7/30/2019 Multi Core Compilation

    28/50

    Algorithm summary

    Dependences are pushed as much inside as possible

    Outer loops are space with minimal communication

    Synchronization-free parallel loops end up first if they exist

    ((i) (i) = 0 for such hyperplanes)Next, space loops with minimal communication are found

    Inner loops are sequential time with maximum reuse

    Complexity: Very fast in practice

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Integrating tiling into the polyhedral framework

    Algorithm

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    29/50

    Algorithm

    Input Generalized dependence graph G = (V, E) (includes dependence polyhedra Pe, e E)Smax: statement with maximum domain dimensionalityfor each dependence e E do

    Build legality constraints: apply Farkas Lemma on (t) (fe(t)) 0 undert Pe, and eliminate allFarkas multipliersBuild communication volume/reuse distance bounding constraints: apply Farkas Lemma tov(n) ((t) (f(t))) 0 undert Pe, and eliminate all Farkas multipliersAggregate constraints from both into Ce(i)

    end forrepeat

    C = for each dependence edge e E do

    C C Ce(i)end forCompute lexicographic minimal solution with us coefficients in the leading position followed by w toiteratively find independent solutions to C (orthogonality constraints are added as each soln is found)if no solutions were found then

    Cut dependences between two strongly-connected components in the GDG and insert the appropriatesplitter in the transformation matrices of the statements

    end if

    Compute Ec: dependences carried by solutions of Step 12/14; update necessary dependence polyhedra(when a portion of it is satisfied)E E Ec; reform the GDG (V, E)

    until HSmax= 0 and E =

    Output A transformation matrix for each statement (with the same number of rows)

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Implementation

    http://goforward/http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    30/50

    1 Introduction

    2 Polyhedral techniques for program optimization

    3 Integrating tiling into the polyhedral framework

    4 Implementation

    5 Experimental results

    6 Related and Future work

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Implementation

    Implementation status

    http://find/
  • 7/30/2019 Multi Core Compilation

    31/50

    Implementation status

    (OpenMP)

    gcc/icc

    Dependence

    CLooG

    Compilable

    Annotated code

    frameworktransformation

    Our affine

    affine transformsStatementwise

    tilespecifier

    Polyhedral

    SyntacticTransformer

    sequences

    scanner/parser+

    Dependencetester locality optimization)

    (parallelization +

    polyhedra

    LooPoNested loop

    target code

    with supernodesand scatteringsUpdated domains

    Figure: Automatic parallelizer (source-to-source)

    Framework implemented with PipLib, PolyLib and interfacedwith LooPo (Univ. of Passau, Germany) and CLooG (INRIAFuturs) at either end

    C/Fortran to transformed code in a completely automaticfashion

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Implementation

    Example: Imperfectly-nested 2-d jacobi

    http://goback/http://find/http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    32/50

    Example: Imperfectly nested 2 d jacobi

    CONSTANT n;

    DO t=1, nDO i=2,n1

    DO j=2,n1b [ i , j ] = a[ i1,j]+a[i, j]+a[i +1,j]

    + a[i , j1]+a[i,j+1];END DO

    END DODO k=2,n1

    DO l=2,n1a[k,l ] = b[k,l ];

    END DOEND DO

    END DO

    forallpp (c1=1;c1

  • 7/30/2019 Multi Core Compilation

    33/50

    Example: TCE 4 index transform

    DO a = 1, NDO q = 1, N

    DO r = 1, NDO s, p = 1, N

    T1[a,q,r,s] = T1[a,q,r,s] + A[p,q,r,s]C4[p,a]DO a = 1, N

    DO b = 1, NDO r = 1, N

    DO s, q = 1, NT2[a,b,r,s] = T2[a,b,r,s]

    + T1[a,q,r,s]C3[q,b]DO a = 1, N

    DO b = 1, NDO c = 1, N

    DO s, r = 1, NT3[a,b,c , s ] = T3[a,b,c,s ]

    + T2[a,b,r,s]C2[r,c]DO a = 1, N

    DO b = 1, NDO c = 1, N

    DO d, s = 1, NB[a,b,c ,d] = B[a,b,c, d] + T3[a,b,c,s]C1[s,d]

    DOALL c1=1, MDO c2=1, M

    DOALL c4=1, MDOALL c5=1, M

    DO c7=1, MS1(a = c1,q = c5,r = c4,s = c2,p = c7)

    DO c7=1, MS2(a = c1,b = c7,r = c4,s = c2,q = c5)DOALL c4=1, M

    DOALL c5=1, MDO c7=1, M

    S3(a = c1,b = c5,c = c4,s = c2,r = c7)DO c7=1, M

    S4(a = c1,b = c5,c = c4,d = c7,s = c2)

    Figure: TCE 4-index transform

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Implementation

    Running time

    http://find/
  • 7/30/2019 Multi Core Compilation

    34/50

    Running time

    Code Num of Num of Num of Runningstatements loops deps time

    2-d Jacobi 2 6 20 0.05s

    Haar 1-d 3 5 12 0.018sLU 2 5 10 0.022s

    TCE 4-index 4 20 15 0.20s

    Swim 58 160 639 20.9s

    Table: Transformation tool running time

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Implementation

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    35/50

    1 Introduction

    2 Polyhedral techniques for program optimization

    3 Integrating tiling into the polyhedral framework

    4 Implementation

    5 Experimental results

    6 Related and Future work

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Implementation

    Target architectures

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    36/50

    g

    Transformation framework can be targeted towards multipleaccelerators

    Different kinds, levels, and degrees of parallelism required ineach case

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Experimental results

    http://find/
  • 7/30/2019 Multi Core Compilation

    37/50

    1 Introduction

    2 Polyhedral techniques for program optimization

    3 Integrating tiling into the polyhedral framework

    4 Implementation

    5 Experimental results

    6 Related and Future work

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Experimental results

    Experimental Setup

    http://find/
  • 7/30/2019 Multi Core Compilation

    38/50

    p p

    Intel Core2 Quad Q6600 2.4 GHz (quad core with shared L2cache), FSB 1066 MHz, DDR2 667 RAM

    32 KB L1 cache, 8 MB L2 cache (4MB per core pair)

    ICC 10.1 (-fast)

    GCC 4.1.2 (-O3 -ftree-vectorize -mtune=core2 -msse2-funroll-loops)

    Linux 2.6.18 x86-64

    All optimized codes were obtained fully automatically through thePLuTo system

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Experimental results

    Imperfectly nested 1-d Jacobi stencil (with icc)

    http://find/
  • 7/30/2019 Multi Core Compilation

    39/50

    p y ( )

    8x

    7x

    6x

    5x

    4x

    3x

    2x

    1x

    4M2M1M500000

    Spee

    dupovernativecompiler

    N (space extent)

    OurLim/Lam

    Scheduling-based (time tiling)icc -fast

    (a) Single core: T = 104

    0

    2

    4

    6

    8

    10

    4321

    GFLoPs

    Number of cores

    Our (tiled parallel)Lim/Lam affine partitioning

    Scheduling-based (time tiling)icc -parallel -fast

    (b) Multi-core parallel: N= 106,T =105

    Figure: Imperfectly nested 1-d Jacobi stencil

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Experimental results

    2-d FDTD

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    40/50

    6x

    5x

    4x

    3x

    2x

    1x

    400020001000500

    Spee

    dupovernativecompiler

    N (problem size)

    Ouricc -fast / Scheduling-based (space tiling)

    Scheduling-based (time tiling)

    (a) Single core: T=500

    0

    1

    2

    3

    4

    5

    6

    7

    8

    4321

    GFLoPs

    Number of cores

    Our (1-d pipelined parallel)Our 2-d pipelined parallel

    Scheduling-based (time tiling)icc -parallel -fast / scheduling-based space tiling

    (b) Parallel: nx = ny = 2000,tmax = 500

    Figure: 2-d FDTD

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Experimental results

    LU Decomposition

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    41/50

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    8K4K2K1K

    GFLOPs

    Problem size

    Our (tiled)Scheduling-based (time tiling)

    icc -parallel -fast

    (a) Single core (L1 and L2 tiled)

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    4321

    GFLOPs

    Number of cores

    Our (2-d pipe parallel+tiled)Our 1-d pipeline parallel

    Scheduling-based (time tiling)icc -parallel -fast

    (b) On a quad core: N=8000

    Figure: LU Decomposition

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Experimental results

    MVT

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    42/50

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4321

    G

    FLOPs

    Number of cores

    Our (+ syntactic post-processing)Our (1-d pipelined parallel, fused (ij/ji))

    Scheduling-based, Lim-Lamfused (ij/ij), i parallel

    gcc -O3

    Figure: MVT performance on a quad core: N=12000

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Related and Future work

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    43/50

    1 Introduction

    2 Polyhedral techniques for program optimization

    3 Integrating tiling into the polyhedral framework

    4 Implementation

    5 Experimental results

    6 Related and Future work

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Related and Future work

    Related work

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    44/50

    Schedule-based approaches [Feautrier92, Darte/Vivien95,

    Griebl04 habilitation thesis]Affine partitioning [Lim/Lam PoPL97, ICS01] - minimizingorder of synchronization does not help; Ahmed/Pingali[IJPP01] (method incomplete due to infeasible heuristic)

    Semi-automatic - URUK/WRAP-IT [Cohen05ICS,Girbal06IJPP] - very powerful flexible application oftransformations specified manually by an expert

    Step-wise completion of transformation matrices wasaddressed for some restricted input in the context of FPGA

    auto-transformation [Bondhugula PPoPP07]Automatically finding good transforms for sequences ofimperfectly nested loops for coarse-grained parallelism andlocality was not previously attempted

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Related and Future work

    Related work

    http://find/
  • 7/30/2019 Multi Core Compilation

    45/50

    Schedule-based approaches [Feautrier92, Darte/Vivien95,

    Griebl04 habilitation thesis]Affine partitioning [Lim/Lam PoPL97, ICS01] - minimizingorder of synchronization does not help; Ahmed/Pingali[IJPP01] (method incomplete due to infeasible heuristic)

    Semi-automatic - URUK/WRAP-IT [Cohen05ICS,Girbal06IJPP] - very powerful flexible application oftransformations specified manually by an expert

    Step-wise completion of transformation matrices wasaddressed for some restricted input in the context of FPGA

    auto-transformation [Bondhugula PPoPP07]Automatically finding good transforms for sequences ofimperfectly nested loops for coarse-grained parallelism andlocality was not previously attempted

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Related and Future work

    Future work

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    46/50

    Integrating with full-fledged frontends GCC - GRAPHITEproject (INRIA Futurs)

    Fusion/parallelization trade-off (which dependences betweenstrongly-connected components to cut/include)

    Combine with stronger cost models for tile size selection

    Interaction with auto-vectorization and prefetching

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Related and Future work

    Future work

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    47/50

    Integrating with full-fledged frontends GCC - GRAPHITEproject (INRIA Futurs)

    Fusion/parallelization trade-off (which dependences betweenstrongly-connected components to cut/include)

    Combine with stronger cost models for tile size selection

    Interaction with auto-vectorization and prefetching

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Related and Future work

    Acknowledgments

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    48/50

    Cedric Bastoul (INRIA Futurs) for CLooG and all othercontributors of CLooG and PipLib

    Martin Griebl (FMI, Universitat Passau, Germany) and teamfor LooPo

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Related and Future work

    Farkas Lemma

    http://find/
  • 7/30/2019 Multi Core Compilation

    49/50

    Let the hyperplanes bounding the polytope of statement Sk begiven by:

    aS,k

    i

    n

    + bS,k 0, k = 1, mS

    where n is a vector of the structure parameters. A well-knownknown result useful in the context of the polytope model is the

    affine form of the Farkas lemma.

    Lemma (Affine form of Farkas Lemma)

    Let D be a non-empty polyhedron defined by p affine inequalitiesor faces ak.x + bk 0, k = 1, p

    Then, an affine form is non-negative everywhere in D iff it is apositive affine combination of the faces:

    (x) 0 +

    k

    k(akx + bk), 0

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    Related and Future work

    Affine Schedules as Partitioning Hyperplanes

    http://find/http://goback/
  • 7/30/2019 Multi Core Compilation

    50/50

    Let there be a dependence from statement instance p of si to q ofSj corresponding to an edge e of the GDG of a program. After

    exact dependence analysis, we obtain,

    p = fe(q), q Pe

    where Pe is as defined previously, and fe is an affine function whichis also called the h-transformation.

    Lemma

    Let si be a one-dimensional affine transform for Si. For s1 , s2 ,

    . . . , sk, to be a coordinated valid tiling hyperplane for all

    statements, the following should hold for each edge e from Si and

    Sj:

    sj (q) si (p) 0, p, q Re

    sj (q) si (fe(q)) 0, q Pe

    Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

    http://find/http://goback/