Multi Core Compilation

7/30/2019 Multi Core Compilation

1/50

Automatic Parallelization for MulticoreArchitectures

P. Sadayappan

Department of Computer Science & EngineeringThe Ohio State University

Joint work withUday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy

Atanas Rountev, J. Ramanujam

December 18, 2007Supported in part by NSF

Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
http://goforward/http://find/http://goback/


2/50

Introduction

1 Introduction

Emerging parallel architectures and automatic parallelization

2 Polyhedral techniques for program optimization

3

Integrating tiling into the polyhedral framework

4 Implementation

5 Experimental results

6 Related and Future work

http://find/


3/50

Introduction Emerging parallel architectures and automatic parallelization

New parallel architectures and accelerators

On-chip parallel architectures and accelerators have emerging

General purpose: multi-core microprocessorsSpecialized: GPGPUs, Cell, FPGAs, MPSoCs

ProgrammingParallelization (large number of processing elements)Locality optimization (limited external memory bandwidth)Scratchpad memory minimization (fast software controlledcaches)

http://find/


4/50

Introduction Emerging parallel architectures and automatic parallelization

New parallel architectures and accelerators

On-chip parallel architectures and accelerators have emerging

General purpose: multi-core microprocessorsSpecialized: GPGPUs, Cell, FPGAs, MPSoCs

ProgrammingParallelization (large number of processing elements)Locality optimization (limited external memory bandwidth)Scratchpad memory minimization (fast software controlledcaches)


P l h d l h i f i i i
http://find/


5/50

Polyhedral techniques for program optimization

1 Introduction


3 Integrating tiling into the polyhedral framework

4 Implementation




P l h d l t h i f ti i ti
http://find/


6/50


Polyhedral optimization

1 Dependence analysis (exact affine dependences) [Feautrier91,Pugh92, Vasilache06ICS]

2

Automatic transformations [Feautrier92, Lim/Lam97,Griebl04]

3 Code generation from specified transforms[Omega90s,Quillere00, Bastoul04, CLooG, Vasilache06CC]

Significant recent advances in first and last step


http://find/


7/50


Polyhedral optimization

1 Dependence analysis (exact affine dependences) [Feautrier91,Pugh92, Vasilache06ICS]

2

Automatic transformations [Feautrier92, Lim/Lam97,Griebl04]

3 Code generation from specified transforms[Omega90s,Quillere00, Bastoul04, CLooG, Vasilache06CC]

Significant recent advances in first and last step


http://find/


8/50


Background: polyhedral/polytope model

Powerful model for optimizing loops with regular accesses

h.x=k

.x k

h.x k

h

(a) A hyperplane (b) A polytope

Figure: Iteration space represented as a polytope


http://find/http://goback/


9/50


Affine Transformations in the polytope model

A one-dimensional affine transform for statement Sk is defined by:

Sk =

c1 c2 . . . cmSk

i

+ c0

=

c1 c2 . . . cmSkc0

i1

= hSki + c0

where hSk = [c1, c2, . . . , cmSk] with c1, c2, . . . , cmSk Z.


http://find/


10/50


Affine transformations

A transformation for each statement: S(i) = TSi + bS

i1i2i

3...in

=

c11 c12 . . . c1nc21 c22 . . . c2n

... ... ... ...cn1 cn2 . . . cnn

i1i2i3...

in

+

c01c02c03

...c0n

Full column-ranked transform is a one-to-one mapping

Each 1-d transform () can later be marked as a space loop ortime (sequential) loop or a band of them can be tiled

Problem: how do you find good transformations optimizedfor parallelism and locality?




11/50

y q p g p

Affine transformations

A transformation for each statement: S(i) = TSi + bS

i1i2i

3...in

=

c11 c12 . . . c1nc21 c22 . . . c2n

... ... ... ...cn1 cn2 . . . cnn

i1i2i3...

in

+

c01c02c03

...c0n

Full column-ranked transform is a one-to-one mapping

Each 1-d transform () can later be marked as a space loop ortime (sequential) loop or a band of them can be tiled

Problem: how do you find good transformations optimizedfor parallelism and locality?


http://find/


12/50

y q p g p

Per-statement affine transformations

do i = 1, ndo j = 1, n

(S1) C[i,j] = 0;end do

end do

do i = 1, ndo j = 1, n

do k = 1, n(S2) C[i , j ] = C[i, j ] + A[i,k]B[k,j]end do

end doend do

do i = 1, ndo j = 1, n

do k = 1, n

(S3) D[i,j] = D[i,j] + E[i,k]C[k,j]end doend do

end do

S1 S2 S3i j const i j k const i j k const

c1 1 0 0 1 0 0 0 0 0 1 0c2 0 1 0 0 1 0 0 0 1 0 0c3 0 0 0 0 0 0 1 0 0 0 1c4 0 0 0 0 0 0 0 0 0 0 1c5 0 0 0 0 0 1 0 1 0 0 0

do c1=1, ndo c2=1, n

C[c1,c2] = 0;do c5=1, n

C[c1,c2] = C[c1,c2] + A[c1,c5] B[c5,c2]end dodo c5=1, n

D[c5,c2] = D[c5,c2] + E[c5,c1] C[c1,c2]end do

end doend do




13/50

y g

Polyhedral Transformation FrameworksWhat remains to be done?

Finding fine-grained parallelism from sequential (affine) inputprogram: solved problem (Feautrier, Darte etc.)

Minimizing synchronization (O(1) vs. O(N)): solved(Lim/Lam)

Efficient coarse-grained, load-balanced execution onmulti-level parallel systems: Good progress (LooPo, CLooGetc.); need more

Tiling is a key loop transformation for efficient coarse-grainedparallel execution, and for data locality optimization

But tiling has not been tightly integrated into polyhedralframeworks: usually a separated post-processing step on fullypermutable loop nests in target loop nests

Transformed loop structure has a higher dimensional iterationspace, not expressible as an affine transformation of input


http://find/


14/50

Polyhedral Transformation FrameworksWhat remains to be done?

Finding fine-grained parallelism from sequential (affine) inputprogram: solved problem (Feautrier, Darte etc.)

Minimizing synchronization (O(1) vs. O(N)): solved(Lim/Lam)

Efficient coarse-grained, load-balanced execution onmulti-level parallel systems: Good progress (LooPo, CLooGetc.); need more

Tiling is a key loop transformation for efficient coarse-grainedparallel execution, and for data locality optimization

But tiling has not been tightly integrated into polyhedralframeworks: usually a separated post-processing step on fullypermutable loop nests in target loop nests

Transformed loop structure has a higher dimensional iterationspace, not expressible as an affine transformation of input




15/50

1 Introduction



4 Implementation






16/50

New algorithm

We develop a new algorithm to find statement-wise affinetransforms, interpreted as partitioning hyperplanes

Models communication/locality optimized tiling in the affinetransformation framework

Simultaneously optimizes parallelism and locality forsequences of imperfectly nested loops

Enables fusion across producing-consuming loops




17/50

Tiling and its legality

Tiling for parallelism Enables coarse-grained parallelization:

reduces frequency of communicationTiling for locality Allows reuse along multiple dimensions -tile fits in faster memory

Tile shape and size affect the volume of communication /number of cache misses

If i and i are dependent (condition for dependence capturedby the polyhedral representation):

si(i) sj(

i) 0, i, i Pe

Classic condition from Irigoin and Triolet [PoPL87]:dependence only has positive components along

Extended for affine dependences and multi-statementimperfectly nested loops

More than just legal transformations


http://find/


18/50

Tiling and its legality

Tiling for parallelism Enables coarse-grained parallelization:

reduces frequency of communicationTiling for locality Allows reuse along multiple dimensions -tile fits in faster memory

Tile shape and size affect the volume of communication /number of cache misses

If i and i are dependent (condition for dependence capturedby the polyhedral representation):

si(i) sj(

i) 0, i, i Pe

Classic condition from Irigoin and Triolet [PoPL87]:dependence only has positive components along

Extended for affine dependences and multi-statementimperfectly nested loops

More than just legal transformations


http://find/


19/50

Capturing communication volume and reuse distance

Feautrier exact dependences: generalized dependence graph (V, E)

Ss, St V, e E; i DSs

, i

DSt

, i, i

e i = fe(i

)i

Se

i

t

(1,0) (2,1)

i i

t tP1

P0

P3P2

(1,1)(1,0)

seq

(1,1)(1,0)

parallelseq parallelseqparallel

P1 P2

P0

(i) (i) is a very important affine function

Represents the component of a dependence along thehyperplane

Communication volume (per unit area) at processor tileboundaries

Cache misses at local tile edgesAutomatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007

http://find/


20/50

Capturing communication volume and reuse distance

Feautrier exact dependences: generalized dependence graph (V, E)

Ss, St V, e E; i DSs

, i

DSt

, i, i

e i = fe(i

)i

Se

i

t

(1,0) (2,1)

i i

t tP1

P0

P3P2

(1,1)(1,0)

seq

(1,1)(1,0)

parallelseq parallelseqparallel

P1 P2

P0

(i) (i) is a very important affine function

Represents the component of a dependence along thehyperplane

Communication volume (per unit area) at processor tileboundariesCache misses at local tile edges



C f ffi
http://find/


21/50

Cost function in Affine Framework

We define an affine form e:

e = si(i) sj(

i)

Minimizing e can be used to:

Find hyperplane that minimizes inter-tile communicationvolume (rate per unit area)

Find direction that minimizes reuse distance

Find intra-tile schedule that minimizes on-chip memory usage

But directly attempting to optimize e

is problematic (notexpressible as a linear function of affine partitioning coefficients)e.g. (i) (i) could be c1i + (c2 c3)j, where1 i N 1 j N i j



U i B di F i A h


22/50

Using a Bounding Function Approach

Bound it from above and minimize the coefficients of the

bound after linearizing with Farkas Lemma

si(

i) sj(i)

u.n + w

v(n) e 0, e E

Even if e involves loop variables, we can find large enoughconstants in u to bound e.

Since v(n) e 0, v should increase with an increase in thestructural parameters, i.e., the coordinates of u are positive.

The reuse distance or communication volume for eachdependence is bounded in this fashion by the same affine form

Minimize the maximum component along hyperplane normalacross all dependences



U i B di F i A h
http://find/


23/50

Using a Bounding Function Approach

Bound it from above and minimize the coefficients of the

bound after linearizing with Farkas Lemma

si(

i) sj(i)

u.n + w

v(n) e 0, e E

Even if e involves loop variables, we can find large enoughconstants in u to bound e.

Since v(n) e 0, v should increase with an increase in thestructural parameters, i.e., the coordinates of u are positive.

The reuse distance or communication volume for eachdependence is bounded in this fashion by the same affine form

Minimize the maximum component along hyperplane normalacross all dependences



B di F ti A h (C td )
http://find/


24/50

Bounding Function Approach (Contd.)

Now, use Farkas lemma to obtain

v(n) e(q) e0 +me

k=1

ek(fek)

Farkas Lemma is applied to linearize the legality andcommunication volume/reuse distance bounding constraints

Use PIP to find the lexicographic minimal solution thatsatisfies above system, using the following order, with u andw in the leading position, and the other variables following inany order.

minimize {u1, u2, . . . , uk, w, . . . , c

is, . . . }



I li ti f t f ti i i i ti
http://find/


25/50

Implications of cost function minimization

u = 0, w = 0: communication-free parallel hyperplane

u = 0, w = const: constant minimum boundary line

communication/cache missesu > 0: non-constant (large) amount of communication/cachemisses

A solution for multiple statements at a level is a fused loop



Fi di g i de e de t sol tio s ite ati el
http://find/


26/50

Finding independent solutions iteratively

Once a solution (a hyperplane for all statements is found), thesame formulation is augmented with additional orthogonalityconstraints

Find independent solutions one after the other till enoughhyperplanes to scan all statements are found

Dependences are not removed as we proceed from hyperplaneto another unless they need to be

A hierarchy of permutable loop nest sets is found

Fully tilable, partially tilable (tiling at any arbitrary level) orinner tilable loops are identified



Algorithm summary
http://find/


27/50

Algorithm summary

Dependences are pushed as much inside as possible

Outer loops are space with minimal communication

Synchronization-free parallel loops end up first if they exist

((i) (i) = 0 for such hyperplanes)Next, space loops with minimal communication are found

Inner loops are sequential time with maximum reuse

Complexity: Very fast in practice



Algorithm summary
http://find/


28/50

Algorithm summary

Dependences are pushed as much inside as possible

Outer loops are space with minimal communication

Synchronization-free parallel loops end up first if they exist

((i) (i) = 0 for such hyperplanes)Next, space loops with minimal communication are found

Inner loops are sequential time with maximum reuse

Complexity: Very fast in practice



Algorithm


29/50

Algorithm

Input Generalized dependence graph G = (V, E) (includes dependence polyhedra Pe, e E)Smax: statement with maximum domain dimensionalityfor each dependence e E do

Build legality constraints: apply Farkas Lemma on (t) (fe(t)) 0 undert Pe, and eliminate allFarkas multipliersBuild communication volume/reuse distance bounding constraints: apply Farkas Lemma tov(n) ((t) (f(t))) 0 undert Pe, and eliminate all Farkas multipliersAggregate constraints from both into Ce(i)

end forrepeat

C = for each dependence edge e E do

C C Ce(i)end forCompute lexicographic minimal solution with us coefficients in the leading position followed by w toiteratively find independent solutions to C (orthogonality constraints are added as each soln is found)if no solutions were found then

Cut dependences between two strongly-connected components in the GDG and insert the appropriatesplitter in the transformation matrices of the statements

end if

Compute Ec: dependences carried by solutions of Step 12/14; update necessary dependence polyhedra(when a portion of it is satisfied)E E Ec; reform the GDG (V, E)

until HSmax= 0 and E =

Output A transformation matrix for each statement (with the same number of rows)


Implementation
http://goforward/http://find/http://goback/


30/50

1 Introduction



4 Implementation




Implementation

Implementation status
http://find/


31/50

Implementation status

(OpenMP)

gcc/icc

Dependence

CLooG

Compilable

Annotated code

frameworktransformation

Our affine

affine transformsStatementwise

tilespecifier

Polyhedral

SyntacticTransformer

sequences

scanner/parser+

Dependencetester locality optimization)

(parallelization +

polyhedra

LooPoNested loop

target code

with supernodesand scatteringsUpdated domains

Figure: Automatic parallelizer (source-to-source)

Framework implemented with PipLib, PolyLib and interfacedwith LooPo (Univ. of Passau, Germany) and CLooG (INRIAFuturs) at either end

C/Fortran to transformed code in a completely automaticfashion


Implementation

Example: Imperfectly-nested 2-d jacobi
http://goback/http://find/http://find/http://goback/


32/50

Example: Imperfectly nested 2 d jacobi

CONSTANT n;

DO t=1, nDO i=2,n1

DO j=2,n1b [ i , j ] = a[ i1,j]+a[i, j]+a[i +1,j]

+ a[i , j1]+a[i,j+1];END DO

END DODO k=2,n1

DO l=2,n1a[k,l ] = b[k,l ];

END DOEND DO

END DO

forallpp (c1=1;c1


33/50

Example: TCE 4 index transform

DO a = 1, NDO q = 1, N

DO r = 1, NDO s, p = 1, N

T1[a,q,r,s] = T1[a,q,r,s] + A[p,q,r,s]C4[p,a]DO a = 1, N

DO b = 1, NDO r = 1, N

DO s, q = 1, NT2[a,b,r,s] = T2[a,b,r,s]

+ T1[a,q,r,s]C3[q,b]DO a = 1, N

DO b = 1, NDO c = 1, N

DO s, r = 1, NT3[a,b,c , s ] = T3[a,b,c,s ]

+ T2[a,b,r,s]C2[r,c]DO a = 1, N

DO b = 1, NDO c = 1, N

DO d, s = 1, NB[a,b,c ,d] = B[a,b,c, d] + T3[a,b,c,s]C1[s,d]

DOALL c1=1, MDO c2=1, M

DOALL c4=1, MDOALL c5=1, M

DO c7=1, MS1(a = c1,q = c5,r = c4,s = c2,p = c7)

DO c7=1, MS2(a = c1,b = c7,r = c4,s = c2,q = c5)DOALL c4=1, M

DOALL c5=1, MDO c7=1, M

S3(a = c1,b = c5,c = c4,s = c2,r = c7)DO c7=1, M

S4(a = c1,b = c5,c = c4,d = c7,s = c2)

Figure: TCE 4-index transform


Implementation

Running time
http://find/


34/50

Running time

Code Num of Num of Num of Runningstatements loops deps time

2-d Jacobi 2 6 20 0.05s

Haar 1-d 3 5 12 0.018sLU 2 5 10 0.022s

TCE 4-index 4 20 15 0.20s

Swim 58 160 639 20.9s

Table: Transformation tool running time


Implementation


35/50

1 Introduction



4 Implementation




Implementation

Target architectures


36/50

g

Transformation framework can be targeted towards multipleaccelerators

Different kinds, levels, and degrees of parallelism required ineach case


Experimental results
http://find/


37/50

1 Introduction



4 Implementation





Experimental Setup
http://find/


38/50

p p

Intel Core2 Quad Q6600 2.4 GHz (quad core with shared L2cache), FSB 1066 MHz, DDR2 667 RAM

32 KB L1 cache, 8 MB L2 cache (4MB per core pair)

ICC 10.1 (-fast)

GCC 4.1.2 (-O3 -ftree-vectorize -mtune=core2 -msse2-funroll-loops)

Linux 2.6.18 x86-64

All optimized codes were obtained fully automatically through thePLuTo system



Imperfectly nested 1-d Jacobi stencil (with icc)
http://find/


39/50

p y ( )

8x

7x

6x

5x

4x

3x

2x

1x

4M2M1M500000

Spee

dupovernativecompiler

N (space extent)

OurLim/Lam

Scheduling-based (time tiling)icc -fast

(a) Single core: T = 104

0

2

4

6

8

10

4321

GFLoPs

Number of cores

Our (tiled parallel)Lim/Lam affine partitioning

Scheduling-based (time tiling)icc -parallel -fast

(b) Multi-core parallel: N= 106,T =105

Figure: Imperfectly nested 1-d Jacobi stencil



2-d FDTD


40/50

6x

5x

4x

3x

2x

1x

400020001000500

Spee

dupovernativecompiler

N (problem size)

Ouricc -fast / Scheduling-based (space tiling)

Scheduling-based (time tiling)

(a) Single core: T=500

0

1

2

3

4

5

6

7

8

4321

GFLoPs

Number of cores

Our (1-d pipelined parallel)Our 2-d pipelined parallel

Scheduling-based (time tiling)icc -parallel -fast / scheduling-based space tiling

(b) Parallel: nx = ny = 2000,tmax = 500

Figure: 2-d FDTD



LU Decomposition


41/50

0

0.5

1

1.5

2

2.5

3

3.5

4

8K4K2K1K

GFLOPs

Problem size

Our (tiled)Scheduling-based (time tiling)

icc -parallel -fast

(a) Single core (L1 and L2 tiled)

0

1

2

3

4

5

6

7

8

9

4321

GFLOPs

Number of cores

Our (2-d pipe parallel+tiled)Our 1-d pipeline parallel

Scheduling-based (time tiling)icc -parallel -fast

(b) On a quad core: N=8000

Figure: LU Decomposition



MVT


42/50

0

0.5

1

1.5

2

2.5

3

3.5

4321

G

FLOPs

Number of cores

Our (+ syntactic post-processing)Our (1-d pipelined parallel, fused (ij/ji))

Scheduling-based, Lim-Lamfused (ij/ij), i parallel

gcc -O3

Figure: MVT performance on a quad core: N=12000


Related and Future work


43/50

1 Introduction



4 Implementation





Related work


44/50

Schedule-based approaches [Feautrier92, Darte/Vivien95,

Griebl04 habilitation thesis]Affine partitioning [Lim/Lam PoPL97, ICS01] - minimizingorder of synchronization does not help; Ahmed/Pingali[IJPP01] (method incomplete due to infeasible heuristic)

Semi-automatic - URUK/WRAP-IT [Cohen05ICS,Girbal06IJPP] - very powerful flexible application oftransformations specified manually by an expert

Step-wise completion of transformation matrices wasaddressed for some restricted input in the context of FPGA

auto-transformation [Bondhugula PPoPP07]Automatically finding good transforms for sequences ofimperfectly nested loops for coarse-grained parallelism andlocality was not previously attempted



Related work
http://find/


45/50

Schedule-based approaches [Feautrier92, Darte/Vivien95,

Griebl04 habilitation thesis]Affine partitioning [Lim/Lam PoPL97, ICS01] - minimizingorder of synchronization does not help; Ahmed/Pingali[IJPP01] (method incomplete due to infeasible heuristic)

Semi-automatic - URUK/WRAP-IT [Cohen05ICS,Girbal06IJPP] - very powerful flexible application oftransformations specified manually by an expert

Step-wise completion of transformation matrices wasaddressed for some restricted input in the context of FPGA

auto-transformation [Bondhugula PPoPP07]Automatically finding good transforms for sequences ofimperfectly nested loops for coarse-grained parallelism andlocality was not previously attempted



Future work


46/50

Integrating with full-fledged frontends GCC - GRAPHITEproject (INRIA Futurs)

Fusion/parallelization trade-off (which dependences betweenstrongly-connected components to cut/include)

Combine with stronger cost models for tile size selection

Interaction with auto-vectorization and prefetching



Future work


47/50

Integrating with full-fledged frontends GCC - GRAPHITEproject (INRIA Futurs)

Fusion/parallelization trade-off (which dependences betweenstrongly-connected components to cut/include)

Combine with stronger cost models for tile size selection

Interaction with auto-vectorization and prefetching



Acknowledgments


48/50

Cedric Bastoul (INRIA Futurs) for CLooG and all othercontributors of CLooG and PipLib

Martin Griebl (FMI, Universitat Passau, Germany) and teamfor LooPo



Farkas Lemma
http://find/


49/50

Let the hyperplanes bounding the polytope of statement Sk begiven by:

aS,k

i

n

+ bS,k 0, k = 1, mS

where n is a vector of the structure parameters. A well-knownknown result useful in the context of the polytope model is the

affine form of the Farkas lemma.

Lemma (Affine form of Farkas Lemma)

Let D be a non-empty polyhedron defined by p affine inequalitiesor faces ak.x + bk 0, k = 1, p

Then, an affine form is non-negative everywhere in D iff it is apositive affine combination of the faces:

(x) 0 +

k

k(akx + bk), 0



Affine Schedules as Partitioning Hyperplanes


50/50

Let there be a dependence from statement instance p of si to q ofSj corresponding to an edge e of the GDG of a program. After

exact dependence analysis, we obtain,

p = fe(q), q Pe

where Pe is as defined previously, and fe is an affine function whichis also called the h-transformation.

Lemma

Let si be a one-dimensional affine transform for Si. For s1 , s2 ,

. . . , sk, to be a coordinated valid tiling hyperplane for all

statements, the following should hold for each edge e from Si and

Sj:

sj (q) si (p) 0, p, q Re

sj (q) si (fe(q)) 0, q Pe


Documents

Multi Core Compilation