Upload
hari-sasank
View
219
Download
0
Embed Size (px)
Citation preview
7/30/2019 Multi Core Compilation
1/50
Automatic Parallelization for MulticoreArchitectures
P. Sadayappan
Department of Computer Science & EngineeringThe Ohio State University
Joint work withUday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy
Atanas Rountev, J. Ramanujam
December 18, 2007Supported in part by NSF
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
http://goforward/http://find/http://goback/7/30/2019 Multi Core Compilation
2/50
Introduction
1 Introduction
Emerging parallel architectures and automatic parallelization
2 Polyhedral techniques for program optimization
3
Integrating tiling into the polyhedral framework
4 Implementation
5 Experimental results
6 Related and Future work
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
http://find/7/30/2019 Multi Core Compilation
3/50
Introduction Emerging parallel architectures and automatic parallelization
New parallel architectures and accelerators
On-chip parallel architectures and accelerators have emerging
General purpose: multi-core microprocessorsSpecialized: GPGPUs, Cell, FPGAs, MPSoCs
ProgrammingParallelization (large number of processing elements)Locality optimization (limited external memory bandwidth)Scratchpad memory minimization (fast software controlledcaches)
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
http://find/7/30/2019 Multi Core Compilation
4/50
Introduction Emerging parallel architectures and automatic parallelization
New parallel architectures and accelerators
On-chip parallel architectures and accelerators have emerging
General purpose: multi-core microprocessorsSpecialized: GPGPUs, Cell, FPGAs, MPSoCs
ProgrammingParallelization (large number of processing elements)Locality optimization (limited external memory bandwidth)Scratchpad memory minimization (fast software controlledcaches)
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
P l h d l h i f i i i
http://find/7/30/2019 Multi Core Compilation
5/50
Polyhedral techniques for program optimization
1 Introduction
2 Polyhedral techniques for program optimization
3 Integrating tiling into the polyhedral framework
4 Implementation
5 Experimental results
6 Related and Future work
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
P l h d l t h i f ti i ti
http://find/7/30/2019 Multi Core Compilation
6/50
Polyhedral techniques for program optimization
Polyhedral optimization
1 Dependence analysis (exact affine dependences) [Feautrier91,Pugh92, Vasilache06ICS]
2
Automatic transformations [Feautrier92, Lim/Lam97,Griebl04]
3 Code generation from specified transforms[Omega90s,Quillere00, Bastoul04, CLooG, Vasilache06CC]
Significant recent advances in first and last step
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Polyhedral techniques for program optimization
http://find/7/30/2019 Multi Core Compilation
7/50
Polyhedral techniques for program optimization
Polyhedral optimization
1 Dependence analysis (exact affine dependences) [Feautrier91,Pugh92, Vasilache06ICS]
2
Automatic transformations [Feautrier92, Lim/Lam97,Griebl04]
3 Code generation from specified transforms[Omega90s,Quillere00, Bastoul04, CLooG, Vasilache06CC]
Significant recent advances in first and last step
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Polyhedral techniques for program optimization
http://find/7/30/2019 Multi Core Compilation
8/50
Polyhedral techniques for program optimization
Background: polyhedral/polytope model
Powerful model for optimizing loops with regular accesses
h.x=k
.x k
h.x k
h
(a) A hyperplane (b) A polytope
Figure: Iteration space represented as a polytope
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Polyhedral techniques for program optimization
http://find/http://goback/7/30/2019 Multi Core Compilation
9/50
Polyhedral techniques for program optimization
Affine Transformations in the polytope model
A one-dimensional affine transform for statement Sk is defined by:
Sk =
c1 c2 . . . cmSk
i
+ c0
=
c1 c2 . . . cmSkc0
i1
= hSki + c0
where hSk = [c1, c2, . . . , cmSk] with c1, c2, . . . , cmSk Z.
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Polyhedral techniques for program optimization
http://find/7/30/2019 Multi Core Compilation
10/50
Polyhedral techniques for program optimization
Affine transformations
A transformation for each statement: S(i) = TSi + bS
i1i2i
3...in
=
c11 c12 . . . c1nc21 c22 . . . c2n
... ... ... ...cn1 cn2 . . . cnn
i1i2i3...
in
+
c01c02c03
...c0n
Full column-ranked transform is a one-to-one mapping
Each 1-d transform () can later be marked as a space loop ortime (sequential) loop or a band of them can be tiled
Problem: how do you find good transformations optimizedfor parallelism and locality?
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Polyhedral techniques for program optimization
http://find/http://goback/7/30/2019 Multi Core Compilation
11/50
y q p g p
Affine transformations
A transformation for each statement: S(i) = TSi + bS
i1i2i
3...in
=
c11 c12 . . . c1nc21 c22 . . . c2n
... ... ... ...cn1 cn2 . . . cnn
i1i2i3...
in
+
c01c02c03
...c0n
Full column-ranked transform is a one-to-one mapping
Each 1-d transform () can later be marked as a space loop ortime (sequential) loop or a band of them can be tiled
Problem: how do you find good transformations optimizedfor parallelism and locality?
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Polyhedral techniques for program optimization
http://find/7/30/2019 Multi Core Compilation
12/50
y q p g p
Per-statement affine transformations
do i = 1, ndo j = 1, n
(S1) C[i,j] = 0;end do
end do
do i = 1, ndo j = 1, n
do k = 1, n(S2) C[i , j ] = C[i, j ] + A[i,k]B[k,j]end do
end doend do
do i = 1, ndo j = 1, n
do k = 1, n
(S3) D[i,j] = D[i,j] + E[i,k]C[k,j]end doend do
end do
S1 S2 S3i j const i j k const i j k const
c1 1 0 0 1 0 0 0 0 0 1 0c2 0 1 0 0 1 0 0 0 1 0 0c3 0 0 0 0 0 0 1 0 0 0 1c4 0 0 0 0 0 0 0 0 0 0 1c5 0 0 0 0 0 1 0 1 0 0 0
do c1=1, ndo c2=1, n
C[c1,c2] = 0;do c5=1, n
C[c1,c2] = C[c1,c2] + A[c1,c5] B[c5,c2]end dodo c5=1, n
D[c5,c2] = D[c5,c2] + E[c5,c1] C[c1,c2]end do
end doend do
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Polyhedral techniques for program optimization
http://find/http://goback/7/30/2019 Multi Core Compilation
13/50
y g
Polyhedral Transformation FrameworksWhat remains to be done?
Finding fine-grained parallelism from sequential (affine) inputprogram: solved problem (Feautrier, Darte etc.)
Minimizing synchronization (O(1) vs. O(N)): solved(Lim/Lam)
Efficient coarse-grained, load-balanced execution onmulti-level parallel systems: Good progress (LooPo, CLooGetc.); need more
Tiling is a key loop transformation for efficient coarse-grainedparallel execution, and for data locality optimization
But tiling has not been tightly integrated into polyhedralframeworks: usually a separated post-processing step on fullypermutable loop nests in target loop nests
Transformed loop structure has a higher dimensional iterationspace, not expressible as an affine transformation of input
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Polyhedral techniques for program optimization
http://find/7/30/2019 Multi Core Compilation
14/50
Polyhedral Transformation FrameworksWhat remains to be done?
Finding fine-grained parallelism from sequential (affine) inputprogram: solved problem (Feautrier, Darte etc.)
Minimizing synchronization (O(1) vs. O(N)): solved(Lim/Lam)
Efficient coarse-grained, load-balanced execution onmulti-level parallel systems: Good progress (LooPo, CLooGetc.); need more
Tiling is a key loop transformation for efficient coarse-grainedparallel execution, and for data locality optimization
But tiling has not been tightly integrated into polyhedralframeworks: usually a separated post-processing step on fullypermutable loop nests in target loop nests
Transformed loop structure has a higher dimensional iterationspace, not expressible as an affine transformation of input
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
http://find/http://goback/7/30/2019 Multi Core Compilation
15/50
1 Introduction
2 Polyhedral techniques for program optimization
3 Integrating tiling into the polyhedral framework
4 Implementation
5 Experimental results
6 Related and Future work
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
http://find/http://goback/7/30/2019 Multi Core Compilation
16/50
New algorithm
We develop a new algorithm to find statement-wise affinetransforms, interpreted as partitioning hyperplanes
Models communication/locality optimized tiling in the affinetransformation framework
Simultaneously optimizes parallelism and locality forsequences of imperfectly nested loops
Enables fusion across producing-consuming loops
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
http://find/http://goback/7/30/2019 Multi Core Compilation
17/50
Tiling and its legality
Tiling for parallelism Enables coarse-grained parallelization:
reduces frequency of communicationTiling for locality Allows reuse along multiple dimensions -tile fits in faster memory
Tile shape and size affect the volume of communication /number of cache misses
If i and i are dependent (condition for dependence capturedby the polyhedral representation):
si(i) sj(
i) 0, i, i Pe
Classic condition from Irigoin and Triolet [PoPL87]:dependence only has positive components along
Extended for affine dependences and multi-statementimperfectly nested loops
More than just legal transformations
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
http://find/7/30/2019 Multi Core Compilation
18/50
Tiling and its legality
Tiling for parallelism Enables coarse-grained parallelization:
reduces frequency of communicationTiling for locality Allows reuse along multiple dimensions -tile fits in faster memory
Tile shape and size affect the volume of communication /number of cache misses
If i and i are dependent (condition for dependence capturedby the polyhedral representation):
si(i) sj(
i) 0, i, i Pe
Classic condition from Irigoin and Triolet [PoPL87]:dependence only has positive components along
Extended for affine dependences and multi-statementimperfectly nested loops
More than just legal transformations
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
http://find/7/30/2019 Multi Core Compilation
19/50
Capturing communication volume and reuse distance
Feautrier exact dependences: generalized dependence graph (V, E)
Ss, St V, e E; i DSs
, i
DSt
, i, i
e i = fe(i
)i
Se
i
t
(1,0) (2,1)
i i
t tP1
P0
P3P2
(1,1)(1,0)
seq
(1,1)(1,0)
parallelseq parallelseqparallel
P1 P2
P0
(i) (i) is a very important affine function
Represents the component of a dependence along thehyperplane
Communication volume (per unit area) at processor tileboundaries
Cache misses at local tile edgesAutomatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
http://find/7/30/2019 Multi Core Compilation
20/50
Capturing communication volume and reuse distance
Feautrier exact dependences: generalized dependence graph (V, E)
Ss, St V, e E; i DSs
, i
DSt
, i, i
e i = fe(i
)i
Se
i
t
(1,0) (2,1)
i i
t tP1
P0
P3P2
(1,1)(1,0)
seq
(1,1)(1,0)
parallelseq parallelseqparallel
P1 P2
P0
(i) (i) is a very important affine function
Represents the component of a dependence along thehyperplane
Communication volume (per unit area) at processor tileboundariesCache misses at local tile edges
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
C f ffi
http://find/7/30/2019 Multi Core Compilation
21/50
Cost function in Affine Framework
We define an affine form e:
e = si(i) sj(
i)
Minimizing e can be used to:
Find hyperplane that minimizes inter-tile communicationvolume (rate per unit area)
Find direction that minimizes reuse distance
Find intra-tile schedule that minimizes on-chip memory usage
But directly attempting to optimize e
is problematic (notexpressible as a linear function of affine partitioning coefficients)e.g. (i) (i) could be c1i + (c2 c3)j, where1 i N 1 j N i j
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
U i B di F i A h
http://find/http://goback/7/30/2019 Multi Core Compilation
22/50
Using a Bounding Function Approach
Bound it from above and minimize the coefficients of the
bound after linearizing with Farkas Lemma
si(
i) sj(i)
u.n + w
v(n) e 0, e E
Even if e involves loop variables, we can find large enoughconstants in u to bound e.
Since v(n) e 0, v should increase with an increase in thestructural parameters, i.e., the coordinates of u are positive.
The reuse distance or communication volume for eachdependence is bounded in this fashion by the same affine form
Minimize the maximum component along hyperplane normalacross all dependences
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
U i B di F i A h
http://find/7/30/2019 Multi Core Compilation
23/50
Using a Bounding Function Approach
Bound it from above and minimize the coefficients of the
bound after linearizing with Farkas Lemma
si(
i) sj(i)
u.n + w
v(n) e 0, e E
Even if e involves loop variables, we can find large enoughconstants in u to bound e.
Since v(n) e 0, v should increase with an increase in thestructural parameters, i.e., the coordinates of u are positive.
The reuse distance or communication volume for eachdependence is bounded in this fashion by the same affine form
Minimize the maximum component along hyperplane normalacross all dependences
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
B di F ti A h (C td )
http://find/7/30/2019 Multi Core Compilation
24/50
Bounding Function Approach (Contd.)
Now, use Farkas lemma to obtain
v(n) e(q) e0 +me
k=1
ek(fek)
Farkas Lemma is applied to linearize the legality andcommunication volume/reuse distance bounding constraints
Use PIP to find the lexicographic minimal solution thatsatisfies above system, using the following order, with u andw in the leading position, and the other variables following inany order.
minimize {u1, u2, . . . , uk, w, . . . , c
is, . . . }
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
I li ti f t f ti i i i ti
http://find/7/30/2019 Multi Core Compilation
25/50
Implications of cost function minimization
u = 0, w = 0: communication-free parallel hyperplane
u = 0, w = const: constant minimum boundary line
communication/cache missesu > 0: non-constant (large) amount of communication/cachemisses
A solution for multiple statements at a level is a fused loop
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
Fi di g i de e de t sol tio s ite ati el
http://find/7/30/2019 Multi Core Compilation
26/50
Finding independent solutions iteratively
Once a solution (a hyperplane for all statements is found), thesame formulation is augmented with additional orthogonalityconstraints
Find independent solutions one after the other till enoughhyperplanes to scan all statements are found
Dependences are not removed as we proceed from hyperplaneto another unless they need to be
A hierarchy of permutable loop nest sets is found
Fully tilable, partially tilable (tiling at any arbitrary level) orinner tilable loops are identified
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
Algorithm summary
http://find/7/30/2019 Multi Core Compilation
27/50
Algorithm summary
Dependences are pushed as much inside as possible
Outer loops are space with minimal communication
Synchronization-free parallel loops end up first if they exist
((i) (i) = 0 for such hyperplanes)Next, space loops with minimal communication are found
Inner loops are sequential time with maximum reuse
Complexity: Very fast in practice
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
Algorithm summary
http://find/7/30/2019 Multi Core Compilation
28/50
Algorithm summary
Dependences are pushed as much inside as possible
Outer loops are space with minimal communication
Synchronization-free parallel loops end up first if they exist
((i) (i) = 0 for such hyperplanes)Next, space loops with minimal communication are found
Inner loops are sequential time with maximum reuse
Complexity: Very fast in practice
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Integrating tiling into the polyhedral framework
Algorithm
http://find/http://goback/7/30/2019 Multi Core Compilation
29/50
Algorithm
Input Generalized dependence graph G = (V, E) (includes dependence polyhedra Pe, e E)Smax: statement with maximum domain dimensionalityfor each dependence e E do
Build legality constraints: apply Farkas Lemma on (t) (fe(t)) 0 undert Pe, and eliminate allFarkas multipliersBuild communication volume/reuse distance bounding constraints: apply Farkas Lemma tov(n) ((t) (f(t))) 0 undert Pe, and eliminate all Farkas multipliersAggregate constraints from both into Ce(i)
end forrepeat
C = for each dependence edge e E do
C C Ce(i)end forCompute lexicographic minimal solution with us coefficients in the leading position followed by w toiteratively find independent solutions to C (orthogonality constraints are added as each soln is found)if no solutions were found then
Cut dependences between two strongly-connected components in the GDG and insert the appropriatesplitter in the transformation matrices of the statements
end if
Compute Ec: dependences carried by solutions of Step 12/14; update necessary dependence polyhedra(when a portion of it is satisfied)E E Ec; reform the GDG (V, E)
until HSmax= 0 and E =
Output A transformation matrix for each statement (with the same number of rows)
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Implementation
http://goforward/http://find/http://goback/7/30/2019 Multi Core Compilation
30/50
1 Introduction
2 Polyhedral techniques for program optimization
3 Integrating tiling into the polyhedral framework
4 Implementation
5 Experimental results
6 Related and Future work
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Implementation
Implementation status
http://find/7/30/2019 Multi Core Compilation
31/50
Implementation status
(OpenMP)
gcc/icc
Dependence
CLooG
Compilable
Annotated code
frameworktransformation
Our affine
affine transformsStatementwise
tilespecifier
Polyhedral
SyntacticTransformer
sequences
scanner/parser+
Dependencetester locality optimization)
(parallelization +
polyhedra
LooPoNested loop
target code
with supernodesand scatteringsUpdated domains
Figure: Automatic parallelizer (source-to-source)
Framework implemented with PipLib, PolyLib and interfacedwith LooPo (Univ. of Passau, Germany) and CLooG (INRIAFuturs) at either end
C/Fortran to transformed code in a completely automaticfashion
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Implementation
Example: Imperfectly-nested 2-d jacobi
http://goback/http://find/http://find/http://goback/7/30/2019 Multi Core Compilation
32/50
Example: Imperfectly nested 2 d jacobi
CONSTANT n;
DO t=1, nDO i=2,n1
DO j=2,n1b [ i , j ] = a[ i1,j]+a[i, j]+a[i +1,j]
+ a[i , j1]+a[i,j+1];END DO
END DODO k=2,n1
DO l=2,n1a[k,l ] = b[k,l ];
END DOEND DO
END DO
forallpp (c1=1;c1
7/30/2019 Multi Core Compilation
33/50
Example: TCE 4 index transform
DO a = 1, NDO q = 1, N
DO r = 1, NDO s, p = 1, N
T1[a,q,r,s] = T1[a,q,r,s] + A[p,q,r,s]C4[p,a]DO a = 1, N
DO b = 1, NDO r = 1, N
DO s, q = 1, NT2[a,b,r,s] = T2[a,b,r,s]
+ T1[a,q,r,s]C3[q,b]DO a = 1, N
DO b = 1, NDO c = 1, N
DO s, r = 1, NT3[a,b,c , s ] = T3[a,b,c,s ]
+ T2[a,b,r,s]C2[r,c]DO a = 1, N
DO b = 1, NDO c = 1, N
DO d, s = 1, NB[a,b,c ,d] = B[a,b,c, d] + T3[a,b,c,s]C1[s,d]
DOALL c1=1, MDO c2=1, M
DOALL c4=1, MDOALL c5=1, M
DO c7=1, MS1(a = c1,q = c5,r = c4,s = c2,p = c7)
DO c7=1, MS2(a = c1,b = c7,r = c4,s = c2,q = c5)DOALL c4=1, M
DOALL c5=1, MDO c7=1, M
S3(a = c1,b = c5,c = c4,s = c2,r = c7)DO c7=1, M
S4(a = c1,b = c5,c = c4,d = c7,s = c2)
Figure: TCE 4-index transform
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Implementation
Running time
http://find/7/30/2019 Multi Core Compilation
34/50
Running time
Code Num of Num of Num of Runningstatements loops deps time
2-d Jacobi 2 6 20 0.05s
Haar 1-d 3 5 12 0.018sLU 2 5 10 0.022s
TCE 4-index 4 20 15 0.20s
Swim 58 160 639 20.9s
Table: Transformation tool running time
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Implementation
http://find/http://goback/7/30/2019 Multi Core Compilation
35/50
1 Introduction
2 Polyhedral techniques for program optimization
3 Integrating tiling into the polyhedral framework
4 Implementation
5 Experimental results
6 Related and Future work
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Implementation
Target architectures
http://find/http://goback/7/30/2019 Multi Core Compilation
36/50
g
Transformation framework can be targeted towards multipleaccelerators
Different kinds, levels, and degrees of parallelism required ineach case
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Experimental results
http://find/7/30/2019 Multi Core Compilation
37/50
1 Introduction
2 Polyhedral techniques for program optimization
3 Integrating tiling into the polyhedral framework
4 Implementation
5 Experimental results
6 Related and Future work
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Experimental results
Experimental Setup
http://find/7/30/2019 Multi Core Compilation
38/50
p p
Intel Core2 Quad Q6600 2.4 GHz (quad core with shared L2cache), FSB 1066 MHz, DDR2 667 RAM
32 KB L1 cache, 8 MB L2 cache (4MB per core pair)
ICC 10.1 (-fast)
GCC 4.1.2 (-O3 -ftree-vectorize -mtune=core2 -msse2-funroll-loops)
Linux 2.6.18 x86-64
All optimized codes were obtained fully automatically through thePLuTo system
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Experimental results
Imperfectly nested 1-d Jacobi stencil (with icc)
http://find/7/30/2019 Multi Core Compilation
39/50
p y ( )
8x
7x
6x
5x
4x
3x
2x
1x
4M2M1M500000
Spee
dupovernativecompiler
N (space extent)
OurLim/Lam
Scheduling-based (time tiling)icc -fast
(a) Single core: T = 104
0
2
4
6
8
10
4321
GFLoPs
Number of cores
Our (tiled parallel)Lim/Lam affine partitioning
Scheduling-based (time tiling)icc -parallel -fast
(b) Multi-core parallel: N= 106,T =105
Figure: Imperfectly nested 1-d Jacobi stencil
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Experimental results
2-d FDTD
http://find/http://goback/7/30/2019 Multi Core Compilation
40/50
6x
5x
4x
3x
2x
1x
400020001000500
Spee
dupovernativecompiler
N (problem size)
Ouricc -fast / Scheduling-based (space tiling)
Scheduling-based (time tiling)
(a) Single core: T=500
0
1
2
3
4
5
6
7
8
4321
GFLoPs
Number of cores
Our (1-d pipelined parallel)Our 2-d pipelined parallel
Scheduling-based (time tiling)icc -parallel -fast / scheduling-based space tiling
(b) Parallel: nx = ny = 2000,tmax = 500
Figure: 2-d FDTD
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Experimental results
LU Decomposition
http://find/http://goback/7/30/2019 Multi Core Compilation
41/50
0
0.5
1
1.5
2
2.5
3
3.5
4
8K4K2K1K
GFLOPs
Problem size
Our (tiled)Scheduling-based (time tiling)
icc -parallel -fast
(a) Single core (L1 and L2 tiled)
0
1
2
3
4
5
6
7
8
9
4321
GFLOPs
Number of cores
Our (2-d pipe parallel+tiled)Our 1-d pipeline parallel
Scheduling-based (time tiling)icc -parallel -fast
(b) On a quad core: N=8000
Figure: LU Decomposition
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Experimental results
MVT
http://find/http://goback/7/30/2019 Multi Core Compilation
42/50
0
0.5
1
1.5
2
2.5
3
3.5
4321
G
FLOPs
Number of cores
Our (+ syntactic post-processing)Our (1-d pipelined parallel, fused (ij/ji))
Scheduling-based, Lim-Lamfused (ij/ij), i parallel
gcc -O3
Figure: MVT performance on a quad core: N=12000
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Related and Future work
http://find/http://goback/7/30/2019 Multi Core Compilation
43/50
1 Introduction
2 Polyhedral techniques for program optimization
3 Integrating tiling into the polyhedral framework
4 Implementation
5 Experimental results
6 Related and Future work
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Related and Future work
Related work
http://find/http://goback/7/30/2019 Multi Core Compilation
44/50
Schedule-based approaches [Feautrier92, Darte/Vivien95,
Griebl04 habilitation thesis]Affine partitioning [Lim/Lam PoPL97, ICS01] - minimizingorder of synchronization does not help; Ahmed/Pingali[IJPP01] (method incomplete due to infeasible heuristic)
Semi-automatic - URUK/WRAP-IT [Cohen05ICS,Girbal06IJPP] - very powerful flexible application oftransformations specified manually by an expert
Step-wise completion of transformation matrices wasaddressed for some restricted input in the context of FPGA
auto-transformation [Bondhugula PPoPP07]Automatically finding good transforms for sequences ofimperfectly nested loops for coarse-grained parallelism andlocality was not previously attempted
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Related and Future work
Related work
http://find/7/30/2019 Multi Core Compilation
45/50
Schedule-based approaches [Feautrier92, Darte/Vivien95,
Griebl04 habilitation thesis]Affine partitioning [Lim/Lam PoPL97, ICS01] - minimizingorder of synchronization does not help; Ahmed/Pingali[IJPP01] (method incomplete due to infeasible heuristic)
Semi-automatic - URUK/WRAP-IT [Cohen05ICS,Girbal06IJPP] - very powerful flexible application oftransformations specified manually by an expert
Step-wise completion of transformation matrices wasaddressed for some restricted input in the context of FPGA
auto-transformation [Bondhugula PPoPP07]Automatically finding good transforms for sequences ofimperfectly nested loops for coarse-grained parallelism andlocality was not previously attempted
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Related and Future work
Future work
http://find/http://goback/7/30/2019 Multi Core Compilation
46/50
Integrating with full-fledged frontends GCC - GRAPHITEproject (INRIA Futurs)
Fusion/parallelization trade-off (which dependences betweenstrongly-connected components to cut/include)
Combine with stronger cost models for tile size selection
Interaction with auto-vectorization and prefetching
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Related and Future work
Future work
http://find/http://goback/7/30/2019 Multi Core Compilation
47/50
Integrating with full-fledged frontends GCC - GRAPHITEproject (INRIA Futurs)
Fusion/parallelization trade-off (which dependences betweenstrongly-connected components to cut/include)
Combine with stronger cost models for tile size selection
Interaction with auto-vectorization and prefetching
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Related and Future work
Acknowledgments
http://find/http://goback/7/30/2019 Multi Core Compilation
48/50
Cedric Bastoul (INRIA Futurs) for CLooG and all othercontributors of CLooG and PipLib
Martin Griebl (FMI, Universitat Passau, Germany) and teamfor LooPo
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Related and Future work
Farkas Lemma
http://find/7/30/2019 Multi Core Compilation
49/50
Let the hyperplanes bounding the polytope of statement Sk begiven by:
aS,k
i
n
+ bS,k 0, k = 1, mS
where n is a vector of the structure parameters. A well-knownknown result useful in the context of the polytope model is the
affine form of the Farkas lemma.
Lemma (Affine form of Farkas Lemma)
Let D be a non-empty polyhedron defined by p affine inequalitiesor faces ak.x + bk 0, k = 1, p
Then, an affine form is non-negative everywhere in D iff it is apositive affine combination of the faces:
(x) 0 +
k
k(akx + bk), 0
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
Related and Future work
Affine Schedules as Partitioning Hyperplanes
http://find/http://goback/7/30/2019 Multi Core Compilation
50/50
Let there be a dependence from statement instance p of si to q ofSj corresponding to an edge e of the GDG of a program. After
exact dependence analysis, we obtain,
p = fe(q), q Pe
where Pe is as defined previously, and fe is an affine function whichis also called the h-transformation.
Lemma
Let si be a one-dimensional affine transform for Si. For s1 , s2 ,
. . . , sk, to be a coordinated valid tiling hyperplane for all
statements, the following should hold for each edge e from Si and
Sj:
sj (q) si (p) 0, p, q Re
sj (q) si (fe(q)) 0, q Pe
Automatic Parallelization for Multicore Architectures NHC 07, HiPC 07, Goa, India, Dec 18, 2007
http://find/http://goback/