1
CR18: Advanced Compilers
L06: Code Generation
Tomofumi Yuki
2
Code Generation
Completing the transformation loop
Problem: how to generate code to scan a
polyhedron? a union of polyhedra? how to generate tiled code? how to generate parametrically tiled
code?
3
Evolution of Code Gen
Ancourt & Irigoin 1991 single polyhedron scanning
LooPo: Griebl & Lengauer 1996 1st step to unions of polyhedra scan bounding box + guards
Omega Code Gen 1995 generate inefficient code (convex hull +
guards) then try to remove inefficiencies
4
Evolution of Code Gen
LoopGen: Quilleré-Rajopadhye-Wilde 2000 efficiently scanning unions of polyhedra
CLooG: Bastoul 2004 improvements to QRW algorithm robust and well maintained
implementation AST Generation: Grosser 2015
Polyhedral AST generation is more than scanning polyhedra
scanning is not enough!
5
Scanning a Polyhedron
Scanning Polyhedra with DO Loops [1991]
Problem: generate bounds on loops outermost loop: constants and params inner loop: + surrounding iterators
Approach: Fourier-Motzkin elimination projecting out variables
6
Single Polyhedron Example
What is the loop nest for lex. scan?
i
ji≤N
j≥0
i-j≥0for i = 0 .. N for j = 0 .. i S;
7
Single Polyhedron Example
What is the loop nest for permuted case? j as the outer loop
i
ji≤N
j≥0
i-j≥0for j = 0 .. N for i = j .. N S;
8
Scanning Unions of Polyhedra Consider scanning two statements
Naïve approach: bounding box
S1: [N]->{ S1[i]->[i] : 0≤i<N }S2: [N]->{ S2[i]->[i+5] : 0≤i<N }
S1: [N]->{ [i] : 0≤i≤N }S2: [N]->{ [i] : 5≤i≤N+5 }
CoB
for (i=0 .. N+5) if (0<=i && i<=N) S1; if (5<=i && i<=N+5) S2;
9
Slightly Better than BBox
Make disjoint domains
But this is also problematic: code size can quickly grow
for (i=0 .. i<=4) S1;for (i=4 .. i<=N) S1; S2;for (i=N+1 .. i<=N+5) S2;
S1: [N]->{ S1[i]->[i] : 0≤i<N }S2: [N]->{ S2[i]->[i] : 0≤i<M }
10
QRW Algorithm
Key: Recursive Splitting Given a set of n-D domains to scan
start at d=1 and context=parameters 1. Restrict the domains to the context 2. Project the domains to outer d-
dimensions 3. Make the projections disjoint 4. Recurse for each disjoint projection
d=d+1, context=a piece of the projection 5. Sort the resulting loops
11
Example
Scan the following domains
i
j
S1S2
d=1context=universe
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
12
Example
Scan the following domains
i
j
S1S2
d=1context=universe
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
13
Example
Scan the following domains
i
j
S1S2
d=1context=universe
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
for (i=0..1) ...
for (i=2..6) ...
14
Example
Scan the following domains
i
j
S1S2
d=2context=0≤i≤2
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
for (i=0..1) ...for (i=0..1) for (j=0..4) S1;
15
Example
Scan the following domains
i
j
S1S2
d=2context=2≤i≤6
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
for (i=2..6) ...
16
Example
Scan the following domains
i
j
S1S2
d=2context=2≤i≤6
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
for (i=2..6) ...
L2L1
L4L3
17
Example
Scan the following domains
i
j
S1S2
d=2context=2≤i≤6
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
for (i=2..6) L2 L1 L3 L4
L2L1
L4L3
18
CLooG: Chunky Loop Generator A few problems in QRW Algorithm
high complexity code size is not controlled
CLooG uses: pattern matching to avoid costly
polyhedral operations during separation may stop the recursion at some depth
and generate loops with guards to reduce size
19
Tiled Code Generation
Tiling with fixed size we did this earlier
Tiling with parametric size problem: non-affine!
20
Tiling Review
What does the tiled code look like?for (i=0; i<=N; i++) for (j=0; j<=N; j++) S;
for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) Sfor (ti=0; ti<=floor(N,ts); ti++) for (tj=0; tj<=floor(N,ts); tj++) for (i=ti*ts; i<min(N+1,(ti+1)*ts); i++) for (j=tj*ts; j<min(N+1,(tj+1)*ts); j++) S
with tile size ts
21
Two Approaches
Use fix sized tiling if the tile is a constant, stays affine pragmatic choice by many tools
Use non-polyhedral code generation much better for tuning tile sizes make sense for semi-automatic tools
22
Difficulties in Tiled Code Gen This is still a very simplified view
In practice, we tile after transformation skewing, etc.
Let’s see the tiled iteration space with tvis
for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) S
23
Full Tiles, Inset / Outset
Partial tiles have a lot of control overhead
Challenges for parametric tiled code gen make sure to scan the outset but also separate the inset use efficient point loops for inset
All with out polyhedral analysis
24
Point Loops for Full/Partial Tile Full Tile Point Loop
Partial/Empty Tile Point Loop
for (i=ti; i<ti+si; i++) for (j=tj; j<tj+sj; j++) for (k=tk; k<tk+sk; k++) ...
for (i=max(ti,...); i<min(ti+si,...); i++) for (j=max(tj,...); j<min(tj+sj,...); j++) for (k=max(tk,...); k<min(tk+sk,...); k++) if (...) ...
25
Progression of Parametric Tiling Perfectly nested, single loop
TLoG [Renganarayana et al. 2007] Multiple levels of tiling
HiTLoG [Renganarayana et al. 2007] PrimeTile [Hartono 2009]
Parallelizing the tiles DynTile [Hartono 2010] D-Tiling [Kim 2011]
26
Computing the Outset
We start with some domain expand in each dimension by (symbolic) tile size – 1 except for upper bounds
{[i,j]: 0≤i≤10 and i≤j≤i+10}
{[i,j]: -(ts-1) ≤i≤10 and -(ts-1)+i≤j and -(ts-1)+j≤i+10}
27
Computing the Inset
We start with some domain shrink in each dimension by (symbolic) tile size – 1 except for lower bounds
{[i,j]: 0≤i≤10 and i≤j≤i+10}
{[i,j]: 0≤i≤10-(ts-1) and i≤j-(ts-1) and j≤i+10-(ts-1)}
28
Syntactic Manipulation
We cannot use polyhedral code generators so back to modifying AST
Modify the loop bounds to get loops that visit outset get guards to switch point-loops
Up to here is HiTLoG/PrimeTile
29
Problem: Parallelization
After tiling, there is parallelism However, it requires skewing of tiles
We need non-polyehdral skewing The key equation:
where
d: number of tiled dimensions ti: tile origins ts: tile sizes
30
D-Tiling
The equation enables skewing of tiles If one of time or tile origins are
unknown, can be computed from the others
Generated Code: (tix is d-1th tile origin)for (time=start:end) for (ti1=ti1LB:ti1UB) … for (tix=tixLB:tixUB) { tid = f(time, ti1, …, tix); //compute tile ti1,ti2,…,tix,tid }
31
Distributed Memory Parallelization Problems implicitly handled by the
shared memory now need explicit treatment
Communication Which processors need to send/receive? Which data to send/receive? How to manage communication buffers?
Data partitioning How do you allocate memory across
nodes?
32
MPI Code Generator
Distributed Memory Parallelization Tiling based Parameterized tile sizes C+MPI implementation
Uniform dependences as key enabler Many affine dependences can be
uniformized Shared memory performance carried
over to distributed memory Scales as well as PLuTo but to multiple
nodes
33
Related Work (Polyhedral)
Polyhedral Approaches Initial idea [Amarasinghe1993] Analysis for fixed sized tiling
[Claßen2006] Further optimization [Bondhugula2011]
“Brute Force” polyhedral analysis for handling communication No hope of handling parametric tile size Can handle arbitrarily affine programs
34
Outline
Introduction “Uniform-ness” of Affine Programs
Uniformization Uniform-ness of PolyBench
MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with
PolyBench Conclusions and Future Work
35
Affine vs Uniform
Affine Dependences: f = Ax+b Examples
(i,j->j,i) (i,j->i,i) (i->0)
Uniform Dependences: f = Ix+b Examples
(i,j->i-1,j) (i->i-1)
36
Uniformization
(i->0) (i->0)
(i->i-1)
37
Uniformization
Uniformization is a classic technique “solved” in the 1980’s has been “forgotten” in the multi-core
era Any affine dependence can be
uniformized by adding a dimension
[Roychowdhury1988] Nullspace pipelining
simple technique for uniformization many dependences are uniformized
38
Uniformization and Tiling
Uniformization does not influence tilability
39
PolyBench [Pouchet2010]
Collection of 30 polyhedral kernels Proposed by Pouchet as a benchmark for
polyhedral compilation Goal: Small enough benchmark so that
individual results are reported; no averages
Kernels from: data mining linear algebra kernels, solvers dynamic programming stencil computations
40
Uniform-ness of PolyBench
5 of them are “incorrect” and are excluded
Embedding: Match dimensions of statements
Phase Detection: Separate program into phases Output of a phase is used as inputs to
the other
Stage Uniform at
Start
AfterEmbeddin
g
AfterPipelining
After Phase
Detection
Number of Fully UniformPrograms
8/25 (32%)
13/25 (52%)
21/25 (84%)
24/25 (96%)
41
Outline
Introduction Uniform-ness of Affine Programs
Uniformization Uniform-ness of PolyBench
MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with
PolyBench Conclusions and Future Work
42
Basic Strategy: Tiling
We focus on tilable programs
43
Dependences in Tilable Space All in the non-positive direction
44
Wave-front Parallelization
All tiles with the same color can run in parallel
45
Assumptions
Uniform in at least one of the dimensions
The uniform dimension is made outermost Tilable space is fully permutable
One-dimensional processor allocation Large enough tile sizes
Dependences do not span multiple tiles Then, communication is extremely
simplified
46
Processor Allocation
Outermost tile loop is distributed
P0 P1 P2 P3i1
i2
47
Values to be Communicated
Faces of the tiles (may be thicker than 1)
i1
i2
P0 P1 P2 P3
48
Naïve Placement of Send and Receive Codes Receiver is the consumer tile of the
values
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
49
Problems in Naïve Placement Receiver is in the next wave-front time
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
t=0
t=1
t=2
t=3
50
Problems in Naïve Placement Receiver is in the next wave-front time Number of communications “in-flight”
= amount of parallelism MPI_Send will deadlock
May not return control if system buffer is full
Asynchronous communication is required Must manage your own buffer required buffer = amount of parallelism
i.e., number of virtual processors
51
Proposed Placement of Send and Receive codes Receiver is one tile below the consumer
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
52
Placement within a Tile
Naïve Placement: Receive -> Compute -> Send
Proposed Placement: Issue asynchronous receive (MPI_Irecv) Compute Issue asynchronous send (MPI_Isend) Wait for values to arrive
Overlap of computation and communication
Only two buffers per physical processor
Overlap
Recv Buffer
Send Buffer
53
Evaluation
Compare performance with PLuTo Shared memory version with same
strategy Cray: 24 cores per node, up to 96 cores Goal: Similar scaling as PLuTo Tile sizes are searched with educated
guesses PolyBench
7 are too small 3 cannot be tiled or have limited
parallelism 9 cannot be used due to
PLuTo/PolyBench issue
54
Performance Results
Linear extrapolation from speed up of 24
cores Broadcast cost at most 2.5 seconds