Polyhedral Compilation as a Design Pattern for Compiler ... · Polyhedral Compilation as a Design Pattern for Compiler Construction PLISS, May 19-24, 2019 [email protected]

Polyhedral Compilationas a Design Pattern forCompiler Construction

PLISS, May 19-24, [email protected]

Polyhedra? Example: Tiles

How many of you read “Design Pattern”? →

Polyhedra? Example: Tiles

Tiles Everywhere

1. Hardware

Example: Google Cloud TPU

Architectural Scalability With Tiling

Tiles Everywhere

1. HardwareGoogle Edge TPU

Edge computing zoo

Tiles Everywhere

1. Hardware2. Data Layout

Example: XLA compiler, Tiled data layout

Repeated/Hierarchical Tilinge.g., BF16 (bfloat16)on Cloud TPU(should be 8x128 then 2x1)

Tiles Everywhere

1. Hardware2. Data Layout3. Control Flow4. Data Flow5. Data Parallelism

Example: Halide for image processing pipelineshttps://halide-lang.org

Meta-programming API and domain-specific language (DSL) for loop transformations, numerical computing kernels

Tiling in Halide

Tiled schedule: strip-mine (a.k.a. split) permute (a.k.a. reorder)

Vectorized schedule: strip-mine vectorize inner loop

Non-divisible bounds/extent: strip-mine shift left/up redundant computation (also forward substitute/inline operand)

https://halide-lang.org

Tiles Everywhere

1. Hardware2. Data Layout3. Control Flow4. Data Flow5. Data Parallelism

Example: Halide for image processing pipelineshttps://halide-lang.org

And also TVM for neural networkshttps://tvm.ai

TVM example: scan cell (RNN)

m = tvm.var("m")n = tvm.var("n")X = tvm.placeholder((m,n), name="X")s_state = tvm.placeholder((m,n))s_init = tvm.compute((1,n), lambda _,i: X[0,i])s_do = tvm.compute((m,n), lambda t,i: s_state[t-1,i] + X[t,i])s_scan = tvm.scan(s_init, s_do, s_state, inputs=[X])

s = tvm.create_schedule(s_scan.op)

// Schedule to run the scan cell on a CUDA deviceblock_x = tvm.thread_axis("blockIdx.x")thread_x = tvm.thread_axis("threadIdx.x")xo,xi = s[s_init].split(s_init.op.axis[1], factor=num_thread)s[s_init].bind(xo, block_x)s[s_init].bind(xi, thread_x)xo,xi = s[s_do].split(s_do.op.axis[1], factor=num_thread)s[s_do].bind(xo, block_x)s[s_do].bind(xi, thread_x)print(tvm.lower(s, [X, s_scan], simple_mode=True))

https://halide-lang.org

https://tvm.ai/

https://docs.tvm.ai/api/python/tvm.html#tvm.var


https://docs.tvm.ai/api/python/tvm.html#tvm.placeholder


https://docs.tvm.ai/api/python/tvm.html#tvm.compute


https://docs.tvm.ai/api/python/tvm.html#tvm.scan

https://docs.tvm.ai/api/python/schedule.html#tvm.create_schedule

https://docs.tvm.ai/api/python/tvm.html#tvm.thread_axis


https://docs.tvm.ai/api/python/build.html#tvm.lower

Tiling and Beyond

1. But what about symbolic bounds, sizes, shapes?2. Other transformations: fusion, fission, pipelining, unrolling… ?3. Composition with other transformations and mapping decisions?4. Consistency with … ?5. Evaluating cost functions, enforcing resource constraints?

→ Impact on compiler construction,intermediate representations,

program analyses and transformations?

→ Polyhedral Compilation as a Design Pattern

● Tiles tend to be hyper-rectangles and occasionally parallelograms, trapezoids

● Compose tile patterns with fusion, fission, pipeliningand nested tile patterns

More generally: Polyhedral Compilation =a geometric, affine, periodic view of

program transformationsalong time: sequence, dependencesand space: parallelism, memory, processors

Polyhedral Compilation in a Nutshell

with Alex ZinenkoBased on “A Performance Vocabulary for Affine Loop Transformations”

by Martin Kong, Louis-Noel Pouchet; and a dozen of other papers

● Multi-level parallelism

− CPU — typically 3 levels: system threads or finer grain tasks, vectors, instruction-level parallelism

− GPU — 2 to 8 levels: work groups, work items, warps and vectors, instruction-level parallelism

and related features on other HW accelerators

● Deep memory hierarchies

− Positive effects: temporal and spatial locality, coalescing, latency hiding through multithreading

− Negative effects: cache conflicts, false sharing

− Many other concerns: capacity constraints, alignment, exposed pipelines

Architectural Effects to Consider

Memory Coalescing

Architectural Effects to Consider

Lines

Line

● Need a program representation to reason about individual array elements, individual iterations, relations among these, and with hardware resources○ Programming languages may provide high level abstractions for nested

loops and arrays, tensor algebra, graphics...○ The need for performance portability leads to domain-specific approaches

E.g., for ML high-performance kernels alone:XLA, TVM, Tensor Comprehensions, Glow, Tiramisu, etc.

● Yet few compiler intermediate representations reconcile these with1. the ability to model hardware features2. while capturing complex transformations3. supporting both general-purpose domain-specific optimizers

https://www.tensorflow.org/xla/

https://tvm.ai/

https://pytorch.org/blog/tensor-comprehensions/

https://facebook.ai/developers/tools/glow

http://tiramisu-compiler.org/

GenericLoop Nest Optimizer

(this is a hammer)

E.g. Intel ICC, Pluto, PPCG, LLVM/Polly


(these are not nails)

(this is a hammer)

Domain-SpecificOptimizer and

Code Generator


E.g., XLA, Halide, Polymage


(these are not nails)

(this is a hammer)

Domain-SpecificOptimizer and

Code Generator


E.g., XLA, Halide, Polymage

Polyhedral Framework =semantical and algorithmic design pattern

for multi-purpose representation, analysis, transformation, optimization, code generation

Polyhedral Representation Example: 2D convolutionfor (int i = 0; i < H - KH; ++i) for (int j = 0; j < W - KW; ++j) { C[i][j] = 0.; for (int k = 0; k < KH; ++k) for (int l = 0; l < KW; ++l) C[i][j] += A[i+k][j+l] * M[k][l]; }

for (int i = 0; i < H - KH; ++i) for (int j = 0; j < W - KW; ++j) { C[i][j] = 0.; for (int k = 0; k < KH; ++k) for (int l = 0; l < KW; ++l) C[i][j] += A[i+k][j+l] * M[k][l]; }

Statement S1

Polyhedral Representation Example: 2D convolution

/* i=0 */ for (int j = 0; j < W - KW; ++j) { C[0][j] = 0.; /*...*/ }

/* i=1 */ for (int j = 0; j < W - KW; ++j) { C[1][j] = 0.; /*...*/ }/*...*/


/* i=0 */ /* and j=0 */ C[0][0] = 0.; /* and j=1 */ C[0][1] = 0.; /*...*//* i=1 */ /* and j=0 */ C[1][0] = 0.; /*...*/

/*...*/


Statement Instance/* i=0 */ /* and j=0 */ C[0][0] = 0.; // S1(0,0) /* and j=1 */ C[0][1] = 0.; // S1(0,1) /*...*//* i=1 */ /* and j=0 */ C[1][0] = 0.; // S1(1,0) /*...*/

/*...*/

Statement instance

= specific execution of a statement in a loop

Iteration DomainIteration domain: set of all statement instances

C[0][0] = 0.; C[0][1] = 0.; C[0][2] = 0.; //...

C[1][0] = 0.; C[1][1] = 0.; C[1][2] = 0.; //…

C[2][0] = 0.; C[2][1] = 0.; C[2][2] = 0.; //…

/*...*/

Iteration DomainIteration domain: set of all statement instances.i

j

Iteration Domain

for (int i = 0; i < H - KH; ++i) for (int j = 0; j < W - KW; ++j) { C[i][j] = 0.; for (int k = 0; k < KH; ++k) for (int l = 0; l < KW; ++l)S2 C[i][j] += A[i+k][j+l] * M[k][l]; }

Iteration domain: set of all statement instances.

need to represent and operate on“finitely presented” integer sets (and relations), and solve optimization problems:

isl http://repo.or.cz/w/isl.gitSven Verdoolaege

D_S2 = (H,W,KH,KW) -> {S2(i,j,k,l): 0 <= i < H-KH && 0 <= j < W-KW && 0 <= k < KH && 0 <= l < KW}

http://repo.or.cz/w/isl.git

Program Transformations

Tiling

Tiling

Tiling

Tiling

Tiling

May be performed as part of scheduling,or separately from affine scheduling

if the scheduler ensures permutability

Loop Fusion and FissionLook for a single-loop schedule respecting all dependences

Multiple fusion heuristics: min, max, “smart”, “wise”...

Affine Schedules

i

j

time

Affine Schedules

Schedules assign logical execution dates to elements of the iteration domain

We are interested* only in affine schedules, that is of the shape

t = c0 + c1 * i1 + c2 * i2 + c3 * i3 + … + k1 * W + k2 * H + k3 * KH + k4 * KW

where c0,c1,c2,...,k1,k2,k3,k4 are constants, and i1,i2,i3… are iteration variables

* We will explain later why

Multi-Dimensional Schedules

for (int i = 0; i < H - KH; ++i) for (int j = 0; j < W - KW; ++j) C[i][j] = 0.;

To obtain nested loop structure like

we need to execute the instances C[x][*] after all of C[x-1][*] + induction, but

t = (W - KW) * i + j

does not match our affine schedule pattern

Multi-Dimensional Schedules

Solution: use multi-dimensional schedules and execute statement instances following the lexicographical order of their logical dates:

(0,0) << (0,1) << (0,5672658425682435) << (1,0) << (1,1).

We can then get the original order using a two-dimensional schedule

(i,j) → (i,j)

Affine Schedules

i

j

time

Multi-Statement Schedules

Similar problem exists: for the same i,j, we want to execute all instances of S2 after all instances of S1

for (int i = 0; i < H - KH; ++i) for (int j = 0; j < W - KW; ++j) { C[i][j] = 0.; // S1 for (int k = 0; k < KH; ++k) for (int l = 0; l < KW; ++l) C[i][j] += A[i + k][j + l] * M[k][l]; // S2 }

Note: in this particular case, we can use (i,j,-1,-1) for S1 and (i,j,k,l) for S2 because the lower bound is constant, this trick no longer works for bounds that depend on outer iterators


General solution: introduce auxiliary scalar dimensions to the schedule to separate the statements thanks to the lexicographical order

(i, j, 0, *, *) for S1; (i, j, 1, k, l) for S2.

Any constant can be used in place of *, or these dimensions can be omitted if we extend lexicographical order to vectors of different size with shorter vector preceding longer vectors with the same prefix

void 2mm(double alpha, double beta, double A[NI][NK], double B[NK][NJ], double C[NJ][NL], double D[NI][NL]) { double tmp[NI][NJ]; for (i = 0; i < NI; i++) for (j = 0; j < NJ; j++) {S1: tmp[i][j] = 0.0; for (k = 0; k < NK; ++k)S2: tmp[i][j] += alpha * A[i][k] * B[k][j]; } for (i = 0; i < NI; i++) for (j = 0; j < NL; j++) {S3: D[i][j] *= beta; for (k = 0; k < NJ; ++k)S4: D[i][j] += tmp[i][k] * C[k][j]; }}

S1→(0,i,0,0,j)S2→(0,i,k,1,j)S3→(1,i,0,0,j)S4→(1,i,k,1,j)

> 2x fasteron 4-core CPU

permute, fuse inner

fuse outer, permute

S1→(0,i,j,1,0)S2→(1,i,j,0,k)S3→(0,i,j,0,0)S4→(1,i,k,1,j)


Schedule Functions or Relations

In the simplest case, we use affine functions to assign logical execution dates, e.g

t = f(i, j, k) = i.

In some cases, we want to relax that and use relations constrained by affine inequalities instead. For example, the non-affine function

t1 = g(i, j, k) = floor(i / 42)

can be transformed into an affine relation

{(i, j, k) -> (t1) : 42*t1 <= i <= 42*t1 + 41}

Polyhedral/Affine Scheduling

● Iteratively produce affine schedule functions such that:

● dependence distances are lexicographically positive

● dependence distances are small ⇒ locality

● dependence distances are zero ⇒ parallelism

● dependences have non-negative distances along consecutive dimensions⇒ permutability (which enables tiling)

(0,1,0,0) (0,1,-2,3) (0,0,-1,42)valid also valid violated

Generally, dependences = RAW + WAR + WAW

permutable permutable

Polyhedral/Affine Scheduling● Iteratively produce affine scheduling functions of shape:

minimize

for every dependence R→S

Statement S, scheduling step ka,b,d – coefficientsi – original loop iteratorsP – symbolic parameters

Polyhedral/Affine Scheduling● Iteratively produce affine scheduling functions of shape:

Statement S, scheduling step ka,b,d – coefficientsi – original loop iteratorsP – symbolic parameters

minimize

for every dependence R→S use the affine form of the Farkas lemma to linearize the inequality

→ Integer Linear Programming (ILP) problem

isl Schedules Trees: Compose w/ Imperative Semantics

Code Generation (simplified)

Given iteration domains and schedules, generate the code that traverses all statement instances in the order defined by the schedules:

- each schedule dimension is (potentially) a loop- compute loop bounds- eliminate single-iteration loops and other redundant control flow- rewrite access subscripts

See [Bastoul, 2004], [Vasilache et.al, 2006], [Grosser et.al, 2015]

Why Affine Schedules?

Code generation requires us to invert the schedule, that is express original loop iterators (i,j,...) in terms of new ones (t…). For example, to rewrite A[i][j].

Please solve:t0 = i*i*j*N + j*k;t1 = k*j - i;

t2 = j*k*(i-1);in integers for i,j,k.

Hilbert’s tenth problem: devise a procedure that decides, for any Diophantine equation, if it has an integer solution. [Hilbert, 1900]Demonstrated that such general procedure does not exist [Matiyasevich, 1970].

Why Affine Schedules?

Systems of affine (i.e., linear) Diophantine equations can be solved,e.g. by Gaussian elimination

Systems of affine equations and inequalities can be solved for some optimum using integer linear programming (ILP) or Fourier-Motzkin elimination (FM)

Note: there may be other special cases of Diophantine equations that can be solved, see [Featurier, 2015], [Yuki, 2019]

Correctness Guarantees

Validity of a Loop Transformation

A code transformation is valid if the transformed code produces the same result as the original code.

If we restrict observable results to memory modifications, the transformed code must write the same data to the same addresses in the same order.

Access Functions

Each statement reads and/or writes to some addresses.

Assuming no aliasing, address = array id + subscripts.

Characterize each access by an affine function:

- arguments: loop induction variables ;- value: vector of array subscripts.

Access Functions

Iteration Domain

Array

i

jA[ ]

Data Dependences

Statement instances that access the same data in some order are dependant.

Define a dependence relation:

- statement instance exists(belongs to its iteration domain)

- two statement instances access the same array element(equality of access functions)

- one of the instances is executed before another(lexicographic order of schedules)

Data Dependences

Iteration Domain

Array

i

jA[ ]

Dependence Distance and Satisfaction

Dependence distance : the difference between logical dates of dependent statement instances.

Dependence is satisfied if its distance is lexicographically positive,i.e. the leading non-zero element is positive => guaranteed order of execution.

Correctness guarantee: all dependences are satisfied.

Dependence Distance and Satisfaction

Distance 0

i

j

j+1Distance 1

Affine Scheduling

Basic Principles of Affine Scheduling

Find coefficients of affine schedules for each statement such that:

- dependences are satisfied (correctness);- schedule is invertible to unambiguously generate code.

Dependences and invertibility define a space of valid schedules. Explore it:

- as an optimization problem given some objective function (ILP);- exhaustively or sampling + evaluation (e.g., evolutionary methods).

Ensuring Schedule Invertibility

The matrix of coefficients must be non-singular

Iterative approach:

- look for matrix rows iteratively, in separate ILP problems;- include constraints that guarantee the matrix still has full row rank

One-shot approach:

- look for the entire matrix of coefficients in a single ILP problem;- restrict the matrix to forms that are guaranteed to have full row rank

Ensuring Schedule Invertibility

Objective Functions

Ingredients of the Objective Function

- Access functions- controlling the order of access- extended forms, such as “vectorized” access functions

- Dependence distances*- dependence satisfaction (positive distance)- Independence (zero distance = parallelism)- proximity (possible reuse)

- Binary decision variables- fusion/fission - access or dependence properties like 0/1-stride

- Lexicographical order

* cannot use distances directly because non-affine, instead, convert into affine constraints on the schedule coefficients using the Farkas lemma

Some Existing Objective Functions

- Farkas-based scheduler [Feautrier 1992]- Iterative; maximize the number of satisfied dependences

⇒ validity, inner loop parallelism

- Pluto [Bondhugula et.al, 2008]- Iterative; minimize the upper bound of the dependence distances

⇒ outer loop parallelism + tiling + possible reuse

- Consecutivity [Vasilache et.al, 2012]- One-shot: maximize the number of stride-0/1 accesses in the last “dimension”

⇒ vectorization

- Spatial locality [Zinenko et.al, 2018]- Iterative: Pluto + maximize the “cache-line-sized” dependence distances

⇒ cache locality + possible vectorization

Objective Vocabulary: Parallelism

Encode dependence satisfaction as binary variable (related to distance)

OP: Outer Parallelism

- for the outermost linear dimension, minimize the # of satisfied dependences

IP: Inner Parallelism

- for the innermost linear dimension, minimize the # of satisfied dependences

for (int i = 0; i < NI; ++i) // distance 0 => parallel for (int j = 0; j < NJ; ++j) // distance * => sequential A[i] += 42.;

Objective Vocabulary: Vectorization

Define penalty for each iterator appearing in the fastest-varying array subscript,and not appearing in other subscripts (breaks vectorization).

SO: Stride Optimization

- minimize stride penalty

for (int i = 0; i < NI; ++i) for (int j = 0; j < NJ; ++j) for (int k = 0; k < NK; ++k) A[i][j] = 0.; // reuse B[i][k] = 0.; // vectorization C[k][j] = 0.; // nothing => penalty

Objective Vocabulary: Parallelism+Reuse

Define a benefit for iterator appearance that favors parallelism, another for reuse, define a penalty for iterator appearance that breaks reuse or vectorization.

OPIR: Outer Parallelism Inner Reuse

- maximize total benefit, then minimize total penalty

for (int i = 0; i < NI; ++i) // parallel for (int k = 0; k < NK; ++k) // reuse for (int j = 0; j < NJ; ++j) // vectorizable C[i][j] += A[i][k] * B[k][j];

Objective Vocabulary: Fusion/Fission

If two statements are in fused loops, the difference of the resp. scalar scheduleis 0. Dependences exist between statements that reuse data.

DGF: Dependence-Guided Fusion

- Minimize the penalty for fissioned dependence-related statements

SIS: Separation of Independent Statements

- Minimize the penalty for fused independent statements

for (i = 0; i < N; ++i) A[i] = 42.;for (i = 0; i < N; ++i) B[i] = A[i];

for (i = 0; i < N; ++i) { A[i] = 42.; B[i] = 43.;}

Objective Vocabulary: Stencils

Stencils access adjacent locations and are often wavefront-parallelized.for (int t = 0; t < T; ++t) for (int i = 0; i < H; ++i) for (int j = 0; j < W; ++j) A[i][j] = A[i-1][j-1]*M[0][0] + A[i-1][j]*M[0][1] + A[i-1][j+1]*M[0][2] + A[i ][j-1]*M[1][0] + A[i ][j]*M[1][1] + A[i ][j+1]*M[1][2] + A[i+1][j-1]*M[2][0] + A[i+1][j]*M[2][1] + A[i+1][j+1]*M[2][2];

Objective Vocabulary: Stencils

Stencils access adjacent locations and are often wavefront-parallelized.

SPAR: Stencil Parallelization- Since outer loop is not parallelizable, favor shifting/skewing in this loop only

to expose inner parallelism. Penalize skewing coefficients.SMVS: Stencils Minimization of Vector Skewing

- Further penalize skewing by the loop iterators that appear in fastest-varying array subscripts since this breaks vectorization.

for (int t = 0; t < T; ++t) for (int i = 0; i < H; ++i) for (int j = 0; j < W; ++j) A[i][j] = A[i-1][j-1]*M[0][0] + A[i-1][j]*M[0][1] + A[i-1][j+1]*M[0][2] + A[i ][j-1]*M[1][0] + A[i ][j]*M[1][1] + A[i ][j+1]*M[1][2] + A[i+1][j-1]*M[2][0] + A[i+1][j]*M[2][1] + A[i+1][j+1]*M[2][2];

Selecting Objectives

Analyze iteration domains and access functions to classify programs and select the corresponding sequence of cost functions.

Selected Performance Results

Experimental Methodology

- H/W: 3.3 GHz, 10-core Intel i9-7900X CPU- Benchmarks: PolyBench/C 3.2 (30 polyhedral kernels)- Conditions:

- Baseline: Pluto 0.11.4 + autotuned tiling- Tested: PoCC = Pluto with scheduling improvements without autotuning

Performance Comparison

Autotuning Search Space

Refinements: Scalability and Customizable Incremental Scheduling

● Grouping “sufficiently similar” accesses:

● Access rank:

− prioritize array references with more subcripts.

● Access multiplicity:

− prioritize repeated accesses.

● Iterative grouping:

− Consider accesses contributing to each other’s multiplicity together.

● Each group optimized separately for temporal/spatial locality

Scalability: Proximity Relation Grouping

● Sort the groups in the ILP to give them more or less priority

● Default heuristic:

− by rank

− by multiplicity

− temporal first

● May include external factors unavailable to a linear optimizer(types, memory characteristics, costs of cache misses, etc.)

Incremental Scheduling Policies

Refined Scheduling Algorithm Template● Consists of two parameterizable ILP problems:

○ carry as little spatial proximity relations as possible and produce coincident dimensions for parallelism(based on the Pluto algorithm [Bondhugula et.al 2008]);

○ carry multiple spatial proximity relations without skewing(based on the Feautrier algorithm [Feautrier 1992]).

● One level of coarse-grain parallelism

− avoid carrying coincident relations in the outer loops.

● Memory hierarchy favoring adjacent accesses

− carry spatial proximity relations in the inner loops.

● False sharing effect

− avoid carrying spatial proximity relations in outer || loops.

● High cost of barrier synchronization inside loops

− if few loops remain to be scheduled, prefer carrying coincident

● Parallelism/Locality conflict

− requires tiling and different schedule for point loops.

Instantiation for CPUs

● Multiple levels of parallelism

− avoid carrying coincidence relations, communicate with block/thread mapper.

● Memory coalescing along one thread dimension

− carry multiple spatial proximity relations while not carrying coincidence relations, ensure mapping to the right thread.

● High overhead of kernel launch

− more aggressive fusion including (spatial) RAR relations

Instantiation for GPUs

Polybench on Sequential CPU

Two versions of ppcg: “post-tile” makes syntactic loop interchange after tiling

Intel Core i7-6600u (Skylake) @ Ubuntu 17.04 + gcc 6.3

Example: 2 matrix multiplications


Example: 2 matrix multiplications


S1→(0,i,0,0,j)S2→(0,i,k,1,j)S3→(1,i,0,0,j)S4→(1,i,k,1,j)

Ppcg-spatial

PlutoS1→(0,i,j,1,0)S2→(1,i,j,0,k)S3→(0,i,j,0,0)S4→(1,i,k,1,j)

Example: LU decomposition

void lu(double A[N][N]) { for (i = 0; i < N; i++) { for (j = 0; j < i; j++) { for (k = 0; k < j; k++)S1: A[i][j] -= A[i][k] * A[k][j];S2: A[i][j] /= A[j][j]; } for (j = i; j < N; j++) for (k = 0; k < i; k++)S3: A[i][j] -= A[i][k] * A[k][j]; }}

S1→tile(i,j,k); point(i,k,j)S2→tile(i,j,j); point(i,j,j)S3→tile(i,j,k); point(i,k,j)

S1→tile(i,k,j); point(i,k,j)S2→tile(i,j,j); point(i,j,j)S3→tile(i,k,j); point(i,k,j)

Pluto

Ppcg-spatial

+wavefront parallelism(i,k,j) → (i+k,k,j)

Reduces false sharing

Parallel CPU4x Intel Xeon E5-2630 (Ivy Bridge) @ CentOS 7.2.1511 + gcc 6.3

Parallel GPU

GPU performance on Polybench is dominated by efficient parallelism extraction

Nvidia Quadro K4000 (Kepler) @ CentOS 7.2.1511 + CUDA 8.0r1

Affine Scheduling Lessons

- “One-size-fits-all” heuristics don’t really fit all.

- Various optimization objectives can be expressed in polyhedral scheduling.

- Different kernels require different optimization strategies.

- The cost of tile size autotuning can be decreased by picking the right cost function, but guessing the best tile sizes remains a challenge.

[Bastoul 2004] Bastoul “Code generation in the polyhedral model is easier than you think” @ PACT’04

[Bondhugula et.al 2008] Bondhugula, Hartono, Ramanujam, Sadayappan “A practical automatic polyhedral parallelizer and locality optimizer” @ PLDI 2008

[Feautrier 1992] Feautrier “Some efficient solutions to the affine scheduling problem” (part I and II) @ Intl. Journal of Parallel Programming 21(5) and 21(6).

[Grosser et.al 2014] Grosser, Verdoolaege, Cohen “Polyhedral AST generation is more than scanning polyhedra” @ TOPLAS 2015

[Matiyasevich 1970] Matiyasevich “The Diophantineness of enumerable sets” @ Reports of the URSS Academy of Sciences 191(2) (in Russian).

[Vasilache et.al 2012] Vasilache, Meister, Baskaran, Lethin “Joint scheduling and layout optimization to enable multi-level vectorization” @ IMPACT 2012

[Zinenko et.al 2018] Zinenko, Verdoolaege, Reddy, Shirako, Grosser, Sarkar, Cohen “Modeling the conflicting demands of parallelism and Temporal/Spatial locality in affine scheduling” @ CC 2018

Some Polyhedral Compilation References

Subjective References in Affine Scheduling

● [Feautrier 1992] first Farkas-based approach to affine scheduling: carry dependences early

● [Bondhugula et al. 2008] Pluto algorithm: outer parallelism and locality in one incremental ILP-based optimization problem

● [Vasilache et al. 2012] introduced access consecutivity constraints and a “one-shot” ILP scheduler

● [Verdoolaege and Isoard 2018] extended Vasilache et al. approach to incremental ILP scheduling

● [Zinenko et al. 2018] unified polyhedral flow for temporal and spatial locality and incremental orchestration of constraints and objectives

Polyhedral Compilation in the Real World

Where From?Mathematical core: “isl”parametric linear optimization, Presburger arithmetic

used in GCC’s Graphite and LLVM Pollyand many research projects including Pluto, PoCC, PPCG...

Building on 12 years of collaborationARM, Inria, ETH Zürich (Tobias Grosser)AMD, Qualcomm, Xilinx, FacebookIISc (Uday Bondhugula)IIT Hyderabad (Ramakrishna Upadrasta)Ohio State University, Colorado State University, Rice UniversityGoogle Summer of Code

92

Research and Industry Transfer - Virtual Labhttps://www.pollylabs.org

● Bilateral contracts ARM, Facebook (FAIR), Xilinx, Inria, ETHZin cooperation with Qualcomm (QuIC)

● Focus on LLVM ecosystem: http://llvm.org→ exploitation & developer community

● Mutualization of core polyhedral compilation infrastructure● Contributing to domain-specific – deep learning, image processing,

solvers, linear algebra – and research compilers● Training and tutorials

http://llvm.org/

93

Timeline

● isl started in 2008, licensed under LGPLv2.1Used by GCC as its polyhedral library since 5.0http://repo.or.cz/w/isl.git

● 2013: Relicensed under MIT, through CARP EU projectUsed by LLVM through the Polly project

● 2014: Triggered ARM to release tools to generate linear algebra kernels● 2014: Qualcomm, then ARM, push for Polly Labs● 2015: Qualcomm Snapdragon LLVM uses Polly, compiles Android OSP ● 2016: Xilinx starts an isl-based project within Vivado HLS● 2017: Facebook works on a deep learning compiler using isl

→ Tensor Comprehensions project

http://repo.or.cz/w/isl.git

Nicolas Vasilache, Oleksandr Zinenko,Theodoros Theodoridis, Priya Goyal, Zach DeVito,

William S. Moses, Sven Verdoolaege,Andrew Adams, Albert Cohen

Programming Language & Systems Research…for Deep Learning?

Deep learning has become massively popular over the last 10 yearsMachine Learning (ML) frameworks are all over the place

Is this good enough?

Programming Language & Systems Research…for Deep Learning?

• Existing layers in vendor libraries from Intel, Nvidia, etc.

• Can reach great performance, 95%+ efficiency on a few, ideal kernels

• But the practice is often far from machine peak

• New layer or architecture → performance bottleneck

• High-performance library coders are scarce, not all genius, don’t scale

• ML differs from scientific/high-performance computing:variety of hardware, data layouts & types, need for symbolic manipulation (automatic differentiation, quantization), and programmer expertise levels

Writing “Good” Neural Network Layers

Tedious experienceout of reach from Machine Learning (ML) users and researchers

Derive ML-specific techniques from software synthesis and compilers

• Mini-language, close to mathematics and easy to manage by automatic tools

• Compiler for algorithmic optimization and automatic differentiation

• Compiler for “polyhedral” scheduling and mapping

• Just-in-time specialization of the model (hyper)parameters for efficient kernel implementations, e.g., for GPU acceleration

• Works transparently: “A New Op” for machine learning and applications

• Integrated with industry-standard frameworks (Caffe2, PyTorch)

Our Approach

Don’t write programs, synthesize them

• “Direct generation” such as active library [2] or built-to-order (BTO) [3] provide effective performance, but miss global optimization

• DSLs such as Halide [4] provide usability, and permit scheduling transformations, though manually specify.

• Compilers like XLA [5] or Latte [6] optimize and fuse operators, though performance lacking as the language can’t represent complex schedules crucial to GPU/others.

[2] Run-time code generation in C++ as a foundation for domain-specific optimization, 2003[3] Automating the generation of composed linear algebra kernels, 2009[4] Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines, 2013[5] Xla:domain-specific compiler for linear algebra to optimizes tensorflow computations, 2017[6] Latte: A language, compiler, and runtime for elegant and efficient deep neural networks, 2016

Prior Work

Our Approach

Our Approach

Synthesize From Model…

Tight mathematical model, emits 1000s optimized lines of code

• Group Convolution

• Kronecker Recurrent Units (KRU)

→ algorithmic exploration of storage/recompute tradeoffs

Synthesize From Model…

● Production MLP from Facebook

● And more complex examples, including Google WaveNet

… Rather Than Write Accelerator Code

Polyhedral Compilation to the Rescue

Polyhedral Compilation to the Rescue

Heard That Before?

• 30 years of parallelizing and optimizing compiler research• … wrapped into a robust, automated tool, with domain specialization• … with modern C++ interface and tight ML framework integration

• Embed the most complex compositions of loop nest and tensor optimizations

Algorithmic Contributions

O[l+Idx[i][j]][k] => shared_O[l][i][j][k]

• Cache indirectly accessed arrays• Only when O and Idx are only read (not written)

• Promote direct accesses if tile of fixed size, elements reused, and >= 1 access without memory coalescing

• Promote indirect accesses in same way (ignore coalescing)

• Heuristics for register promotion as well

Memory Promotion

Performance?

Performance?

TC in a Nutshell

Polyhedral Compilation in the Real World… once again with broader ambitions

MLIR: Multi-Level Intermediate Representationfor the End of Moore’s Law

Presenting the work of many, many, people

From EuroLLVM 2019 keynote and tutorial

TensorFlow

Huge machine learning community

Programming APIs for many languages

Abstraction layer for accelerators:- Heterogenous, distributed, mobile, custom ASICs…- Urgency is driven by the “end of Moore’s law”

Open Source: https://tensorflow.orghttps://tensorflow.org/mlir

https://tensorflow.org

https://tensorflow.org

Why a new compiler infrastructure?

The LLVM Ecosystem: Clang Compiler

LLVM IR Machine IRClang ASTC, C++

CUDA, OpenCL Asm

are SSA IRs:● Different levels of abstraction - operations and types are different● Abstraction-specific optimization at both levels

Progressive lowering:● Simpler lowering, reuse across other front/back ends

Green boxes

Azul Falcon JVM

LLVM IR Machine IRClang ASTC, C++

CUDA, OpenCL Asm

Java & JVM Languages Java BC

“Falcon: An Optimizing Java JIT” - LLVM Developer Meeting Oct’2017

Uses LLVM IR for high level domain specific optimization:● Encodes information in lots of ad-hoc ways: IR Metadata, well known functions, intrinsics, …

● Reuses LLVM infrastructure: pass manager, passes like inliner, etc.

https://llvm.org/devmtg/2017-10/#talk12

Swift, Rust and Julia have a high level IR - Not C and C++

LLVM IR Machine IR Asm

Swift


SIL IRSwift AST

Rust MIR IRRust AST

Julia Julia IRJulia AST

“Introducing MIR”: Rust Language Blog, “Julia SSA-form IR”: Julia docs

● Domain specific optimizations: generic specialization, devirt, ref count optzns, library-specific optzns, etc

● Dataflow driven type checking - e.g. borrow checker in Rust● Domain specific optimizations, progressive lowering

Clang ASTC, C++

CUDA, OpenCL

https://blog.rust-lang.org/2016/04/19/MIR.html

https://docs.julialang.org/en/v1/devdocs/ssair/index.html

TensorFlow XLA Compiler

LLVM IR Machine IR Asm

Swift


SIL IRSwift AST

Rust MIR IRRust AST

Julia Julia IRJulia AST

XLA HLOTF GraphTensorFlow Ecosystem

“XLA Overview”: https://tensorflow.org/xla/overview (video overview)

● Domain specific optimizations, progressive lowering, ad-hoc emitters

Clang ASTC, C++

CUDA, OpenCL

https://www.tensorflow.org/xla/overview

https://www.youtube.com/watch?v=2IOPpyyuLkc

The TensorFlow compiler ecosystem

TensorFlow Graph

LLVM IRXLA HLO

TPU IR

TensorFlow Lite

Several othersTensor RT

nGraph

NNAPI

Many others

Core ML

Many “Graph IRs”, each with challenges:● Similar-but-different (some, proprietary) technologies: not going away anytime soon

● Duplication of infrastructure at all levels

→ Need SSA-based design to generalize and improve “ML graphs”

Great!● High-level domain-specific optimizations● Progressive lowering encourages reuse between levels

Domain Specific IRs

Not great!● Huge expense to build this infrastructure● Reimplementation of all the same stuff

○ pass managers, location/error tracking, testing tools○ inlining, use-def chains, constant folding, partial redundancy elimination, ...

● Innovations in one community don’t benefit the others

MLIR Primer

Many similarities to LLVM

func @testFunction(%arg0: i32) { %x = call @thingToCall(%arg0) : (i32) -> i32 br ^bb1^bb1: %y = addi %x, %x : i32 return %y : i32}

Module

Function

Block

Operation

Operation

Block

Operation

Operation

● SSA, CFG, typed, three address● Module/Function/Block/Operation structure● Round trippable textual form● Syntactically similar:

MLIR Type System: some examples

Scalars: ● f16, bf16, f32, … i1, i8, i16, i32, … i3, i4, i7, i57, …

Vectors: ● vector<4 x f32>, vector<4x4 x f16>

Tensors, including dynamic shape and rank:● tensor<4x4 x f32>, tensor<4x?x?x17x? x f32>, tensor<* x f32>

Others: ● functions/closures, memory buffers, quantized integers, TensorFlow stuff, ...

MLIR Operations: an open ecosystem

No fixed / builtin list of globally known operations:● No “instruction” vs “target-indep intrinsic” vs “target-dep intrinsic” distinction

○ Why is “add” an instruction but “add with overflow” an intrinsic in LLVM? 😿

Passes are expected to conservatively handle unknown operations:● just like LLVM does with unknown intrinsics

func @testFunction(%arg0: i32) -> i32 { %x = “any_unknown_operation_here”(%arg0, %arg0) : (i32, i32) -> i32 %y = “my_increment”(%x) : (i32) -> i32 return %y : i32}

MLIR Operations Capabilities

Operations always have: opcode and source location info

Operations may have:- Block arguments instead of PHI nodes- Any number of SSA results and operands- Attributes: constant values of custom syntax and type- Regions: discussed in later slide- Custom printing/parsing - or use the more verbose generic syntax

Extensible Operations Allow Multi-Level IR

TensorFlow%x = "tf.Conv2d"(%input, %filter) {strides: [1,1,2,1], padding: "SAME", dilations: [2,1,1,1]} : (tensor<*xf32>, tensor<*xf32>) -> tensor<*xf32>

XLA HLO

LLVM IR

%m = “xla.AllToAll"(%z) {split_dimension: 1, concat_dimension: 0, split_count: 2} : (memref<300x200x32xf32>) -> memref<600x100x32xf32>

%f = "llvm.add"(%a, %b) : (f32, f32) -> f32

Also: TF-Lite, Core ML, other frontends, ...

Don’t we end up with the JSON/XML of compiler IRs???

MLIR “Dialects”: Families of defined operations

Example Dialects:- TensorFlow, XLA HLO, TF Lite, Swift SIL, ...- linalg, affine, LLVM IR, ...

Dialects can define:- Operations- Custom type and attribute systems

Operation can define:- Invariants on # operands, results, attributes, ...- Custom parser, printer, verifier, ...- Constant folding, canonicalization patterns, …

Operations in a Nutshell

%res:2 = "mydialect.morph"(%input#3) { some.attribute : true, other_attribute : 1.5 } : (!mydialect<"custom_type">) -> (!mydialect<"other_type">, !mydialect<"other_type">) loc(callsite("foo" at "mysource.cc":10:8))

● No predefined set of instructions● Operations are like “opaque functions” to MLIR

Name of theresults

Op IdNumber of

value returnedDialectprefix Argument

Index inthe producer’s results

Dialect prefix for the type

Opaque string/

Dialect specific type

List of attributes:constant named arguments

Mandatory and Rich Location

Nested Regions in a Linear IR

➔ Functional control flow, XLA fusion node, lambdas/closures, parallelism abstractions like OpenMP, etc.

%7 = tf.If(%arg0 : tensor<i1>, %arg1 : tensor<2xf32>) -> tensor<2xf32> { … “then” code... return ... } else { … “else” code... return ... }

%2 = xla.fusion (%0 : tensor<f32>, %1 : tensor<f32>) : tensor<f32> { ^bb0(%a0 : tensor<f32>, %a1 : tensor<f32>): %x0 = xla.add %a0, %a1 : tensor<f32> %x1 = xla.relu %x0 : tensor<f32> return %x1 }

Bigger Example: Polyhedral IR Dialect

affine.for and affine.if represent simplified polyhedral schedule trees:● Great match for ML kernels● Includes systems of affine constraints, mappings, solvers, etc.

func @matmul_square(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) { %n = dim %A, 0 : memref<?x?xf32>

affine.for %i = 0 to %n { affine.for %j = 0 to %n { store 0, %C[%i, %j] : memref<?x?xf32> affine.for %k = 0 to %n { %a = load %A[%i, %k] : memref<?x?xf32> %b = load %B[%k, %j] : memref<?x?xf32> %prod = mulf %a, %b : f32 %c = load %C[%i, %j] : memref<?x?xf32> %sum = addf %c, %prod : f32 store %sum, %C[%i, %j] : memref<?x?xf32> } } } return}

More on MLIR: See the EuroLLVM’19 TutorialExample: DSL and Compiler for a Heterogeneous World

LLVM IR Machine IRToy ASTToy AsmTIR

Shape InferenceFunction Specialization

High-Level Language Specific

Optimizations

HW Accelerator (TPU, GPU, FPGA, ..)

Need to analyze and transform the AST→ heavy infrastructure

And is the AST really the most friendly representation we can get? New HW: are we extensible

and future-proof?

More on MLIR: See the EuroLLVM’19 TutorialIt’s All About Dialect(s)

LLVM IR Machine IRToy ASTToy Asm

High-Level Language Specific

Optimizations

HW Accelerator (TPU, GPU, FPGA, ..)

MLIR

Implementedas Dialect

Implementedas Dialect

TIR

Shape InferenceFunction Specialization

MLIR is a Large Project...

Albert CohenAlex ZinenkoAlexandre PassosAndrew SelleAndy DavisBjarke RouneBrian PattonChris LattnerCliff YoungDavid MajnemerDaniel KillebrewDimitrios VytiniotisFeng LiuHimabindu Pucha

Jacques PienaarJames MolloyJeff DeanJianwei XieLei ZhangMark HeffernanMartin WickeMehdi AminiMichael IsardNicolas VasilachePaul BarhamPeter HawkinsRasmus LarsenRichard Wei

River RiddleSanjoy DasSergei LebedevSkye Wanderman-MilneSmit HinsuSourabh BajajStella LaurenzoTatiana ShpeismanTodd WangUday BondhugulaUlysse BeaugnonYanan Cao

MLIR-Related Research Projects

Hackability & HW/SW Research

Aiming for a super-extensible system, catalyzing next-gen accelerator research:

● domain-specific languages / annotations lower naturally to MLIR● domain-specific HW constructs are first-class operations● extend type system: novel numerics, sparse tensors, (G)ADTs, …● concurrency & parallel constructs, memory modeling● many classes of transformations have structured search spaces: algorithmic

rewriting, graph rewriting, memory-recompute, polyhedral, and synthesis

Accelerate innovation in hardware, compiler algorithms, and applications thereof

Compile to Learn → Learn to Compile

● Move past handwritten heuristics○ NP complete problems○ Cost models that are hard or infeasible to characterize○ Hardware explosion, model diversity, problem diversity, … can’t scale

● Autotuning, search and caching○ Separate algorithms and policy○ Exploit structure in search space

Two projects: past & future

& Telamon

Telamon: Commutative Optimizationson Partially Specified Implementations

with Ulysse Beaugnon, Basile Clément and Andi Drebes, Nicolas TollenaereENS, Inria, Google

CC 2017: Optimization space pruning without regretsUlysse Beaugnon, Antoine Pouille, Marc Pouzet, Jacques Pienaar, Albert Cohen

arXiv preprint: On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPUUlysse Beaugnon, Basile Clément, Nicolas Tollenaere, Albert Cohen

https://ai.google/research/pubs/pub46935

https://arxiv.org/abs/1904.03383

Context: “superoptimizing” loop nests in numerical kernelsChallenge: finding good/best combinations of implementation decisions is hard

● Optimizations may enable or disable others● Transformations ordering affects performance● Cannot infer precise performance estimation from intermediate compilation steps

Corollary: optimizing compilation never seems to catch up... new hardware, optimization tricks… effectively witnessing a widening performance portability gap

Problem Statement

Candidates as Partially Specified Implementations

● Optimizations as independent, commutative decisionse.g., tile? unrolling? ordering?

● Vector of choices, listed upfront, decisions taken in any orderdefer any interference to search

● Synthesize imperative code from fixed/complete decision vectorse.g., infer buffers, control flow

Constraint Programming for Semantics and Resource Modeling

● Control structuree.g., loop nesting

● Semantics of the kernele.g., def-use, array dependences

● Optimization interactionse.g., enabling transformations

● Resource constraintse.g., local memory

Branch-and-Bound- and MCTS-enabled Search

● Lower bound derived from orthogonal resource modelinginspired by roofline modeling

● Lower bound for a candidate = ideal performance for a set of potential implementations

● Empowered by structured, decision vector and CSP-based implementation space

Telamon Approach

Inspired From Polyhedral Compilation

● Polyhedral compilation○ Affine scheduling

e.g., ILP-based○ Code generation

from affine schedules to nested loops

● Meta-programming array processing code○ Halide / TVM specific combinators

and scheduling/mapping primitives○ URUK, CHiLL

with automatic schedule completion

TVM example: scan cell (RNN)m = tvm.var("m")n = tvm.var("n")X = tvm.placeholder ((m,n), name ="X")s_state = tvm.placeholder ((m,n))s_init = tvm.compute((1,n), lambda _,i: X[0,i])s_do = tvm.compute((m,n), lambda t,i: s_state[t -1,i] + X[t,i])s_scan = tvm.scan(s_init, s_update, s_state, inputs =[X])s = tvm.create_schedule (s_scan.op)// Schedule to run the scan cell on a CUDA deviceblock_x = tvm.thread_axis ("blockIdx.x" )thread_x = tvm.thread_axis ("threadIdx.x" )xo,xi = s[s_init] .split(s_init.op.axis[1], factor=num_thread)s[s_init].bind(xo, block_x)s[s_init].bind(xi, thread_x)xo,xi = s[s_do].split(s_do.op.axis[1], factor=num_thread)s[s_do].bind(xo, block_x)s[s_do].bind(xi, thread_x)print(tvm.lower(s, [X, s_scan], simple_mode =True))

Candidates?

https://en.wikipedia.org/wiki/Polytope_model

http://halide-lang.org

https://tvm.ai

http://www.cs.colostate.edu/~pouchet/doc/pact-article.07.pdf

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.214.8396







https://docs.tvm.ai/api/python/tvm.html#tvm.scan

https://docs.tvm.ai/api/python/schedule.html#tvm.create_schedule



https://docs.tvm.ai/api/python/build.html#tvm.lower

Inspired From Program Synthesis and Superoptimization

● Program synthesis○ Start from denotational specification, possibly partial (sketching), or

(counter-)examplesTelamon ≈ Domain-specific denotations

○ Guess possible implementations by (guided) sampling lots of random onesTelamon ≈ Guess efficient implementations by (guided) sampling lots of stupid ones

○ Filter correct implementations using SMT solver or theorem proverTelamon ≈ Constraint programming to model both correctness and hardware mapping

● Superoptimization○ Typically on basic blocks, with SAT solver or theorem prover and search○ Architecture and performance modeling

Telamon ≈ Operate on loop nests and arrays

Constraints?

https://en.wikipedia.org/wiki/Program_synthesis

https://en.wikipedia.org/wiki/Superoptimization

Inspired From Adaptive Libraries and Autotuning

● Feedback-directed and iterative compiler optimization, lots of work since the late 90s● Adaptive libraries

○ SPIRAL: Domain-Specific Language (DSL) + Rewrite Rules + Multi-Armed Bandit or MCTShttp://www.spiral.net

○ ATLAS, FFTW, etc.: hand-written fixed-size kernels + micro-benchmarks + meta-heuristics

○ Pouchet et al. (affine), Park et al. (affine and CFG): Genetic Algorithm, SVM, Graph Kernels

● Telamon○ vs. SPIRAL, FFTW: better structured, independent/commutative choices,

branch-and-bound○ vs. Pouchet and Park: finite space, bounded vectors

Search?

http://www.spiral.net/

https://scholar.google.com/citations?user=TCppIZYAAAAJ&hl=fr

https://scholar.google.com/citations?user=RNzbA4IAAAAJ&hl=en

Partially Instantiated Vector of Decisions● Every choice is decision variable● List the domain of variables: the values they can take● Taking a decision = restricting a domain● Fully specified implementation ⇔ All decision variables assigned a single value

- order(a, b) ∈ { Before, After }- order(a, c) ∈ { Before }- …

Candidates

Kernel Decisions

Enforce coherent decisions with constraints

order(x, d0) = Inner && order(x, y) = Before => order(y, d0) ∈ { Inner, After }

%x = load X[0]

%y = add %x, 42

for %d0 = 0 to 16 {

%z = add %y, %d0

}

%y = add %x, 42

for %d0 = 0 to 16 {

%x = load X[0]

%z = add %y, %d0

}

for %d0 = 0 to 16 {

%x = load X[0]

%y = add %x, 42

%z = add %y, %d0

}

order(%x, %d0) ∈ { Before, Inner }

order(%x, %y) ∈ { Before }

order(%y, %d0) ∈ { Before, Inner }

...

order(%x, %d0) ∈ { Before, Inner } <- decision


order(%y, %d0) ∈ { Before, Inner }

...

order(%x, %d0) ∈ { Before, Inner } <- decision


order(%y, %d0) ∈ { Before, Inner } <- constraint propagation

...

Candidates and Constraints

Well Behaved Set of Actions

● Commute

● All decisions known upfront

● Constraint propagation almost never backtracks in practice

Flat, Fixed Sized, Ideal Environment for Reinforcement Learning (RL)

● Extract features from the decision vector

● Global heuristics, aware of all potential optimizations

● Infer all possible decisions (actions) and/or estimate performance

Enabling Better Search Algorithms

Find an Assignment for Functions

kind: Dimension -> { Loop, Unrolled, Vector, Thread, Block }

order: Statements x Statements -> { Before, After, Inner, Outer, Fused }

That Respects Constraints

∀ a, b ∊ Dimension. order(a, b) = Fused => kind(a) = kind(b)

(a.k.a. typed fusion)

Constraint Satisfaction Problem (CSP)

● Finding an implementation is easy: random decisions + constraint propagation● Use CSP to represent, not to solve the problem● No analytical objective function

○ Analytical functions cannot model the complexity of the hardware○ Hardware details are proprietary○ Use actual evaluations○ And external heuristics

CSP Without the Optimization Aspect

Generic loop nest and array optimizations + GPU-specific optimizations

Supported Decisions

● Strip mining factor● Loop interchange● Loop fusion● Statement Scheduling● Rematerialization

● Array placement in memory spaces

● Memory layout

● Copy to local memories

● Vectorization

Evaluation: Telamon on GPU

Computes Z = X + Y- load X

- load Y

- add X and Y into Z

- store Z

Implementation space

● Each instruction in its own loop

● Strip-mined 3 times

● Can choose strip-mining factors

● Can fuse, interchange and unroll loops

● Can reorder instructions

● Can coalesce transfers across memory spaces

Example: Vector Addition

GPU Execution

Telamon + GPU Backend (Rust)

Implementation(PTX)

Implementation Space Description

(Custom DSL)

Kernel Description(Rust API Calls)

Compiled Constraint Program

Monte Carlo Tree Search & Program Synthesis

Telamon System Overview

Performance model of a lower bound on the execution time

∀x∊S. Model(S) ≤ Time(x)

● Enables Branch & Bound, with feedback from real executions○ Reduces the search space by several orders of magnitude○ Prunes early in the search tree (75% in the first two levels for matmul on GPU)

● Possible because it is aware of potential future decisions● GPU model of block- and thread-level performance features, as well

as single-thread microarchitecture○ No cache and registers model (yet)○ Coarse-grain model of the interaction between bottlenecks

Branch and Bound + Monte Carlo Tree Search (MCTS)

Source: Wikipedia

Zooming in the MCTS-Based Search

Results on Nvidia Kepler GPU

● High variance of the search time (stuck in suboptimal areas)

● Lots of dead-ends○ Mostly due to performance model○ ~20x more dead-ends than implementations

● Non-stationary distribution due to cuts○ Somewhat intrinsic to MCTS○ Branch & bound strategy makes it trickier

Search Issues (Ongoing Research)

Take Home Message

In a Nutshell — Benefits of Polyhedral Compilation

Search SpaceAbstract, Partially Specified

Implementations

● Optimizations and lowering,choices and transformationse.g., tile? unrolling? ordering?

● Choice vector or sequence of transformations/rewrite rules combine with search

● Synthesize imperative code, API calls, assembly code infer buffers, control...

ConstraintsFunctional Semantics and

Resource Modeling

● Control structuree.g., loop nesting

● Semantics of the kernele.g., def-use, array dependences

● Optimization interactionse.g., enabling transformations

● Resource constraintse.g., local memory, DMA

Search HeuristicsSemi-automatic or Black-box

Optimization

● Objective functionslinear approximations, resource counting...

● Feedback from actual executionprofile-directed, JIT, trace-based...

● Combinatorial optimizationILP, SMT, graph algorithms, reinforcement learning...

With numerous applications: compiler construction, domain-specific optimization, performance portability

Documents

Polyhedral Compilation as a Design Pattern for Compiler ... · Polyhedral Compilation as a Design Pattern for Compiler Construction PLISS, May 19-24, 2019 [email protected]