[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. Lethin

1

The R-Stream High-Level Program Transformation Tool


�• R-Stream Overview

�• Compilation Walk-through

�• Performance Results

�• Getting R-Stream

2

Outline


Power efficiency driving architectures

3

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

HeterogeneousProcessing

DistributedLocal Memories

ExplicitlyManaged

Architecture

NUMA

BandwidthStarved

Hierarchical(including board,chassis, cabinet)

MultipleExecutionModels

MixedParallelism

Types

MultipleSpatial

Dimensions


�• Expressing it• Annotations and pragma dialects for C• Explicitly (e.g., new languages like CUDA and OpenCL)

�• But before expressing it, how can programmers find it?• Manual constructive procedures, art, sweat, time

– Artisans get complete control over every detail• Automatically

– Operations research problem– Like scheduling trucks to save fuel– Model, solve , implement– Faster, sometimes better, than a human

4

Computation choreography

Our focus


�• Naïve approach• Model

– Tasks, dependences– Resource use, latencies– Machine model with connectivity, resource capacities

• Solve with ILP– Minimize overall task length– Subject to dependences, resource use

• Problems– Complexity: task graph is huge!– Dynamics: loop lengths unknown.

�• So we do something much more cool.

5

How to do automatic scheduling?

Reservoir Labs Harvard 04/12/2011 6

Program Transformations Specification

iteration space of a statement S(i,j)

j

i

22: ZZ

t1

t2

�• Schedule maps iterations to multi-dimensional time:• A feasible schedule preserves dependences�• Placement maps iterations to multi-dimensional space:• UHPC in progress, partially done�• Layout maps data elements to multi-dimensional space:• UHPC in progress�• Hierarchical by design, tiling serves separation of concerns


Loop transformations

for(i=0; i<N; i++)for(j=0; j<N; j++)S(i,j);

for(j=0; j<N; j++)for(i=0; i<N; i++)S(i,j);

for(i=N-1; i>=0; i--)for(j=0; j<N; j++)S(j,i);

for(i=0; i<N; i++)for(j= *i; j<N+ *i; j++)S(i,j- *i);

ji

ji0110

),(

ji

ji1001

),(

ji

ji101

),(

for(i=0; i< *N; i+= )for(j=0; j<N; j++)S(i/ ,j); j

iji

100

),(

permutation

reversal

skewing

scaling

unimodular


Loop fusion and distribution

for(i=0; i<N; i++)for(j=0; j<N; j++)S1(i,j);

for(j=0; j<N; j++)S2(i,j)

for(i=0; i<N; i++)for(j=0; j<N; j++)S1(i,j);S2(i,j)

fusion

distribution

1000010000001000

),(1 ji

ji

1000010100001000

),(2 ji

ji

1000010000001000

),(1 ji

ji

1100010000001000

),(2 ji

ji

fusion

distribution


Enabling technology is new compiler math

Uniform Recurrence Equations [Karp et al. 1970]

Polyhedral Model [1980-]

Many: Feautrier, Darte, Vivien, Wilde, Rajopadhye, etc,....

Many: Lamport, Allen/Kennedy, Banerjee, Irigoin, Wolfe/Lam, Pugh, Pingali, e

Vectorization, SMP, locality optimizations

Dependence summary: direction/distance vectors

New computational techniques based on polyhedral representations

Unimodular transformations

Exact dependence analysis

Loop synthesis via polyhedral scanning

Loop Transformations and Parallelization [1970-]

Mostly linear-algebraic

General affine transformations

Systolic Array Mapping


R-Stream model: polyhedra

n = f();for (i=5; i<= n; i+=2) {

A[i][i] = A[i][i]/B[i];for (j=0; j<=i; j++) {if (j<=10) {�… A[i+2j+n][i+3]�…

}}

1130

001

002

011

11

0

nji

AA

Affine and non-affine transformationsOrder and place of operations and data

10

}12;;0;5,|,{ 2 kiijijniZkZji

Loop code represented (exactly or conservatively) with polyhedrons High-level, mathematical view of a mappingBut targets concrete properties: parallelism, locality, memory footprint


�• Parametric imperfect loop nests

�• Subsumes classical transformations

�• Compacts the transformation search space

�• Parallelization, locality optimization (communication avoiding)

�• Preserves semantics

�• Analytic joint formulations of optimizations

�• Not just for affine static control programs

11

Polyhedral slogans


�• Killer math

�• Scalability of optimizations/code generation

�• Mostly confined to dependence preserving transformations

�• Code can be radically transformed – outputs can look wildly different

�• Modeling indirections, pointers, non-affine code.

�• Many of these challenges are solved

12

Polyhedral model – challenges in building a compiler


R-Stream blueprint

13

Polyhedral Mapper

Raising Lowering

Scalar RepresentationEDG CFront End

PrettyPrinter

Machine Model


Inside the polyhedral mapper

14

Jolylib, �…

GDG representation

Tactics Module

ParallelizationLocality

OptimizationTiling Placement Comm.

Generation

MemoryPromotion

SyncGeneration

LayoutOptimization

PolyhedralScanning

�…


Inside the polyhedral mapper

15

Jolylib, �…

GDG representation

Tactics Module

ParallelizationLocality

OptimizationTiling Placement Comm.

Generation

MemoryPromotion

SyncGeneration

LayoutOptimization

PolyhedralScanning

�…

Optimization modules engineered to expose �“knobs�” that could be used by auto-tuner


Driving the mapping: the machine model

�• Target machine characteristics that have an influence on how the mapping should be done

• Local memory / cache sizes• Communication facilities: DMA, cache(s)• Synchronization capabilities• Symmetrical or not• SIMD width• Bandwidths

�• Currently: two-level model (Host and Accelerators) �• XML schema and graphical rendering


Machine model example: multi-Tesla

OpenMP morphCUDA morph

Host

1 thread per GPU

XML file


Mapping process

2- Task formation:- Coarse-grain atomic tasks- Master/slave side operations

- Local / global data layout optimization- Multi-buffering (explicitly managed)- Synchronization (barriers)- Bulk communications- Thread generation -> master/slave- CUDA-specific optimizations

1- Scheduling:Parallelism, locality, tilability

3- Placement:Assign tasks to blocks/threads

Dependencies


Program Transformations Specification

iteration space of a statement S(i,j)

j

i

22: ZZ

t1

t2

�• Schedule maps iterations to multi-dimensional time:• A feasible schedule preserves dependences�• Placement maps iterations to multi-dimensional space:• UHPC in progress, partially done�• Layout maps data elements to multi-dimensional space:• UHPC in progress�• Hierarchical by design, tiling serves separation of concerns


Model for scheduling trades 3 objectives jointly

Loop Fusion

MoreParallelism

MoreLocality

Loop Fission

SufficientOccupancy

MemoryCoalescing

FewerGlobal

MemoryAccesses

+ successive thread

contiguity

BetterEffective

Bandwidth

Patent pending

+ successive thread

contiguity


Optimization with BLAS vs. global optimization

/* Optimization with BLAS */for loop {�…BLAS call 1�…BLAS call 2�…�…BLAS call n�…

}

Outer loop(s)

Retrieve data Z from diskStore data Z back to diskRetrieve data Z from disk !!!

Numerous cache misses/* Global Optimization*/doall loop {�…for loop {�…[read from Z]�…[write to Z]�…[read from Z]

}�…

}

Loop fusion can

improve locality

Can parallelize outer loop(s)

VS.

Global optimization can expose better parallelism and locality


• Significant parallelism is needed to fully utilize all resources• Locality is also critical to minimize communication• Parallelism can come at the expense of locality

�• Our approach: R-Stream compiler exposes parallelism via affine scheduling that simultaneously augments locality using loop fusion

Tradeoffs between parallelism and locality

High on-chip parallelism

Limited bandwidthat chip border

Reuse data once loaded on chip =

locality


Parallelism/locality tradeoff example

/** Original code:* Simplified CSLC LMS*/for (k=0; k<400; k++) {for (i=0; i<3997; i++) {z[i]=0;for (j=0; j<4000; j++)z[i]= z[i]+B[i][j]*x[k][j];

}for (i=0; i<3997; i++)w[i]=w[i]+z[i];

}

doall (i=0; i<400; i++)doall (j=0; j<3997; j++)z_e[j][i]=0

doall (i=0; i<400; i++)doall (j=0; j<3997; j++)for (k=0; k<4000; k++)z_e[j][i]=z_e[j][i]+B[j][k]*x[i][k];

doall (i=0; i<3997; i++)for (j=0; j<400; j++)w[i]=w[i]+z_e[i][j];

doall (i=0; i<3997; i++)z[i] = z_e[i][399];

Max. parallelism(no fusion)

Array z gets expanded, to introduce another level of parallelism

Data accumulation

2 levels of parallelism, but poor data reuse (on array z_e)

Maximum distribution destroys locality


Parallelism/locality tradeoff example (cont.)

doall (i=0; i<3997; i++)for (j=0; j<400; j++) {z[i]=0;for (k=0; k<4000; k++)z[i]=z[i]+B[i][k]*x[j][k];w[i]=w[i]+z[i];}

Max. fusion

Very good data reuse (on array z), but only 1 level of parallelism

Aggressive loop fusion destroys parallelism (i.e., only 1 degree of parallelism)/*

* Original code:* Simplified CSLC LMS*/for (k=0; k<400; k++) {for (i=0; i<3997; i++) {z[i]=0;for (j=0; j<4000; j++)z[i]= z[i]+B[i][j]*x[k][j];

}for (i=0; i<3997; i++)w[i]=w[i]+z[i];

}


Parallelism/locality tradeoff example (cont.)

doall (i=0; i<3997; i++) {doall (j=0; j<400; j++) {z_e[i][j]=0;for (k=0; k<4000; k++)z_e[i][j]=z_e[i][j]+B[i][k]*x[j][k];

}for (j=0; j<400; j++)w[i]=w[i]+z_e[i][j];

}doall (i=0; i<3997; i++)z[i]=z_e[i][399];

Parallelism with partial fusion

2 levels of parallelism with good data reuse (on array z_e)

Data accumulation

Expansion of array zPartial fusion doesn’t decrease parallelism

/** Original code:* Simplified CSLC LMS*/for (k=0; k<400; k++) {for (i=0; i<3997; i++) {z[i]=0;for (j=0; j<4000; j++)z[i]= z[i]+B[i][j]*x[k][j];

}for (i=0; i<3997; i++)w[i]=w[i]+z[i];

}


Parallelism/locality tradeoffs: performance numbers

Code with a good balance between parallelism and fusion performs best

In explicitly managed memory/scratchpad architectures this is even more true


�• R-Stream uses a heuristic based on an objective function with several cost coefficients:

• slowdown in execution if a loop p is executed sequentially rather than in parallel• cost in performance if two loops p and q remain unfused rather than fused

�• These two cost coefficients address parallelism and locality in a unified and unbiased manner (as opposed to traditional compilers)

�• Fine-grained parallelism, such as SIMD, can also be modeled using similar formulation

R-Stream: affine scheduling and fusion

edges loope

loops eel

ll fupw

cost of unfusingtwo loops

slowdown in sequential execution

minimize

Patent Pending


Parallelism + locality + spatial locality

28

edges loope

loops eel

ll fupw

benefits of improved localitybenefits of parallel execution

New algorithm (unpublished) balances contiguity to enhance coalescing for GPU and SIMDization

modulo data-layout transformations

Hypothesis that auto-tuning should adjust these parameters






29

Outline


What R-Stream does for you – in a nutshell

30

�• Input• Sequential

– Short and simple textbook C code

– Just add a “#pragmamap” and R-Stream figures out the rest



31

�• Output• OpenMP + CUDA code

– Hundreds of lines of tightly optimized GPU-side CUDA code

– Few lines of host-side OpenMP C code

• Very difficult to hand-optimize

• Not available in any standard library

R-Stream




• Gauss-Seidel 9 points stencil– Used in iterative PDE solvers

– scientific modeling (heat, fluid flow, waves, etc.)

– Building block for faster iterative solvers like Multigrid or AMR



32

�• Output• OpenMP + CUDA code

– Hundreds of lines of tightly optimized GPU-side CUDA code

– Few lines of host-side OpenMP C code

R-Stream




Will be illustrated inthe next few slides

• Achieving up to– 20 GFLOPS in

GTX 285– 25 GFLOPS in

GTX 480


Finding and utilizing available parallelism

Extracting and mapping parallel loops

GPUSM N

SM 2

SM 1

Off-chip Device memory (Global, constant, texture)

Shared Memory

InstructionUnit

SP 1

Registers

…SP 2

Registers

SP M

Registers

ConstantCache

TextureCache

R-Stream AUTOMATICALLY finds and forms parallelism

Excerpt of automatically generated code


Memory compaction on GPU scratchpad

GPUSM N

SM 2

SM 1


Shared Memory

InstructionUnit

SP 1

Registers

…SP 2

Registers

SP M

Registers

ConstantCache

TextureCache


R-Stream AUTOMATICALLYmanages local scratchpad


GPU DRAM to scratchpad coalesced communication

GPUSM N

SM 2

SM 1


Shared Memory

InstructionUnit

SP 1

Registers

…SP 2

Registers

SP M

Registers

ConstantCache

TextureCache

Coalesced GPU DRAM accesses

R-Stream AUTOMATICALLYchooses parallelism to favor coalescing



Host-to-GPU communication

GPUSM N

SM 2

SM 1


Shared Memory

InstructionUnit

SP 1

Registers

…SP 2

Registers

SP M

Registers

ConstantCache

TextureCache

Host memory

CPU

PCI Express


R-Stream AUTOMATICALLYchooses partition and sets up host to GPU communication


Multi-GPU mapping

Host memory

CPU

Host memory

CPU

GPU GPU GPU

GPU memory

GPU memory

GPU memory

Mappingacross all GPUs

R-Stream AUTOMATICALLY finds another level of parallelism, across GPUs



Multi-GPU mapping

Host memory

CPU

Host memory

CPU

GPU GPU GPU

GPU memory

GPU memory

GPU memory

Multi-streamingof host-GPU communication


R-Stream AUTOMATICALLYcreates n-way software pipelines for communications


Program

39

Future capabilities – mapping to CPU-GPU clusters

DRAM

CPU

DRAM

CPU

GPU GPU GPU

DRAM DRAM DRAM

CPU + GPU CPU + GPU

High Speed Interconnect (e.g. InfiniBand)

MPI Process

OpenMP process launching CUDA

MPI Process

OpenMP process launching CUDA






40

Outline


�• Main comparisons:• R-Stream High-Level C Transformation Tool 3.1.2• Intel MKL 10.2.1

Experimental evaluation

Radar code MKL calls

Configuration 1: MKL

Radar code

GCC

Configuration 2: Low-level compilers

ICC

Radar code

Configuration 3: R-Stream

Optimized radar code

GCCICC

R-Stream


�• Intel Xeon workstation:• Dual quad-core E5405 Xeon processors (8 cores total) • 9GB memory�• 8 OpenMP threads�• Single precision floating point data�• Low-level compilers and the used flags:• GCC: -O6 -fno-trapping-math -ftree-vectorize -msse3 -fopenmp• ICC: -fast -openmp

Experimental evaluation (cont.)


�• Beamforming algorithms:• MVDR-SER: Minimum Variance Distortionless Response using

Sequential Regression• CSLC-LMS: Coherent Sidelobe Cancellation using Least Mean Square• CSLC-RLS: Coherent Sidelobe Cancellation using Robust Least

Square�• Expressed in sequential ANSI C�• 400 radar iterations�• Compute 3 radar sidelobes (for CSLC-LMS and CSLC-RLS)

Radar benchmarks


MVDR-SER


CSLC-LMS


CSLC-RLS


�• #pragma rstream map

�• void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X],

�• int pX, int pY, int pZ) {

�• double temp;

�• int i, j, k;

�• for (k=4; k<pZ-4; k++) {

�• for (j=4; j<pY-4; j++) {

�• for (i=4; i<pX-4; i++) {

�• temp = C0 * U2[k][j][i] +

�• C1 * (U2[k-1][j][i] + U2[k+1][j][i] +

�• U2[k][j-1][i] + U2[k][j+1][i] +

�• U2[k][j][i-1] + U2[k][j][i+1]) +

�• C2 * (U2[k-2][j][i] + U2[k+2][j][i] +

�• U2[k][j-2][i] + U2[k][j+2][i] +

�• U2[k][j][i-2] + U2[k][j][i+2]) +

�• C3 * (U2[k-3][j][i] + U2[k+3][j][i] +

�• U2[k][j-3][i] + U2[k][j+3][i] +

�• U2[k][j][i-3] + U2[k][j][i+3]) +

�• C4 * (U2[k-4][j][i] + U2[k+4][j][i] +

�• U2[k][j-4][i] + U2[k][j+4][i] +

�• U2[k][j][i-4] + U2[k][j][i+4]);

�• U1[k][j][i] =

�• 2.0f * U2[k][j][i] - U1[k][j][i] +

�• V[k][j][i] * temp;

�• } } } }

47

3D Discretized wave equation input code (RTM)

25-point 8th

order (in space) stencil


3D Discretized wave equation input code (RTM)

Not so naïve …

Communication autotuningknobs

ThreadIdx.x divergence is expensive






49

Outline


�• Ongoing development also supported by DOE, Reservoir• Improvements in scope, stability, performance

�• Installations/evaluations at US government laboratories

�• Forward collaboration with Georgia Tech on Keeneland• HP SL390 - 3 FERMI GPU, 2 Westmere/node

�• Basis of compiler for DARPA UHPC Intel Corporation Team

50

Current status


�• Per developer seat licensing model

�• Support (releases, bug fixes, services)

�• Available with commercial grade external solvers

�• Government has limited rights / SBIR data rights

�• Academic source licenses with collaborators

�• Professional team, continuity, software engineering

51

Availability


�• Teaches you about parallelism and the polyhedral model:• Generates correct code for any transformation

– The transformation may be incorrect if specified by the user• Imperative code generation has been the bottleneck till 2005

– 3 thesis written on the topic and it’s still not completely covered– It is good to have an intuition of how these things work

�• R-Stream has meaningful metrics to represent your program• Maximal amount of parallelism given minimal expansion• Tradeoffs between coarse-grained and fine-grained parallelism• Loop types (doall, red, perm, seq)�• Help with algorithm selection�• Tiling of imperfectly nested loop nests�• Generates code for explicitly managed memory

52

R-Stream Gives More Insight About Programs


�• R-Stream is a great transformation tool, it is also a great learning tool:• Takes “simple C” input code• Applies multiple transformations and lists them explicitly• Can dump code at any step in the process• It’s the tool I wish I had during my PhD:

– To be fair, I already had a great set of tools�• R-Stream can be used in multiple modes:• Fully automatic + compile flag options• Autotuning mode (more than just tile size and unrolling … )• Scripting / Programmatic mode (Beanshell + interfaces)• Mix of these modes + manual post-processing

53

How to Use R-Stream Successfully


�• Akin to traditional gcc / icc compilation with flags:• Predefined transformations can be parameterized

– Except less/no phase ordering issues• Except you can see what each transformation does at each step• Except you can generate compilable and executable code at (almost)

any step in the process• Except you can control code generation for compactness or

performance

54

Use Case: Fully Automatic Mode + Compile Flag Options


�• More advanced than traditional approaches:• Knobs go far beyond loop unrolling + unroll-and-jam + tiling • Knobs are based on well-understood models• Knobs target high-level properties of the program

– Amount of parallelism, amount of memory expansion, depth of pipelining of communication and computations …

• Knobs are dependent on target machine, program and state of the mapping process:– Our tool has introspection

�• We are really building a hierarchical autotuning transformation tool

55

Use Case: Autotuning Mode


�• Beanshell access to optimizations

�• Can direct and review the process of compilation• automatic tactics (affine scheduling)

�• Can direct code generation

�• Access to “tuning parameters”

�• All options / commands available on command line interface

56

Use Case: “Power user” interactive interface


�• R-Stream simplifies software development and maintenance

�• Porting: reduces expense and delivery delays

�• Does this by automatically parallelizing loop code• While optimizing for data locality, coalescing, etc.

�• Addresses• Dense loop-intensive computations

�• Extensions• Data-parallel programming idioms• Sparse representations• Dynamic runtime execution

57

Conclusion


�• Per-developer seat, floating, cloud-based licensing

�• Discounts for academic users

�• Research collaborations with academic partners

�• For more information:• Call us at 212-780-0527, or• See Rich, Ann• E-mail us at

– {sales,lethin,johnson}@reservoir.com

58

Contact us

Education

[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)