58
Reservoir Labs Harvard 04/12/2011 N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. Lethin 1 The R-Stream High-Level Program Transformation Tool

[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

  • Upload
    npinto

  • View
    1.743

  • Download
    2

Embed Size (px)

DESCRIPTION

http://cs264.org http://j.mp/fmzhYl

Citation preview

Page 1: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. Lethin

1

The R-Stream High-Level Program Transformation Tool

Page 2: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• R-Stream Overview

�• Compilation Walk-through

�• Performance Results

�• Getting R-Stream

2

Outline

Page 3: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

Power efficiency driving architectures

3

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

HeterogeneousProcessing

DistributedLocal Memories

ExplicitlyManaged

Architecture

NUMA

BandwidthStarved

Hierarchical(including board,chassis, cabinet)

MultipleExecutionModels

MixedParallelism

Types

MultipleSpatial

Dimensions

Page 4: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Expressing it• Annotations and pragma dialects for C• Explicitly (e.g., new languages like CUDA and OpenCL)

�• But before expressing it, how can programmers find it?• Manual constructive procedures, art, sweat, time

– Artisans get complete control over every detail• Automatically

– Operations research problem– Like scheduling trucks to save fuel– Model, solve , implement– Faster, sometimes better, than a human

4

Computation choreography

Our focus

Page 5: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Naïve approach• Model

– Tasks, dependences– Resource use, latencies– Machine model with connectivity, resource capacities

• Solve with ILP– Minimize overall task length– Subject to dependences, resource use

• Problems– Complexity: task graph is huge!– Dynamics: loop lengths unknown.

�• So we do something much more cool.

5

How to do automatic scheduling?

Page 6: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 6

Program Transformations Specification

iteration space of a statement S(i,j)

j

i

22: ZZ

t1

t2

�• Schedule maps iterations to multi-dimensional time:• A feasible schedule preserves dependences�• Placement maps iterations to multi-dimensional space:• UHPC in progress, partially done�• Layout maps data elements to multi-dimensional space:• UHPC in progress�• Hierarchical by design, tiling serves separation of concerns

Page 7: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 7

Loop transformations

for(i=0; i<N; i++)for(j=0; j<N; j++)S(i,j);

for(j=0; j<N; j++)for(i=0; i<N; i++)S(i,j);

for(i=N-1; i>=0; i--)for(j=0; j<N; j++)S(j,i);

for(i=0; i<N; i++)for(j= *i; j<N+ *i; j++)S(i,j- *i);

ji

ji0110

),(

ji

ji1001

),(

ji

ji101

),(

for(i=0; i< *N; i+= )for(j=0; j<N; j++)S(i/ ,j); j

iji

100

),(

permutation

reversal

skewing

scaling

unimodular

Page 8: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 8

Loop fusion and distribution

for(i=0; i<N; i++)for(j=0; j<N; j++)S1(i,j);

for(j=0; j<N; j++)S2(i,j)

for(i=0; i<N; i++)for(j=0; j<N; j++)S1(i,j);S2(i,j)

fusion

distribution

1000010000001000

),(1 ji

ji

1000010100001000

),(2 ji

ji

1000010000001000

),(1 ji

ji

1100010000001000

),(2 ji

ji

fusion

distribution

Page 9: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 9

Enabling technology is new compiler math

Uniform Recurrence Equations [Karp et al. 1970]

Polyhedral Model [1980-]

Many: Feautrier, Darte, Vivien, Wilde, Rajopadhye, etc,....

Many: Lamport, Allen/Kennedy, Banerjee, Irigoin, Wolfe/Lam, Pugh, Pingali, e

Vectorization, SMP, locality optimizations

Dependence summary: direction/distance vectors

New computational techniques based on polyhedral representations

Unimodular transformations

Exact dependence analysis

Loop synthesis via polyhedral scanning

Loop Transformations and Parallelization [1970-]

Mostly linear-algebraic

General affine transformations

Systolic Array Mapping

Page 10: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

R-Stream model: polyhedra

n = f();for (i=5; i<= n; i+=2) {

A[i][i] = A[i][i]/B[i];for (j=0; j<=i; j++) {if (j<=10) {�… A[i+2j+n][i+3]�…

}}

1130

001

002

011

11

0

nji

AA

Affine and non-affine transformationsOrder and place of operations and data

10

}12;;0;5,|,{ 2 kiijijniZkZji

Loop code represented (exactly or conservatively) with polyhedrons High-level, mathematical view of a mappingBut targets concrete properties: parallelism, locality, memory footprint

Page 11: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Parametric imperfect loop nests

�• Subsumes classical transformations

�• Compacts the transformation search space

�• Parallelization, locality optimization (communication avoiding)

�• Preserves semantics

�• Analytic joint formulations of optimizations

�• Not just for affine static control programs

11

Polyhedral slogans

Page 12: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Killer math

�• Scalability of optimizations/code generation

�• Mostly confined to dependence preserving transformations

�• Code can be radically transformed – outputs can look wildly different

�• Modeling indirections, pointers, non-affine code.

�• Many of these challenges are solved

12

Polyhedral model – challenges in building a compiler

Page 13: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

R-Stream blueprint

13

Polyhedral Mapper

Raising Lowering

Scalar RepresentationEDG CFront End

PrettyPrinter

Machine Model

Page 14: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

Inside the polyhedral mapper

14

Jolylib, �…

GDG representation

Tactics Module

ParallelizationLocality

OptimizationTiling Placement Comm.

Generation

MemoryPromotion

SyncGeneration

LayoutOptimization

PolyhedralScanning

�…

Page 15: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

Inside the polyhedral mapper

15

Jolylib, �…

GDG representation

Tactics Module

ParallelizationLocality

OptimizationTiling Placement Comm.

Generation

MemoryPromotion

SyncGeneration

LayoutOptimization

PolyhedralScanning

�…

Optimization modules engineered to expose �“knobs�” that could be used by auto-tuner

Page 16: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

Driving the mapping: the machine model

�• Target machine characteristics that have an influence on how the mapping should be done

• Local memory / cache sizes• Communication facilities: DMA, cache(s)• Synchronization capabilities• Symmetrical or not• SIMD width• Bandwidths

�• Currently: two-level model (Host and Accelerators) �• XML schema and graphical rendering

Page 17: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 17

Machine model example: multi-Tesla

OpenMP morphCUDA morph

Host

1 thread per GPU

XML file

Page 18: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 18

Mapping process

2- Task formation:- Coarse-grain atomic tasks- Master/slave side operations

- Local / global data layout optimization- Multi-buffering (explicitly managed)- Synchronization (barriers)- Bulk communications- Thread generation -> master/slave- CUDA-specific optimizations

1- Scheduling:Parallelism, locality, tilability

3- Placement:Assign tasks to blocks/threads

Dependencies

Page 19: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 19

Program Transformations Specification

iteration space of a statement S(i,j)

j

i

22: ZZ

t1

t2

�• Schedule maps iterations to multi-dimensional time:• A feasible schedule preserves dependences�• Placement maps iterations to multi-dimensional space:• UHPC in progress, partially done�• Layout maps data elements to multi-dimensional space:• UHPC in progress�• Hierarchical by design, tiling serves separation of concerns

Page 20: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 20

Model for scheduling trades 3 objectives jointly

Loop Fusion

MoreParallelism

MoreLocality

Loop Fission

SufficientOccupancy

MemoryCoalescing

FewerGlobal

MemoryAccesses

+ successive thread

contiguity

BetterEffective

Bandwidth

Patent pending

+ successive thread

contiguity

Page 21: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

Optimization with BLAS vs. global optimization

/* Optimization with BLAS */for loop {�…BLAS call 1�…BLAS call 2�…�…BLAS call n�…

}

Outer loop(s)

Retrieve data Z from diskStore data Z back to diskRetrieve data Z from disk !!!

Numerous cache misses/* Global Optimization*/doall loop {�…for loop {�…[read from Z]�…[write to Z]�…[read from Z]

}�…

}

Loop fusion can

improve locality

Can parallelize outer loop(s)

VS.

Global optimization can expose better parallelism and locality

Page 22: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

• Significant parallelism is needed to fully utilize all resources• Locality is also critical to minimize communication• Parallelism can come at the expense of locality

�• Our approach: R-Stream compiler exposes parallelism via affine scheduling that simultaneously augments locality using loop fusion

Tradeoffs between parallelism and locality

High on-chip parallelism

Limited bandwidthat chip border

Reuse data once loaded on chip =

locality

Page 23: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

Parallelism/locality tradeoff example

/** Original code:* Simplified CSLC LMS*/for (k=0; k<400; k++) {for (i=0; i<3997; i++) {z[i]=0;for (j=0; j<4000; j++)z[i]= z[i]+B[i][j]*x[k][j];

}for (i=0; i<3997; i++)w[i]=w[i]+z[i];

}

doall (i=0; i<400; i++)doall (j=0; j<3997; j++)z_e[j][i]=0

doall (i=0; i<400; i++)doall (j=0; j<3997; j++)for (k=0; k<4000; k++)z_e[j][i]=z_e[j][i]+B[j][k]*x[i][k];

doall (i=0; i<3997; i++)for (j=0; j<400; j++)w[i]=w[i]+z_e[i][j];

doall (i=0; i<3997; i++)z[i] = z_e[i][399];

Max. parallelism(no fusion)

Array z gets expanded, to introduce another level of parallelism

Data accumulation

2 levels of parallelism, but poor data reuse (on array z_e)

Maximum distribution destroys locality

Page 24: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

Parallelism/locality tradeoff example (cont.)

doall (i=0; i<3997; i++)for (j=0; j<400; j++) {z[i]=0;for (k=0; k<4000; k++)z[i]=z[i]+B[i][k]*x[j][k];w[i]=w[i]+z[i];}

Max. fusion

Very good data reuse (on array z), but only 1 level of parallelism

Aggressive loop fusion destroys parallelism (i.e., only 1 degree of parallelism)/*

* Original code:* Simplified CSLC LMS*/for (k=0; k<400; k++) {for (i=0; i<3997; i++) {z[i]=0;for (j=0; j<4000; j++)z[i]= z[i]+B[i][j]*x[k][j];

}for (i=0; i<3997; i++)w[i]=w[i]+z[i];

}

Page 25: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

Parallelism/locality tradeoff example (cont.)

doall (i=0; i<3997; i++) {doall (j=0; j<400; j++) {z_e[i][j]=0;for (k=0; k<4000; k++)z_e[i][j]=z_e[i][j]+B[i][k]*x[j][k];

}for (j=0; j<400; j++)w[i]=w[i]+z_e[i][j];

}doall (i=0; i<3997; i++)z[i]=z_e[i][399];

Parallelism with partial fusion

2 levels of parallelism with good data reuse (on array z_e)

Data accumulation

Expansion of array zPartial fusion doesn’t decrease parallelism

/** Original code:* Simplified CSLC LMS*/for (k=0; k<400; k++) {for (i=0; i<3997; i++) {z[i]=0;for (j=0; j<4000; j++)z[i]= z[i]+B[i][j]*x[k][j];

}for (i=0; i<3997; i++)w[i]=w[i]+z[i];

}

Page 26: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

Parallelism/locality tradeoffs: performance numbers

Code with a good balance between parallelism and fusion performs best

In explicitly managed memory/scratchpad architectures this is even more true

Page 27: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• R-Stream uses a heuristic based on an objective function with several cost coefficients:

• slowdown in execution if a loop p is executed sequentially rather than in parallel• cost in performance if two loops p and q remain unfused rather than fused

�• These two cost coefficients address parallelism and locality in a unified and unbiased manner (as opposed to traditional compilers)

�• Fine-grained parallelism, such as SIMD, can also be modeled using similar formulation

R-Stream: affine scheduling and fusion

edges loope

loops eel

ll fupw

cost of unfusingtwo loops

slowdown in sequential execution

minimize

Patent Pending

Page 28: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

Parallelism + locality + spatial locality

28

edges loope

loops eel

ll fupw

benefits of improved localitybenefits of parallel execution

New algorithm (unpublished) balances contiguity to enhance coalescing for GPU and SIMDization

modulo data-layout transformations

Hypothesis that auto-tuning should adjust these parameters

Page 29: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• R-Stream Overview

�• Compilation Walk-through

�• Performance Results

�• Getting R-Stream

29

Outline

Page 30: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

What R-Stream does for you – in a nutshell

30

�• Input• Sequential

– Short and simple textbook C code

– Just add a “#pragmamap” and R-Stream figures out the rest

Page 31: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

What R-Stream does for you – in a nutshell

31

�• Output• OpenMP + CUDA code

– Hundreds of lines of tightly optimized GPU-side CUDA code

– Few lines of host-side OpenMP C code

• Very difficult to hand-optimize

• Not available in any standard library

R-Stream

�• Input• Sequential

– Short and simple textbook C code

– Just add a “#pragmamap” and R-Stream figures out the rest

• Gauss-Seidel 9 points stencil– Used in iterative PDE solvers

– scientific modeling (heat, fluid flow, waves, etc.)

– Building block for faster iterative solvers like Multigrid or AMR

Page 32: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

What R-Stream does for you – in a nutshell

32

�• Output• OpenMP + CUDA code

– Hundreds of lines of tightly optimized GPU-side CUDA code

– Few lines of host-side OpenMP C code

R-Stream

�• Input• Sequential

– Short and simple textbook C code

– Just add a “#pragmamap” and R-Stream figures out the rest

Will be illustrated inthe next few slides

• Achieving up to– 20 GFLOPS in

GTX 285– 25 GFLOPS in

GTX 480

Page 33: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 33

Finding and utilizing available parallelism

Extracting and mapping parallel loops

GPUSM N

SM 2

SM 1

Off-chip Device memory (Global, constant, texture)

Shared Memory

InstructionUnit

SP 1

Registers

…SP 2

Registers

SP M

Registers

ConstantCache

TextureCache

R-Stream AUTOMATICALLY finds and forms parallelism

Excerpt of automatically generated code

Page 34: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 34

Memory compaction on GPU scratchpad

GPUSM N

SM 2

SM 1

Off-chip Device memory (Global, constant, texture)

Shared Memory

InstructionUnit

SP 1

Registers

…SP 2

Registers

SP M

Registers

ConstantCache

TextureCache

Excerpt of automatically generated code

R-Stream AUTOMATICALLYmanages local scratchpad

Page 35: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 35

GPU DRAM to scratchpad coalesced communication

GPUSM N

SM 2

SM 1

Off-chip Device memory (Global, constant, texture)

Shared Memory

InstructionUnit

SP 1

Registers

…SP 2

Registers

SP M

Registers

ConstantCache

TextureCache

Coalesced GPU DRAM accesses

R-Stream AUTOMATICALLYchooses parallelism to favor coalescing

Excerpt of automatically generated code

Page 36: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 36

Host-to-GPU communication

GPUSM N

SM 2

SM 1

Off-chip Device memory (Global, constant, texture)

Shared Memory

InstructionUnit

SP 1

Registers

…SP 2

Registers

SP M

Registers

ConstantCache

TextureCache

Host memory

CPU

PCI Express

Excerpt of automatically generated code

R-Stream AUTOMATICALLYchooses partition and sets up host to GPU communication

Page 37: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 37

Multi-GPU mapping

Host memory

CPU

Host memory

CPU

GPU GPU GPU

GPU memory

GPU memory

GPU memory

Mappingacross all GPUs

R-Stream AUTOMATICALLY finds another level of parallelism, across GPUs

Excerpt of automatically generated code

Page 38: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 38

Multi-GPU mapping

Host memory

CPU

Host memory

CPU

GPU GPU GPU

GPU memory

GPU memory

GPU memory

Multi-streamingof host-GPU communication

Excerpt of automatically generated code

R-Stream AUTOMATICALLYcreates n-way software pipelines for communications

Page 39: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

Program

39

Future capabilities – mapping to CPU-GPU clusters

DRAM

CPU

DRAM

CPU

GPU GPU GPU

DRAM DRAM DRAM

CPU + GPU CPU + GPU

High Speed Interconnect (e.g. InfiniBand)

MPI Process

OpenMP process launching CUDA

MPI Process

OpenMP process launching CUDA

Page 40: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• R-Stream Overview

�• Compilation Walk-through

�• Performance Results

�• Getting R-Stream

40

Outline

Page 41: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Main comparisons:• R-Stream High-Level C Transformation Tool 3.1.2• Intel MKL 10.2.1

Experimental evaluation

Radar code MKL calls

Configuration 1: MKL

Radar code

GCC

Configuration 2: Low-level compilers

ICC

Radar code

Configuration 3: R-Stream

Optimized radar code

GCCICC

R-Stream

Page 42: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Intel Xeon workstation:• Dual quad-core E5405 Xeon processors (8 cores total) • 9GB memory�• 8 OpenMP threads�• Single precision floating point data�• Low-level compilers and the used flags:• GCC: -O6 -fno-trapping-math -ftree-vectorize -msse3 -fopenmp• ICC: -fast -openmp

Experimental evaluation (cont.)

Page 43: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Beamforming algorithms:• MVDR-SER: Minimum Variance Distortionless Response using

Sequential Regression• CSLC-LMS: Coherent Sidelobe Cancellation using Least Mean Square• CSLC-RLS: Coherent Sidelobe Cancellation using Robust Least

Square�• Expressed in sequential ANSI C�• 400 radar iterations�• Compute 3 radar sidelobes (for CSLC-LMS and CSLC-RLS)

Radar benchmarks

Page 44: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

MVDR-SER

Page 45: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

CSLC-LMS

Page 46: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

CSLC-RLS

Page 47: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• #pragma rstream map

�• void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X],

�• int pX, int pY, int pZ) {

�• double temp;

�• int i, j, k;

�• for (k=4; k<pZ-4; k++) {

�• for (j=4; j<pY-4; j++) {

�• for (i=4; i<pX-4; i++) {

�• temp = C0 * U2[k][j][i] +

�• C1 * (U2[k-1][j][i] + U2[k+1][j][i] +

�• U2[k][j-1][i] + U2[k][j+1][i] +

�• U2[k][j][i-1] + U2[k][j][i+1]) +

�• C2 * (U2[k-2][j][i] + U2[k+2][j][i] +

�• U2[k][j-2][i] + U2[k][j+2][i] +

�• U2[k][j][i-2] + U2[k][j][i+2]) +

�• C3 * (U2[k-3][j][i] + U2[k+3][j][i] +

�• U2[k][j-3][i] + U2[k][j+3][i] +

�• U2[k][j][i-3] + U2[k][j][i+3]) +

�• C4 * (U2[k-4][j][i] + U2[k+4][j][i] +

�• U2[k][j-4][i] + U2[k][j+4][i] +

�• U2[k][j][i-4] + U2[k][j][i+4]);

�• U1[k][j][i] =

�• 2.0f * U2[k][j][i] - U1[k][j][i] +

�• V[k][j][i] * temp;

�• } } } }

47

3D Discretized wave equation input code (RTM)

25-point 8th

order (in space) stencil

Page 48: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011 48

3D Discretized wave equation input code (RTM)

Not so naïve …

Communication autotuningknobs

ThreadIdx.x divergence is expensive

Page 49: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• R-Stream Overview

�• Compilation Walk-through

�• Performance Results

�• Getting R-Stream

49

Outline

Page 50: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Ongoing development also supported by DOE, Reservoir• Improvements in scope, stability, performance

�• Installations/evaluations at US government laboratories

�• Forward collaboration with Georgia Tech on Keeneland• HP SL390 - 3 FERMI GPU, 2 Westmere/node

�• Basis of compiler for DARPA UHPC Intel Corporation Team

50

Current status

Page 51: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Per developer seat licensing model

�• Support (releases, bug fixes, services)

�• Available with commercial grade external solvers

�• Government has limited rights / SBIR data rights

�• Academic source licenses with collaborators

�• Professional team, continuity, software engineering

51

Availability

Page 52: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Teaches you about parallelism and the polyhedral model:• Generates correct code for any transformation

– The transformation may be incorrect if specified by the user• Imperative code generation has been the bottleneck till 2005

– 3 thesis written on the topic and it’s still not completely covered– It is good to have an intuition of how these things work

�• R-Stream has meaningful metrics to represent your program• Maximal amount of parallelism given minimal expansion• Tradeoffs between coarse-grained and fine-grained parallelism• Loop types (doall, red, perm, seq)�• Help with algorithm selection�• Tiling of imperfectly nested loop nests�• Generates code for explicitly managed memory

52

R-Stream Gives More Insight About Programs

Page 53: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• R-Stream is a great transformation tool, it is also a great learning tool:• Takes “simple C” input code• Applies multiple transformations and lists them explicitly• Can dump code at any step in the process• It’s the tool I wish I had during my PhD:

– To be fair, I already had a great set of tools�• R-Stream can be used in multiple modes:• Fully automatic + compile flag options• Autotuning mode (more than just tile size and unrolling … )• Scripting / Programmatic mode (Beanshell + interfaces)• Mix of these modes + manual post-processing

53

How to Use R-Stream Successfully

Page 54: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Akin to traditional gcc / icc compilation with flags:• Predefined transformations can be parameterized

– Except less/no phase ordering issues• Except you can see what each transformation does at each step• Except you can generate compilable and executable code at (almost)

any step in the process• Except you can control code generation for compactness or

performance

54

Use Case: Fully Automatic Mode + Compile Flag Options

Page 55: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• More advanced than traditional approaches:• Knobs go far beyond loop unrolling + unroll-and-jam + tiling • Knobs are based on well-understood models• Knobs target high-level properties of the program

– Amount of parallelism, amount of memory expansion, depth of pipelining of communication and computations …

• Knobs are dependent on target machine, program and state of the mapping process:– Our tool has introspection

�• We are really building a hierarchical autotuning transformation tool

55

Use Case: Autotuning Mode

Page 56: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Beanshell access to optimizations

�• Can direct and review the process of compilation• automatic tactics (affine scheduling)

�• Can direct code generation

�• Access to “tuning parameters”

�• All options / commands available on command line interface

56

Use Case: “Power user” interactive interface

Page 57: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• R-Stream simplifies software development and maintenance

�• Porting: reduces expense and delivery delays

�• Does this by automatically parallelizing loop code• While optimizing for data locality, coalescing, etc.

�• Addresses• Dense loop-intensive computations

�• Extensions• Data-parallel programming idioms• Sparse representations• Dynamic runtime execution

57

Conclusion

Page 58: [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Reservoir Labs Harvard 04/12/2011

�• Per-developer seat, floating, cloud-based licensing

�• Discounts for academic users

�• Research collaborations with academic partners

�• For more information:• Call us at 212-780-0527, or• See Rich, Ann• E-mail us at

– {sales,lethin,johnson}@reservoir.com

58

Contact us