[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Programming the Memory Hierarchy with Sequoia

Michael Bauer, Alex Aiken

1

Stanford University

OutlineThe Sequoia Programming Model

Compiling Sequoia

Automated Tuning with Sequoia

Performance Results

Extensions for Irregular Parallelism (Time Permitting)

Areas of Future Research

2

The Sequoia Programming Model

3

Sequoia

4

Language: stream programming for machines with deep memory hierarchies

Idea: expose abstract memory hierarchy to the programmer

Implementation: benchmarks run well on many multi-level machines

SMP, CMP, Cluster of CMPs, GPU, Cluster of GPUs, Disk

The key challenge in high performance programming is:

communication (not parallelism)

LatencyBandwidth

LOCALITY

5

Streaming

6

Streaming involves structuring algorithms as collections of independent [locality cognizant] computations with well defined working sets.

This structuring can be done at many scalesKeep temporaries in registers

Cache/scratchpad blockingMessage passing on a clusterOut-of-core algorithms

Streaming

7

Streaming involves structuring algorithms as collections of independent [locality cognizant] computations with well defined working sets.

Efficient programs exhibit thisstructure at many scales.

8

Facilitate development of hierarchy-aware stream

Provide constructs that can be implemented efficiently

Place computation and data in machineExplicit parallelism and communicationLarge bulk transfers

Locality in Programming Languages

9

Local (private) vs. global (remote) addressesUPC, Titanium

Domain distributions (map array elements to locations)HPF, UPC, ZPL, X10, Fortress, Chapel

Focus on communication between nodesIgnore hierarchy within a node

Locality in Programming Languages

10

Streams and kernelsStream data off chip. Kernel data on chip.StreamC/KernelC, BrookGPU Shading (Cg, HLSL)

Architecture specificOnly represent two levels

(Except CUDA and PMH)

Abstract Machine ModelTree of independent address spaces

Each level is progressively smaller, but computationally more powerful

Arbitrary branching factor

Arbitrary number of levels

11

Hierarchical MemoryReal machines as trees of memories

Dual-core PC 4 node cluster of PCs

L2 cache

ALUs ALUs

Main memory

L1 cache L1 cache

L2 cache

ALUs ALUs

Main memory

L1 cache L1 cache L2 cache

ALUs

Nodememory

Aggregate cluster memory(virtual level)

L1 cache

L2 cache

ALUs

Nodememory

L1 cache

L2 cache

ALUs

Nodememory

L1 cache

L2 cache

ALUs

Nodememory

L1 cache

12

Hierarchical Memory

13

CPU Main memory

GPU Main memory

Shared Mem. Shared Mem. Shared Mem. Shared Mem.

Warps

Registers

Warps

Registers

Warps

Registers

Warps

Registers

Warps

Registers

Warps

Registers

Warps

Registers

Warps

Registers

Single GPU

Hierarchical Memory

14

Aggregate cluster memory(virtual level)

CMP Main Memory CMP Main Memory

GPU Main memory

GPU Main memory

GPU Main memory

GPU Main memory

Shared Shared Shared Shared Shared Shared Shared Shared

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

MPI Cluster of CMPs w/ GPU Accelerators

Example: Blocked Matrix Multiply

void matmul_L1( int M, int N, int T,float* A,float* B,float* C)

for (int i=0;; i<M;; i++)

for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)

C[i][j] += A[i][k] * B[k][j];;

C += A x B

matmul_L132x32

matrix mult

A B C

15

Example: Blocked Matrix Multiply

C += A x Bvoid matmul_L2( int M, int N, int T,float* A,float* B,float* C)

Perform series of L1 matrix multiplications.

matmul_L2256x256

matrix mult

A B C

matmul_L132x32

matrix mult

matmul_L132x32

matrix mult

matmul_L132x32

matrix mult

matmul_L132x32

matrix mult

16

Example: Blocked Matrix Multiplyvoid matmul( int M, int N, int T,

float* A,float* B,float* C)

Perform series of L2 matrix multiplications.

matmullarge matrix mult

matmul_L132x32

matrix mult ...

matmul_L2256x256

matrix mult

matmul_L132x32

matrix mult

matmul_L132x32

matrix mult

matmul_L132x32

matrix mult

matmul_L2256x256

matrix mult

matmul_L132x32

matrix mult ...matmul_L1

32x32matrix mult

matmul_L132x32

matrix mult

matmul_L132x32

matrix mult

. . .. . .. . .

17

Sequoia TasksSpecial functions called tasks are the building blocks of Sequoia programs

task matmul::leaf( in float A[M][T],in float B[T][N],inout float C[M][N] )

for (int i=0;; i<M;; i++)for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)

C[i][j] += A[i][k] * B[k][j];;

18

Sequoia TasksTask arguments and temporaries define a working setTask working set resident at a specific location in abstract machine modelTasks assigned location in the memory hierarchy

Maintain call-by-value-result (CBVR) semantics



C[i][j] += A[i][k] * B[k][j];;

19

Task Hierarchiestask matmul::inner( in float A[M][T],

in float B[T][N],inout float C[M][N] )

tunable int P, Q, R;;

Recursively call matmul task on submatrices

of A, B, and C of size PxQ, QxR, and PxR.



C[i][j] += A[i][k] * B[k][j];;20

Task Hierarchiestask matmul::inner( in float A[M][T],


tunable int P, Q, R;;

mappar( int i=0 to M/P,int j=0 to N/R )

mapseq( int k=0 to T/Q )

matmul( A[P*i:P*(i+1);;P][Q*k:Q*(k+1);;Q],B[Q*k:Q*(k+1);;Q][R*j:R*(j+1);;R],C[P*i:P*(i+1);;P][R*j:R*(j+1);;R] );;

task matmul::leaf( in float A[M][T],


for (int i=0;; i<M;; i++)

for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)

C[i][j] += A[i][k] * B[k][j];;

matmul::inner

matmul::leaf

Variant call graph

21

Summary: Sequoia TasksSingle abstraction for

Isolation/ parallelismExplicit communication/ working setsExpressing locality

Sequoia programs describe hierarchies of tasksMapped onto memory hierarchyParameterized for portability

22

Compiling Sequoia

23

Sequoia CompilerSource-to-source compilation

Three inputsSource FileMachine FileMapping File

Compilation works on hierarchical programs

Many standard optimizationsDone at all levels of the hierarchyGreatly increases leverage of optimization

source.sq source.mp machine.m

Sequoia Compiler (sq++)

Machine Specific Source Code

24

Inter-Level Copy Elimination (1)

25

Copy elimination near the root removes not one instruction, but thousands/millions

B

A

CopyCopy

C

Mi

Mi+1

A

Copy

C

Mi

Mi+1

Inter-Level Copy Elimination (2)

26

Copy elimination near the root removes not one instruction, but thousands/millions

A

B

CopyCopy

C

Mi

Mi+1A

B

Copy

Mi

Mi+1

Software Pipelining

27

SchedulingPrefetch batch of dataCompute on dataInitiate write of results

Overlap communication and computation

compute 1write output 0

time

compute 2

compute 3

read input 2

write output 1

read input 3

write output 2

read input 4

Sequoia RuntimeUniform scheme for explicitly describing memory hierarchies

Capture common traits important for performanceAllow composition of memory hierarchies

Simple, portable API for many parallel machinesMechanism independence for communication and management of parallel resources

SMP, CMP, MPI Cluster, CUDA, Disk, Cell (deprecated), Scalar (debugging), OpenCL (future)

28

Graphical Runtime Representation

Memory Level i+1

CPU Level i+1

Memory Level iChild N

Memory Level iMemory Level iChild 1

Runtime

CPU Level iChild 1

CPU Level i CPU Level iChild N

29

Runtime DesignUniform API to support many devices

Manages basic program tasksData allocation and namingSetup parallel resourcesSynchronization

Greatly simplifies/modularizes implementationCompiler generates code for one API, not many machinesEach runtime is isolated from all othersRuntimes can be implemented separately and composed freely

Makes some basic assumptionsSoftware has control over memory resourcesPersistence of data for software controlled memory

30

Automatic Tuning

31

Autotuner

32

Many parameters to tuneSequoia codes parameterized by tunablesChoice of task variants at different callsites

The tuning framework sets these parametersSearch-basedProgrammer defines the search space

Software-Managed Memory

33

Smooth with high frequency components(due to alignment)

Hierarchical Search

34

M2

M1

M0

set of tunables: S1

Bottom Up

set of tunables: S0

Search Algorithm

35

A pyramid searchGreedy search performed at each level

Achieve good performance quickly because of smooth space

Start with a coarse grid Refine the grid when no further progress can be made

grid spacing

Performance Results

36

Sequoia Benchmarks

37

Linear Algebra: BLAS Level 1SAXPY, Level 2 SGEMV, and Level 3 SGEMM

Conv2D: 2D single precision convolution with 9x9 support (non-periodic boundary constraints)

FFT3D: Complex single-precision FFTGravity: 100 times steps of N-body stellar dynamics

simulation (N^2) single precisionHMMER: Fuzzy protein string matching using HMM

evaluation (Horn et al. SC2005 paper)

Single Runtime Configurations

38

Scalar2.4 GHz Intel Pentium4 Xeon, 1GB

8-way SMP4 dual-core 2.66GHz Intel P4 Xeons, 8GB

Disk2.4 GHz Intel P4, 160GB disk, ~50MB/s from disk

Cluster16, Intel 2.4GHz P4 Xeons, 1GB/node, Infiniband interconnect (780MB/s)

Cell3.2 GHz IBM Cell blade (1 Cell 8 SPE), 1GB

PS33.2 GHz Cell in Sony Playstation 3 (6 SPE), 256MB (160MB usable)

Single Runtime Configurations - GFLOPS

39

Scalar SMP Disk Cluster Cell PS3

SAXPY 0.3 0.7 0.007 4.9 3.5 3.1

SGEMV 1.1 1.7 0.04 12 12 10

SGEMM 6.9 45 5.5 91 119 94

CONV2D 1.9 7.8 0.6 24 85 62

FFT3D 0.7 3.9 0.05 5.5 54 31

GRAVITY 4.8 40 3.7 68 97 71

HMMER 0.9 11 0.9 12 12 7.1

SGEMM Performance

40

ClusterIntel Cluster MKL: 101 GFlop/sSequoia: 91 GFlop/s

SMPIntel MKL: 44 GFlop/sSequoia: 45 GFlop/s

FFT3D Performance

41

CellMercury Computer: 58 GFlop/sFFTW 3.2 alpha 2: 35 GFlop/sSequoia: 54 GFlop/s

ClusterFFTW 3.2 alpha 2: 5.3 GFlop/sSequoia: 5.5 GFlop/s

SMPFFTW 3.2 alpha 2: 4.2 GFlop/sSequoia: 3.9 GFlop/s

Best Known Implementations

42

HMMerATI X1900XT: 9.4 GFlop/s

(Horn et al. 2005)Sequoia Cell: 12 GFlop/sSequoia SMP: 11 GFlop/s

GravityGrape-6A: 2 billion interactions/s

(Fukushige et al. 2005)Sequoia Cell: 4 billion interactions/sSequoia PS3: 3 billion interactions/s

Multi-Runtime System Configurations

43

Cluster of SMPsFour 2-way, 3.16GHz Intel Pentium 4 Xeons connected via GigE (80MB/s peak)

Disk + PS3Sony Playstation 3 bringing data from disk (~30MB/s)

Cluster of PS3sTwo Sony Playstation GigE (60MB/s peak)

SMP vs. Cluster of SMP (GFLOPS)

44

Cluster of SMPs

SMP

SAXPY 1.9 0.7

SGEMV 4.4 1.7

SGEMM 48 45

CONV2D 4.8 7.8

FFT3D 1.1 3.9

GRAVITY 50 40

HMMER 14 11

Same number of total processors

Computed limited applications agnostic to interconnect.

Disk+PS3 Comparison (GFLOPS)

45

Disk+PS3 PS3

SAXPY 0.004 3.1

SGEMV 0.014 10

SGEMM 3.7 94

CONV2D 0.48 62

FFT3D 0.05 31

GRAVITY 66 71

HMMER 8.3 7.1

Some applications have the computational intensity to run from disk with little slowdown.

blocks to hide memory latency.

Extensions for Irregular Parallelism

46

Regular vs. Irregular Parallelism

Regular ComputationsStatically known working setsStatically known communication patternsPredictable running times

Irregular ComputationsDynamically determined working setsDynamically determinedcommunication patternsUnpredictable running times

Regular applications provide scalability, but large applications still have irregular components that need to be parallelized

47

Spawn StatementSpawn launches unbounded number of tasksContinue launching until termination condition is met

task<inner> void performWork() // ...// (task to be run, termination condition)spawn(performWork(this), workQueue.isEmpty());;// ...

48

Parent Pointers and Call-UpParent pointers provide child with way to name parent address spaceA call-up is a task that runs atomically in the

Maintains abstract machine model

task<leaf> void handleWork(parent WorkList *wl) // Perform call-up to retrieve workvector<Work> localWorkQueue = wl->getWork();;

/* Perform Work... */

// Call-Up to add back extra workwl->addWork(localWorkQueue);;

49

Case Study: Boolean Satisfiability (SAT)SAT is useful in many industrial applications

CAD ToolsModel Checkers/ Static Analyses

SAT is a search problem

SAT is an irregular applicationDynamic working set (partial assignments)Dynamic communication (learned clauses)Unknown running times (solving partial assignments)

(x1 + ¬x3 + x4) ^ (¬x2 + x3 + x5) ^ (x4 + ¬x5 + ¬x6

50

Parallel SAT SolvingSpawn many sequential SAT solvers1

Periodically call-up to update assumptions

1. We use the Mini-Sat sequential solver

Level 1

Level 0

(x1 + x2 + ¬x3

SequentialSolver

SequentialSolver

SequentialSolver

51

Parallel SAT SolvingSpawn many sequential SAT solvers1

Periodically call-up to update assumptions

1. We use the Mini-Sat sequential solver

Level 1

Level 0

(x1 + x2 + ¬x3

SequentialSolver

SequentialSolver

SequentialSolver

52

Performance Results for SAT

53

Speedups over Mini-SAT sequential solver2008 SAT Champion

Number of Processors

Speedup over

Sequential

11

Case Study: Parallel Sorting

17 2 16 29821 5

Generalized Quicksort Algorithm

Sorting is an irregular applicationDynamic working set (partitions)Dynamic communication (swizzle)Unknown execution time (partition sort)

54

Sorting PerformanceSort 227 Integers (vs STL Sort)Compare to Original Sequoia

55

Dynamic Load Balancing with SpawnEmploy a re-spawn heuristicDynamic load balancing with enough tasks

56

Case Study: Sparse Matrix Multiply

57

Sparse matrix multiplied with a dense vector

No optimizations for sparse patterns

Dynamically allocate chunks of rows to childrenHandle case were some rows have more non-zero elements than othersChildren have an initial assignment of rowsUse working stealing when finished

Performance Results for SpMV

58

Speedups over sequential OSKI codeExcluding tuning time for OSKI execution

Similar to other results for SpMVInherently memory bound

Number of Processors

Speedup over

Sequential

Areas of Future Research

59

Memory ManagementWhat abstractions are presented for each address space?

Stack? Heap? GC?Persistence?

How to represent distributed data structures and arrays in virtual levels?

Major problem for clusters with distributed memory

How do to handle pointer data structures that need to be partitioned?

Better mechanisms for communicating locality?

60

Sequoia: DSL Compiler Target?

61

Can DSL compilers perform domain specific optimizations in a machine agnostic context?

DSL A

DSL A Compilerwith Optimizations

DSL B

DSL B Compilerwith Optimizations

Sequoia

Sequoia Compilerw/ machine optim.

MPI P-Threads CUDA OpenCL Disk

ConclusionsProgramming to an abstract memory hierarchy provides both locality information and portability

Sequoia provides a general framework for autotuning in deep memory hierarchies

Constructs for irregular parallelism are important for obtaining good performance

62

Questions?http://sequoia.stanford.edu

63

Back-up Slides

64

Locality Aware ProgrammingSpecify functionally independent tasks

Call-by-value-result semantics

Couple locality information with explicit parallelism

Provide tunable variables for machine independence

65

TunablesProvide mechanism for specifying machine dependent variables

Two flavors:Integer tunables for controlTask tunables for task variants

Specified in the mapping file

66

Sequoia Rough EdgesMemory Management

What abstractions are presented for each address space?

structures)?

Abstract Machine ModelIs it too abstract?What operations would be allowed if it was slightly less abstract?

Pipeline ParallelismCould Sequoia support something like GRAMPS graph of queues

67

Case Study: Fluid SimulationFluidanimate application from PARSEC1

3-D fluid flow simulated by particlesSpace is partitioned into cells

Fluid is an irregular applicationDynamic working set (cells per grid)Dynamic communication (ghost cells)Unknown running time (particles per cell)

1. PARSEC benchmark suite: parsec.princeton.cs.edu68

The Ghost Cell ProblemAll copies of a ghost cell must be reduced sequentiallyDifferent ghost cells can be reduced in parallel

Call-up serializes these parallel reductions

Future research: prove independence of call-ups at compile time to allow concurrent call-up execution

69

Education

[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)