69
Programming the Memory Hierarchy with Sequoia Michael Bauer, Alex Aiken 1 Stanford University

[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

  • Upload
    npinto

  • View
    1.000

  • Download
    1

Embed Size (px)

DESCRIPTION

http://cs264.org http://j.mp/gBHYkg

Citation preview

Page 1: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Programming the Memory Hierarchy with Sequoia

Michael Bauer, Alex Aiken

1

Stanford University

Page 2: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

OutlineThe Sequoia Programming Model

Compiling Sequoia

Automated Tuning with Sequoia

Performance Results

Extensions for Irregular Parallelism (Time Permitting)

Areas of Future Research

2

Page 3: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

The Sequoia Programming Model

3

Page 4: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Sequoia

4

Language: stream programming for machines with deep memory hierarchies

Idea: expose abstract memory hierarchy to the programmer

Implementation: benchmarks run well on many multi-level machines

SMP, CMP, Cluster of CMPs, GPU, Cluster of GPUs, Disk

Page 5: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

The key challenge in high performance programming is:

communication (not parallelism)

LatencyBandwidth

LOCALITY

5

Page 6: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Streaming

6

Streaming involves structuring algorithms as collections of independent [locality cognizant] computations with well defined working sets.

This structuring can be done at many scalesKeep temporaries in registers

Cache/scratchpad blockingMessage passing on a clusterOut-of-core algorithms

Page 7: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Streaming

7

Streaming involves structuring algorithms as collections of independent [locality cognizant] computations with well defined working sets.

Efficient programs exhibit thisstructure at many scales.

Page 8: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

8

Facilitate development of hierarchy-aware stream

Provide constructs that can be implemented efficiently

Place computation and data in machineExplicit parallelism and communicationLarge bulk transfers

Page 9: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Locality in Programming Languages

9

Local (private) vs. global (remote) addressesUPC, Titanium

Domain distributions (map array elements to locations)HPF, UPC, ZPL, X10, Fortress, Chapel

Focus on communication between nodesIgnore hierarchy within a node

Page 10: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Locality in Programming Languages

10

Streams and kernelsStream data off chip. Kernel data on chip.StreamC/KernelC, BrookGPU Shading (Cg, HLSL)

Architecture specificOnly represent two levels

(Except CUDA and PMH)

Page 11: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Abstract Machine ModelTree of independent address spaces

Each level is progressively smaller, but computationally more powerful

Arbitrary branching factor

Arbitrary number of levels

11

Page 12: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Hierarchical MemoryReal machines as trees of memories

Dual-core PC 4 node cluster of PCs

L2 cache

ALUs ALUs

Main memory

L1 cache L1 cache

L2 cache

ALUs ALUs

Main memory

L1 cache L1 cache L2 cache

ALUs

Nodememory

Aggregate cluster memory(virtual level)

L1 cache

L2 cache

ALUs

Nodememory

L1 cache

L2 cache

ALUs

Nodememory

L1 cache

L2 cache

ALUs

Nodememory

L1 cache

12

Page 13: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Hierarchical Memory

13

CPU Main memory

GPU Main memory

Shared Mem. Shared Mem. Shared Mem. Shared Mem.

Warps

Registers

Warps

Registers

Warps

Registers

Warps

Registers

Warps

Registers

Warps

Registers

Warps

Registers

Warps

Registers

Single GPU

Page 14: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Hierarchical Memory

14

Aggregate cluster memory(virtual level)

CMP Main Memory CMP Main Memory

GPU Main memory

GPU Main memory

GPU Main memory

GPU Main memory

Shared Shared Shared Shared Shared Shared Shared Shared

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

WarpReg

MPI Cluster of CMPs w/ GPU Accelerators

Page 15: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Example: Blocked Matrix Multiply

void matmul_L1( int M, int N, int T,float* A,float* B,float* C)

for (int i=0;; i<M;; i++)

for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)

C[i][j] += A[i][k] * B[k][j];;

C += A x B

matmul_L132x32

matrix mult

A B C

15

Page 16: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Example: Blocked Matrix Multiply

C += A x Bvoid matmul_L2( int M, int N, int T,float* A,float* B,float* C)

Perform series of L1 matrix multiplications.

matmul_L2256x256

matrix mult

A B C

matmul_L132x32

matrix mult

matmul_L132x32

matrix mult

matmul_L132x32

matrix mult

matmul_L132x32

matrix mult

16

Page 17: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Example: Blocked Matrix Multiplyvoid matmul( int M, int N, int T,

float* A,float* B,float* C)

Perform series of L2 matrix multiplications.

matmullarge matrix mult

matmul_L132x32

matrix mult ...

matmul_L2256x256

matrix mult

matmul_L132x32

matrix mult

matmul_L132x32

matrix mult

matmul_L132x32

matrix mult

matmul_L2256x256

matrix mult

matmul_L132x32

matrix mult ...matmul_L1

32x32matrix mult

matmul_L132x32

matrix mult

matmul_L132x32

matrix mult

. . .. . .. . .

17

Page 18: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Sequoia TasksSpecial functions called tasks are the building blocks of Sequoia programs

task matmul::leaf( in float A[M][T],in float B[T][N],inout float C[M][N] )

for (int i=0;; i<M;; i++)for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)

C[i][j] += A[i][k] * B[k][j];;

18

Page 19: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Sequoia TasksTask arguments and temporaries define a working setTask working set resident at a specific location in abstract machine modelTasks assigned location in the memory hierarchy

Maintain call-by-value-result (CBVR) semantics

task matmul::leaf( in float A[M][T],in float B[T][N],inout float C[M][N] )

for (int i=0;; i<M;; i++)for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)

C[i][j] += A[i][k] * B[k][j];;

19

Page 20: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Task Hierarchiestask matmul::inner( in float A[M][T],

in float B[T][N],inout float C[M][N] )

tunable int P, Q, R;;

Recursively call matmul task on submatrices

of A, B, and C of size PxQ, QxR, and PxR.

task matmul::leaf( in float A[M][T],in float B[T][N],inout float C[M][N] )

for (int i=0;; i<M;; i++)for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)

C[i][j] += A[i][k] * B[k][j];;20

Page 21: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Task Hierarchiestask matmul::inner( in float A[M][T],

in float B[T][N],inout float C[M][N] )

tunable int P, Q, R;;

mappar( int i=0 to M/P,int j=0 to N/R )

mapseq( int k=0 to T/Q )

matmul( A[P*i:P*(i+1);;P][Q*k:Q*(k+1);;Q],B[Q*k:Q*(k+1);;Q][R*j:R*(j+1);;R],C[P*i:P*(i+1);;P][R*j:R*(j+1);;R] );;

task matmul::leaf( in float A[M][T],

in float B[T][N],inout float C[M][N] )

for (int i=0;; i<M;; i++)

for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)

C[i][j] += A[i][k] * B[k][j];;

matmul::inner

matmul::leaf

Variant call graph

21

Page 22: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Summary: Sequoia TasksSingle abstraction for

Isolation/ parallelismExplicit communication/ working setsExpressing locality

Sequoia programs describe hierarchies of tasksMapped onto memory hierarchyParameterized for portability

22

Page 23: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Compiling Sequoia

23

Page 24: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Sequoia CompilerSource-to-source compilation

Three inputsSource FileMachine FileMapping File

Compilation works on hierarchical programs

Many standard optimizationsDone at all levels of the hierarchyGreatly increases leverage of optimization

source.sq source.mp machine.m

Sequoia Compiler (sq++)

Machine Specific Source Code

24

Page 25: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Inter-­Level Copy Elimination (1)

25

Copy elimination near the root removes not one instruction, but thousands/millions

B

A

CopyCopy

C

Mi

Mi+1

A

Copy

C

Mi

Mi+1

Page 26: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Inter-­Level Copy Elimination (2)

26

Copy elimination near the root removes not one instruction, but thousands/millions

A

B

CopyCopy

C

Mi

Mi+1A

B

Copy

Mi

Mi+1

Page 27: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Software Pipelining

27

SchedulingPrefetch batch of dataCompute on dataInitiate write of results

Overlap communication and computation

compute 1write output 0

time

compute 2

compute 3

read input 2

write output 1

read input 3

write output 2

read input 4

Page 28: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Sequoia RuntimeUniform scheme for explicitly describing memory hierarchies

Capture common traits important for performanceAllow composition of memory hierarchies

Simple, portable API for many parallel machinesMechanism independence for communication and management of parallel resources

SMP, CMP, MPI Cluster, CUDA, Disk, Cell (deprecated), Scalar (debugging), OpenCL (future)

28

Page 29: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Graphical Runtime Representation

Memory Level i+1

CPU Level i+1

Memory Level iChild N

Memory Level iMemory Level iChild 1

Runtime

CPU Level iChild 1

CPU Level i CPU Level iChild N

29

Page 30: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Runtime DesignUniform API to support many devices

Manages basic program tasksData allocation and namingSetup parallel resourcesSynchronization

Greatly simplifies/modularizes implementationCompiler generates code for one API, not many machinesEach runtime is isolated from all othersRuntimes can be implemented separately and composed freely

Makes some basic assumptionsSoftware has control over memory resourcesPersistence of data for software controlled memory

30

Page 31: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Automatic Tuning

31

Page 32: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Autotuner

32

Many parameters to tuneSequoia codes parameterized by tunablesChoice of task variants at different callsites

The tuning framework sets these parametersSearch-basedProgrammer defines the search space

Page 33: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Software-­Managed Memory

33

Smooth with high frequency components(due to alignment)

Page 34: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Hierarchical Search

34

M2

M1

M0

set of tunables: S1

Bottom Up

set of tunables: S0

Page 35: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Search Algorithm

35

A pyramid searchGreedy search performed at each level

Achieve good performance quickly because of smooth space

Start with a coarse grid Refine the grid when no further progress can be made

grid spacing

Page 36: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Performance Results

36

Page 37: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Sequoia Benchmarks

37

Linear Algebra: BLAS Level 1SAXPY, Level 2 SGEMV, and Level 3 SGEMM

Conv2D: 2D single precision convolution with 9x9 support (non-periodic boundary constraints)

FFT3D: Complex single-precision FFTGravity: 100 times steps of N-body stellar dynamics

simulation (N^2) single precisionHMMER: Fuzzy protein string matching using HMM

evaluation (Horn et al. SC2005 paper)

Page 38: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Single Runtime Configurations

38

Scalar2.4 GHz Intel Pentium4 Xeon, 1GB

8-way SMP4 dual-core 2.66GHz Intel P4 Xeons, 8GB

Disk2.4 GHz Intel P4, 160GB disk, ~50MB/s from disk

Cluster16, Intel 2.4GHz P4 Xeons, 1GB/node, Infiniband interconnect (780MB/s)

Cell3.2 GHz IBM Cell blade (1 Cell 8 SPE), 1GB

PS33.2 GHz Cell in Sony Playstation 3 (6 SPE), 256MB (160MB usable)

Page 39: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Single Runtime Configurations -­ GFLOPS

39

Scalar SMP Disk Cluster Cell PS3

SAXPY 0.3 0.7 0.007 4.9 3.5 3.1

SGEMV 1.1 1.7 0.04 12 12 10

SGEMM 6.9 45 5.5 91 119 94

CONV2D 1.9 7.8 0.6 24 85 62

FFT3D 0.7 3.9 0.05 5.5 54 31

GRAVITY 4.8 40 3.7 68 97 71

HMMER 0.9 11 0.9 12 12 7.1

Page 40: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

SGEMM Performance

40

ClusterIntel Cluster MKL: 101 GFlop/sSequoia: 91 GFlop/s

SMPIntel MKL: 44 GFlop/sSequoia: 45 GFlop/s

Page 41: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

FFT3D Performance

41

CellMercury Computer: 58 GFlop/sFFTW 3.2 alpha 2: 35 GFlop/sSequoia: 54 GFlop/s

ClusterFFTW 3.2 alpha 2: 5.3 GFlop/sSequoia: 5.5 GFlop/s

SMPFFTW 3.2 alpha 2: 4.2 GFlop/sSequoia: 3.9 GFlop/s

Page 42: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Best Known Implementations

42

HMMerATI X1900XT: 9.4 GFlop/s

(Horn et al. 2005)Sequoia Cell: 12 GFlop/sSequoia SMP: 11 GFlop/s

GravityGrape-6A: 2 billion interactions/s

(Fukushige et al. 2005)Sequoia Cell: 4 billion interactions/sSequoia PS3: 3 billion interactions/s

Page 43: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Multi-­Runtime System Configurations

43

Cluster of SMPsFour 2-way, 3.16GHz Intel Pentium 4 Xeons connected via GigE (80MB/s peak)

Disk + PS3Sony Playstation 3 bringing data from disk (~30MB/s)

Cluster of PS3sTwo Sony Playstation GigE (60MB/s peak)

Page 44: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

SMP vs. Cluster of SMP (GFLOPS)

44

Cluster of SMPs

SMP

SAXPY 1.9 0.7

SGEMV 4.4 1.7

SGEMM 48 45

CONV2D 4.8 7.8

FFT3D 1.1 3.9

GRAVITY 50 40

HMMER 14 11

Same number of total processors

Computed limited applications agnostic to interconnect.

Page 45: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Disk+PS3 Comparison (GFLOPS)

45

Disk+PS3 PS3

SAXPY 0.004 3.1

SGEMV 0.014 10

SGEMM 3.7 94

CONV2D 0.48 62

FFT3D 0.05 31

GRAVITY 66 71

HMMER 8.3 7.1

Some applications have the computational intensity to run from disk with little slowdown.

blocks to hide memory latency.

Page 46: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Extensions for Irregular Parallelism

46

Page 47: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Regular vs. Irregular Parallelism

Regular ComputationsStatically known working setsStatically known communication patternsPredictable running times

Irregular ComputationsDynamically determined working setsDynamically determinedcommunication patternsUnpredictable running times

Regular applications provide scalability, but large applications still have irregular components that need to be parallelized

47

Page 48: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Spawn StatementSpawn launches unbounded number of tasksContinue launching until termination condition is met

task<inner> void performWork() // ...// (task to be run, termination condition)spawn(performWork(this), workQueue.isEmpty());;// ...

48

Page 49: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Parent Pointers and Call-­UpParent pointers provide child with way to name parent address spaceA call-­up is a task that runs atomically in the

Maintains abstract machine model

task<leaf> void handleWork(parent WorkList *wl) // Perform call-­up to retrieve workvector<Work> localWorkQueue = wl-­>getWork();;

/* Perform Work... */

// Call-­Up to add back extra workwl-­>addWork(localWorkQueue);;

49

Page 50: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Case Study: Boolean Satisfiability (SAT)SAT is useful in many industrial applications

CAD ToolsModel Checkers/ Static Analyses

SAT is a search problem

SAT is an irregular applicationDynamic working set (partial assignments)Dynamic communication (learned clauses)Unknown running times (solving partial assignments)

(x1 + ¬x3 + x4) ^ (¬x2 + x3 + x5) ^ (x4 + ¬x5 + ¬x6

50

Page 51: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Parallel SAT SolvingSpawn many sequential SAT solvers1

Periodically call-­up to update assumptions

1. We use the Mini-­Sat sequential solver

Level 1

Level 0

(x1 + x2 + ¬x3

SequentialSolver

SequentialSolver

SequentialSolver

51

Page 52: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Parallel SAT SolvingSpawn many sequential SAT solvers1

Periodically call-­up to update assumptions

1. We use the Mini-­Sat sequential solver

Level 1

Level 0

(x1 + x2 + ¬x3

SequentialSolver

SequentialSolver

SequentialSolver

52

Page 53: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Performance Results for SAT

53

Speedups over Mini-SAT sequential solver2008 SAT Champion

Number of Processors

Speedup over

Sequential

Page 54: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

11

Case Study: Parallel Sorting

17 2 16 29821 5

Generalized Quicksort Algorithm

Sorting is an irregular applicationDynamic working set (partitions)Dynamic communication (swizzle)Unknown execution time (partition sort)

54

Page 55: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Sorting PerformanceSort 227 Integers (vs STL Sort)Compare to Original Sequoia

55

Page 56: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Dynamic Load Balancing with SpawnEmploy a re-spawn heuristicDynamic load balancing with enough tasks

56

Page 57: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Case Study: Sparse Matrix Multiply

57

Sparse matrix multiplied with a dense vector

No optimizations for sparse patterns

Dynamically allocate chunks of rows to childrenHandle case were some rows have more non-zero elements than othersChildren have an initial assignment of rowsUse working stealing when finished

Page 58: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Performance Results for SpMV

58

Speedups over sequential OSKI codeExcluding tuning time for OSKI execution

Similar to other results for SpMVInherently memory bound

Number of Processors

Speedup over

Sequential

Page 59: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Areas of Future Research

59

Page 60: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Memory ManagementWhat abstractions are presented for each address space?

Stack? Heap? GC?Persistence?

How to represent distributed data structures and arrays in virtual levels?

Major problem for clusters with distributed memory

How do to handle pointer data structures that need to be partitioned?

Better mechanisms for communicating locality?

60

Page 61: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Sequoia: DSL Compiler Target?

61

Can DSL compilers perform domain specific optimizations in a machine agnostic context?

DSL A

DSL A Compilerwith Optimizations

DSL B

DSL B Compilerwith Optimizations

Sequoia

Sequoia Compilerw/ machine optim.

MPI P-Threads CUDA OpenCL Disk

Page 62: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

ConclusionsProgramming to an abstract memory hierarchy provides both locality information and portability

Sequoia provides a general framework for autotuning in deep memory hierarchies

Constructs for irregular parallelism are important for obtaining good performance

62

Page 63: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Questions?http://sequoia.stanford.edu

63

Page 64: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Back-­up Slides

64

Page 65: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Locality Aware ProgrammingSpecify functionally independent tasks

Call-by-value-result semantics

Couple locality information with explicit parallelism

Provide tunable variables for machine independence

65

Page 66: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

TunablesProvide mechanism for specifying machine dependent variables

Two flavors:Integer tunables for controlTask tunables for task variants

Specified in the mapping file

66

Page 67: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Sequoia Rough EdgesMemory Management

What abstractions are presented for each address space?

structures)?

Abstract Machine ModelIs it too abstract?What operations would be allowed if it was slightly less abstract?

Pipeline ParallelismCould Sequoia support something like GRAMPS graph of queues

67

Page 68: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Case Study: Fluid SimulationFluidanimate application from PARSEC1

3-D fluid flow simulated by particlesSpace is partitioned into cells

Fluid is an irregular applicationDynamic working set (cells per grid)Dynamic communication (ghost cells)Unknown running time (particles per cell)

1. PARSEC benchmark suite: parsec.princeton.cs.edu68

Page 69: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

The Ghost Cell ProblemAll copies of a ghost cell must be reduced sequentiallyDifferent ghost cells can be reduced in parallel

Call-up serializes these parallel reductions

Future research: prove independence of call-ups at compile time to allow concurrent call-up execution

69