Autotuning Scientific Kernels on Multicore Systems · 2012. 9. 7. · Autotuning Scientific Kernels on Multicore Systems Sam Williams§, Jonathan Carter, James Demmel§, Leonid

Autotuning Scientific Kernels on Multicore SystemsSam Williams*§, Jonathan Carter*, James Demmel§, Leonid Oliker*§,David Patterson*§, John Shalf*, Katherine Yelick*§, Richard Vuduc‡

*Lawrence Berkeley National Laboratory, §UC Berkeley, ‡Georgia Institute of Technology

Scalability and Performance Comparison•Clovertown has problems with both multicore and multisocket scaling•Niagara2 delivered performance between Opteron and Clovertown•Despite being heavily bound by double precision, Cell is by far the fastest

BERKELEY PAR LAB

What is Autotuning?Idea•There are too many complex architectures with too many possible codetransformations to optimize over.•An optimization on one machine may slow another machine down.•Need a general, automatic solution

Code Generators•Kernel-specific•Perl script generates 1000’s of code variations•Autotuner searches over all possible implementations (sometimes guidedby a performance model to prune the space) to find the optimalconfiguration•Examples include:

Array Padding avoids conflicts in the L1/L2Vectorization avoids rolling the TLBUnrolling/DLP compensates for poor compilersSW Prefetching attempts to hide L2 and DRAM latencySIMDization compensates for poor compilers, and

streaming stores minimize memory traffic

Architectures Evaluated

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT1MB

victim1MBvictim

SRI / crossbar

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT 1MB

victim1MBvictim

SRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

2.2GHz AMD Opteron

667MHz FBDIMMs

Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

2.33GHz Intel Xeon (Clovertown)

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

XDR BIF

PPE

512KBL2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

BIF XDR

PPE

512KBL2

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

<20GB/seach

direction

Crossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

1.4GHz Sun Niagara2 (Huron) 3.2GHz IBM Cell Blade (QS20)

Autotuning Lattice Methods

ReferenceS. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Lattice Boltzmann Simulation Optimization on LeadingMulticore Platforms", International Parallel & Distributed Processing Symposium (IPDPS) (to appear), 2008.Winner Best Paper Award Application track

Lattice-Boltzmann Methods•Out-of-place (Jacobi) style structured grid code•Popular in CFD•Simplified kinetic model that maintains the macroscopic quantities•Distribution functions (e.g. 27 velocities per point in space) are used to reconstruct macroscopic quantities

Lattice-Boltzmann Magneto-hydrodynamics (LBMHD)•Simulates plasma turbulence•Couples CFD and Maxwell’s Equations•Thus it requires:

a Momentum (27 component) distribution anda Magnetic (45 component) distribution7 macroscopic quantities(density, momentum, magnetic field)

•Two phases to the code: collision() advances the grid one time stepstream() handles the boundary conditions (periodic for benchmark)

•Each cell update requires ~1300 flops and ~1200 bytes of data•flop:byte ~ 1.0(ideal), ~0.66(cache-based machines)•2 Problem Sizes: 643(330MB), and 1283(2.5GB)•Currently utilize Structure-of-Arrays data layout to maximize locality

Autotuning LBMHD•Autotuning dramatically improved performance on the Opteron (4x)•Became important when the problem could no longer be mapped with Niagara2’s 4MB pages•Although prefetching showed little benefit, SIMD and streaming stores helped significantly.•Cell was not autotuned, and only collision() was implemented

LBMHD(64^3)

0

10

20

30

40

50

60

70

Clovertown Opteron Niagara2

(Huron)

Cell Blade

MFlop/s/watt

LBMHD(64^3)

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

cores

GFlop/s

Opteron

Clovertown

Niagara2 (Huron)

Cell Blade

0.0

1.0

2.0

3.0

4.0

5.0

6.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

Original Padded Vectorized Unrolled/DLP Prefetching SIMD

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 1 2 4

64^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

.

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 8 16 1 2 4 8 16

64^3 128^3

GFlop/s

Clovertown Opteron Niagara2 Cell Blade

System Power Efficiency•Used a digital power meter to measure sustained system power•Niagara2 system required 50% more power than other systems

14

12

4

16

13

5

9

8

212

0

25

3

1

24

22

23

2618

15

6

19

17

7

11

10

20

+Z

+Y

+X

Initial Multicore SpMV ExperimentationReferenceS. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel, "Optimization of Sparse Matrix-Vector Multiplicationon Emerging Multicore Platforms", Supercomputing (SC), 2007.

Two threaded implementations:•Cache-based Pthreads, and Cell local-store based.•1D parallelization by rows•Progressively expanded the autotuner (constant levels of productivity)

Autotuning•For x86 machines Naïve parallelization didn’t even double performance, but delivered 40x on Niagara2.•Prefetching(exhaustive search)/DMA was helpful on both x86 machines (more so after compression)•NUMA is essential on Opteron/Cell•Matrix Compression(heuristic) often delivered performance better than the reduction in memory traffic•There is no naïve Cell SPE implementation•Although local store blocking is essential on Cell for correctness, it is rarely beneficial on cache-based machines•Machines running a large number of threads need large numbers of DIMMs (not just bandwidth)

System Power Efficiency•Used a digital power meter to measure sustained power under load•16 FBDIMMs on Niagara2 drove sustained power to 450W, while others required around 300W•Clovertown delivered lower performance and required more power than Opteron (thus substantially lower power efficiency)•Cell delivered good performance at moderate power

Performance and Scalability•All machines showed good multisocket scaling (bandwidth per socket was the limiting factor)•Clovertown showed very poor multicore scaling (low data utilization on FSB)•Machines with simpler cores, and more bandwidth delivered better performance

Naïve Serial Naïve Parallel +NUMA +Prefetching+Matrix Compression +Cache/TLB Blocking +DIMMs / Firmware / Array Padding

Clovertown Opteron Niagara2 Cell Blade

Comparison with MPI•Compared autotuned pthread implementation with autotuned shared memory MPI (MPICH) (PETSc+OSKI autotuned implementation)•For many matrices, MPI was only slightly faster than serial on the x86 machines•MPI rarely scaled beyond 16 threads on Niagara2 (still under investigation)•Autotuned pthread implementation was often 2-3x faster than autotuned MPI

Naïve Serial

Autotunedshared-memory MPI(PETSc+OSKI)

Autotunedpthreads

Clovertown Opteron Niagara2

Dense ProteinFEM /

SpheresFEM /

CantileverWind

TunnelFEM /

Harbor QCDFEM /Ship

Eco-nomics Epidem

FEM /Accel Circuit webbase LP

Spyplot

Dimension

Nonzeros(nonzeros/row)

CSR footprint

2K

4.0M(2K)

48MB

36K

4.3M(119)

52MB

83K

6.0M(72)

72MB

62K

4.0M(65)

48MB

218K

11.6M(53)

140MB

47K

2.37M(50)

28MB

49K

1.90M(39)

23MB

141K

3.98M(28)

48MB

207K

1.27M(6)

16MB

526K

2.1M(4)

27MB

121K

2.62M(22)

32MB

171K

959K(6)

12MB

1M

3.1M(3)

41MB

4K x 1M

11.3M(2825)

135MB

Name

Dataset: the matrices

•Pruned original SPARSITY suite down to 14 matrices of interest•None should fit in cache12-135MB (memory intensive benchmark)•4 categories•Rank ranging from 2K to 1M

LBMHD(64^3)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

cores

GFlop/s/core

Opteron

Clovertown

Niagara2 (Huron)

Cell Blade

What are structured grids ?

Structured Grids• Data is arranged in regular multidimensional grids (usually 2-4dimensions)•Computation is series of grid update steps•Neighbor addressing is implicit based on each point’s coordinates•For a given point, a stencil is a pre-determined set of nearest neighbors(possibly including itself)• A stencil code updates every point in a regular grid with a commonstencil. 5-point 2D Stencil

7-point 3D Stencil

• There are several structured grid kernels, including:• Basic Poisson solver (e.g., Jacobi and Gauss-Seidel iterations)• Multigrid• Mesh Refinement• Adaptive Mesh Refinement (AMR)• Lattice Methods (including LBMHD)

A x y

Sparse Matrix-Vector Multiplication

Sparse Matrices•Most entries are 0.0•Significant performance advantage in only storing/operating on the nonzeros

Evaluate y=Ax•A is a sparse matrix•x & y are dense vectors

Challenges•Difficult to exploit ILP and DLP•Irregular memory access to the source vector•Often difficult to load balance•Very low computational intensity (often >6bytes/flop)•Likely memory bound

Overview

Mulicore architecture are paradigm for next decade•Offer best tradeoffs: performance, energy, reliability•However, diverse CMP choices raise difficult questions

• Performance, application suitability, optimization strategy

Progress to date:•Explored broadest set of CMP systems in HPC literature•Examined two diverse scientific algorithms: SPMV and LBMHD•Developed performance portable auto-tuners•Achieved highest intra-node performance in the literature•Continuing effort with additional platforms and algorithms•Developing rooftop model which offers insight into CMP performance•Model ties together:

floating point performance, arithmetic intensity, memory performance

A r i t h m e t i c I n t e n s i t y

O( N ) O( log(N) ) O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTs

Dense Linear Algebra (BLAS3)

Particle Methods

Insights:•Paradoxically, the most complex/advanced architectures required themost tuning, and delivered the lowest performance•Niagara2 delivered both very good performance and productivity•Cell delivered extremely good performance and efficiency (processorand power)•Our multicore-specific autotuning is required to achieve top performance•Architectural transparency is invaluable in optimizing code•Results suggest CMPs designs should maximize effective bandwidthutilization with increasing cores, even at cost of single core performance

0.5

1.0

1/8actual flop:byte ratio

atta

inab

le G

flop/

s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Cell Blade (SPEs)

peak DP

w/out FMA

w/out SIMD

w/out ILP

0.5

1.0


atta

inab

le G

flop/

s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Opteron (rev.F)

peak DP

mul / add imbalance

w/outILP or SIMD

0.5

1.0


atta

inab

le G

flop/

s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Xeon (Clovertown)peak DP

mul / add imbalance

w/out SIMD

w/out ILP

Documents

Autotuning Scientific Kernels on Multicore Systems · 2012. 9. 7. · Autotuning Scientific Kernels on Multicore Systems Sam Williams*§, Jonathan Carter*, James Demmel§, Leonid

Autotuning Scientific Kernels on Multicore Systems · 2012. 9. 7. · Autotuning Scientific Kernels on Multicore Systems Sam Williams§, Jonathan Carter, James Demmel§, Leonid