Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Autotuning Scientific Kernels on Multicore SystemsSam Williams*§, Jonathan Carter*, James Demmel§, Leonid Oliker*§,David Patterson*§, John Shalf*, Katherine Yelick*§, Richard Vuduc‡
*Lawrence Berkeley National Laboratory, §UC Berkeley, ‡Georgia Institute of Technology
Scalability and Performance Comparison•Clovertown has problems with both multicore and multisocket scaling•Niagara2 delivered performance between Opteron and Clovertown•Despite being heavily bound by double precision, Cell is by far the fastest
BERKELEY PAR LAB
What is Autotuning?Idea•There are too many complex architectures with too many possible codetransformations to optimize over.•An optimization on one machine may slow another machine down.•Need a general, automatic solution
Code Generators•Kernel-specific•Perl script generates 1000’s of code variations•Autotuner searches over all possible implementations (sometimes guidedby a performance model to prune the space) to find the optimalconfiguration•Examples include:
Array Padding avoids conflicts in the L1/L2Vectorization avoids rolling the TLBUnrolling/DLP compensates for poor compilersSW Prefetching attempts to hide L2 and DRAM latencySIMDization compensates for poor compilers, and
streaming stores minimize memory traffic
Architectures Evaluated
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT1MB
victim1MBvictim
SRI / crossbar
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT 1MB
victim1MBvictim
SRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
2.2GHz AMD Opteron
667MHz FBDIMMs
Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
2.33GHz Intel Xeon (Clovertown)
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
XDR BIF
PPE
512KBL2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
BIF XDR
PPE
512KBL2
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
<20GB/seach
direction
Crossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
1.4GHz Sun Niagara2 (Huron) 3.2GHz IBM Cell Blade (QS20)
Autotuning Lattice Methods
ReferenceS. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Lattice Boltzmann Simulation Optimization on LeadingMulticore Platforms", International Parallel & Distributed Processing Symposium (IPDPS) (to appear), 2008.Winner Best Paper Award Application track
Lattice-Boltzmann Methods•Out-of-place (Jacobi) style structured grid code•Popular in CFD•Simplified kinetic model that maintains the macroscopic quantities•Distribution functions (e.g. 27 velocities per point in space) are used to reconstruct macroscopic quantities
Lattice-Boltzmann Magneto-hydrodynamics (LBMHD)•Simulates plasma turbulence•Couples CFD and Maxwell’s Equations•Thus it requires:
a Momentum (27 component) distribution anda Magnetic (45 component) distribution7 macroscopic quantities(density, momentum, magnetic field)
•Two phases to the code: collision() advances the grid one time stepstream() handles the boundary conditions (periodic for benchmark)
•Each cell update requires ~1300 flops and ~1200 bytes of data•flop:byte ~ 1.0(ideal), ~0.66(cache-based machines)•2 Problem Sizes: 643(330MB), and 1283(2.5GB)•Currently utilize Structure-of-Arrays data layout to maximize locality
Autotuning LBMHD•Autotuning dramatically improved performance on the Opteron (4x)•Became important when the problem could no longer be mapped with Niagara2’s 4MB pages•Although prefetching showed little benefit, SIMD and streaming stores helped significantly.•Cell was not autotuned, and only collision() was implemented
LBMHD(64^3)
0
10
20
30
40
50
60
70
Clovertown Opteron Niagara2
(Huron)
Cell Blade
MFlop/s/watt
LBMHD(64^3)
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
cores
GFlop/s
Opteron
Clovertown
Niagara2 (Huron)
Cell Blade
0.0
1.0
2.0
3.0
4.0
5.0
6.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
Original Padded Vectorized Unrolled/DLP Prefetching SIMD
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 1 2 4
64^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
.
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
1 2 4 8 16 1 2 4 8 16
64^3 128^3
GFlop/s
Clovertown Opteron Niagara2 Cell Blade
System Power Efficiency•Used a digital power meter to measure sustained system power•Niagara2 system required 50% more power than other systems
14
12
4
16
13
5
9
8
212
0
25
3
1
24
22
23
2618
15
6
19
17
7
11
10
20
+Z
+Y
+X
Initial Multicore SpMV ExperimentationReferenceS. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel, "Optimization of Sparse Matrix-Vector Multiplicationon Emerging Multicore Platforms", Supercomputing (SC), 2007.
Two threaded implementations:•Cache-based Pthreads, and Cell local-store based.•1D parallelization by rows•Progressively expanded the autotuner (constant levels of productivity)
Autotuning•For x86 machines Naïve parallelization didn’t even double performance, but delivered 40x on Niagara2.•Prefetching(exhaustive search)/DMA was helpful on both x86 machines (more so after compression)•NUMA is essential on Opteron/Cell•Matrix Compression(heuristic) often delivered performance better than the reduction in memory traffic•There is no naïve Cell SPE implementation•Although local store blocking is essential on Cell for correctness, it is rarely beneficial on cache-based machines•Machines running a large number of threads need large numbers of DIMMs (not just bandwidth)
System Power Efficiency•Used a digital power meter to measure sustained power under load•16 FBDIMMs on Niagara2 drove sustained power to 450W, while others required around 300W•Clovertown delivered lower performance and required more power than Opteron (thus substantially lower power efficiency)•Cell delivered good performance at moderate power
Performance and Scalability•All machines showed good multisocket scaling (bandwidth per socket was the limiting factor)•Clovertown showed very poor multicore scaling (low data utilization on FSB)•Machines with simpler cores, and more bandwidth delivered better performance
Naïve Serial Naïve Parallel +NUMA +Prefetching+Matrix Compression +Cache/TLB Blocking +DIMMs / Firmware / Array Padding
Clovertown Opteron Niagara2 Cell Blade
Comparison with MPI•Compared autotuned pthread implementation with autotuned shared memory MPI (MPICH) (PETSc+OSKI autotuned implementation)•For many matrices, MPI was only slightly faster than serial on the x86 machines•MPI rarely scaled beyond 16 threads on Niagara2 (still under investigation)•Autotuned pthread implementation was often 2-3x faster than autotuned MPI
Naïve Serial
Autotunedshared-memory MPI(PETSc+OSKI)
Autotunedpthreads
Clovertown Opteron Niagara2
Dense ProteinFEM /
SpheresFEM /
CantileverWind
TunnelFEM /
Harbor QCDFEM /Ship
Eco-nomics Epidem
FEM /Accel Circuit webbase LP
Spyplot
Dimension
Nonzeros(nonzeros/row)
CSR footprint
2K
4.0M(2K)
48MB
36K
4.3M(119)
52MB
83K
6.0M(72)
72MB
62K
4.0M(65)
48MB
218K
11.6M(53)
140MB
47K
2.37M(50)
28MB
49K
1.90M(39)
23MB
141K
3.98M(28)
48MB
207K
1.27M(6)
16MB
526K
2.1M(4)
27MB
121K
2.62M(22)
32MB
171K
959K(6)
12MB
1M
3.1M(3)
41MB
4K x 1M
11.3M(2825)
135MB
Name
Dataset: the matrices
•Pruned original SPARSITY suite down to 14 matrices of interest•None should fit in cache12-135MB (memory intensive benchmark)•4 categories•Rank ranging from 2K to 1M
LBMHD(64^3)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
cores
GFlop/s/core
Opteron
Clovertown
Niagara2 (Huron)
Cell Blade
What are structured grids ?
Structured Grids• Data is arranged in regular multidimensional grids (usually 2-4dimensions)•Computation is series of grid update steps•Neighbor addressing is implicit based on each point’s coordinates•For a given point, a stencil is a pre-determined set of nearest neighbors(possibly including itself)• A stencil code updates every point in a regular grid with a commonstencil. 5-point 2D Stencil
7-point 3D Stencil
• There are several structured grid kernels, including:• Basic Poisson solver (e.g., Jacobi and Gauss-Seidel iterations)• Multigrid• Mesh Refinement• Adaptive Mesh Refinement (AMR)• Lattice Methods (including LBMHD)
A x y
Sparse Matrix-Vector Multiplication
Sparse Matrices•Most entries are 0.0•Significant performance advantage in only storing/operating on the nonzeros
Evaluate y=Ax•A is a sparse matrix•x & y are dense vectors
Challenges•Difficult to exploit ILP and DLP•Irregular memory access to the source vector•Often difficult to load balance•Very low computational intensity (often >6bytes/flop)•Likely memory bound
Overview
Mulicore architecture are paradigm for next decade•Offer best tradeoffs: performance, energy, reliability•However, diverse CMP choices raise difficult questions
• Performance, application suitability, optimization strategy
Progress to date:•Explored broadest set of CMP systems in HPC literature•Examined two diverse scientific algorithms: SPMV and LBMHD•Developed performance portable auto-tuners•Achieved highest intra-node performance in the literature•Continuing effort with additional platforms and algorithms•Developing rooftop model which offers insight into CMP performance•Model ties together:
floating point performance, arithmetic intensity, memory performance
A r i t h m e t i c I n t e n s i t y
O( N ) O( log(N) ) O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTs
Dense Linear Algebra (BLAS3)
Particle Methods
Insights:•Paradoxically, the most complex/advanced architectures required themost tuning, and delivered the lowest performance•Niagara2 delivered both very good performance and productivity•Cell delivered extremely good performance and efficiency (processorand power)•Our multicore-specific autotuning is required to achieve top performance•Architectural transparency is invaluable in optimizing code•Results suggest CMPs designs should maximize effective bandwidthutilization with increasing cores, even at cost of single core performance
0.5
1.0
1/8actual flop:byte ratio
atta
inab
le G
flop/
s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Cell Blade (SPEs)
peak DP
w/out FMA
w/out SIMD
w/out ILP
0.5
1.0
1/8actual flop:byte ratio
atta
inab
le G
flop/
s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Opteron (rev.F)
peak DP
mul / add imbalance
w/outILP or SIMD
0.5
1.0
1/8actual flop:byte ratio
atta
inab
le G
flop/
s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Xeon (Clovertown)peak DP
mul / add imbalance
w/out SIMD
w/out ILP