View
212
Download
0
Tags:
Embed Size (px)
Citation preview
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
BIPSBIPS
Tuning Sparse Matrix Vector Multiplication for multi-core SMPs
Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2,
John Shalf2, Katherine Yelick1,2, James Demmel1,2
1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology
BIPSBIPS Overview
Multicore is the de facto performance solution for the next decade
Examined Sparse Matrix Vector Multiplication (SpMV) kernel Important HPC kernel Memory intensive Challenging for multicore
Present two autotuned threaded implementations: Pthread, cache-based implementation Cell local store-based implementation
Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 IBM Cell Broadband Engine
Compare with leading MPI implementation(PETSc) with an autotuned serial kernel (OSKI)
BIPSBIPS Sparse Matrix Vector Multiplication
Sparse Matrix Most entries are 0.0 Performance advantage in only
storing/operating on the nonzeros Requires significant meta data
Evaluate y=Ax A is a sparse matrix x & y are dense vectors
Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop)
A x y
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
BIPSBIPS
Dataset (Matrices)Multicore SMPs
Test Suite
BIPSBIPS Matrices Used
Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M
Dense
ProteinFEM /
SpheresFEM /
CantileverWind
TunnelFEM /Harbor
QCDFEM /Ship
Economics Epidemiology
FEM /Accelerator
Circuit webbase
LP
2K x 2K Dense matrixstored in sparse format
Well Structured(sorted by nonzeros/row)
Poorly Structuredhodgepodge
Extreme Aspect Ratio(linear programming)
BIPSBIPS Multicore SMP Systems
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar
Sw
itch
Fully Buffered DRAM
4MB
Sha
red
L2 (
16 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(f
ill)
90 G
B/s
(writ
ethr
u)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
BIPSBIPSMulticore SMP Systems
(memory hierarchy)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar
Sw
itch
Fully Buffered DRAM
4MB
Sha
red
L2 (
16 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(f
ill)
90 G
B/s
(writ
ethr
u)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
Conventional C
ache-based
Memory Hierarchy
Disjoint Local S
tore
Memory Hierarchy
BIPSBIPSMulticore SMP Systems
(cache)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar
Sw
itch
Fully Buffered DRAM
4MB
Sha
red
L2 (
16 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(f
ill)
90 G
B/s
(writ
ethr
u)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
16MB(vectors fit)
4MB
4MB
(local store)
4MB
BIPSBIPSMulticore SMP Systems
(peak flops)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar
Sw
itch
Fully Buffered DRAM
4MB
Sha
red
L2 (
16 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(f
ill)
90 G
B/s
(writ
ethr
u)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
75 Gflop/s
(w/SIMD)
17 Gflop/s
29 Gflop/s
(w/SIMD)
11 Gflop/s
BIPSBIPSMulticore SMP Systems
(peak read bandwidth)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar
Sw
itch
Fully Buffered DRAM
4MB
Sha
red
L2 (
16 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(f
ill)
90 G
B/s
(writ
ethr
u)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
21 GB/s 21 GB/s
51 GB/s43 GB/s
BIPSBIPSMulticore SMP Systems
(NUMA)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar
Sw
itch
Fully Buffered DRAM
4MB
Sha
red
L2 (
16 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(f
ill)
90 G
B/s
(writ
ethr
u)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
Uni
form
Mem
ory
Acc
ess
Non
-Uni
form
Mem
ory
Acc
ess
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
BIPSBIPS
Naïve ImplementationFor cache-based machinesIncluded a median performance number
BIPSBIPS vanilla C Performance
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 Vanilla C implementation Matrix stored in CSR (compressed
sparse row) Explored compiler options - only
the best is presented here
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
BIPSBIPS
Optimized for multicore/threadingVariety of shared memory programming models
are acceptable(not just Pthreads)More colors = more optimizations = more work
Pthread Implementation
BIPSBIPS Parallelization
Matrix partitioned by rows and balanced by the number of nonzeros
SPMD like approach A barrier() is called before and after the SpMV kernel Each sub matrix stored separately in CSR Load balancing can be challenging
# of threads explored in powers of 2 (in paper)
A
x
y
BIPSBIPS Naïve Parallel Performance
Naïve Pthreads
Naïve Single Thread
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
BIPSBIPS
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Naïve Parallel Performance
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
8x cores = 1.9x performance8x cores = 1.9x performance 4x cores = 1.5x performance4x cores = 1.5x performance
64x threads = 41x performance64x threads = 41x performance
BIPSBIPS
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Naïve Parallel Performance
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
1.4% of peak flops
29% of bandwidth
1.4% of peak flops
29% of bandwidth4% of peak flops
20% of bandwidth
4% of peak flops
20% of bandwidth
25% of peak flops
39% of bandwidth
25% of peak flops
39% of bandwidth
BIPSBIPS Case for Autotuning
How do we deliver good performance across all these architectures, across all matrices without exhaustively optimizing every combination
Autotuning Write a Perl script that generates all possible optimizations Heuristically, or exhaustively search the optimizations Existing SpMV solution: OSKI (developed at UCB)
This work: Optimizations geared for multi-core/-threading generates SSE/SIMD intrinsics, prefetching, loop
transformations, alternate data structures, etc… “prototype for parallel OSKI”
BIPSBIPS Exploiting NUMA, Affinity
Bandwidth on the Opteron(and Cell) can vary substantially based on placement of data
Bind each sub matrix and the thread to process it together
Explored libnuma, Linux, and Solaris routines Adjacent blocks bound to adjacent cores
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Single ThreadMultiple Threads,
One memory controller
Multiple Threads,
Both memory controllers
BIPSBIPS Performance (+NUMA)
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
BIPSBIPS Performance (+SW Prefetching)
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
BIPSBIPS Matrix Compression
For memory bound kernels, minimizing memory
traffic should maximize performance Compress the meta data
Exploit structure to eliminate meta data
Heuristic: select the compression that
minimizes the matrix size: power of 2 register blocking CSR/COO format 16b/32b indices etc…
Side effect: matrix may be minimized to the point where it fits entirely in cache
BIPSBIPSPerformance (+matrix compression)
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
BIPSBIPS Cache and TLB Blocking
Accesses to the matrix and destination vector are streaming But, access to the source vector can be random Reorganize matrix (and thus access pattern) to maximize reuse. Applies equally to TLB blocking (caching PTEs)
Heuristic: block destination, then keep addingmore columns as long as the number ofsource vector cache lines(or pages) touchedis less than the cache(or TLB). Apply allprevious optimizations individually to eachcache block.
Search: neither, cache, cache&TLB
Better locality at the expense of confusingthe hardware prefetchers.
A
x
y
BIPSBIPS Performance (+cache blocking)
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
BIPSBIPS Banks, Ranks, and DIMMs
In this SPMD approach, as the number of threads increases, so to does the number of concurrent streams to memory.
Most memory controllers have finite capability to reorder the requests. (DMA can avoid or minimize this)
Addressing/Bank conflicts become increasingly likely
Add more DIMMs, configuration of ranks can help Clovertown system was already fully populated
BIPSBIPS Performance (more DIMMs, …)
+More DIMMs, Rank configuration, etc…
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
BIPSBIPS
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Performance (more DIMMs, …)
+More DIMMs, Rank configuration, etc…
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
4% of peak flops
52% of bandwidth
4% of peak flops
52% of bandwidth20% of peak flops
66% of bandwidth
20% of peak flops
66% of bandwidth
52% of peak flops
54% of bandwidth
52% of peak flops
54% of bandwidth
BIPSBIPS
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Performance (more DIMMs, …)
+More DIMMs, Rank configuration, etc…
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
3 essential optimizations3 essential optimizations 4 essential optimizations4 essential optimizations
2 essential optimizations2 essential optimizations
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
BIPSBIPS
CommentsPerformance
Cell Implementation
BIPSBIPS Cell Implementation
No vanilla C implementation (aside from the PPE) Even SIMDized double precision is extremely weak
Scalar double precision is unbearable Minimum register blocking is 2x1 (SIMDizable) Can increase memory traffic by 66%
Cache blocking optimization is transformed into local store blocking Spatial and temporal locality is captured by software when
the matrix is optimized In essence, the high bits of column indices are grouped into DMA
lists No branch prediction
Replace branches with conditional operations In some cases, what were optional optimizations on cache based
machines, are requirements for correctness on Cell
Despite the performance, Cell is still handicapped by double precision
BIPSBIPS
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
16 SPEs8 SPEs1 SPE
Performance
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 IBM Cell Broadband Engine
BIPSBIPS
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
16 SPEs8 SPEs1 SPE
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
Performance
IBM Cell Broadband Engine
39% of peak flops
89% of bandwidth
39% of peak flops
89% of bandwidth
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
BIPSBIPS
Multicore MPI ImplementationThis is the default approach to programming multicore
BIPSBIPS Multicore MPI Implementation
Used PETSc with shared memory MPICH Used OSKI (developed @ UCB) to optimize each thread = Highly optimized MPI
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
DenseProtein
FEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
MPI(autotuned) Pthreads(autotuned)Naïve Single Thread
Intel Clovertown AMD Opteron
BIPSBIPS
Median Performance
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
IntelClovertown
AMD Opteron Sun Niagara2 IBM Cell
GFlop/s
Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)
Median Power Efficiency
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
20.0
22.0
24.0
IntelClovertown
AMD Opteron Sun Niagara2 IBM Cell
MFlop/s/Watt
Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)
Median Performance & Efficiency
Used digital power meter to measure sustained system power FBDIMM drives up Clovertown and Niagara2 power Right: sustained MFlop/s / sustained Watts
Default approach(MPI) achieves very low performance and efficiency
BIPSBIPS Summary
Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.
Most machines achieved less than 50-60% of DRAM bandwidth Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency
90% of memory bandwidth High power efficiency Easily understood performance Extra traffic = lower performance (future work can address this)
multicore specific autotuned implementation significantly outperformed a state of the art MPI implementation Matrix compression geared towards multicore NUMA Prefetching
BIPSBIPS Acknowledgments
UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns)
Sun Microsystems Niagara2
Forschungszentrum Jülich Cell blade cluster