C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

BIPSBIPS

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs

Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2,

John Shalf2, Katherine Yelick1,2, James Demmel1,2

1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology

[email protected]

BIPSBIPS Overview

Multicore is the de facto performance solution for the next decade

Examined Sparse Matrix Vector Multiplication (SpMV) kernel Important HPC kernel Memory intensive Challenging for multicore

Present two autotuned threaded implementations: Pthread, cache-based implementation Cell local store-based implementation

Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 IBM Cell Broadband Engine

Compare with leading MPI implementation(PETSc) with an autotuned serial kernel (OSKI)

BIPSBIPS Sparse Matrix Vector Multiplication

Sparse Matrix Most entries are 0.0 Performance advantage in only

storing/operating on the nonzeros Requires significant meta data

Evaluate y=Ax A is a sparse matrix x & y are dense vectors

Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop)

A x y


BIPSBIPS

Dataset (Matrices)Multicore SMPs

Test Suite

BIPSBIPS Matrices Used

Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M

Dense

ProteinFEM /

SpheresFEM /

CantileverWind

TunnelFEM /Harbor

QCDFEM /Ship

Economics Epidemiology

FEM /Accelerator

Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

BIPSBIPS Multicore SMP Systems

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar

Sw

itch

Fully Buffered DRAM

4MB

Sha

red

L2 (

16 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU








179

GB

/s(f

ill)

90 G

B/s

(writ

ethr

u)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

BIPSBIPSMulticore SMP Systems

(memory hierarchy)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2


10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar

Sw

itch

Fully Buffered DRAM

4MB

Sha

red

L2 (

16 w

ay)










179

GB

/s(f

ill)

90 G

B/s

(writ

ethr

u)

Sun Niagara2


AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron


1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s


XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Conventional C

ache-based

Memory Hierarchy

Disjoint Local S

tore

Memory Hierarchy


(cache)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2


10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar

Sw

itch

Fully Buffered DRAM

4MB

Sha

red

L2 (

16 w

ay)










179

GB

/s(f

ill)

90 G

B/s

(writ

ethr

u)

Sun Niagara2


AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron


1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s


XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

16MB(vectors fit)

4MB

4MB

(local store)

4MB


(peak flops)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2


10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar

Sw

itch

Fully Buffered DRAM

4MB

Sha

red

L2 (

16 w

ay)










179

GB

/s(f

ill)

90 G

B/s

(writ

ethr

u)

Sun Niagara2


AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron


1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s


XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

75 Gflop/s

(w/SIMD)

17 Gflop/s

29 Gflop/s

(w/SIMD)

11 Gflop/s


(peak read bandwidth)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2


10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar

Sw

itch

Fully Buffered DRAM

4MB

Sha

red

L2 (

16 w

ay)










179

GB

/s(f

ill)

90 G

B/s

(writ

ethr

u)

Sun Niagara2


AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron


1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s


XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

21 GB/s 21 GB/s

51 GB/s43 GB/s


(NUMA)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2


10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar

Sw

itch

Fully Buffered DRAM

4MB

Sha

red

L2 (

16 w

ay)










179

GB

/s(f

ill)

90 G

B/s

(writ

ethr

u)

Sun Niagara2


AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron


1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s


XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Uni

form

Mem

ory

Acc

ess

Non

-Uni

form

Mem

ory

Acc

ess


BIPSBIPS

Naïve ImplementationFor cache-based machinesIncluded a median performance number

BIPSBIPS vanilla C Performance

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2 Vanilla C implementation Matrix stored in CSR (compressed

sparse row) Explored compiler options - only

the best is presented here


BIPSBIPS

Optimized for multicore/threadingVariety of shared memory programming models

are acceptable(not just Pthreads)More colors = more optimizations = more work

Pthread Implementation

BIPSBIPS Parallelization

Matrix partitioned by rows and balanced by the number of nonzeros

SPMD like approach A barrier() is called before and after the SpMV kernel Each sub matrix stored separately in CSR Load balancing can be challenging

# of threads explored in powers of 2 (in paper)

A

x

y

BIPSBIPS Naïve Parallel Performance

Naïve Pthreads

Naïve Single Thread

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2

BIPSBIPS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Naïve Parallel Performance

Naïve Pthreads



Sun Niagara2

8x cores = 1.9x performance8x cores = 1.9x performance 4x cores = 1.5x performance4x cores = 1.5x performance

64x threads = 41x performance64x threads = 41x performance

BIPSBIPS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Naïve Parallel Performance

Naïve Pthreads



Sun Niagara2

1.4% of peak flops

29% of bandwidth

1.4% of peak flops

29% of bandwidth4% of peak flops

20% of bandwidth

4% of peak flops

20% of bandwidth

25% of peak flops

39% of bandwidth

25% of peak flops

39% of bandwidth

BIPSBIPS Case for Autotuning

How do we deliver good performance across all these architectures, across all matrices without exhaustively optimizing every combination

Autotuning Write a Perl script that generates all possible optimizations Heuristically, or exhaustively search the optimizations Existing SpMV solution: OSKI (developed at UCB)

This work: Optimizations geared for multi-core/-threading generates SSE/SIMD intrinsics, prefetching, loop

transformations, alternate data structures, etc… “prototype for parallel OSKI”

BIPSBIPS Exploiting NUMA, Affinity

Bandwidth on the Opteron(and Cell) can vary substantially based on placement of data

Bind each sub matrix and the thread to process it together

Explored libnuma, Linux, and Solaris routines Adjacent blocks bound to adjacent cores

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Single ThreadMultiple Threads,

One memory controller

Multiple Threads,

Both memory controllers

BIPSBIPS Performance (+NUMA)

+NUMA/Affinity

Naïve Pthreads


0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2

BIPSBIPS Performance (+SW Prefetching)

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads


0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2

BIPSBIPS Matrix Compression

For memory bound kernels, minimizing memory

traffic should maximize performance Compress the meta data

Exploit structure to eliminate meta data

Heuristic: select the compression that

minimizes the matrix size: power of 2 register blocking CSR/COO format 16b/32b indices etc…

Side effect: matrix may be minimized to the point where it fits entirely in cache

BIPSBIPSPerformance (+matrix compression)

+Matrix Compression


+NUMA/Affinity

Naïve Pthreads


0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2

BIPSBIPS Cache and TLB Blocking

Accesses to the matrix and destination vector are streaming But, access to the source vector can be random Reorganize matrix (and thus access pattern) to maximize reuse. Applies equally to TLB blocking (caching PTEs)

Heuristic: block destination, then keep addingmore columns as long as the number ofsource vector cache lines(or pages) touchedis less than the cache(or TLB). Apply allprevious optimizations individually to eachcache block.

Search: neither, cache, cache&TLB

Better locality at the expense of confusingthe hardware prefetchers.

A

x

y

BIPSBIPS Performance (+cache blocking)

+Cache/TLB Blocking

+Matrix Compression


+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2

BIPSBIPS Banks, Ranks, and DIMMs

In this SPMD approach, as the number of threads increases, so to does the number of concurrent streams to memory.

Most memory controllers have finite capability to reorder the requests. (DMA can avoid or minimize this)

Addressing/Bank conflicts become increasingly likely

Add more DIMMs, configuration of ranks can help Clovertown system was already fully populated

BIPSBIPS Performance (more DIMMs, …)

+More DIMMs, Rank configuration, etc…

+Cache/TLB Blocking

+Matrix Compression


+NUMA/Affinity

Naïve Pthreads


0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2

BIPSBIPS

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Performance (more DIMMs, …)


+Cache/TLB Blocking

+Matrix Compression


+NUMA/Affinity

Naïve Pthreads



Sun Niagara2

4% of peak flops

52% of bandwidth

4% of peak flops

52% of bandwidth20% of peak flops

66% of bandwidth

20% of peak flops

66% of bandwidth

52% of peak flops

54% of bandwidth

52% of peak flops

54% of bandwidth

BIPSBIPS

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Performance (more DIMMs, …)


+Cache/TLB Blocking

+Matrix Compression


+NUMA/Affinity

Naïve Pthreads



Sun Niagara2

3 essential optimizations3 essential optimizations 4 essential optimizations4 essential optimizations

2 essential optimizations2 essential optimizations


BIPSBIPS

CommentsPerformance

Cell Implementation

BIPSBIPS Cell Implementation

No vanilla C implementation (aside from the PPE) Even SIMDized double precision is extremely weak

Scalar double precision is unbearable Minimum register blocking is 2x1 (SIMDizable) Can increase memory traffic by 66%

Cache blocking optimization is transformed into local store blocking Spatial and temporal locality is captured by software when

the matrix is optimized In essence, the high bits of column indices are grouped into DMA

lists No branch prediction

Replace branches with conditional operations In some cases, what were optional optimizations on cache based

machines, are requirements for correctness on Cell

Despite the performance, Cell is still handicapped by double precision

BIPSBIPS

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

16 SPEs8 SPEs1 SPE

Performance

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2 IBM Cell Broadband Engine

BIPSBIPS

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

16 SPEs8 SPEs1 SPE

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2

Performance

IBM Cell Broadband Engine

39% of peak flops

89% of bandwidth

39% of peak flops

89% of bandwidth


BIPSBIPS

Multicore MPI ImplementationThis is the default approach to programming multicore

BIPSBIPS Multicore MPI Implementation

Used PETSc with shared memory MPICH Used OSKI (developed @ UCB) to optimize each thread = Highly optimized MPI

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

DenseProtein

FEM-SphrFEM-Cant

TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

MPI(autotuned) Pthreads(autotuned)Naïve Single Thread



BIPSBIPS

Summary

BIPSBIPS

Median Performance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

IntelClovertown

AMD Opteron Sun Niagara2 IBM Cell

GFlop/s

Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)

Median Power Efficiency

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

22.0

24.0

IntelClovertown

AMD Opteron Sun Niagara2 IBM Cell

MFlop/s/Watt

Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)

Median Performance & Efficiency

Used digital power meter to measure sustained system power FBDIMM drives up Clovertown and Niagara2 power Right: sustained MFlop/s / sustained Watts

Default approach(MPI) achieves very low performance and efficiency

BIPSBIPS Summary

Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.

Most machines achieved less than 50-60% of DRAM bandwidth Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency

90% of memory bandwidth High power efficiency Easily understood performance Extra traffic = lower performance (future work can address this)

multicore specific autotuned implementation significantly outperformed a state of the art MPI implementation Matrix compression geared towards multicore NUMA Prefetching

BIPSBIPS Acknowledgments

UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns)

Sun Microsystems Niagara2

Forschungszentrum Jülich Cell blade cluster


BIPSBIPS

Questions?

Documents

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard