3D-Stacked Logic-in-Memory Hardware For Sparse Matrix ... · Carnegie Mellon 3D-Stacked Logic-in-Memory Hardware . For Sparse Matrix Operations . Franz Franchetti . Carnegie Mellon

Carnegie Mellon Carnegie Mellon

3D-Stacked Logic-in-Memory Hardware For Sparse Matrix Operations

Franz Franchetti Carnegie Mellon University In collaboration with Qiuling Zhu, Fazle Sadi, Qi Guo, Guangling Xu, Ekin Sumbul, James C. Hoe and Larry Pileggi

The work was sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement No. HR0011-13-2-0007. The content, views and conclusions presented in this document do not necessarily reflect the position or the policy of DARPA or the U.S. Government. No official endorsement should be inferred. This work was also supported in part by the Intelligence Advanced Research Program Agency and Space and Naval Warfare Systems Center Pacific under Contract No. N66001-12- C-2008, Program Manager Dennis Polla.


How to Accelerate Graph Algorithms? Insight: Important Graph algorithms = funny* tensor contraction** Edge Betweenness Centrality Bayesian belief propagation Minimal Spanning Trees Single source and all pairs shortest path

But: Sparse matrix operations are inefficient on current machines High meta-data ratio: 1 index/data in CSR*** format Memory bound: little or no reuse (E.g., SpMV****) Tiling impossible: tiling in CSR is hard Inefficient: bandwidth bound, runs at 1% compute peak

*funny = over a semi-ring with add/mul redefined, e.g., (*, +) = (+, min) **tensor contraction captures scalar product, matrix * vector, matrix * matrix and multiple matrix * matrix, including nD transposes ***CSR = compressed sparse row ****SpMV = sparse matrix-vector multiplication


Enabling Technology: Logic on 3D DRAM Stack

3D stacked DRAM and logic/DRAM integration Logic and DRAM layers connected by TSVs Better timing, lower area, advanced IO in the logic layer

Logic directly on DRAM stack Bandwidth and latency concerns still exist to off-chip Behind DRAM interface opens up internal resources

HMC

DDR4 DIMMs

Intel Knights Landing Nvidia Volta Micron HMC

HMC


Concept: Memory-Side Accelerators

Idea: Accelerator on DRAM side No off-DIMM data traffic Huge problem sizes possible 3D stacking is enabling technology

Configurable array of accelerators Domain specific, highly configurable Cover DoD-relevant kernels Configurable on-accelerator routing

System CPU Multicore/manycore CPU CPU-side accelerators Explicit memory management SIMD and multicore parallelism

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&docid=vAsx1pRWshmqkM&tbnid=OGZEFdq-QNznEM:&ved=0CAUQjRw&url=http://nanobitwallpaper.com/intel-haswell-i7/&ei=a_iiU5idNsKMqAaGuYGgDQ&bvm=bv.69411363,d.cWc&psig=AFQjCNH6vnDdvGOCz5Y4P6I3GkzOQ06S5w&ust=1403275695913880


Simulation, Emulation, and Software Stack Synopsis DesignWare

McPAT

Accelerator Simulation Timing: Synopsis DesignWare Power: DRAMSim2, Cacti, Cacti-3DD

McPAT

Full System Evaluation Run code on real system (Haswell, Xeon Phi)

or in simulator (SimpleScalar,…) Normal DRAM access for CPU, but

trap accelerator command memory space, invoke simulator

API and Software stack Accelerator: memory mapped device with command and data address space User API: C configuration library, standard API where applicable Virtual memory: fixed non-standard logical-to-physical mapping Memory management: Special malloc/free, Linux kernel support

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&docid=vAsx1pRWshmqkM&tbnid=OGZEFdq-QNznEM:&ved=0CAUQjRw&url=http://nanobitwallpaper.com/intel-haswell-i7/&ei=a_iiU5idNsKMqAaGuYGgDQ&bvm=bv.69411363,d.cWc&psig=AFQjCNH6vnDdvGOCz5Y4P6I3GkzOQ06S5w&ust=1403275695913880

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&docid=dlVkmJ7ffYvwmM&tbnid=HJ4Cd3GGsE_eIM:&ved=0CAUQjRw&url=http://www.pcmag.com/article2/0,2817,2399406,00.asp&ei=h_qiU-ekFoibqAbTjYGIBg&bvm=bv.69411363,d.cWc&psig=AFQjCNGWHyOsHoqt0LsrS5Tul1bN8jeU-w&ust=1403276256253817


Sparse Matrix Multiplication (SpGEMM)

(Buluc, Gilbert) A B C

State-of-the-art algorithm: Algorithm minimizes floating-point operations Matrix B: streaming memory access But: Matrix A: Every access is a cache and row buffer miss Expensive multiway merge required for Matrix C


SpGEMM Revisited: 3DIC Changes Trade-Offs

A B C

On chip: Standard approach

A B C

Off-chip: SUMMA with hypersparse blocks On chip

DRAM layer

DRAM layer

DRAM layer

3DIC: stacked DRAM+ASIC

Trade-off: full streaming memory access, but higher memory bandwidth required

DRAM layer

Accelerator Layer


Why Does This Work?

Power-graph incidence matrixes are very sparse on average 3 non-zeroes per row

Tiles are hypersparse 1,000 entries in 50,000 x 50,000 tile

Total memory size limits matrix dimension only about 100 tiles per row/column

3D stacking provides enormous bandwidth but DRAM array (and row buffer miss latency) is unchanged

Here O(n3/b3) algorithm beats O(n*nnz) algorithm the key is the huge block size and maximum matrix dimension


3D Sparse Matrix Accelerator

Off-chip: SRUMMA (tiled shared memory algorithm)

On-chip: regular SpGEMM kernel

DRAM Dies

Buffer

LiM

DRAM

DRAM DRAM

Scratchpad Layer

DRAM Dies

LiM Dies Scratchpad: Hierarchical tiling

SRAM logic-in-memory

EDRAM scratchpad

DRAM

Accelerator Layer

DRAM layer

DRAM layer

DRAM layer

DRAM layer


Logic Layer Functionality Examples

Data sorting memory

1

6 7 8

10

4 1 3 4 6 7

10

3

1

7 8

10

3 6

4

1

8 10

3 4 7

6 3 7 4

1

10

3 4 6 8

6 8 7 10

1 3 4 6 7 8

8

Pointer chasing memory

Non-zero row

index array

Column index array

i i+1

1 4 6 7 8

10

Orthogonal access memory

3

Smart SRAM Designs

Sparse Matrix Content Access Memory (CAM) bit-0 access

one-big array reordering


SpGEMM Kernel


index 0 index 1 index 2 index 3

index n

value 0 value 1 value 2 value 3

value n

search index

S E Q U E N C E R

value 2

X

+

Op2 (Match): Read, modify and write back

Op1 (Mismatch): Insert a new entry

E N C O

D E R

D E C O D E R

Pipeline

r𝒐𝒐𝒘𝒘𝑨𝑨

v𝒂𝒂𝒍𝒍𝑨𝑨

v𝒂𝒂𝒍𝒍𝑩𝑩

r𝒐𝒐𝒘𝒘𝑪𝑪

v𝒂𝒂𝒍𝒍𝑪𝑪

r𝒐𝒐𝒘𝒘𝑪𝑪 v𝒂𝒂𝒍𝒍𝑪𝑪

SpGEMM-CAM Functional Architecture

CAM structure facilitates efficient one-cycle non-zero merge

Content Addressable Memory (CAM)

• Numerical values • Row/column Index • Index matching

• CAM data array • CAM key array • CAM comparisons

Sparse matrix components CAM storage structure

Match algorithm to hardware structure


Map SpGEMM Algorithm to 3D LiM Architecture

Map matrix tiles to DRAM rows Sequential tile streaming guarantees maximum sustained bandwidth and good

data locality (cheap regular access)

Matrix A

Matrix B

(Conceptual 3D stack structure)

Matrix Tile

DRAM Row

Matrix Tile

DRAM Row TSV

TSV

TSV

TSV

A B C


SpGEMM Multicore

Duplicated SpGEMM LiM cores for parallel matrix processing

Match TSV bandwidth to the LiM

processing throughput (# of kernels)

A completely balanced system optimized from the app knowledge

Central Controller

core-1 core-n

Meta Data Array

A

B

C

A

B

C A

B

C

core-2

(CAM)

(CAM)

(CAM) SpGEMM

LiM Cores

LiM Die Layer

TSV TSV TSV TSV T S V

T S V

T S V

Stacked DRAM Dies

SpGEMM LiM kernel

core-1

core-2


Recursive SpGEMM on Hierarchical 3D LiM

Buffer

LiM

DRAM

DRAM DRAM

Scratchpad Layer

DRAM Dies

LiM Dies

LiM

DRAM DRAM DRAM DRAM Dies

LiM Dies

n1x n2 n1x n2

L-1

L-2

n1

n1

n2

n2

L-1

L-2

L-3

D1

D1

D2

TSV Idle


Chip Generator GUI Interface

stack-2

stack-4

stack-8

stack-16stack-32

0

200

400

600

800

bank-32bank-16

bank-8

N_stack

N_bank

(b) Bandwidth [GB/s] w/ 256 IO per bank

Re-architects the DRAM dies

Bandwidth 3D Memory

Access Timing

EE: Bandwidth/Power

Thermal and Temperature

3D-System Design Space Exploration

Silicon Interposer: Future System Scaling

Silicon interposer

Stack 1 Stack 2

Design Automation Framework & Chip Generator


Transistor count > 1M Core Area: 1.003 x 1.292 mm2

Max Frequency: 475MHz Avg. Power at max f : 71.7 mW

Test Chip: Single SpGEMM Core


Intel Dual-CPU Intel Xeon E5-2430 system about 140W, 6 DRAM channels, 64 GB/s max, 6 DIMMs (48 chips), 210 GFLOPS peak Intel MKL 11.0 mkl_dcsrmultdcsr (unsorted):

100 MFLOPS – 1 GFLOPS, 1 – 10 MFLOPS/W

Our 3DIC System 4 DRAM layers@ 2GBit, 1 logic layer, 32nm, 30 – 50 cores, 1GHz, 30 – 50 GFLOPS peak Performance: 10 – 50 GFLOPS @ 668GB/s with 1024 TSV

Power efficiency: 1 GFLOPS/W @ 350GB/s with 512TSV or 8GB/s with 1024 TSV

SpGEMM: Simulation Results

Matrices from University of Florida sparse matrix collection


Beyond SpGEMM: SpMV Accelerator

3DIC Memory Side Accelerator Target: Larges Sparse Graphs

Algorithm SpMV Kernel

Block-column multiply + merge step Partial results are streamed to DRAM Matrix streams from DRAM Vector segment is held in scratchpad/EDRAM

Standard sparse matrix times dense vector Matrix is very sparse (a few non-zeroes per row) Matrix and vector size is large (>10M x 10M/GB)


Ongoing: System-Level Design Full system: CPU and DRAM

Scalability: multiple DRAM stacks on silicon interposer


Summary

1.0E-02

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

belgi

umde

l_n20 NLR

rgg_2

0AS

365

vent

ur3

road

del_n

18pr

efergg

_17

small

wau

toca

idaR

coA_

Cco

A_D cs4 cti

fe_4e

lt2fe_

body

fe_sp

hergg

_15

t60k

vsp_

bvs

p_c

wing

wing

_n

matrix size 1M-10M matrix size 100K-1M matrix size 10K-100K

Intel MKL (Th-12) 3D-LiM (block-32K)

Power efficiency [MFLOPS/w] for running various benchmark matrices

1000x

SpGEMM is core algorithm for graphs graphs and sparse matrices are dual

3DIC requires to rethink algorithms Example: trade off latency for bandwidth

Enormous efficiency gains are possible both power efficiency and performance

Documents

3D-Stacked Logic-in-Memory Hardware For Sparse Matrix ... · Carnegie Mellon 3D-Stacked Logic-in-Memory Hardware . For Sparse Matrix Operations . Franz Franchetti . Carnegie Mellon