Upload
ngotruc
View
228
Download
4
Embed Size (px)
Citation preview
Carnegie Mellon Carnegie Mellon
3D-Stacked Logic-in-Memory Hardware For Sparse Matrix Operations
Franz Franchetti Carnegie Mellon University In collaboration with Qiuling Zhu, Fazle Sadi, Qi Guo, Guangling Xu, Ekin Sumbul, James C. Hoe and Larry Pileggi
The work was sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement No. HR0011-13-2-0007. The content, views and conclusions presented in this document do not necessarily reflect the position or the policy of DARPA or the U.S. Government. No official endorsement should be inferred. This work was also supported in part by the Intelligence Advanced Research Program Agency and Space and Naval Warfare Systems Center Pacific under Contract No. N66001-12- C-2008, Program Manager Dennis Polla.
Carnegie Mellon Carnegie Mellon
How to Accelerate Graph Algorithms? Insight: Important Graph algorithms = funny* tensor contraction** Edge Betweenness Centrality Bayesian belief propagation Minimal Spanning Trees Single source and all pairs shortest path
But: Sparse matrix operations are inefficient on current machines High meta-data ratio: 1 index/data in CSR*** format Memory bound: little or no reuse (E.g., SpMV****) Tiling impossible: tiling in CSR is hard Inefficient: bandwidth bound, runs at 1% compute peak
*funny = over a semi-ring with add/mul redefined, e.g., (*, +) = (+, min) **tensor contraction captures scalar product, matrix * vector, matrix * matrix and multiple matrix * matrix, including nD transposes ***CSR = compressed sparse row ****SpMV = sparse matrix-vector multiplication
Carnegie Mellon Carnegie Mellon
Enabling Technology: Logic on 3D DRAM Stack
3D stacked DRAM and logic/DRAM integration Logic and DRAM layers connected by TSVs Better timing, lower area, advanced IO in the logic layer
Logic directly on DRAM stack Bandwidth and latency concerns still exist to off-chip Behind DRAM interface opens up internal resources
HMC
DDR4 DIMMs
Intel Knights Landing Nvidia Volta Micron HMC
HMC
Carnegie Mellon Carnegie Mellon
Concept: Memory-Side Accelerators
Idea: Accelerator on DRAM side No off-DIMM data traffic Huge problem sizes possible 3D stacking is enabling technology
Configurable array of accelerators Domain specific, highly configurable Cover DoD-relevant kernels Configurable on-accelerator routing
System CPU Multicore/manycore CPU CPU-side accelerators Explicit memory management SIMD and multicore parallelism
Carnegie Mellon Carnegie Mellon
Simulation, Emulation, and Software Stack Synopsis DesignWare
McPAT
Accelerator Simulation Timing: Synopsis DesignWare Power: DRAMSim2, Cacti, Cacti-3DD
McPAT
Full System Evaluation Run code on real system (Haswell, Xeon Phi)
or in simulator (SimpleScalar,…) Normal DRAM access for CPU, but
trap accelerator command memory space, invoke simulator
API and Software stack Accelerator: memory mapped device with command and data address space User API: C configuration library, standard API where applicable Virtual memory: fixed non-standard logical-to-physical mapping Memory management: Special malloc/free, Linux kernel support
Carnegie Mellon Carnegie Mellon
Sparse Matrix Multiplication (SpGEMM)
(Buluc, Gilbert) A B C
State-of-the-art algorithm: Algorithm minimizes floating-point operations Matrix B: streaming memory access But: Matrix A: Every access is a cache and row buffer miss Expensive multiway merge required for Matrix C
Carnegie Mellon Carnegie Mellon
SpGEMM Revisited: 3DIC Changes Trade-Offs
A B C
On chip: Standard approach
A B C
Off-chip: SUMMA with hypersparse blocks On chip
DRAM layer
DRAM layer
DRAM layer
3DIC: stacked DRAM+ASIC
Trade-off: full streaming memory access, but higher memory bandwidth required
DRAM layer
Accelerator Layer
Carnegie Mellon Carnegie Mellon
Why Does This Work?
Power-graph incidence matrixes are very sparse on average 3 non-zeroes per row
Tiles are hypersparse 1,000 entries in 50,000 x 50,000 tile
Total memory size limits matrix dimension only about 100 tiles per row/column
3D stacking provides enormous bandwidth but DRAM array (and row buffer miss latency) is unchanged
Here O(n3/b3) algorithm beats O(n*nnz) algorithm the key is the huge block size and maximum matrix dimension
Carnegie Mellon Carnegie Mellon
3D Sparse Matrix Accelerator
Off-chip: SRUMMA (tiled shared memory algorithm)
On-chip: regular SpGEMM kernel
DRAM Dies
Buffer
LiM
DRAM
DRAM DRAM
Scratchpad Layer
DRAM Dies
LiM Dies Scratchpad: Hierarchical tiling
SRAM logic-in-memory
EDRAM scratchpad
DRAM
Accelerator Layer
DRAM layer
DRAM layer
DRAM layer
DRAM layer
Carnegie Mellon Carnegie Mellon
Logic Layer Functionality Examples
Data sorting memory
1
6 7 8
10
4 1 3 4 6 7
10
3
1
7 8
10
3 6
4
1
8 10
3 4 7
6 3 7 4
1
10
3 4 6 8
6 8 7 10
1 3 4 6 7 8
8
Pointer chasing memory
Non-zero row
index array
Column index array
i i+1
1 4 6 7 8
10
Orthogonal access memory
3
Smart SRAM Designs
Sparse Matrix Content Access Memory (CAM) bit-0 access
one-big array reordering
Carnegie Mellon Carnegie Mellon
SpGEMM Kernel
Carnegie Mellon Carnegie Mellon
index 0 index 1 index 2 index 3
index n
value 0 value 1 value 2 value 3
value n
search index
S E Q U E N C E R
value 2
X
+
Op2 (Match): Read, modify and write back
Op1 (Mismatch): Insert a new entry
E N C O
D E R
D E C O D E R
Pipeline
r𝒐𝒐𝒘𝒘𝑨𝑨
v𝒂𝒂𝒍𝒍𝑨𝑨
v𝒂𝒂𝒍𝒍𝑩𝑩
r𝒐𝒐𝒘𝒘𝑪𝑪
v𝒂𝒂𝒍𝒍𝑪𝑪
r𝒐𝒐𝒘𝒘𝑪𝑪 v𝒂𝒂𝒍𝒍𝑪𝑪
SpGEMM-CAM Functional Architecture
CAM structure facilitates efficient one-cycle non-zero merge
Content Addressable Memory (CAM)
• Numerical values • Row/column Index • Index matching
• CAM data array • CAM key array • CAM comparisons
Sparse matrix components CAM storage structure
Match algorithm to hardware structure
Carnegie Mellon Carnegie Mellon
Map SpGEMM Algorithm to 3D LiM Architecture
Map matrix tiles to DRAM rows Sequential tile streaming guarantees maximum sustained bandwidth and good
data locality (cheap regular access)
Matrix A
Matrix B
(Conceptual 3D stack structure)
Matrix Tile
DRAM Row
Matrix Tile
DRAM Row TSV
TSV
TSV
TSV
A B C
Carnegie Mellon Carnegie Mellon
SpGEMM Multicore
Duplicated SpGEMM LiM cores for parallel matrix processing
Match TSV bandwidth to the LiM
processing throughput (# of kernels)
A completely balanced system optimized from the app knowledge
Central Controller
core-1 core-n
Meta Data Array
A
B
C
A
B
C A
B
C
core-2
(CAM)
(CAM)
(CAM) SpGEMM
LiM Cores
LiM Die Layer
TSV TSV TSV TSV T S V
T S V
T S V
Stacked DRAM Dies
SpGEMM LiM kernel
core-1
core-2
Carnegie Mellon Carnegie Mellon
Recursive SpGEMM on Hierarchical 3D LiM
Buffer
LiM
DRAM
DRAM DRAM
Scratchpad Layer
DRAM Dies
LiM Dies
LiM
DRAM DRAM DRAM DRAM Dies
LiM Dies
n1x n2 n1x n2
L-1
L-2
n1
n1
n2
n2
L-1
L-2
L-3
D1
D1
D2
TSV Idle
Carnegie Mellon Carnegie Mellon
Chip Generator GUI Interface
stack-2
stack-4
stack-8
stack-16stack-32
0
200
400
600
800
bank-32bank-16
bank-8
N_stack
N_bank
(b) Bandwidth [GB/s] w/ 256 IO per bank
Re-architects the DRAM dies
Bandwidth 3D Memory
Access Timing
EE: Bandwidth/Power
Thermal and Temperature
3D-System Design Space Exploration
Silicon Interposer: Future System Scaling
Silicon interposer
Stack 1 Stack 2
Design Automation Framework & Chip Generator
Carnegie Mellon Carnegie Mellon
Transistor count > 1M Core Area: 1.003 x 1.292 mm2
Max Frequency: 475MHz Avg. Power at max f : 71.7 mW
Test Chip: Single SpGEMM Core
Carnegie Mellon Carnegie Mellon
Intel Dual-CPU Intel Xeon E5-2430 system about 140W, 6 DRAM channels, 64 GB/s max, 6 DIMMs (48 chips), 210 GFLOPS peak Intel MKL 11.0 mkl_dcsrmultdcsr (unsorted):
100 MFLOPS – 1 GFLOPS, 1 – 10 MFLOPS/W
Our 3DIC System 4 DRAM layers@ 2GBit, 1 logic layer, 32nm, 30 – 50 cores, 1GHz, 30 – 50 GFLOPS peak Performance: 10 – 50 GFLOPS @ 668GB/s with 1024 TSV
Power efficiency: 1 GFLOPS/W @ 350GB/s with 512TSV or 8GB/s with 1024 TSV
SpGEMM: Simulation Results
Matrices from University of Florida sparse matrix collection
Carnegie Mellon Carnegie Mellon
Beyond SpGEMM: SpMV Accelerator
3DIC Memory Side Accelerator Target: Larges Sparse Graphs
Algorithm SpMV Kernel
Block-column multiply + merge step Partial results are streamed to DRAM Matrix streams from DRAM Vector segment is held in scratchpad/EDRAM
Standard sparse matrix times dense vector Matrix is very sparse (a few non-zeroes per row) Matrix and vector size is large (>10M x 10M/GB)
Carnegie Mellon Carnegie Mellon
Ongoing: System-Level Design Full system: CPU and DRAM
Scalability: multiple DRAM stacks on silicon interposer
Carnegie Mellon Carnegie Mellon
Summary
1.0E-02
1.0E-01
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
belgi
umde
l_n20 NLR
rgg_2
0AS
365
vent
ur3
road
del_n
18pr
efergg
_17
small
wau
toca
idaR
coA_
Cco
A_D cs4 cti
fe_4e
lt2fe_
body
fe_sp
hergg
_15
t60k
vsp_
bvs
p_c
wing
wing
_n
matrix size 1M-10M matrix size 100K-1M matrix size 10K-100K
Intel MKL (Th-12) 3D-LiM (block-32K)
Power efficiency [MFLOPS/w] for running various benchmark matrices
1000x
SpGEMM is core algorithm for graphs graphs and sparse matrices are dual
3DIC requires to rethink algorithms Example: trade off latency for bandwidth
Enormous efficiency gains are possible both power efficiency and performance