Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural
Network Accelerators
Haitong Li1, Mudit Bhargava2,Paul N. Whatmough2 and H.-S. Philip Wong1
1Stanford University 2Arm Research
2015.04.15 2
Machine Learning at Edge
Industry efforts: Arm ML Processor, Google Edge TPU, Apple Neural Engine, various startups, etc.
Efficiency & privacy
2015.04.15 32015.04.15
filament
oxygen ion
Top Electrode
Bottom Electrode
metaloxide
oxygen vacancy
Soft M agnet
Pinned M agnet
tunnel barrier (oxide)
current
WLVDD
BLBL
WL
p-Sin+
BL
capacitor
n+
PE
Inp
ut
Fmap
SRA
M
Weight SRAM
Output Fmap SRAM
On-chip Off-chip
Main memory technologies
Input Layer ∈ ℝ ¹⁶ Hidden Layer ∈ ℝ ¹² Hidden Layer ∈ ℝ ¹⁰ Output Layer ∈ ℝ ¹
Machine Learning at Edge
Co-evaluate
previously disjoint spaces
▪ Mobile SoCs: heavily area-constrained, memory resources limited
▪ Data movement: a critical energy-efficiency bottleneck
▪ “Augment” DNN accelerators with on-chip NVMs → understand energy-area tradeoffs
H. Li, M. Bhargava, P. N. Whatmough, H.-S. P. Wong, DAC, 2019
2015.04.15 42015.04.15
Memory Technology Landscape
N/A
(off-chip)
▪ DRAM & eDRAM: destructive read; high refresh power; scalability
▪ Emerging NVM (e.g., MRAM, RRAM): non-volatile; reads more efficient than writes; 3D integration
▪ SRAM: bleeding edge, most expensive
2015.04.15 52015.04.15
Modeling and Evaluation Methodology
PE
Inp
ut
Fmap
SRA
M
Weight SRAM
Output Fmap SRAM
On-chip Off-chip
Main memory technologies
Pareto-optimal designs
Technology benchmarking
Tradeoff analysis
Neural network topologies
µArch definitions
SW/HW techniques
Technology characteristics
Evaluation framework built upon open-source SCALE-Sim from Arm Researchhttps://github.com/ARM-software/SCALE-Sim
2015.04.15 62015.04.15
Design Space Explorations (DSE)ResNet-50, GoogLeNet, MobileNet, FasterRCNN, YOLO-tiny
PE
Inp
ut
Fmap
SRA
M
Weight SRAM
Output Fmap SRAM
On-chip Off-chip
Memory
{32, 64, 128, 256, 1024} KB
{32, 64, 128, 256, 1024} KB
{32, 64, 128, 256, 1024} KB
16×16
24×24
32×32
LPDDR3 DRAM
MRAM
3D VRRAM
eDRAM/SRAM
Dataflows:
Output stationary
Weight stationary
Precision = 8 bit; Clk = 1 GHz; DRAM latency hidden (pipelining)
2015.04.15 72015.04.15
Design Space Explorations (DSE)
PE
Inp
ut
Fmap
SRA
M
Weight SRAM
Output Fmap SRAM
On-chip Off-chip
Memory
{32, 64, 128, 256, 1024} KB
{32, 64, 128, 256, 1024} KB
{32, 64, 128, 256, 1024} KB
16×16
24×24
32×32
LPDDR3 DRAM
MRAM
3D VRRAM
eDRAM/SRAM
Dataflows:
Output stationary
Weight stationary
ResNet-50, GoogLeNet, MobileNet, FasterRCNN, YOLO-tiny
(ResNet-50: 24.35 MB weights + 0.77 MB activations)
Details can be found in: H. Li, M. Bhargava, P. N. Whatmough, H.-S. P. Wong, DAC, 2019
2015.04.15 82015.04.15
Accelerator Energy-Area Pareto Frontiers
colormap represents total SRAM capacity in a specific design w/ off-chip DRAM
ResNet-50, output stationary, DRAM baselines
2015.04.15 92015.04.15
Design-Time vs. Run-Time SRAM Allocation
▪ Compare representative pareto-optimal designs (same SRAM capacity)
– Layer-wise vs. network-wise optimization & allocation
▪ Granularity = 32 KB
– Minimum block size to allocate for IFMap, OFMap, and filter weights
2015.04.15 102015.04.15
Design-Time vs. Run-Time SRAM Allocation
2015.04.15 112015.04.15
Design-Time vs. Run-Time SRAM Allocation
IFMap: 800 KBFilters: 32 KB; OFMap: 224 KB
Filters: 544 KBIFMap: 32 KB; OFMap: 480 KB
2015.04.15 122015.04.15
Recap
PE
Inp
ut
Fmap
SRA
M
Weight SRAM
Output Fmap SRAM
On-chip Off-chip
Memory
{32, 64, 128, 256, 1024} KB
{32, 64, 128, 256, 1024} KB
{32, 64, 128, 256, 1024} KB
16×16
24×24
32×32
LPDDR3 DRAM
MRAM
3D VRRAM
eDRAM/SRAM
Dataflows:
Output stationary
Weight stationary
Precision = 8 bit; Clk = 1 GHz; DRAM latency hidden (pipelining)
2015.04.15 132015.04.15
Energy-Area Tradeoffs with MRAM
w/ off-chip
DRAM
w/ MRAM
w/o compression
2.5X weight compression
4X weight compression
3.2X
+57%
ResNet-50 | output stationary | PE size = 24 × 24
2015.04.15 142015.04.15
Energy-Area Tradeoffs with MRAM
w/ off-chip
DRAM
w/ MRAM
w/o compression
2.5X weight compression
4X weight compression
3.2X
+57%
3.2X energy benefits | 57% area increase
2015.04.15 152015.04.15
Energy-Area Tradeoffs with MRAM
w/ off-chip
DRAM
w/ MRAM
w/o compression
2.5X weight compression
4X weight compression
3.2X
+57%
3.2X energy benefits | 57% area increase
Energy gap: MRAM-SRAM < DRAM-SRAM →less SRAM needed in MRAM-based design
2015.04.15 162015.04.15
Energy-Area Tradeoffs with MRAM
w/ off-chip
DRAM
w/ MRAM
w/o compression
2.5X weight compression
4X weight compression
3.2X
+57%
3.2X energy gains | 57% area increase
MRAM-SRAM energy gap smaller than DRAM-SRAM →
less SRAM needed for MRAM design
Can we partition accelerator area more efficiently among PE array, SRAM, and NVM?
2015.04.15 172015.04.15
Towards Higher Density: 3D Vertical RRAM
▪ Array structure similar to 3D V-NAND
▪ more layers → bit area ↓ & bit cost ↓
▪ Today: energy/bit higher than MRAM
F. K. Hsueh et al., IEDM, 2017
H. Li et al., VLSI, 2016
2015.04.15 182015.04.15
Architecture Implications
▪ IFMap/OFMap/Weights → VRRAM
– Endurance resilience required [1], [2]
▪ Weights → VRRAM; IFMap/OFMap → DRAM
– NVM for weight storage only
▪ Ultra dense due to 3D, then what?
[1] M. M. A. Sabry, et al., Proc. IEEE, 2019[2] T. Wu, et al., ISSCC, 2019
SRAM buffers
MACs
3D RRAM
2015.04.15 192015.04.15
Energy-Area Tradeoffs with 3D VRRAM
1.83X
-33%
1.83X energy benefits | 33% area savings
2015.04.15 202015.04.15
Energy-Area Tradeoffs with 3D VRRAM
1.83Xenergy
-33%area
33%
3D RRAM-based design vs. DRAM baseline:Accelerator area savings due to 4X less SRAM required
2015.04.15 212015.04.15
Energy-Area Tradeoffs with 3D VRRAM
1.83Xenergy
-33%area
33%
3D RRAM-based design vs. DRAM baseline:Accelerator area savings due to 4X less SRAM required
High density is an enabler for more aggressive energy-area optimization
2015.04.15 222015.04.15
Energy-Area-Efficiency Landscape
733% area
increase
57% area
increase
33% area
savings
172% area
increase
Baseline design: SRAM buffers + DRAM
2015.04.15 232015.04.15
Conclusions
▪ Extensive design space explorations for DNN accelerators with NVM
▪ Energy-area tradeoffs obtained w.r.t. pareto-optimal baselines
–MRAM: 4.68X energy benefits & 57% area increase
–3D VRRAM: 2.22X energy benefits & 33% area savings
▪ Today’s “technology node gap” between SRAM and NVM
–Low-power SRAM and high-density NVM join forces
2015.04.15 24
Acknowledgement
Brian Cline, Greg Yeric, Matthew Mattina, Naveen Suda, YK Chong (Arm)
Priyanka Raina, Subhasish Mitra (Stanford University)
E2CDA - ENIGMA
Non-Volatile Memory Technology Research Initiative (NMTRI)
2015.04.15 252015.04.15
End of Talk
▪ Extensive design space explorations for DNN accelerators with NVM
▪ Energy-area tradeoffs obtained w.r.t. pareto-optimal baselines
–MRAM: 4.68X energy benefits & 57% area increase
–3D VRRAM: 2.22X energy benefits & 33% area savings
▪ Today’s “technology node gap” between SRAM and NVM
–Low-power SRAM and high-density NVM join forces