On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural

Network Accelerators

Haitong Li1, Mudit Bhargava2,Paul N. Whatmough2 and H.-S. Philip Wong1

1Stanford University 2Arm Research

2015.04.15 2

Machine Learning at Edge

Industry efforts: Arm ML Processor, Google Edge TPU, Apple Neural Engine, various startups, etc.

Efficiency & privacy

2015.04.15 32015.04.15

filament

oxygen ion

Top Electrode

Bottom Electrode

metaloxide

oxygen vacancy

Soft M agnet

Pinned M agnet

tunnel barrier (oxide)

current

WLVDD

BLBL

WL

p-Sin+

BL

capacitor

n+

PE

Inp

ut

Fmap

SRA

M

Weight SRAM

Output Fmap SRAM

On-chip Off-chip

Main memory technologies

Input Layer ∈ ℝ ¹⁶ Hidden Layer ∈ ℝ ¹² Hidden Layer ∈ ℝ ¹⁰ Output Layer ∈ ℝ ¹

Machine Learning at Edge

Co-evaluate

previously disjoint spaces

▪ Mobile SoCs: heavily area-constrained, memory resources limited

▪ Data movement: a critical energy-efficiency bottleneck

▪ “Augment” DNN accelerators with on-chip NVMs → understand energy-area tradeoffs

H. Li, M. Bhargava, P. N. Whatmough, H.-S. P. Wong, DAC, 2019

2015.04.15 42015.04.15

Memory Technology Landscape

N/A

(off-chip)

▪ DRAM & eDRAM: destructive read; high refresh power; scalability

▪ Emerging NVM (e.g., MRAM, RRAM): non-volatile; reads more efficient than writes; 3D integration

▪ SRAM: bleeding edge, most expensive

2015.04.15 52015.04.15

Modeling and Evaluation Methodology

PE

Inp

ut

Fmap

SRA

M

Weight SRAM

Output Fmap SRAM

On-chip Off-chip

Main memory technologies

Pareto-optimal designs

Technology benchmarking

Tradeoff analysis

Neural network topologies

µArch definitions

SW/HW techniques

Technology characteristics

Evaluation framework built upon open-source SCALE-Sim from Arm Researchhttps://github.com/ARM-software/SCALE-Sim

https://github.com/ARM-software/SCALE-Sim

2015.04.15 62015.04.15

Design Space Explorations (DSE)ResNet-50, GoogLeNet, MobileNet, FasterRCNN, YOLO-tiny

PE

Inp

ut

Fmap

SRA

M

Weight SRAM

Output Fmap SRAM

On-chip Off-chip

Memory

{32, 64, 128, 256, 1024} KB

{32, 64, 128, 256, 1024} KB

{32, 64, 128, 256, 1024} KB

16×16

24×24

32×32

LPDDR3 DRAM

MRAM

3D VRRAM

eDRAM/SRAM

Dataflows:

Output stationary

Weight stationary

Precision = 8 bit; Clk = 1 GHz; DRAM latency hidden (pipelining)

2015.04.15 72015.04.15

Design Space Explorations (DSE)

PE

Inp

ut

Fmap

SRA

M

Weight SRAM

Output Fmap SRAM

On-chip Off-chip

Memory

{32, 64, 128, 256, 1024} KB

{32, 64, 128, 256, 1024} KB

{32, 64, 128, 256, 1024} KB

16×16

24×24

32×32

LPDDR3 DRAM

MRAM

3D VRRAM

eDRAM/SRAM

Dataflows:

Output stationary

Weight stationary

ResNet-50, GoogLeNet, MobileNet, FasterRCNN, YOLO-tiny

(ResNet-50: 24.35 MB weights + 0.77 MB activations)

Details can be found in: H. Li, M. Bhargava, P. N. Whatmough, H.-S. P. Wong, DAC, 2019

2015.04.15 82015.04.15

Accelerator Energy-Area Pareto Frontiers

colormap represents total SRAM capacity in a specific design w/ off-chip DRAM

ResNet-50, output stationary, DRAM baselines

2015.04.15 92015.04.15

Design-Time vs. Run-Time SRAM Allocation

▪ Compare representative pareto-optimal designs (same SRAM capacity)

– Layer-wise vs. network-wise optimization & allocation

▪ Granularity = 32 KB

– Minimum block size to allocate for IFMap, OFMap, and filter weights

2015.04.15 102015.04.15


2015.04.15 112015.04.15


IFMap: 800 KBFilters: 32 KB; OFMap: 224 KB

Filters: 544 KBIFMap: 32 KB; OFMap: 480 KB

2015.04.15 122015.04.15

Recap

PE

Inp

ut

Fmap

SRA

M

Weight SRAM

Output Fmap SRAM

On-chip Off-chip

Memory

{32, 64, 128, 256, 1024} KB

{32, 64, 128, 256, 1024} KB

{32, 64, 128, 256, 1024} KB

16×16

24×24

32×32

LPDDR3 DRAM

MRAM

3D VRRAM

eDRAM/SRAM

Dataflows:

Output stationary

Weight stationary

Precision = 8 bit; Clk = 1 GHz; DRAM latency hidden (pipelining)

2015.04.15 132015.04.15

Energy-Area Tradeoffs with MRAM

w/ off-chip

DRAM

w/ MRAM

w/o compression

2.5X weight compression

4X weight compression

3.2X

+57%

ResNet-50 | output stationary | PE size = 24 × 24

2015.04.15 142015.04.15


w/ off-chip

DRAM

w/ MRAM

w/o compression



3.2X

+57%

3.2X energy benefits | 57% area increase

2015.04.15 152015.04.15


w/ off-chip

DRAM

w/ MRAM

w/o compression



3.2X

+57%

3.2X energy benefits | 57% area increase

Energy gap: MRAM-SRAM < DRAM-SRAM →less SRAM needed in MRAM-based design

2015.04.15 162015.04.15


w/ off-chip

DRAM

w/ MRAM

w/o compression



3.2X

+57%

3.2X energy gains | 57% area increase

MRAM-SRAM energy gap smaller than DRAM-SRAM →

less SRAM needed for MRAM design

Can we partition accelerator area more efficiently among PE array, SRAM, and NVM?

2015.04.15 172015.04.15

Towards Higher Density: 3D Vertical RRAM

▪ Array structure similar to 3D V-NAND

▪ more layers → bit area ↓ & bit cost ↓

▪ Today: energy/bit higher than MRAM

F. K. Hsueh et al., IEDM, 2017

H. Li et al., VLSI, 2016

2015.04.15 182015.04.15

Architecture Implications

▪ IFMap/OFMap/Weights → VRRAM

– Endurance resilience required [1], [2]

▪ Weights → VRRAM; IFMap/OFMap → DRAM

– NVM for weight storage only

▪ Ultra dense due to 3D, then what?

[1] M. M. A. Sabry, et al., Proc. IEEE, 2019[2] T. Wu, et al., ISSCC, 2019

SRAM buffers

MACs

3D RRAM

2015.04.15 192015.04.15

Energy-Area Tradeoffs with 3D VRRAM

1.83X

-33%

1.83X energy benefits | 33% area savings

2015.04.15 202015.04.15


1.83Xenergy

-33%area

33%

3D RRAM-based design vs. DRAM baseline:Accelerator area savings due to 4X less SRAM required

2015.04.15 212015.04.15


1.83Xenergy

-33%area

33%

3D RRAM-based design vs. DRAM baseline:Accelerator area savings due to 4X less SRAM required

High density is an enabler for more aggressive energy-area optimization

2015.04.15 222015.04.15

Energy-Area-Efficiency Landscape

733% area

increase

57% area

increase

33% area

savings

172% area

increase

Baseline design: SRAM buffers + DRAM

2015.04.15 232015.04.15

Conclusions

▪ Extensive design space explorations for DNN accelerators with NVM

▪ Energy-area tradeoffs obtained w.r.t. pareto-optimal baselines

–MRAM: 4.68X energy benefits & 57% area increase

–3D VRRAM: 2.22X energy benefits & 33% area savings

▪ Today’s “technology node gap” between SRAM and NVM

–Low-power SRAM and high-density NVM join forces

2015.04.15 24

Acknowledgement

Brian Cline, Greg Yeric, Matthew Mattina, Naveen Suda, YK Chong (Arm)

Priyanka Raina, Subhasish Mitra (Stanford University)

E2CDA - ENIGMA

Non-Volatile Memory Technology Research Initiative (NMTRI)

2015.04.15 252015.04.15

End of Talk

▪ Extensive design space explorations for DNN accelerators with NVM

▪ Energy-area tradeoffs obtained w.r.t. pareto-optimal baselines

–MRAM: 4.68X energy benefits & 57% area increase

–3D VRRAM: 2.22X energy benefits & 33% area savings

▪ Today’s “technology node gap” between SRAM and NVM

–Low-power SRAM and high-density NVM join forces

Documents

On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip