25
On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators Haitong Li 1 , Mudit Bhargava 2 , Paul N. Whatmough 2 and H.-S. Philip Wong 1 1 Stanford University 2 Arm Research

On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural

Network Accelerators

Haitong Li1, Mudit Bhargava2,Paul N. Whatmough2 and H.-S. Philip Wong1

1Stanford University 2Arm Research

Page 2: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 2

Machine Learning at Edge

Industry efforts: Arm ML Processor, Google Edge TPU, Apple Neural Engine, various startups, etc.

Efficiency & privacy

Page 3: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 32015.04.15

filament

oxygen ion

Top Electrode

Bottom Electrode

metaloxide

oxygen vacancy

Soft M agnet

Pinned M agnet

tunnel barrier (oxide)

current

WLVDD

BLBL

WL

p-Sin+

BL

capacitor

n+

PE

Inp

ut

Fmap

SRA

M

Weight SRAM

Output Fmap SRAM

On-chip Off-chip

Main memory technologies

Input Layer ∈ ℝ ¹⁶ Hidden Layer ∈ ℝ ¹² Hidden Layer ∈ ℝ ¹⁰ Output Layer ∈ ℝ ¹

Machine Learning at Edge

Co-evaluate

previously disjoint spaces

▪ Mobile SoCs: heavily area-constrained, memory resources limited

▪ Data movement: a critical energy-efficiency bottleneck

▪ “Augment” DNN accelerators with on-chip NVMs → understand energy-area tradeoffs

H. Li, M. Bhargava, P. N. Whatmough, H.-S. P. Wong, DAC, 2019

Page 4: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 42015.04.15

Memory Technology Landscape

N/A

(off-chip)

▪ DRAM & eDRAM: destructive read; high refresh power; scalability

▪ Emerging NVM (e.g., MRAM, RRAM): non-volatile; reads more efficient than writes; 3D integration

▪ SRAM: bleeding edge, most expensive

Page 5: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 52015.04.15

Modeling and Evaluation Methodology

PE

Inp

ut

Fmap

SRA

M

Weight SRAM

Output Fmap SRAM

On-chip Off-chip

Main memory technologies

Pareto-optimal designs

Technology benchmarking

Tradeoff analysis

Neural network topologies

µArch definitions

SW/HW techniques

Technology characteristics

Evaluation framework built upon open-source SCALE-Sim from Arm Researchhttps://github.com/ARM-software/SCALE-Sim

Page 6: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 62015.04.15

Design Space Explorations (DSE)ResNet-50, GoogLeNet, MobileNet, FasterRCNN, YOLO-tiny

PE

Inp

ut

Fmap

SRA

M

Weight SRAM

Output Fmap SRAM

On-chip Off-chip

Memory

{32, 64, 128, 256, 1024} KB

{32, 64, 128, 256, 1024} KB

{32, 64, 128, 256, 1024} KB

16×16

24×24

32×32

LPDDR3 DRAM

MRAM

3D VRRAM

eDRAM/SRAM

Dataflows:

Output stationary

Weight stationary

Precision = 8 bit; Clk = 1 GHz; DRAM latency hidden (pipelining)

Page 7: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 72015.04.15

Design Space Explorations (DSE)

PE

Inp

ut

Fmap

SRA

M

Weight SRAM

Output Fmap SRAM

On-chip Off-chip

Memory

{32, 64, 128, 256, 1024} KB

{32, 64, 128, 256, 1024} KB

{32, 64, 128, 256, 1024} KB

16×16

24×24

32×32

LPDDR3 DRAM

MRAM

3D VRRAM

eDRAM/SRAM

Dataflows:

Output stationary

Weight stationary

ResNet-50, GoogLeNet, MobileNet, FasterRCNN, YOLO-tiny

(ResNet-50: 24.35 MB weights + 0.77 MB activations)

Details can be found in: H. Li, M. Bhargava, P. N. Whatmough, H.-S. P. Wong, DAC, 2019

Page 8: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 82015.04.15

Accelerator Energy-Area Pareto Frontiers

colormap represents total SRAM capacity in a specific design w/ off-chip DRAM

ResNet-50, output stationary, DRAM baselines

Page 9: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 92015.04.15

Design-Time vs. Run-Time SRAM Allocation

▪ Compare representative pareto-optimal designs (same SRAM capacity)

– Layer-wise vs. network-wise optimization & allocation

▪ Granularity = 32 KB

– Minimum block size to allocate for IFMap, OFMap, and filter weights

Page 10: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 102015.04.15

Design-Time vs. Run-Time SRAM Allocation

Page 11: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 112015.04.15

Design-Time vs. Run-Time SRAM Allocation

IFMap: 800 KBFilters: 32 KB; OFMap: 224 KB

Filters: 544 KBIFMap: 32 KB; OFMap: 480 KB

Page 12: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 122015.04.15

Recap

PE

Inp

ut

Fmap

SRA

M

Weight SRAM

Output Fmap SRAM

On-chip Off-chip

Memory

{32, 64, 128, 256, 1024} KB

{32, 64, 128, 256, 1024} KB

{32, 64, 128, 256, 1024} KB

16×16

24×24

32×32

LPDDR3 DRAM

MRAM

3D VRRAM

eDRAM/SRAM

Dataflows:

Output stationary

Weight stationary

Precision = 8 bit; Clk = 1 GHz; DRAM latency hidden (pipelining)

Page 13: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 132015.04.15

Energy-Area Tradeoffs with MRAM

w/ off-chip

DRAM

w/ MRAM

w/o compression

2.5X weight compression

4X weight compression

3.2X

+57%

ResNet-50 | output stationary | PE size = 24 × 24

Page 14: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 142015.04.15

Energy-Area Tradeoffs with MRAM

w/ off-chip

DRAM

w/ MRAM

w/o compression

2.5X weight compression

4X weight compression

3.2X

+57%

3.2X energy benefits | 57% area increase

Page 15: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 152015.04.15

Energy-Area Tradeoffs with MRAM

w/ off-chip

DRAM

w/ MRAM

w/o compression

2.5X weight compression

4X weight compression

3.2X

+57%

3.2X energy benefits | 57% area increase

Energy gap: MRAM-SRAM < DRAM-SRAM →less SRAM needed in MRAM-based design

Page 16: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 162015.04.15

Energy-Area Tradeoffs with MRAM

w/ off-chip

DRAM

w/ MRAM

w/o compression

2.5X weight compression

4X weight compression

3.2X

+57%

3.2X energy gains | 57% area increase

MRAM-SRAM energy gap smaller than DRAM-SRAM →

less SRAM needed for MRAM design

Can we partition accelerator area more efficiently among PE array, SRAM, and NVM?

Page 17: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 172015.04.15

Towards Higher Density: 3D Vertical RRAM

▪ Array structure similar to 3D V-NAND

▪ more layers → bit area ↓ & bit cost ↓

▪ Today: energy/bit higher than MRAM

F. K. Hsueh et al., IEDM, 2017

H. Li et al., VLSI, 2016

Page 18: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 182015.04.15

Architecture Implications

▪ IFMap/OFMap/Weights → VRRAM

– Endurance resilience required [1], [2]

▪ Weights → VRRAM; IFMap/OFMap → DRAM

– NVM for weight storage only

▪ Ultra dense due to 3D, then what?

[1] M. M. A. Sabry, et al., Proc. IEEE, 2019[2] T. Wu, et al., ISSCC, 2019

SRAM buffers

MACs

3D RRAM

Page 19: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 192015.04.15

Energy-Area Tradeoffs with 3D VRRAM

1.83X

-33%

1.83X energy benefits | 33% area savings

Page 20: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 202015.04.15

Energy-Area Tradeoffs with 3D VRRAM

1.83Xenergy

-33%area

33%

3D RRAM-based design vs. DRAM baseline:Accelerator area savings due to 4X less SRAM required

Page 21: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 212015.04.15

Energy-Area Tradeoffs with 3D VRRAM

1.83Xenergy

-33%area

33%

3D RRAM-based design vs. DRAM baseline:Accelerator area savings due to 4X less SRAM required

High density is an enabler for more aggressive energy-area optimization

Page 22: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 222015.04.15

Energy-Area-Efficiency Landscape

733% area

increase

57% area

increase

33% area

savings

172% area

increase

Baseline design: SRAM buffers + DRAM

Page 23: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 232015.04.15

Conclusions

▪ Extensive design space explorations for DNN accelerators with NVM

▪ Energy-area tradeoffs obtained w.r.t. pareto-optimal baselines

–MRAM: 4.68X energy benefits & 57% area increase

–3D VRRAM: 2.22X energy benefits & 33% area savings

▪ Today’s “technology node gap” between SRAM and NVM

–Low-power SRAM and high-density NVM join forces

Page 24: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 24

Acknowledgement

Brian Cline, Greg Yeric, Matthew Mattina, Naveen Suda, YK Chong (Arm)

Priyanka Raina, Subhasish Mitra (Stanford University)

E2CDA - ENIGMA

Non-Volatile Memory Technology Research Initiative (NMTRI)

Page 25: On-Chip Memory Technology Design Space Explorations for ... · 2015.04.15 7 Design Space Explorations (DSE) PE I n p u t F m a p S R A M Weight SRAM Output Fmap SRAM On-chip Off-chip

2015.04.15 252015.04.15

End of Talk

▪ Extensive design space explorations for DNN accelerators with NVM

▪ Energy-area tradeoffs obtained w.r.t. pareto-optimal baselines

–MRAM: 4.68X energy benefits & 57% area increase

–3D VRRAM: 2.22X energy benefits & 33% area savings

▪ Today’s “technology node gap” between SRAM and NVM

–Low-power SRAM and high-density NVM join forces