A 28nm SoC with a 1.2GHz 568nJ/ Prediction Sparse Deep ...vlsiarch.eecs.harvard.edu/wp-content/uploads/2017/02/whatmough... · – Specialized$Architecture$and$Eﬃcient$Digital$Circuits$

14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine

with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 1 of 86

A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network

Engine with >0.1 Timing Error Rate Tolerance for IoT Applications

Paul Whatmough, S. K. Lee,

H. Lee, S. Rama, D. Brooks, G.-Y. Wei Harvard University, Cambridge, MA



Motivation



Motivation

•  Deep Neural Networks (DNNs) –  A universal classifier with many diverse applica5ons –  Interpret noisy real-‐world data from sensor-‐rich systems



Motivation

•  Deep Neural Networks (DNNs) –  A universal classifier with many diverse applica5ons –  Interpret noisy real-‐world data from sensor-‐rich systems –  Large memory footprint and compute load –  Perform DNN inference on device in micro-‐Joules



Motivation

•  DNN ENGINE – A Shared Programmable Classifier –  Specialized Architecture and Efficient Digital Circuits

Inpu

t Vector

Outpu

t Classes

DNN ENGINE

Hidden Layers

Outputs

+DNN Topology

+Weights

Inputs

Sensor Data

WeightNeuron Classification

Argmax



Motivation

•  DNN ENGINE – A Shared Programmable Classifier –  Specialized Architecture and Efficient Digital Circuits –  Parallelism: Balanced memory bandwidth and throughput

Inpu

t Vector

Outpu

t Classes

DNN ENGINE

Hidden Layers

Outputs

+DNN Topology

+Weights

Inputs

Sensor Data


Argmax



Motivation

•  DNN ENGINE – A Shared Programmable Classifier –  Specialized Architecture and Efficient Digital Circuits –  Parallelism: Balanced memory bandwidth and throughput –  Sparsity: HW support for sparse data and small datatypes

Inpu

t Vector

Outpu

t Classes

DNN ENGINE

Hidden Layers

Outputs

+DNN Topology

+Weights

Inputs

Sensor Data


Argmax



Motivation

•  DNN ENGINE – A Shared Programmable Classifier –  Specialized Architecture and Efficient Digital Circuits –  Parallelism: Balanced memory bandwidth and throughput –  Sparsity: HW support for sparse data and small datatypes –  Resilience: Razor adapta5on to minimize PVT margins

Inpu

t Vector

Outpu

t Classes

DNN ENGINE

Hidden Layers

Outputs

+DNN Topology

+Weights

Inputs

Sensor Data


Argmax



Outline

•  Mo5va5on

•  DNN Engine – Parallelism – Sparsity – Resilience

•  Measurement Results

•  Summary



Fully-Connected DNN Graph Inpu

t Vector

Outpu

t Classes

Hidden Layers

Input Layer Output Layer




t Vector

Outpu

t Classes

Neuron Weight




t Vector

Outpu

t Classes

SIMD Window




t Vector

Outpu

t Classes

Weights




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes



0"

10"

20"

30"

0"0.2"0.4"0.6"0.8"1"

1" 4" 8" 16" 32"

Weight"R

eads"/"Cycle"

Data"Reads"(N

orm.)"

SIMD"Width"(Words)"

Balancing Efficiency and Bandwidth

•  Parallelism increases throughput and data reuse

No Data Reuse



0"

10"

20"

30"

0"0.2"0.4"0.6"0.8"1"

1" 4" 8" 16" 32"

Weight"R

eads"/"Cycle"

Data"Reads"(N

orm.)"

SIMD"Width"(Words)"


Bandwidth too high

•  Parallelism increases throughput and data reuse –  But also increases Memory Bandwidth demands

No Data Reuse



0"

10"

20"

30"

0"0.2"0.4"0.6"0.8"1"

1" 4" 8" 16" 32"

Weight"R

eads"/"Cycle"

Data"Reads"(N

orm.)"

SIMD"Width"(Words)"


•  Parallelism increases throughput and data reuse –  But also increases Memory Bandwidth demands

•  8-‐Way SIMD is efficient at reasonable memory BW –  10x Ac5va5on reuse, with 128b AXI channel

Bandwidth too high

No Data Reuse

128b

/cycle



DNN ENGINE Micro-Architecture

MAC DatapathWeight Load

NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host Processor

Writeback

Data Read

W-‐MEM




NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host Processor

Writeback

Data Read

W-‐MEM


•  Host Processor loads configura5on and input data




NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host Processor

Writeback

Data Read

W-‐MEM


•  Accumulate products of Ac5va5on and Weights




NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host Processor

Writeback

Data Read

W-‐MEM


•  Add bias, apply ReLU ac5va5on and writeback




NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host Processor

Writeback

Data Read

W-‐MEM


•  IRQ to host, which retrieves predic5on data



Outline

•  Mo5va5on



•  Summary



Exploiting Sparse Data in DNNs Inpu

t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes




t Vector

Outpu

t Classes





NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host Processor

Writeback

Data Read

W-‐MEM





NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMAIndex

Host Processor

Writeback

SKIP

Data Read

W-‐MEM



Outline

•  Mo5va5on



•  Summary



Razor Hardware Accelerators

CLK

voltage

process

coupling jiZer ageing

safety

temp




CLK

voltage

process


safety

temp

•  Occasionally allow real cri5cal paths to fail 5ming –  Explicitly check for late-‐arriving signals –  Adapt Vdd/Fclk to track near-‐zero error-‐rate




CLK

voltage

process


safety

temp

•  Occasionally allow real cri5cal paths to fail 5ming –  Explicitly check for late-‐arriving signals –  Adapt Vdd/Fclk to track near-‐zero error-‐rate

•  Must ensure 5ming errors do not affect system –  Roll-‐back and replay [Das et al., CICC’13] –  Algorithmic error tolerance [Whatmough et al., ISSCC’13]




RZNBUF

IPBUF

XBUF

RZ

Data In

Data Out

CFG

ActivationDMAIndex

Host Processor

Writeback

SKIP

Data Read

W-‐MEM

DNN Engine Micro-Architecture

•  Two Razor stages: W Memory and MAC Datapath



Memory: Algorithm-Level Tolerance

•  Weights are zero-‐mean Gaussians (regulariza5on) –  Small magnitude, random sign (worst-‐case in TC)

Two’s Compliment (TC) -1: 11111111+1: 00000001

!0.3% !0.2% !0.1% 0.0% 0.1% 0.2% 0.3%

Coun

t%(a.u.)%

Weight%Value%



Memory: Algorithm-Level Tolerance Two’s Compliment (TC) -1: 11111111+1: 00000001

Sign-‐Magnitude (SM) -1: 10000001+1: 00000001

•  Weights are zero-‐mean Gaussians (regulariza5on) –  Small magnitude, random sign (worst-‐case in TC)

•  Sign-‐Magnitude (SM) reduces switching in weights –  Lowers switched cap (power) and bit error rate

!0.3% !0.2% !0.1% 0.0% 0.1% 0.2% 0.3%

Coun

t%(a.u.)%

Weight%Value%



RZ

RZ

X

W

Y

Datapath: Circuit-Level Tolerance

•  Algorithmic error tolerance in datapath is poor –  Errors are persistent in accumulator

•  Circuit-‐level error tolerance using 5me borrowing –  Single-‐sided cri5cal path at accumulator register



RZ

RZ

X

W

Y




Cri5cal Path



RZ

RZ

X

W

Y




Cri5cal Path

Slack



Outline

•  Mo5va5on



•  Summary



28nm SoC for IoT Applications

ARM Cortex-‐M0

UARTDNN Engine

W-‐MEM1MB

SYSCTL

Bridge

GPIO

D-‐MEM64KB

I-‐MEM64KBTimers

BIST

Wdog

UARTs

32b AH

B

32b AP

B

128b

AXI

M

M

S

S

S

S M S

S

S

Low-‐BW Peripherals

S

S

SCAN MS M

S

Accelerator SubsystemCortex-‐M0 Subsystem

VACC –!Accelerator LogicVMEMP –!SRAM PeripheryVMEMC –!SRAM Core

USB

GPIO

Scan

USB

DCO

28nm SoC Test Chip

S

VDCO

VSOC

RTC

ASYN

C

FCLKHCLKOSC

VSOC



Razor Voltage Scaling

•  FC-‐DNN 784-‐256-‐256-‐256-‐10 (98.36% MNIST)

Sign-‐off

VDD Scaling (FCLK = 667MHz)

Tim

ing

Vio

latio

n R

ate

Pow

er (m

W)

TCSM




•  Opera5on at PoFF (770mV) lowers power by 30%


Tim

ing

Vio

latio

n R

ate

Pow

er (m

W)

TCSM Si

gn-‐off

PoFF

30% Power Reduc5on





Tim

ing

Vio

latio

n R

ate

Pow

er (m

W)

TCSM

Zero M

argin

•  Margin for fast varia5ons down to 715mV

30% Power Reduc5on

Sign-‐off

Margin Po

FF



Timing Error Tolerance Summary

•  Aggregate 5ming viola5on rate greater than 10-‐1

10-1

10-3

10-5

10-7

MNIST @ 98.36% AccuracyMemory Datapath Combined

Tim

ing

Vio

latio

n R

ate

TC SM BM TC SM TB TC ALL

> 1,

000,

000

X



0.4

0.6

1.0

0.2

0.8

Nor

mal

ized

Pre

d/S

ec

Nor

mal

ized

Ene

rgy/

Pre

d

TC

ThroughputEnergy

SM 8b

0.9V667MHz

2.0

4.0

8.0

1.0

3.0

5.0

6.0

7.0

Sign-off

Energy/Throughput Summary

-‐30%

MNIST @ 98.36% Accuracy



0.4

0.6

1.0

0.2

0.8

Nor

mal

ized

Pre

d/S

ec

Nor

mal

ized

Ene

rgy/

Pre

d

TC

ThroughputEnergy

SM 8b AVS SKIP

0.9V667MHz

0.77V667MHz

2.0

4.0

8.0

1.0

3.0

5.0

6.0

7.0

PoFFSign-off


-‐30%

-‐4X

+4X


-‐30%



0.4

0.6

1.0

0.2

0.8

Nor

mal

ized

Pre

d/S

ec

Nor

mal

ized

Ene

rgy/

Pre

d

TC

ThroughputEnergy

260nJ

SM 8b AVS SKIP AVS

0.9V667MHz

0.77V667MHz

0.715V667MHz

2.0

4.0

8.0

1.0

3.0

5.0

6.0

7.0

PoFFSign-off

Energy/Throughput Summary MNIST @ 98.36% Accuracy

-‐30%

-‐4X

+4X

-‐30%



0.4

0.6

1.0

0.2

0.8

Nor

mal

ized

Pre

d/S

ec

Nor

mal

ized

Ene

rgy/

Pre

d

TC

ThroughputEnergy

260nJ

SM 8b AVS SKIP AVS AFS SKIP

0.9V667MHz

0.77V667MHz

0.715V667MHz

0.9V1GHz

2.0

4.0

8.0

1.0

3.0

5.0

6.0

7.0

PoFF PoFFSign-off


+50%

+4X

-‐4X


-‐30%



0.4

0.6

1.0

0.2

0.8

Nor

mal

ized

Pre

d/S

ec

Nor

mal

ized

Ene

rgy/

Pre

d

TC

ThroughputEnergy

260nJ570nJ

SM 8b AVS SKIP AVS AFS SKIP AFS

0.9V667MHz

0.77V667MHz

0.715V667MHz

0.9V1GHz

0.9V1.2GHz

2.0

4.0

8.0

1.0

3.0

5.0

6.0

7.0

PoFF PoFFSign-off


+20%


+50%

+4X

-‐4X

-‐30%



Summary and Conclusion

•  DNN Engine – A Shared Programmable Classifier –  Parallelism: 10X Data Reuse @ 128b/cycle BW –  Sparsity: +4X Throughput, -‐4X Energy –  Resilience: +50% Throughput or -‐30% Energy w/Razor

•  Energy Efficient DNN –  568 nJ/pred @ 1.2 GHz

•  Timing Error Tolerance –  Pe >10-‐1 @ 98.36% (MNIST)



Documents

A 28nm SoC with a 1.2GHz 568nJ/ Prediction Sparse Deep ...vlsiarch.eecs.harvard.edu/wp-content/uploads/2017/02/whatmough... · – Specialized$Architecture$and$Eﬃcient$Digital$Circuits$