Upload
letuong
View
213
Download
1
Embed Size (px)
Citation preview
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 1 of 86
A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network
Engine with >0.1 Timing Error Rate Tolerance for IoT Applications
Paul Whatmough, S. K. Lee,
H. Lee, S. Rama, D. Brooks, G.-Y. Wei Harvard University, Cambridge, MA
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 2 of 86
Motivation
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 3 of 86
Motivation
• Deep Neural Networks (DNNs) – A universal classifier with many diverse applica5ons – Interpret noisy real-‐world data from sensor-‐rich systems
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 4 of 86
Motivation
• Deep Neural Networks (DNNs) – A universal classifier with many diverse applica5ons – Interpret noisy real-‐world data from sensor-‐rich systems – Large memory footprint and compute load – Perform DNN inference on device in micro-‐Joules
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 5 of 86
Motivation
• DNN ENGINE – A Shared Programmable Classifier – Specialized Architecture and Efficient Digital Circuits
Inpu
t Vector
Outpu
t Classes
DNN ENGINE
Hidden Layers
Outputs
+DNN Topology
+Weights
Inputs
Sensor Data
WeightNeuron Classification
Argmax
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 6 of 86
Motivation
• DNN ENGINE – A Shared Programmable Classifier – Specialized Architecture and Efficient Digital Circuits – Parallelism: Balanced memory bandwidth and throughput
Inpu
t Vector
Outpu
t Classes
DNN ENGINE
Hidden Layers
Outputs
+DNN Topology
+Weights
Inputs
Sensor Data
WeightNeuron Classification
Argmax
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 7 of 86
Motivation
• DNN ENGINE – A Shared Programmable Classifier – Specialized Architecture and Efficient Digital Circuits – Parallelism: Balanced memory bandwidth and throughput – Sparsity: HW support for sparse data and small datatypes
Inpu
t Vector
Outpu
t Classes
DNN ENGINE
Hidden Layers
Outputs
+DNN Topology
+Weights
Inputs
Sensor Data
WeightNeuron Classification
Argmax
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 8 of 86
Motivation
• DNN ENGINE – A Shared Programmable Classifier – Specialized Architecture and Efficient Digital Circuits – Parallelism: Balanced memory bandwidth and throughput – Sparsity: HW support for sparse data and small datatypes – Resilience: Razor adapta5on to minimize PVT margins
Inpu
t Vector
Outpu
t Classes
DNN ENGINE
Hidden Layers
Outputs
+DNN Topology
+Weights
Inputs
Sensor Data
WeightNeuron Classification
Argmax
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 9 of 86
Outline
• Mo5va5on
• DNN Engine – Parallelism – Sparsity – Resilience
• Measurement Results
• Summary
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 10 of 86
Fully-Connected DNN Graph Inpu
t Vector
Outpu
t Classes
Hidden Layers
Input Layer Output Layer
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 11 of 86
Fully-Connected DNN Graph Inpu
t Vector
Outpu
t Classes
Neuron Weight
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 12 of 86
Fully-Connected DNN Graph Inpu
t Vector
Outpu
t Classes
SIMD Window
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 13 of 86
Fully-Connected DNN Graph Inpu
t Vector
Outpu
t Classes
Weights
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 14 of 86
Fully-Connected DNN Graph Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 15 of 86
Fully-Connected DNN Graph Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 16 of 86
Fully-Connected DNN Graph Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 17 of 86
Fully-Connected DNN Graph Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 18 of 86
Fully-Connected DNN Graph Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 19 of 86
Fully-Connected DNN Graph Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 20 of 86
Fully-Connected DNN Graph Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 21 of 86
Fully-Connected DNN Graph Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 22 of 86
0"
10"
20"
30"
0"0.2"0.4"0.6"0.8"1"
1" 4" 8" 16" 32"
Weight"R
eads"/"Cycle"
Data"Reads"(N
orm.)"
SIMD"Width"(Words)"
Balancing Efficiency and Bandwidth
• Parallelism increases throughput and data reuse
No Data Reuse
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 23 of 86
0"
10"
20"
30"
0"0.2"0.4"0.6"0.8"1"
1" 4" 8" 16" 32"
Weight"R
eads"/"Cycle"
Data"Reads"(N
orm.)"
SIMD"Width"(Words)"
Balancing Efficiency and Bandwidth
Bandwidth too high
• Parallelism increases throughput and data reuse – But also increases Memory Bandwidth demands
No Data Reuse
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 24 of 86
0"
10"
20"
30"
0"0.2"0.4"0.6"0.8"1"
1" 4" 8" 16" 32"
Weight"R
eads"/"Cycle"
Data"Reads"(N
orm.)"
SIMD"Width"(Words)"
Balancing Efficiency and Bandwidth
• Parallelism increases throughput and data reuse – But also increases Memory Bandwidth demands
• 8-‐Way SIMD is efficient at reasonable memory BW – 10x Ac5va5on reuse, with 128b AXI channel
Bandwidth too high
No Data Reuse
128b
/cycle
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 25 of 86
DNN ENGINE Micro-Architecture
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMA
Host Processor
Writeback
Data Read
W-‐MEM
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 26 of 86
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMA
Host Processor
Writeback
Data Read
W-‐MEM
DNN ENGINE Micro-Architecture
• Host Processor loads configura5on and input data
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 27 of 86
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMA
Host Processor
Writeback
Data Read
W-‐MEM
DNN ENGINE Micro-Architecture
• Accumulate products of Ac5va5on and Weights
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 28 of 86
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMA
Host Processor
Writeback
Data Read
W-‐MEM
DNN ENGINE Micro-Architecture
• Add bias, apply ReLU ac5va5on and writeback
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 29 of 86
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMA
Host Processor
Writeback
Data Read
W-‐MEM
DNN ENGINE Micro-Architecture
• IRQ to host, which retrieves predic5on data
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 30 of 86
Outline
• Mo5va5on
• DNN Engine – Parallelism – Sparsity – Resilience
• Measurement Results
• Summary
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 31 of 86
Exploiting Sparse Data in DNNs Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 32 of 86
Exploiting Sparse Data in DNNs Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 33 of 86
Exploiting Sparse Data in DNNs Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 34 of 86
Exploiting Sparse Data in DNNs Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 35 of 86
Exploiting Sparse Data in DNNs Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 36 of 86
Exploiting Sparse Data in DNNs Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 37 of 86
Exploiting Sparse Data in DNNs Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 38 of 86
Exploiting Sparse Data in DNNs Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 39 of 86
Exploiting Sparse Data in DNNs Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 40 of 86
Exploiting Sparse Data in DNNs Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 41 of 86
Exploiting Sparse Data in DNNs Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 42 of 86
Exploiting Sparse Data in DNNs Inpu
t Vector
Outpu
t Classes
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 43 of 86
DNN ENGINE Micro-Architecture
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMA
Host Processor
Writeback
Data Read
W-‐MEM
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 44 of 86
DNN ENGINE Micro-Architecture
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMAIndex
Host Processor
Writeback
SKIP
Data Read
W-‐MEM
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 45 of 86
Outline
• Mo5va5on
• DNN Engine – Parallelism – Sparsity – Resilience
• Measurement Results
• Summary
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 46 of 86
Razor Hardware Accelerators
CLK
voltage
process
coupling jiZer ageing
safety
temp
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 47 of 86
Razor Hardware Accelerators
CLK
voltage
process
coupling jiZer ageing
safety
temp
• Occasionally allow real cri5cal paths to fail 5ming – Explicitly check for late-‐arriving signals – Adapt Vdd/Fclk to track near-‐zero error-‐rate
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 48 of 86
Razor Hardware Accelerators
CLK
voltage
process
coupling jiZer ageing
safety
temp
• Occasionally allow real cri5cal paths to fail 5ming – Explicitly check for late-‐arriving signals – Adapt Vdd/Fclk to track near-‐zero error-‐rate
• Must ensure 5ming errors do not affect system – Roll-‐back and replay [Das et al., CICC’13] – Algorithmic error tolerance [Whatmough et al., ISSCC’13]
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 49 of 86
MAC DatapathWeight Load
RZNBUF
IPBUF
XBUF
RZ
Data In
Data Out
CFG
ActivationDMAIndex
Host Processor
Writeback
SKIP
Data Read
W-‐MEM
DNN Engine Micro-Architecture
• Two Razor stages: W Memory and MAC Datapath
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 50 of 86
Memory: Algorithm-Level Tolerance
• Weights are zero-‐mean Gaussians (regulariza5on) – Small magnitude, random sign (worst-‐case in TC)
Two’s Compliment (TC) -1: 11111111+1: 00000001
!0.3% !0.2% !0.1% 0.0% 0.1% 0.2% 0.3%
Coun
t%(a.u.)%
Weight%Value%
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 51 of 86
Memory: Algorithm-Level Tolerance Two’s Compliment (TC) -1: 11111111+1: 00000001
Sign-‐Magnitude (SM) -1: 10000001+1: 00000001
• Weights are zero-‐mean Gaussians (regulariza5on) – Small magnitude, random sign (worst-‐case in TC)
• Sign-‐Magnitude (SM) reduces switching in weights – Lowers switched cap (power) and bit error rate
!0.3% !0.2% !0.1% 0.0% 0.1% 0.2% 0.3%
Coun
t%(a.u.)%
Weight%Value%
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 52 of 86
RZ
RZ
X
W
Y
Datapath: Circuit-Level Tolerance
• Algorithmic error tolerance in datapath is poor – Errors are persistent in accumulator
• Circuit-‐level error tolerance using 5me borrowing – Single-‐sided cri5cal path at accumulator register
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 53 of 86
RZ
RZ
X
W
Y
Datapath: Circuit-Level Tolerance
• Algorithmic error tolerance in datapath is poor – Errors are persistent in accumulator
• Circuit-‐level error tolerance using 5me borrowing – Single-‐sided cri5cal path at accumulator register
Cri5cal Path
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 54 of 86
RZ
RZ
X
W
Y
Datapath: Circuit-Level Tolerance
• Algorithmic error tolerance in datapath is poor – Errors are persistent in accumulator
• Circuit-‐level error tolerance using 5me borrowing – Single-‐sided cri5cal path at accumulator register
Cri5cal Path
Slack
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 55 of 86
Outline
• Mo5va5on
• DNN Engine – Parallelism – Sparsity – Resilience
• Measurement Results
• Summary
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 56 of 86
28nm SoC for IoT Applications
ARM Cortex-‐M0
UARTDNN Engine
W-‐MEM1MB
SYSCTL
Bridge
GPIO
D-‐MEM64KB
I-‐MEM64KBTimers
BIST
Wdog
UARTs
32b AH
B
32b AP
B
128b
AXI
M
M
S
S
S
S M S
S
S
Low-‐BW Peripherals
S
S
SCAN MS M
S
Accelerator SubsystemCortex-‐M0 Subsystem
VACC –!Accelerator LogicVMEMP –!SRAM PeripheryVMEMC –!SRAM Core
USB
GPIO
Scan
USB
DCO
28nm SoC Test Chip
S
VDCO
VSOC
RTC
ASYN
C
FCLKHCLKOSC
VSOC
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 57 of 86
Razor Voltage Scaling
• FC-‐DNN 784-‐256-‐256-‐256-‐10 (98.36% MNIST)
Sign-‐off
VDD Scaling (FCLK = 667MHz)
Tim
ing
Vio
latio
n R
ate
Pow
er (m
W)
TCSM
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 58 of 86
Razor Voltage Scaling
• Opera5on at PoFF (770mV) lowers power by 30%
VDD Scaling (FCLK = 667MHz)
Tim
ing
Vio
latio
n R
ate
Pow
er (m
W)
TCSM Si
gn-‐off
PoFF
30% Power Reduc5on
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 59 of 86
Razor Voltage Scaling
VDD Scaling (FCLK = 667MHz)
Tim
ing
Vio
latio
n R
ate
Pow
er (m
W)
TCSM
Zero M
argin
• Margin for fast varia5ons down to 715mV
30% Power Reduc5on
Sign-‐off
Margin Po
FF
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 60 of 86
Timing Error Tolerance Summary
• Aggregate 5ming viola5on rate greater than 10-‐1
10-1
10-3
10-5
10-7
MNIST @ 98.36% AccuracyMemory Datapath Combined
Tim
ing
Vio
latio
n R
ate
TC SM BM TC SM TB TC ALL
> 1,
000,
000
X
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 61 of 86
0.4
0.6
1.0
0.2
0.8
Nor
mal
ized
Pre
d/S
ec
Nor
mal
ized
Ene
rgy/
Pre
d
TC
ThroughputEnergy
SM 8b
0.9V667MHz
2.0
4.0
8.0
1.0
3.0
5.0
6.0
7.0
Sign-off
Energy/Throughput Summary
-‐30%
MNIST @ 98.36% Accuracy
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 62 of 86
0.4
0.6
1.0
0.2
0.8
Nor
mal
ized
Pre
d/S
ec
Nor
mal
ized
Ene
rgy/
Pre
d
TC
ThroughputEnergy
SM 8b AVS SKIP
0.9V667MHz
0.77V667MHz
2.0
4.0
8.0
1.0
3.0
5.0
6.0
7.0
PoFFSign-off
Energy/Throughput Summary
-‐30%
-‐4X
+4X
MNIST @ 98.36% Accuracy
-‐30%
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 63 of 86
0.4
0.6
1.0
0.2
0.8
Nor
mal
ized
Pre
d/S
ec
Nor
mal
ized
Ene
rgy/
Pre
d
TC
ThroughputEnergy
260nJ
SM 8b AVS SKIP AVS
0.9V667MHz
0.77V667MHz
0.715V667MHz
2.0
4.0
8.0
1.0
3.0
5.0
6.0
7.0
PoFFSign-off
Energy/Throughput Summary MNIST @ 98.36% Accuracy
-‐30%
-‐4X
+4X
-‐30%
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 64 of 86
0.4
0.6
1.0
0.2
0.8
Nor
mal
ized
Pre
d/S
ec
Nor
mal
ized
Ene
rgy/
Pre
d
TC
ThroughputEnergy
260nJ
SM 8b AVS SKIP AVS AFS SKIP
0.9V667MHz
0.77V667MHz
0.715V667MHz
0.9V1GHz
2.0
4.0
8.0
1.0
3.0
5.0
6.0
7.0
PoFF PoFFSign-off
Energy/Throughput Summary
+50%
+4X
-‐4X
MNIST @ 98.36% Accuracy
-‐30%
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 65 of 86
0.4
0.6
1.0
0.2
0.8
Nor
mal
ized
Pre
d/S
ec
Nor
mal
ized
Ene
rgy/
Pre
d
TC
ThroughputEnergy
260nJ570nJ
SM 8b AVS SKIP AVS AFS SKIP AFS
0.9V667MHz
0.77V667MHz
0.715V667MHz
0.9V1GHz
0.9V1.2GHz
2.0
4.0
8.0
1.0
3.0
5.0
6.0
7.0
PoFF PoFFSign-off
Energy/Throughput Summary
+20%
MNIST @ 98.36% Accuracy
+50%
+4X
-‐4X
-‐30%
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 66 of 86
Summary and Conclusion
• DNN Engine – A Shared Programmable Classifier – Parallelism: 10X Data Reuse @ 128b/cycle BW – Sparsity: +4X Throughput, -‐4X Energy – Resilience: +50% Throughput or -‐30% Energy w/Razor
• Energy Efficient DNN – 568 nJ/pred @ 1.2 GHz
• Timing Error Tolerance – Pe >10-‐1 @ 98.36% (MNIST)
14.3: A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine
with >0.1 Timing Error Rate Tolerance for IoT Applications © 2017 IEEE International Solid-State Circuits Conference 67 of 86