Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Mixed-Signal Techniques for Embedded
Machine Learning Systems
Boris Murmann
June 16, 2019
Applications
Speed of response
Bandwidth utilized
Privacy
Power consumed
CloudEdge
Conversational
Interfaces
โฆnatural?
โฆseamless?
โฆreal-time?
2
Task Complexity, Memory and Classification Energy
Face Detection
SVM
Digit RecognitionImage Classification
10 classes
Binary CNN
Image Classification
1000 classes
MobileNet v1
Self-Driving
ResNet-50 HD
pJ
nJ
ยตJ
mJ
J
3
Task Complexity, Memory and Classification Energy
pJ
nJ
ยตJ
mJ
J
My main interest
4
Edge Inference System
5
A/D
A/D
Sensors
Wake-up
Detector
Pre-
Processin
g
Deep
Neural
Network
DRA
M
SRA
M
Buffer
PE
Array
Register
File
Compute
MAC
Opportunities for Analog/Mixed-Signal Design
6
A/D
A/D
Sensors
Wake-up
Detector
Pre-
Processin
g
Deep
Neural
Network
DRA
M
SRA
M
Buffer
PE
Array
Register
File
Compute
MAC
Analog
Feature
Extraction
Analog
MAC
In-Memory
Computing
Outline
7
Data-Compressive Imager for Object Detection
โบ Omid-Zohoor & Young, TCSVT 2018 & ISSCC 2019
Mixed-Signal ConvNet
โบ Bankman, ISSCC 2018 & JSSC 2019
RRAM-based ConvNet with In-Memory Compute
โบ Ongoing work
Wake-Up Detector with Hand-Crafted Features
Feature
Extractor
Simple
Classifier
Sensor
Signal LabelA/D
Training
Algorithm
Labeled
Training Data
High-dimensional
data
Low-dimensional
representation
Data Deluge
8
Analog Feature Extractor
Low-rate and/or lowโresolution ADC
Low data rate digital I/O
Reduced memory requirements
Feature
Extractor
Simple
Classification
Sensor
SignalLabelA/D
Training
Algorithm
Labeled
Training Data
Low-dimensional
representation
9
Histogram of Oriented Gradients
10
Analog Gradient Computation
H1 H2๐บ๐ป =
V1 V2๐บ๐ =
horizontal
vertical
V2
H
1
PH
2
V1
๐บ๐ป = 400๐๐ โ 100๐๐ = ๐๐๐๐๐ฝ
๐บ๐ป =1
4400๐๐ โ
1
4100๐๐ = ๐๐๐๐ฝ
Dark patch
Bright patch
11
High Dynamic Range Images
12
Gradient magnitude varies
significantly across image
Would require high-
resolution ADCs (โฅ 9b) to
digitize computed gradients
โบ But, we want to produce
as little data as possible
Ratio-Based (โLogโ) Gradients
V2
H
1
PH
2
V1
H1
H2
๐บ๐ป =V1
V2
๐บ๐ =
H1
H2
๐บ๐ป = =๐ผ ร
๐ผ ร
H1
H2
Illumination Invariant
13
Ratio Quantization
V2
H
1
PH
2
V1
H1
H2
๐บ๐ป =
V1
V2
๐บ๐ =
๐๐๐๐ฅ1๐๐๐๐ฅ2
1
22
Negative
edge
Positive
edge
No
edge
*Empirically determined
thresholds
๐๐๐๐ฅ1๐๐๐๐ฅ2
1
22
1.5b
2.75b
4.21.31
1.3
1
4.2
H1
H2
V1
V2
๐บ๐ = ๐๐บ๐ป = ๐
14
HOG Feature Compression with 1.5b Gradients
Quant
Fewer angle bins
Quantizing histogram
magnitudes
25ร less data in HOG
features compared to 8-bit
image
15
Log vs. Linear GradientsLess Illu
min
atio
n
16
Log vs. Linear GradientsLess Illu
min
atio
n
17
Log vs. Linear GradientsLess Illu
min
atio
n
18
Prototype Chip
19
Row Buffers with Pixel Binning (Image Pyramid)
20
Ratio-to-Digital Converter (RDC)
21
Data-Driven Spec Derivation
22
๐ป1 + ๐๐๐ป2 + ๐๐
Chip Summary
โข 0.13 ยตm CIS 1P4M
โข 5ยตm 4T pixels
โข QVGA 320(V) x 240(H)
โข 229 mW @ 30 FPS
Supply Voltages
Pixel: 2.5V
Analog: 1.5V, 2.5V
Digital: 0.9V
23
Sample Images
24
25
Results using Deformable Parts Model detection & custom database (PascalRAW)
Comparison to State of the Art
26
Information Preservation
27
Use Log Gradients as ConvNet Input?
28
Ongoing work; comparable performance using ResNet-10 (PascalRaw dataset)
Digital ConvNet Fabric
Can We Play Mixed-Signal Tricks in a ConvNet?
29
BinaryNet
Courbariaux et al., NIPS 2016
Weights and activations constrained to +1
and -1, multiplication becomes XNOR
Minimizes D/A and A/D overhead
Nice option for small/medium-size
problems and mixed-signal exploration
30
Bankman et al., ISSCC 2018
Mixed-Signal Binary CNN Processor
1. Binary CNN with โCMOS-inspiredโ topology, engineered for minimal circuit-level path loading
2. Hardware architecture amortizes memory access across many computations, with all memory on chip (328 KB)
3. Energy-efficient switched-capacitor neuron for wide vector summation, replacing digital adder tree
31
CIFAR-10
Zhao et al., FPGA 2017
Original BinaryNet Topology
88.54% accuracy on CIFAR-10
1.67 MB weight memory (68% FC layers)
27.9 mJ/classification with FPGA
32
Mixed-Signal BinaryNet Topology
Sacrificed accuracy for regularity and energy efficiency
86.05% accuracy on CIFAR-10
328 KB weight memory
3.8 mJ per classification
33
Neuron
34
Naรฏve Sequential Computation
35
Weight-Stationary
256
x2 (north/south)
x4 (neuron mux)
36
Weight-Stationary and Data-ParallelParallel
broadcast
37
Complete Architecture
38
๐ง = sgn
๐=0
1023
๐ค๐๐ฅ๐ + ๐ ๐ง
๐ค ๐Filter Weights
๐ฅImage Patch Filter Bias
9b sign-magnitude
๐ = โ1 ๐
๐=0
7
2๐๐๐
Output
2ร2ร256 2ร2ร256
Neuron Function
39
๐ฃdiff๐๐ท๐ท
=๐ถ๐ข๐ถ๐ก๐๐ก
๐=0
1023
๐ค๐๐ฅ๐ + ๐
๐ = โ1 ๐
๐=0
7
2๐๐๐
Switched-Capacitor Implementation
Bias & offset cal.Weights x inputs
Batch normalization
folded into weight
signs and bias
40
Behavioral Simulations
41
Significant margin in noise, offset, and mismatch (VFS = 460 mV)
๐ถ๐ข = 1 fF๐ฃ๐๐ = 4.6 mV RMS๐ฃ๐ = 460 ยตV RMS
โMemory-Cell-Likeโ Processing Element
1 fF metal-oxide-metal fringe capacitorStandard-cell-based
42 transistors
24107 F2
42
TSMC 28nm HPL 1P8M
6 mm2 area
328 KB SRAM
10 MHz clock
Supply Voltages
VDD โ Digital Logic, 0.6V โ 1.0V
VMEM โ SRAM, 0.53V โ 1.0V
VNEU โ Neuron Array, 0.6V
VCOMP โ Comparators, 0.8V
Die Photo
43
ฮผ = 86.05%
ฯ = 0.40%
Measured Classification Accuracy
10 chips, 180 runs each through
10,000 CIFAR-10 test images
VDD = 0.8V, VMEM = 0.8V
3.8 ฮผJ/classification
237 FPS, 899 ฮผW
0.43 ฮผJ in 1.8V I/O
Mean accuracy ฮผ = 86.05% same
as perfect digital model
44
Comparison to Synthesized Digital
BinarEye(Moons et al., CICC 2018)
Synthesized Digital Mixed-Signal
45
Digital vs. Mixed-Signal Binary CNN Processor
Synthesized
Digital
Moons et al.
CICC 2018
Hand-Designed
Digital
(Projected)
Mixed-Signal
Bankman et al.
ISSCC 2018
Energy @
86.05%
CIFAR-10
46
CIFAR-10 Energy vs. Accuracy Neuromorphicโบ [1] TrueNorth, Esser PNAS
2016
GPUโบ [2] Zhao FPGA 2017
FPGAโบ [2] Zhao FPGA 2017
โบ [3] Umuroglu FPGA 2017
MCUโบ [4] CMSIS-NN, Lai arXiv 2018
Memory-like, mixed-signalโบ [5] Bankman ISSCC 2018
BinarEye, digitalโบ [6] Moons CICC 2018
In-memory, mixed-signalโบ [7] Jia arXiv 2018
*energy excludes off-chip DRAM
47
Limitations of Mixed-Signal BinaryNet
48
Poor programmability
Relatively limited accuracy (even on CIFAR-10) due to 1b arithmetic
Energy advantage over customized digital is not revolutionary
โบ Same SRAM, essentially same dataflow
Need a more โanalogโ memory system to unleash larger gains
โบ In-memory computing
BinaryNet Synapse versus Resistive RAM
โข TBD
โข 25 F2
โข Multi-bit (?)
49
โข 0.93 fJ per 1b-MAC in 28 nm
โข 24107 F2
โข Single-bit
Tsai, 2018
Matrix-Vector Multiplication with Resistive Cells
50
Typically use two cells to
achieve pos/neg weights
(other schemes possible)
256 columns
โฆ
10
24
row
s
25 F2 cell @ F = 90 nm
Side length s = 0.45 mm
2 x 1024 x 0.45 um x 256 x 0.45 um = 0.106 mm2
per layer
0.84 mm2 for the complete 8-layer network
RRAM Density โ BinaryNet Example
51
(2x)
Ongoing Research
52
What is the best architecture?
How many levels can be stored in each cell?
What is the most efficient readout?
Can we cope with nonidealities using training techniques?
53
(Content deleted for online distribution)
54
(Content deleted for online distribution)
55
(Content deleted for online distribution)
56
(Content deleted for online distribution)
VGG-7 Experiment (4.8 Million Parameters)
57
8192
10
3ร3ร128
3ร3ร128
3ร3ร256 3ร3ร512
3ร3ร2563ร3ร3
2-bit weights, 2-bit
activations
Accuracy on CIFAR-10
2b quantization only: 93%
2b quantization + RRAM/ADC model: 92%
Work in progress!
Optimal energy/MAC = 0.38 fJ
Energy Model for Column in Conv6 Layer
58
Summary
59
Analog feature extraction is attractive for wake-up detectors
Adding analog compute in ConvNets can be beneficial when it
simultaneously lets us reduce data movement
โบ In-memory analog compute looks most promising
โบ Can consider SRAM or emerging memories (e.g. RRAM)
Expect significant progress as more application drivers for โmachine
learning at the edgeโ emerge