Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Luca Benini – Why Now? 1 of 30
Luca Benini
The Deep Learning Revolution Why now?
http://www.pulp-platform.org
Luca Benini – Why Now? 2 of 30
Machine Learning Frenzy
Luca Benini – Why Now? 3 of 30
Yes, but Why?
First, it was machine vision…
Human
Deep Learning appears (green dots)
ImageNet Large Scale Visual Recognition Competition (ILSVRC)
Imagenet training data 10 million hand- annotated
images (object in picture)
1 million bounding boxes
also provided
Luca Benini – Why Now? 4 of 30
Then, speech recognition…
English training data▪ 11,940 hours
▪ 8 million utterances
▪ ...and growing everyday
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Baidu Research) 2016
Luca Benini – Why Now? 5 of 30
And automatic translation…
Google’s Neural
Machine Translation
System: Bridging the
Gap between Human
and Machine
Translation (2016)
EnFr
training set contains
36M sentence pairs
Luca Benini – Why Now? 6 of 30
Only research Milestones?
Growing Use of Deep Learning at Google [J. Dean]
Across many
products/areas:
Android
Apps
drugdiscovery
Gmail
Image understanding
Maps
Natural language
understanding
Photos
Robotics research
Speech
Translation
YouTube
…many others...
# of directories containing model descriptionfiles
RankBrain is the third most useful signal for ranking pages (extracts a “purified” search from what you type)
While humans guessed correctly 70% of top ranking pages, RankBrain had an 80% success rate.
Luca Benini – Why Now? 7 of 30
Only a Google Thing?
Deep Learning Market
DL Software revenue
Luca Benini – Why Now? 8 of 30
Only for Big guys?
Luca Benini – Why Now? 9 of 30
What is it?
Luca Benini – Why Now? 10 of 30
Anatomy of a Neural Network
From Stanford cs231n lecture notes
Biological neuron
w1 w2 w3
x1
y
x2 x3
y=F(w1x1+w2x2+w3x3)
F(x)=max(0,x)
Artificial neuron
Inference: given x, compute y with fixed w
Training (aka backpropagation): given x and y, adjust w so that F(x) gets “close”
(fitness function) to y
Don’t get to close in a single shot – smooth adjustments with some randomization
(stochastic gradient)
Luca Benini – Why Now? 11 of 30
Deep neural networks (DNNs)
Input Result
Application components:
Task objectivee.g. Identify face
Training data10-100M images
Network architecture~10 layers1B parameters
Learning algorithm~30 Exaflops~30 GPU days
Raw data Low-levelfeatures Mid-level features High-levelfeatures
Luca Benini – Why Now? 12 of 30
Layer-by-layer computation
Key operation is dense M x V
Training by recursive backpropagation of error on fitness function
Luca Benini – Why Now? 13 of 30
Used also for Latency-insensitive inference
Batching for training
Batched operation boosts re-use of weights.
Without batching, would use each element of Weight matrix once.
Want 10-50 arithmetic operations per memory fetch to avoiding having to wait all the time for data from memory
Luca Benini – Why Now? 14 of 30
22
CNN Computation: main kernel (per layer)
Deep Convolutional NNs
Filters conserved through plane
MAC-dominated – even without batching
Can be cast as matrix multiply
Luca Benini – Why Now? 15 of 30
Recepy for Deep Learning
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-
explained-in-a-single-powerpoint-slide/
Luca Benini – Why Now? 16 of 30
Recepy for Deep Learning
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-
explained-in-a-single-powerpoint-slide/
overfittingDon’t forget!
PreventingOverfitting
Modify the Network
Better optimization Strategy
Luca Benini – Why Now? 17 of 30
Results get betterwith
more data +
bigger models +
more
computation
The Singularity of DL
Luca Benini – Why Now? 18 of 30
Results get betterwith
more data +
bigger models +
more
computation
Better algorithms, new insights and improved
techniques always help, too!
The Singularity of DL
Luca Benini – Why Now? 19 of 30
GPUs are Great for DL
Why?
Because they are good at matrix multiply 90%
utilization is achievable (on lots of “cores”)
Pascal GP100
3840 “cores”
3840 MAC/cycle
@ 1.4GHz
5.3 TMACS (FP)
@300W
28pJ/OP
Luca Benini – Why Now? 20 of 30
What’s Next?
Luca Benini – Why Now? 21 of 30
Near-Sensor (aka Edge) DL
Battery + Harvesting powered
a few mW power envelope
Long range, low BW
Short range, BW
Low rate (periodic) data
SW update, commands
Transmit
Idle: ~1µW
Active: ~ 50mW
Analyze and Classify
µController
IOs
1 ÷ 25 MOPS
1 ÷ 10 mW
e.g. CortexM
Sense
MEMS IMU
MEMS Microphone
ULP Imager
100 µW ÷ 2 mW
EMG/ECG/EIT
L2 Memory
1 ÷ 2000 MOPS
1 ÷ 10 mW
Luca Benini – Why Now? 22 of 30
Does it Matter?
3x Cost reduction if data
volume is reduced by 95%
Luca Benini – Why Now? 23 of 30
Origami: A CNN Accelerator
FP not needed: 12-bit signals sufficient
Input to classification double-vs-12-bit accuracy loss < 0.5% (80.6% to 80.1%)
23
https://arxiv.org/abs/1512.04295
Luca Benini – Why Now? 24 of 30
CNNs: typical workload
Performance for 10 fps: ~73 GOPS/s
Energy efficiency: ~2300 GOPS/W efficiency
Origami core in 28nm FDSOI 10 fps ResNet-34 with ~32mW
Scaling Origami to 28nm FDSOI
Example: ResNet-34
classifies 224x224 images into 1000 classes
~ trained human-level performance
~ 21M parameters
~ 3.6G MAC operations
0.4pj/OP
Luca Benini – Why Now? 25 of 30
Pushing Further: YodaNN1
Approximation at the algorithmic side Binary weights
BinaryConnect [Courbariaux, NIPS15], XOR NET [Rastegari, arXiv16]
Reduce weights to a binary value -1/+1
Accuracy loss significant but gap is closing:
ResNet-18 on ImageNet with 83.0% (binary-weight) vs. 89.2% (single-precision) top-5 accuracy; and 60.8% vs. 69.3% top-1 accuracy
Ultra-optimized HW is possible!
1After the Yedi Master from Star Wars - “Small in size
but wise and powerful” cit. www.starwars.com
Luca Benini – Why Now? 26 of 30
YODANN Energy Efficiency
Same area 832 SoP units + all SCM
12x Boost in core energy efficiency (single layer)
https://arxiv.org/abs/1606.05487
0.03pj/OP
Luca Benini – Why Now? 27 of 30
Origami, YodaNN vs. Human
Type Analog (bio)
Q2.9Precision
Q2.9Precision
Binary-Weight
Network human ResNet-34 ResNet-18 ResNet-18
Top-1 error [%]
21.53 30.7 39.2
Top-5 error [%]
5.1 5.6 10.8 17.0
Hardware Brain Origami Origami YodaNN
Energy-eff. [uJ/img]
100.000(*) 1086 543 31
The «energy-efficient AI» challenge (e.g. Human vs. IBM Watson)
Game over for humans also in energy-efficient vision?
…. Not yet! (object recognition is a super-simple task)
*Pbrain = 10W, 10% of the brain used for vision, trained human working at 10img/sec
Luca Benini – Why Now? 28 of 30
Thanks!!!
www.pulp-platform.org
www-micrel.deis.unibo.it/pulp-project
iis-projects.ee.ethz.ch/index.php/PULP
Morale:
Plenty of room at the bottom
Luca Benini – Why Now? 29 of 30
Tensor Processing Unit
NB vs. K80 GPU
A specializazion
story:
1. Reduced
precision (8bit)
2. Dedicated
memory
architecture