The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 1 of 30

Luca Benini

The Deep Learning Revolution Why now?

http://www.pulp-platform.org


Machine Learning Frenzy


Yes, but Why?

First, it was machine vision…

Human

Deep Learning appears (green dots)

ImageNet Large Scale Visual Recognition Competition (ILSVRC)

Imagenet training data 10 million hand- annotated

images (object in picture)

1 million bounding boxes

also provided


Then, speech recognition…

English training data▪ 11,940 hours

▪ 8 million utterances

▪ ...and growing everyday

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Baidu Research) 2016


And automatic translation…

Google’s Neural

Machine Translation

System: Bridging the

Gap between Human

and Machine

Translation (2016)

EnFr

training set contains

36M sentence pairs


Only research Milestones?

Growing Use of Deep Learning at Google [J. Dean]

Across many

products/areas:

Android

Apps

drugdiscovery

Gmail

Image understanding

Maps

Natural language

understanding

Photos

Robotics research

Speech

Translation

YouTube

…many others...

# of directories containing model descriptionfiles

RankBrain is the third most useful signal for ranking pages (extracts a “purified” search from what you type)

While humans guessed correctly 70% of top ranking pages, RankBrain had an 80% success rate.


Only a Google Thing?

Deep Learning Market

DL Software revenue


Only for Big guys?


What is it?


Anatomy of a Neural Network

From Stanford cs231n lecture notes

Biological neuron

w1 w2 w3

x1

y

x2 x3

y=F(w1x1+w2x2+w3x3)

F(x)=max(0,x)

Artificial neuron

Inference: given x, compute y with fixed w

Training (aka backpropagation): given x and y, adjust w so that F(x) gets “close”

(fitness function) to y

Don’t get to close in a single shot – smooth adjustments with some randomization

(stochastic gradient)


Deep neural networks (DNNs)

Input Result

Application components:

Task objectivee.g. Identify face

Training data10-100M images

Network architecture~10 layers1B parameters

Learning algorithm~30 Exaflops~30 GPU days

Raw data Low-levelfeatures Mid-level features High-levelfeatures


Layer-by-layer computation

Key operation is dense M x V

Training by recursive backpropagation of error on fitness function


Used also for Latency-insensitive inference

Batching for training

Batched operation boosts re-use of weights.

Without batching, would use each element of Weight matrix once.

Want 10-50 arithmetic operations per memory fetch to avoiding having to wait all the time for data from memory


22

CNN Computation: main kernel (per layer)

Deep Convolutional NNs

Filters conserved through plane

MAC-dominated – even without batching

Can be cast as matrix multiply


Recepy for Deep Learning

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-

explained-in-a-single-powerpoint-slide/


Recepy for Deep Learning

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-

explained-in-a-single-powerpoint-slide/

overfittingDon’t forget!

PreventingOverfitting

Modify the Network

Better optimization Strategy


Results get betterwith

more data +

bigger models +

more

computation

The Singularity of DL


Results get betterwith

more data +

bigger models +

more

computation

Better algorithms, new insights and improved

techniques always help, too!

The Singularity of DL


GPUs are Great for DL

Why?

Because they are good at matrix multiply 90%

utilization is achievable (on lots of “cores”)

Pascal GP100

3840 “cores”

3840 MAC/cycle

@ 1.4GHz

5.3 TMACS (FP)

@300W

28pJ/OP


What’s Next?


Near-Sensor (aka Edge) DL

Battery + Harvesting powered

a few mW power envelope

Long range, low BW

Short range, BW

Low rate (periodic) data

SW update, commands

Transmit

Idle: ~1µW

Active: ~ 50mW

Analyze and Classify

µController

IOs

1 ÷ 25 MOPS

1 ÷ 10 mW

e.g. CortexM

Sense

MEMS IMU

MEMS Microphone

ULP Imager

100 µW ÷ 2 mW

EMG/ECG/EIT

L2 Memory

1 ÷ 2000 MOPS

1 ÷ 10 mW


Does it Matter?

3x Cost reduction if data

volume is reduced by 95%


Origami: A CNN Accelerator

FP not needed: 12-bit signals sufficient

Input to classification double-vs-12-bit accuracy loss < 0.5% (80.6% to 80.1%)

23

https://arxiv.org/abs/1512.04295


CNNs: typical workload

Performance for 10 fps: ~73 GOPS/s

Energy efficiency: ~2300 GOPS/W efficiency

Origami core in 28nm FDSOI 10 fps ResNet-34 with ~32mW

Scaling Origami to 28nm FDSOI

Example: ResNet-34

classifies 224x224 images into 1000 classes

~ trained human-level performance

~ 21M parameters

~ 3.6G MAC operations

0.4pj/OP


Pushing Further: YodaNN1

Approximation at the algorithmic side Binary weights

BinaryConnect [Courbariaux, NIPS15], XOR NET [Rastegari, arXiv16]

Reduce weights to a binary value -1/+1

Accuracy loss significant but gap is closing:

ResNet-18 on ImageNet with 83.0% (binary-weight) vs. 89.2% (single-precision) top-5 accuracy; and 60.8% vs. 69.3% top-1 accuracy

Ultra-optimized HW is possible!

1After the Yedi Master from Star Wars - “Small in size

but wise and powerful” cit. www.starwars.com


YODANN Energy Efficiency

Same area 832 SoP units + all SCM

12x Boost in core energy efficiency (single layer)

https://arxiv.org/abs/1606.05487

0.03pj/OP


Origami, YodaNN vs. Human

Type Analog (bio)

Q2.9Precision

Q2.9Precision

Binary-Weight

Network human ResNet-34 ResNet-18 ResNet-18

Top-1 error [%]

21.53 30.7 39.2

Top-5 error [%]

5.1 5.6 10.8 17.0

Hardware Brain Origami Origami YodaNN

Energy-eff. [uJ/img]

100.000(*) 1086 543 31

The «energy-efficient AI» challenge (e.g. Human vs. IBM Watson)

Game over for humans also in energy-efficient vision?

…. Not yet! (object recognition is a super-simple task)

*Pbrain = 10W, 10% of the brain used for vision, trained human working at 10img/sec


Thanks!!!

www.pulp-platform.org

www-micrel.deis.unibo.it/pulp-project

iis-projects.ee.ethz.ch/index.php/PULP

Morale:

Plenty of room at the bottom


Tensor Processing Unit

NB vs. K80 GPU

A specializazion

story:

1. Reduced

precision (8bit)

2. Dedicated

memory

architecture

Documents

The Deep Learning Revolution - T3LAB