29
Luca Benini – Why Now? 1 of 30 Luca Benini The Deep Learning Revolution Why now? http://www.pulp-platform.org

The Deep Learning Revolution - T3LAB

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 1 of 30

Luca Benini

The Deep Learning Revolution Why now?

http://www.pulp-platform.org

Page 2: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 2 of 30

Machine Learning Frenzy

Page 3: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 3 of 30

Yes, but Why?

First, it was machine vision…

Human

Deep Learning appears (green dots)

ImageNet Large Scale Visual Recognition Competition (ILSVRC)

Imagenet training data 10 million hand- annotated

images (object in picture)

1 million bounding boxes

also provided

Page 4: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 4 of 30

Then, speech recognition…

English training data▪ 11,940 hours

▪ 8 million utterances

▪ ...and growing everyday

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Baidu Research) 2016

Page 5: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 5 of 30

And automatic translation…

Google’s Neural

Machine Translation

System: Bridging the

Gap between Human

and Machine

Translation (2016)

EnFr

training set contains

36M sentence pairs

Page 6: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 6 of 30

Only research Milestones?

Growing Use of Deep Learning at Google [J. Dean]

Across many

products/areas:

Android

Apps

drugdiscovery

Gmail

Image understanding

Maps

Natural language

understanding

Photos

Robotics research

Speech

Translation

YouTube

…many others...

# of directories containing model descriptionfiles

RankBrain is the third most useful signal for ranking pages (extracts a “purified” search from what you type)

While humans guessed correctly 70% of top ranking pages, RankBrain had an 80% success rate.

Page 7: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 7 of 30

Only a Google Thing?

Deep Learning Market

DL Software revenue

Page 8: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 8 of 30

Only for Big guys?

Page 9: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 9 of 30

What is it?

Page 10: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 10 of 30

Anatomy of a Neural Network

From Stanford cs231n lecture notes

Biological neuron

w1 w2 w3

x1

y

x2 x3

y=F(w1x1+w2x2+w3x3)

F(x)=max(0,x)

Artificial neuron

Inference: given x, compute y with fixed w

Training (aka backpropagation): given x and y, adjust w so that F(x) gets “close”

(fitness function) to y

Don’t get to close in a single shot – smooth adjustments with some randomization

(stochastic gradient)

Page 11: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 11 of 30

Deep neural networks (DNNs)

Input Result

Application components:

Task objectivee.g. Identify face

Training data10-100M images

Network architecture~10 layers1B parameters

Learning algorithm~30 Exaflops~30 GPU days

Raw data Low-levelfeatures Mid-level features High-levelfeatures

Page 12: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 12 of 30

Layer-by-layer computation

Key operation is dense M x V

Training by recursive backpropagation of error on fitness function

Page 13: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 13 of 30

Used also for Latency-insensitive inference

Batching for training

Batched operation boosts re-use of weights.

Without batching, would use each element of Weight matrix once.

Want 10-50 arithmetic operations per memory fetch to avoiding having to wait all the time for data from memory

Page 14: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 14 of 30

22

CNN Computation: main kernel (per layer)

Deep Convolutional NNs

Filters conserved through plane

MAC-dominated – even without batching

Can be cast as matrix multiply

Page 15: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 15 of 30

Recepy for Deep Learning

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-

explained-in-a-single-powerpoint-slide/

Page 16: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 16 of 30

Recepy for Deep Learning

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-

explained-in-a-single-powerpoint-slide/

overfittingDon’t forget!

PreventingOverfitting

Modify the Network

Better optimization Strategy

Page 17: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 17 of 30

Results get betterwith

more data +

bigger models +

more

computation

The Singularity of DL

Page 18: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 18 of 30

Results get betterwith

more data +

bigger models +

more

computation

Better algorithms, new insights and improved

techniques always help, too!

The Singularity of DL

Page 19: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 19 of 30

GPUs are Great for DL

Why?

Because they are good at matrix multiply 90%

utilization is achievable (on lots of “cores”)

Pascal GP100

3840 “cores”

3840 MAC/cycle

@ 1.4GHz

5.3 TMACS (FP)

@300W

28pJ/OP

Page 20: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 20 of 30

What’s Next?

Page 21: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 21 of 30

Near-Sensor (aka Edge) DL

Battery + Harvesting powered

a few mW power envelope

Long range, low BW

Short range, BW

Low rate (periodic) data

SW update, commands

Transmit

Idle: ~1µW

Active: ~ 50mW

Analyze and Classify

µController

IOs

1 ÷ 25 MOPS

1 ÷ 10 mW

e.g. CortexM

Sense

MEMS IMU

MEMS Microphone

ULP Imager

100 µW ÷ 2 mW

EMG/ECG/EIT

L2 Memory

1 ÷ 2000 MOPS

1 ÷ 10 mW

Page 22: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 22 of 30

Does it Matter?

3x Cost reduction if data

volume is reduced by 95%

Page 23: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 23 of 30

Origami: A CNN Accelerator

FP not needed: 12-bit signals sufficient

Input to classification double-vs-12-bit accuracy loss < 0.5% (80.6% to 80.1%)

23

https://arxiv.org/abs/1512.04295

Page 24: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 24 of 30

CNNs: typical workload

Performance for 10 fps: ~73 GOPS/s

Energy efficiency: ~2300 GOPS/W efficiency

Origami core in 28nm FDSOI 10 fps ResNet-34 with ~32mW

Scaling Origami to 28nm FDSOI

Example: ResNet-34

classifies 224x224 images into 1000 classes

~ trained human-level performance

~ 21M parameters

~ 3.6G MAC operations

0.4pj/OP

Page 25: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 25 of 30

Pushing Further: YodaNN1

Approximation at the algorithmic side Binary weights

BinaryConnect [Courbariaux, NIPS15], XOR NET [Rastegari, arXiv16]

Reduce weights to a binary value -1/+1

Accuracy loss significant but gap is closing:

ResNet-18 on ImageNet with 83.0% (binary-weight) vs. 89.2% (single-precision) top-5 accuracy; and 60.8% vs. 69.3% top-1 accuracy

Ultra-optimized HW is possible!

1After the Yedi Master from Star Wars - “Small in size

but wise and powerful” cit. www.starwars.com

Page 26: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 26 of 30

YODANN Energy Efficiency

Same area 832 SoP units + all SCM

12x Boost in core energy efficiency (single layer)

https://arxiv.org/abs/1606.05487

0.03pj/OP

Page 27: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 27 of 30

Origami, YodaNN vs. Human

Type Analog (bio)

Q2.9Precision

Q2.9Precision

Binary-Weight

Network human ResNet-34 ResNet-18 ResNet-18

Top-1 error [%]

21.53 30.7 39.2

Top-5 error [%]

5.1 5.6 10.8 17.0

Hardware Brain Origami Origami YodaNN

Energy-eff. [uJ/img]

100.000(*) 1086 543 31

The «energy-efficient AI» challenge (e.g. Human vs. IBM Watson)

Game over for humans also in energy-efficient vision?

…. Not yet! (object recognition is a super-simple task)

*Pbrain = 10W, 10% of the brain used for vision, trained human working at 10img/sec

Page 28: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 28 of 30

Thanks!!!

www.pulp-platform.org

www-micrel.deis.unibo.it/pulp-project

iis-projects.ee.ethz.ch/index.php/PULP

Morale:

Plenty of room at the bottom

Page 29: The Deep Learning Revolution - T3LAB

Luca Benini – Why Now? 29 of 30

Tensor Processing Unit

NB vs. K80 GPU

A specializazion

story:

1. Reduced

precision (8bit)

2. Dedicated

memory

architecture