77
Squeezing Deep Learning into mobile phones - A Practitioners guide Anirudh Koul

Squeezing Deep Learning Into Mobile Phones

Embed Size (px)

Citation preview

Page 1: Squeezing Deep Learning Into Mobile Phones

Squeezing Deep Learning into mobile phones

- A Practitioners guideAnirudh Koul

Page 2: Squeezing Deep Learning Into Mobile Phones

i

Anirudh Koul , @anirudhkoul , http://koul.aiProject Lead, Seeing AIApplied Researcher, Microsoft AI & ResearchAkoul at Microsoft dot com

Currently working on applying artificial intelligence for productivity, augmented reality and accessibilityAlong with Eugene Seleznev, Saqib Shaikh, Meher Kasam

Page 3: Squeezing Deep Learning Into Mobile Phones

Why Deep Learning On Mobile?

i

Latency

Privacy

Page 4: Squeezing Deep Learning Into Mobile Phones

Mobile Deep Learning Recipe

i

Mobile Inference Engine + Pretrained Model = DL App(Efficient) (Efficient)

Page 5: Squeezing Deep Learning Into Mobile Phones

Building a DL App in _ time

Page 6: Squeezing Deep Learning Into Mobile Phones

Building a DL App in 1 hour

Page 7: Squeezing Deep Learning Into Mobile Phones

Use Cloud APIs

i

Microsoft Cognitive ServicesClarifaiGoogle Cloud VisionIBM Watson ServicesAmazon Rekognition

Page 8: Squeezing Deep Learning Into Mobile Phones

Microsoft Cognitive Services

i

Models won the 2015 ImageNet Large Scale Visual Recognition ChallengeVision, Face, Emotion, Video and 21 other topics

Page 9: Squeezing Deep Learning Into Mobile Phones

Building a DL App in 1 day

Page 10: Squeezing Deep Learning Into Mobile Phones

ihttp://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/

Energy to trainConvolutionalNeural Network

Energy to useConvolutionalNeural Network

Page 11: Squeezing Deep Learning Into Mobile Phones

Base PreTrained Model

i

ImageNet – 1000 Object CategorizerInceptionResnet

Page 12: Squeezing Deep Learning Into Mobile Phones

Running pre-trained models on mobile

i

MXNet TensorflowCNNDroidDeepLearningKitCaffeTorch

Page 13: Squeezing Deep Learning Into Mobile Phones

MXNET

i

Amalgamation : Pack all the code in a single source file

Pro:• Cross Platform (iOS, Android), Easy porting• Usable in any programming language

Con:• CPU only, Slow https://github.com/Leliana/WhatsThis

Page 14: Squeezing Deep Learning Into Mobile Phones

Tensorflow

i

Easy pipeline to bring Tensorflow models to mobileGreat documentationOptimizations to bring model to mobileUpcoming : XLA (Accelerated Linear Algebra) compiler to optimize for hardware

Page 15: Squeezing Deep Learning Into Mobile Phones

CNNdroid

i

GPU accelerated CNNs for AndroidSupports Caffe, Torch and Theano models~30-40x Speedup using mobile GPU vs CPU (AlexNet)

Internally, CNNdroid expresses data parallelism for different layers, instead of leaving to the GPU’s hardware scheduler

Page 16: Squeezing Deep Learning Into Mobile Phones

DeepLearningKit

i

Platform : iOS, OS X and tvOS (Apple TV)DNN Type : CNNs models trained in CaffeRuns on mobile GPU, uses Metal

Pro : Fast, directly ingests Caffe modelsCon : Unmaintained

Page 17: Squeezing Deep Learning Into Mobile Phones

Caffe

i

Caffe for Android https://github.com/sh1r0/caffe-android-libSample app https://github.com/sh1r0/caffe-android-demo

Caffe for iOS : https://github.com/aleph7/caffeSample app https://github.com/noradaiko/caffe-ios-sample

Pro : Usually couple of lines to port a pretrained model to mobile CPUCon : Unmaintained

Page 18: Squeezing Deep Learning Into Mobile Phones

Running pre-trained models on mobile

i

Mobile Library

Platform GPU

DNN Architecture Supported

Trained Models Supported

Tensorflow iOS/Android

Yes CNN,RNN,LSTM, etc

Tensorflow

CNNDroid Android Yes CNN Caffe, Torch, Theano

DeepLearningKit

iOS Yes CNN Caffe

MXNet iOS/Android

No CNN,RNN,LSTM, etc

MXNet

Caffe iOS/Android

No CNN Caffe

Torch iOS/Android

No CNN,RNN,LSTM, etc

Torch

Page 19: Squeezing Deep Learning Into Mobile Phones

Building a DL App in 1 week

Page 20: Squeezing Deep Learning Into Mobile Phones

i

Learn Playing an Accordion3 months

Page 21: Squeezing Deep Learning Into Mobile Phones

i

Learn Playing an Accordion3 months

Knows Piano

Fine Tune Skills

1 week

Page 22: Squeezing Deep Learning Into Mobile Phones

I got a dataset, Now What?

i

Step 1 : Find a pre-trained modelStep 2 : Fine tune a pre-trained modelStep 3 : Run using existing frameworks

“Don’t Be A Hero” - Andrej Karpathy

Page 23: Squeezing Deep Learning Into Mobile Phones

How to find pretrained models for my task?

i

Search “Model Zoo”

Microsoft Cognitive Toolkit (previously called CNTK) – 50 ModelsCaffe Model ZooKerasTensorflowMXNet

Page 24: Squeezing Deep Learning Into Mobile Phones

AlexNet, 2012 (simplified)

i[Krizhevsky, Sutskever,Hinton’12]

Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11

n-dimensionFeature

representation

Page 25: Squeezing Deep Learning Into Mobile Phones

Deciding how to fine tune

i

Size of New Dataset

Similarity to Original Dataset

What to do?

Large High Fine tune.Small High Don’t Fine Tune, it will overfit.

Train linear classifier on CNN Features

Small Low Train a classifier from activations in lower layers.Higher layers are dataset specific to older dataset.

Large Low Train CNN from scratchhttp://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

Page 26: Squeezing Deep Learning Into Mobile Phones

Deciding when to fine tune

i

Size of New Dataset

Similarity to Original Dataset

What to do?

Large High Fine tune.Small High Don’t Fine Tune, it will overfit.

Train linear classifier on CNN Features

Small Low Train a classifier from activations in lower layers.Higher layers are dataset specific to older dataset.

Large Low Train CNN from scratchhttp://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

Page 27: Squeezing Deep Learning Into Mobile Phones

Deciding when to fine tune

i

Size of New Dataset

Similarity to Original Dataset

What to do?

Large High Fine tune.Small High Don’t Fine Tune, it will overfit.

Train linear classifier on CNN Features

Small Low Train a classifier from activations in lower layers.Higher layers are dataset specific to older dataset.

Large Low Train CNN from scratchhttp://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

Page 28: Squeezing Deep Learning Into Mobile Phones

Deciding when to fine tune

i

Size of New Dataset

Similarity to Original Dataset

What to do?

Large High Fine tune.Small High Don’t Fine Tune, it will overfit.

Train linear classifier on CNN Features

Small Low Train a classifier from activations in lower layers.Higher layers are dataset specific to older dataset.

Large Low Train CNN from scratchhttp://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

Page 29: Squeezing Deep Learning Into Mobile Phones

Building a DL Website in 1 week

Page 30: Squeezing Deep Learning Into Mobile Phones

Less Data + Smaller Networks = Faster browser training

i

Page 31: Squeezing Deep Learning Into Mobile Phones

Several JavaScript Libraries

i

Run large CNNs• Keras-JS• MXNetJS• CaffeJS

Train and Run CNNs• ConvNetJS

Train and Run LSTMs• Brain.js• Synaptic.js

Train and Run NNs• Mind.js• DN2A

Page 32: Squeezing Deep Learning Into Mobile Phones

ConvNetJS

i

Both Train and Test NNs in browserTrain CNNs in browser

Page 33: Squeezing Deep Learning Into Mobile Phones

Keras.js

i

Run Keras models in browser, with GPU support.

Page 34: Squeezing Deep Learning Into Mobile Phones

Brain.JS

i

Train and run NNs in browserSupports Feedforward, RNN, LSTM, GRUNo CNNsDemo : http://brainjs.com/

Trained NN to recognize color contrast

Page 35: Squeezing Deep Learning Into Mobile Phones

MXNetJS

i

On Firefox and Microsoft Edge, performance is 8x faster than Chrome. Optimization difference because of ASM.js.

Page 36: Squeezing Deep Learning Into Mobile Phones

Building a DL App in 1 month

(and get featured in Apple App store)

Page 37: Squeezing Deep Learning Into Mobile Phones

Response Time Limits – Powers of 10

i

0.1 second : Reacting instantly1.0 seconds : User’s flow of thought10 seconds : Keeping the user’s attention

[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:

Page 38: Squeezing Deep Learning Into Mobile Phones

Apple frameworks for Deep Learning Inference

i

BNNS – Basic Neural Network SubroutineMPS – Metal Performance Shaders

Page 39: Squeezing Deep Learning Into Mobile Phones

Metal Performance Shaders (MPS)

i

Fast, Provides GPU acceleration for inference phaseFaster app load times than Tensorflow (Jan 2017)About 1/3rd the run time memory of Tensorflow on Inception-V3 (Jan 2017)~130 ms on iPhone 7S Plus to run Inception-V3

Cons: • Limited documentation. • No easy way to programmatically port models. • No batch normalization. Solution : Join Conv and BatchNorm weights

Page 40: Squeezing Deep Learning Into Mobile Phones

i

Putting out more frames than an art gallery

Page 41: Squeezing Deep Learning Into Mobile Phones

Basic Neural Network Subroutines (BNNS)

i

Runs on CPU

BNNS is faster for smaller networks than MPS but slower for bigger networks

Page 42: Squeezing Deep Learning Into Mobile Phones

BrainCore

i

NN Framework for iOSProvides LSTMs functionalityFast, uses Metal, runs on iPhone GPUhttps://github.com/aleph7/braincore

Page 43: Squeezing Deep Learning Into Mobile Phones

Building a DL App in 6 months

Page 44: Squeezing Deep Learning Into Mobile Phones

i

What you want

https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016

$2000$200,000What you can afford

Page 45: Squeezing Deep Learning Into Mobile Phones

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2

3x3 conv, 384

3x3 conv, 384

3x3 conv, 256, pool/2

fc, 4096

fc, 4096

fc, 1000

AlexNet, 8 layers

(ILSVRC 2012)

Revolution of Depth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015i

Page 46: Squeezing Deep Learning Into Mobile Phones

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2

3x3 conv, 384

3x3 conv, 384

3x3 conv, 256, pool/2

fc, 4096

fc, 4096

fc, 1000

AlexNet, 8 layers

(ILSVRC 2012)

3x3 conv, 64

3x3 conv, 64, pool/2

3x3 conv, 128

3x3 conv, 128, pool/2

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256, pool/2

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512, pool/2

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512, pool/2

fc, 4096

fc, 4096

fc, 1000

VGG, 19 layers

(ILSVRC 2014)

input

Conv7x7+ 2(S)

MaxPool 3x3+ 2(S)

LocalRespNorm

Conv1x1+ 1(V)

Conv3x3+ 1(S)

LocalRespNorm

MaxPool 3x3+ 2(S)

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

MaxPool 3x3+ 2(S)

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Av eragePool 5x5+ 3(V)

Dept hConcat

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

AveragePool 5x5+ 3(V)

Dept hConcat

MaxPool 3x3+ 2(S)

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

AveragePool 7x7+ 1(V)

FC

Conv1x1+ 1(S)

FC

FC

Soft maxAct iv at ion

soft max0

Conv1x1+ 1(S)

FC

FC

Soft maxActivat ion

soft max1

Soft maxAct ivat ion

soft max2

GoogleNet, 22 layers

(ILSVRC 2014)

Revolution of Depth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015i

Page 47: Squeezing Deep Learning Into Mobile Phones

AlexNet, 8 layers

(ILSVRC 2012)

ResNet, 152 layers

(ILSVRC 2015)

3x3 conv, 64

3x3 conv, 64, pool/2

3x3 conv, 128

3x3 conv, 128, pool/2

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256, pool/2

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512, pool/2

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512, pool/2

fc, 4096

fc, 4096

fc, 1000

11x11 conv , 96, /4, pool/2

5x5 conv, 256, pool/2

3x3 conv, 384

3x3 conv, 384

3x3 conv, 256, pool/2

fc, 4096

fc, 4096

fc, 1000

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

7x7 conv, 64, /2, pool/2

VGG, 19 layers

(ILSVRC 2014)

Revolution of Depth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015i

Ultra deep

Page 48: Squeezing Deep Learning Into Mobile Phones

ResNet, 152 layers

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

7x7 conv, 64, /2, pool/2

Revolution of Depth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015i

Page 49: Squeezing Deep Learning Into Mobile Phones

ILSVRC'15 ResNet

ILSVRC'14 GoogleNet

ILSVRC'14VGG

ILSVRC'13 ILSVRC'12 AlexNet

ILSVRC'11 ILSVRC'10

3.57

6.7 7.3

11.7

16.4

25.828.2

ImageNet Classification top-5 error (%)

shallow8 layers

19 layers22 layers

152 layers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

8 layers

Revolution of Depth

i

Page 50: Squeezing Deep Learning Into Mobile Phones

Your Budget - Smartphone Floating Point Operations Per Second (2015)

i http://pages.experts-exchange.com/processing-power-compared/

Page 51: Squeezing Deep Learning Into Mobile Phones

Accuracy vs Operations Per Image Inference

i

Size is proportional to num parameters

Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016

552 MB

240 MB

What we want

Page 52: Squeezing Deep Learning Into Mobile Phones

Accuracy Per Parameter

iAlfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016

Page 53: Squeezing Deep Learning Into Mobile Phones

Pick your DNN Architecture for your mobile architecture

i

Resnet Family

Under 150 ms on iPhone 7 using Metal GPUKaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, "Deep Residual Learning for Image Recognition”, 2015

Page 54: Squeezing Deep Learning Into Mobile Phones

Strategies to make DNNs even more efficient

i

Shallow networksCompressing pre-trained networksDesigning compact layersQuantizing parametersNetwork binarization

Page 55: Squeezing Deep Learning Into Mobile Phones

Pruning

i

Aim : Remove all connections with absolute weights below a threshold

Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015

Page 56: Squeezing Deep Learning Into Mobile Phones

Observation : Most parameters in Fully Connected Layers

iAlexNet 240 MB

VGG-16 552 MB

96% of all parameters

90% of all parameters

Page 57: Squeezing Deep Learning Into Mobile Phones

Pruning gets quickest model compression without accuracy loss

iAlexNet 240 MB

VGG-16 552 MB

First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy

Page 58: Squeezing Deep Learning Into Mobile Phones

Weight Sharing

i

Idea : Cluster weights with similar values together, and store in a dictionary.

CodebookHuffman codingHashedNets

Simplest implementation:• Round all weights into 256 levels• Tensorflow export script reduces inception zip file from 87 MB to 26 MB

with 1% drop in precision

Page 59: Squeezing Deep Learning Into Mobile Phones

Selective training to keep networks shallow

i

Idea : Augment data limited to how your network will be used

Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate.Followed by WordLens / Google Translate

Example : Add blur if analyzing mobile phone frames

Page 60: Squeezing Deep Learning Into Mobile Phones

Design consideration for custom architectures – Small Filters

i

Three layers of 3x3 convolutions >> One layer of 7x7 convolution

Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutionsReplace NxN convolutions with stack of 1xN and Nx1ÞFewer parameters ÞLess compute ÞMore non-linearity

BetterFasterStronger

Andrej Karpathy, CS-231n Notes, Lecture 11

Page 61: Squeezing Deep Learning Into Mobile Phones

SqueezeNet - AlexNet-level accuracy in 0.5 MB

i

SqueezeNet base 4.8 MBSqueezeNet compressed 0.5 MB

80.3% top-5 Accuracy on ImageNet0.72 GFLOPS/image

Fire Block

Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"

Page 62: Squeezing Deep Learning Into Mobile Phones

Reduced precision

i

Reduce precision from 32 bits to <=16 bits or lesserUse stochastic rounding for best results

In Practice:• Ristretto + Caffe

• Automatic Network quantization• Finds balance between compression rate and accuracy

• Apple Metal Performance Shaders automatically quantize to 16 bits

• Tensorflow has 8 bit quantization support• Gemmlowp – Low precision matrix multiplication library

Page 63: Squeezing Deep Learning Into Mobile Phones

Binary weighted Networks

i

Idea :Reduce the weights to -1,+1Speedup : Convolution operation can be approximated by only summation and subtraction

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

Page 64: Squeezing Deep Learning Into Mobile Phones

Binary weighted Networks

i

Idea :Reduce the weights to -1,+1Speedup : Convolution operation can be approximated by only summation and subtraction

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

Page 65: Squeezing Deep Learning Into Mobile Phones

Binary weighted Networks

i

Idea :Reduce the weights to -1,+1Speedup : Convolution operation can be approximated by only summation and subtraction

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

Page 66: Squeezing Deep Learning Into Mobile Phones

XNOR-Net

i

Idea :Reduce both weights + inputs to -1,+1Speedup : Convolution operation can be approximated by XNOR and Bitcount operations

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

Page 67: Squeezing Deep Learning Into Mobile Phones

XNOR-Net

i

Idea :Reduce both weights + inputs to -1,+1Speedup : Convolution operation can be approximated by XNOR and Bitcount operations

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

Page 68: Squeezing Deep Learning Into Mobile Phones

XNOR-Net

i

Idea :Reduce both weights + inputs to -1,+1Speedup : Convolution operation can be approximated by XNOR and Bitcount operations

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

Page 69: Squeezing Deep Learning Into Mobile Phones

XNOR-Net on Mobile

i

Page 70: Squeezing Deep Learning Into Mobile Phones

Building a DL App and get $10 million in

funding(or a PhD)

Page 71: Squeezing Deep Learning Into Mobile Phones

i

Page 72: Squeezing Deep Learning Into Mobile Phones

Minerva

i

Page 73: Squeezing Deep Learning Into Mobile Phones

Minerva

i

Page 74: Squeezing Deep Learning Into Mobile Phones

DeepX Toolkit

iNicholas D. Lane et al, “DXTK : Enabling Resource-efficient Deep Learning on Mobile and Embedded Devices with the DeepX Toolkit",2016

Page 75: Squeezing Deep Learning Into Mobile Phones

EIE : Efficient Inference Engine on Compressed DNNs

iSong Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, William Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network", 2016

189x faster on CPU13x faster on GPU

Page 76: Squeezing Deep Learning Into Mobile Phones

One Last Question

Page 77: Squeezing Deep Learning Into Mobile Phones

How to access the slides in 1 second

Link posted here -> @anirudhkoul