Upload
anirudh-koul
View
6.993
Download
1
Embed Size (px)
Citation preview
Squeezing Deep Learning into mobile phones
- A Practitioners guideAnirudh Koul
i
Anirudh Koul , @anirudhkoul , http://koul.aiProject Lead, Seeing AIApplied Researcher, Microsoft AI & ResearchAkoul at Microsoft dot com
Currently working on applying artificial intelligence for productivity, augmented reality and accessibilityAlong with Eugene Seleznev, Saqib Shaikh, Meher Kasam
Why Deep Learning On Mobile?
i
Latency
Privacy
Mobile Deep Learning Recipe
i
Mobile Inference Engine + Pretrained Model = DL App(Efficient) (Efficient)
Building a DL App in _ time
Building a DL App in 1 hour
Use Cloud APIs
i
Microsoft Cognitive ServicesClarifaiGoogle Cloud VisionIBM Watson ServicesAmazon Rekognition
Microsoft Cognitive Services
i
Models won the 2015 ImageNet Large Scale Visual Recognition ChallengeVision, Face, Emotion, Video and 21 other topics
Building a DL App in 1 day
ihttp://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/
Energy to trainConvolutionalNeural Network
Energy to useConvolutionalNeural Network
Base PreTrained Model
i
ImageNet – 1000 Object CategorizerInceptionResnet
Running pre-trained models on mobile
i
MXNet TensorflowCNNDroidDeepLearningKitCaffeTorch
MXNET
i
Amalgamation : Pack all the code in a single source file
Pro:• Cross Platform (iOS, Android), Easy porting• Usable in any programming language
Con:• CPU only, Slow https://github.com/Leliana/WhatsThis
Tensorflow
i
Easy pipeline to bring Tensorflow models to mobileGreat documentationOptimizations to bring model to mobileUpcoming : XLA (Accelerated Linear Algebra) compiler to optimize for hardware
CNNdroid
i
GPU accelerated CNNs for AndroidSupports Caffe, Torch and Theano models~30-40x Speedup using mobile GPU vs CPU (AlexNet)
Internally, CNNdroid expresses data parallelism for different layers, instead of leaving to the GPU’s hardware scheduler
DeepLearningKit
i
Platform : iOS, OS X and tvOS (Apple TV)DNN Type : CNNs models trained in CaffeRuns on mobile GPU, uses Metal
Pro : Fast, directly ingests Caffe modelsCon : Unmaintained
Caffe
i
Caffe for Android https://github.com/sh1r0/caffe-android-libSample app https://github.com/sh1r0/caffe-android-demo
Caffe for iOS : https://github.com/aleph7/caffeSample app https://github.com/noradaiko/caffe-ios-sample
Pro : Usually couple of lines to port a pretrained model to mobile CPUCon : Unmaintained
Running pre-trained models on mobile
i
Mobile Library
Platform GPU
DNN Architecture Supported
Trained Models Supported
Tensorflow iOS/Android
Yes CNN,RNN,LSTM, etc
Tensorflow
CNNDroid Android Yes CNN Caffe, Torch, Theano
DeepLearningKit
iOS Yes CNN Caffe
MXNet iOS/Android
No CNN,RNN,LSTM, etc
MXNet
Caffe iOS/Android
No CNN Caffe
Torch iOS/Android
No CNN,RNN,LSTM, etc
Torch
Building a DL App in 1 week
i
Learn Playing an Accordion3 months
i
Learn Playing an Accordion3 months
Knows Piano
Fine Tune Skills
1 week
I got a dataset, Now What?
i
Step 1 : Find a pre-trained modelStep 2 : Fine tune a pre-trained modelStep 3 : Run using existing frameworks
“Don’t Be A Hero” - Andrej Karpathy
How to find pretrained models for my task?
i
Search “Model Zoo”
Microsoft Cognitive Toolkit (previously called CNTK) – 50 ModelsCaffe Model ZooKerasTensorflowMXNet
AlexNet, 2012 (simplified)
i[Krizhevsky, Sutskever,Hinton’12]
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11
n-dimensionFeature
representation
Deciding how to fine tune
i
Size of New Dataset
Similarity to Original Dataset
What to do?
Large High Fine tune.Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratchhttp://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Deciding when to fine tune
i
Size of New Dataset
Similarity to Original Dataset
What to do?
Large High Fine tune.Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratchhttp://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Deciding when to fine tune
i
Size of New Dataset
Similarity to Original Dataset
What to do?
Large High Fine tune.Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratchhttp://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Deciding when to fine tune
i
Size of New Dataset
Similarity to Original Dataset
What to do?
Large High Fine tune.Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratchhttp://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Building a DL Website in 1 week
Less Data + Smaller Networks = Faster browser training
i
Several JavaScript Libraries
i
Run large CNNs• Keras-JS• MXNetJS• CaffeJS
Train and Run CNNs• ConvNetJS
Train and Run LSTMs• Brain.js• Synaptic.js
Train and Run NNs• Mind.js• DN2A
ConvNetJS
i
Both Train and Test NNs in browserTrain CNNs in browser
Keras.js
i
Run Keras models in browser, with GPU support.
Brain.JS
i
Train and run NNs in browserSupports Feedforward, RNN, LSTM, GRUNo CNNsDemo : http://brainjs.com/
Trained NN to recognize color contrast
MXNetJS
i
On Firefox and Microsoft Edge, performance is 8x faster than Chrome. Optimization difference because of ASM.js.
Building a DL App in 1 month
(and get featured in Apple App store)
Response Time Limits – Powers of 10
i
0.1 second : Reacting instantly1.0 seconds : User’s flow of thought10 seconds : Keeping the user’s attention
[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:
Apple frameworks for Deep Learning Inference
i
BNNS – Basic Neural Network SubroutineMPS – Metal Performance Shaders
Metal Performance Shaders (MPS)
i
Fast, Provides GPU acceleration for inference phaseFaster app load times than Tensorflow (Jan 2017)About 1/3rd the run time memory of Tensorflow on Inception-V3 (Jan 2017)~130 ms on iPhone 7S Plus to run Inception-V3
Cons: • Limited documentation. • No easy way to programmatically port models. • No batch normalization. Solution : Join Conv and BatchNorm weights
i
Putting out more frames than an art gallery
Basic Neural Network Subroutines (BNNS)
i
Runs on CPU
BNNS is faster for smaller networks than MPS but slower for bigger networks
BrainCore
i
NN Framework for iOSProvides LSTMs functionalityFast, uses Metal, runs on iPhone GPUhttps://github.com/aleph7/braincore
Building a DL App in 6 months
i
What you want
https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016
$2000$200,000What you can afford
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015i
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
3x3 conv, 64
3x3 conv, 64, pool/2
3x3 conv, 128
3x3 conv, 128, pool/2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
fc, 4096
fc, 4096
fc, 1000
VGG, 19 layers
(ILSVRC 2014)
input
Conv7x7+ 2(S)
MaxPool 3x3+ 2(S)
LocalRespNorm
Conv1x1+ 1(V)
Conv3x3+ 1(S)
LocalRespNorm
MaxPool 3x3+ 2(S)
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
MaxPool 3x3+ 2(S)
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Av eragePool 5x5+ 3(V)
Dept hConcat
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool 5x5+ 3(V)
Dept hConcat
MaxPool 3x3+ 2(S)
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
AveragePool 7x7+ 1(V)
FC
Conv1x1+ 1(S)
FC
FC
Soft maxAct iv at ion
soft max0
Conv1x1+ 1(S)
FC
FC
Soft maxActivat ion
soft max1
Soft maxAct ivat ion
soft max2
GoogleNet, 22 layers
(ILSVRC 2014)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015i
AlexNet, 8 layers
(ILSVRC 2012)
ResNet, 152 layers
(ILSVRC 2015)
3x3 conv, 64
3x3 conv, 64, pool/2
3x3 conv, 128
3x3 conv, 128, pool/2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
fc, 4096
fc, 4096
fc, 1000
11x11 conv , 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2
VGG, 19 layers
(ILSVRC 2014)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015i
Ultra deep
ResNet, 152 layers
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015i
ILSVRC'15 ResNet
ILSVRC'14 GoogleNet
ILSVRC'14VGG
ILSVRC'13 ILSVRC'12 AlexNet
ILSVRC'11 ILSVRC'10
3.57
6.7 7.3
11.7
16.4
25.828.2
ImageNet Classification top-5 error (%)
shallow8 layers
19 layers22 layers
152 layers
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
8 layers
Revolution of Depth
i
Your Budget - Smartphone Floating Point Operations Per Second (2015)
i http://pages.experts-exchange.com/processing-power-compared/
Accuracy vs Operations Per Image Inference
i
Size is proportional to num parameters
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016
552 MB
240 MB
What we want
Accuracy Per Parameter
iAlfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016
Pick your DNN Architecture for your mobile architecture
i
Resnet Family
Under 150 ms on iPhone 7 using Metal GPUKaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, "Deep Residual Learning for Image Recognition”, 2015
Strategies to make DNNs even more efficient
i
Shallow networksCompressing pre-trained networksDesigning compact layersQuantizing parametersNetwork binarization
Pruning
i
Aim : Remove all connections with absolute weights below a threshold
Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015
Observation : Most parameters in Fully Connected Layers
iAlexNet 240 MB
VGG-16 552 MB
96% of all parameters
90% of all parameters
Pruning gets quickest model compression without accuracy loss
iAlexNet 240 MB
VGG-16 552 MB
First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy
Weight Sharing
i
Idea : Cluster weights with similar values together, and store in a dictionary.
CodebookHuffman codingHashedNets
Simplest implementation:• Round all weights into 256 levels• Tensorflow export script reduces inception zip file from 87 MB to 26 MB
with 1% drop in precision
Selective training to keep networks shallow
i
Idea : Augment data limited to how your network will be used
Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate.Followed by WordLens / Google Translate
Example : Add blur if analyzing mobile phone frames
Design consideration for custom architectures – Small Filters
i
Three layers of 3x3 convolutions >> One layer of 7x7 convolution
Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutionsReplace NxN convolutions with stack of 1xN and Nx1ÞFewer parameters ÞLess compute ÞMore non-linearity
BetterFasterStronger
Andrej Karpathy, CS-231n Notes, Lecture 11
SqueezeNet - AlexNet-level accuracy in 0.5 MB
i
SqueezeNet base 4.8 MBSqueezeNet compressed 0.5 MB
80.3% top-5 Accuracy on ImageNet0.72 GFLOPS/image
Fire Block
Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"
Reduced precision
i
Reduce precision from 32 bits to <=16 bits or lesserUse stochastic rounding for best results
In Practice:• Ristretto + Caffe
• Automatic Network quantization• Finds balance between compression rate and accuracy
• Apple Metal Performance Shaders automatically quantize to 16 bits
• Tensorflow has 8 bit quantization support• Gemmlowp – Low precision matrix multiplication library
Binary weighted Networks
i
Idea :Reduce the weights to -1,+1Speedup : Convolution operation can be approximated by only summation and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
Binary weighted Networks
i
Idea :Reduce the weights to -1,+1Speedup : Convolution operation can be approximated by only summation and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
Binary weighted Networks
i
Idea :Reduce the weights to -1,+1Speedup : Convolution operation can be approximated by only summation and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net
i
Idea :Reduce both weights + inputs to -1,+1Speedup : Convolution operation can be approximated by XNOR and Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net
i
Idea :Reduce both weights + inputs to -1,+1Speedup : Convolution operation can be approximated by XNOR and Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net
i
Idea :Reduce both weights + inputs to -1,+1Speedup : Convolution operation can be approximated by XNOR and Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net on Mobile
i
Building a DL App and get $10 million in
funding(or a PhD)
i
Minerva
i
Minerva
i
DeepX Toolkit
iNicholas D. Lane et al, “DXTK : Enabling Resource-efficient Deep Learning on Mobile and Embedded Devices with the DeepX Toolkit",2016
EIE : Efficient Inference Engine on Compressed DNNs
iSong Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, William Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network", 2016
189x faster on CPU13x faster on GPU
One Last Question
How to access the slides in 1 second
Link posted here -> @anirudhkoul