Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Efficient and Scalable Deep LearningWeiWen, Yiran Chen, Hai(Helen)Li, DukeUniversity
{wei.wen,yiran.chen,hai.li}@duke.edu
Efficient Deep Learning for Faster Inference on the Edge and in the Cloud
(Alfredo et al. 2016)60
62
64
66
68
70
72
74
76
0 10 20 30 40 50 60
TestPerplexity vsModel Size
VDLSTM,Inan etal.(2016)
NAS,Zoph &Le(2016)
RHN,Zilly etal.(2016)
ImageClassificationonImageNet LanguageModelingonPennTreebank
M
More accurate predictionbypayingcomputation cost using larger models.
Inference on the edge:• Limited computing capability• Limited memory• Limited batteryenergy
Wired PopularMechanics
A machine:48CPUs and8GPUs; a distributedversion: 1,202CPUsand176GPUs (Silver et al. 2016)
“Prototypesusearound2,500watts”
Inference in the cloud:• A challenge: Real-time response to
trillions of AI service requests• Facebook (K. Hazelwood@ISCA’18)
• 200+ Trillion predictions per day• 5+ Billion language translations per dayWe must compress and accelerate deep learning models.
Wired
OSXDaily
Fortune
IBMCloudAI
One of Our solutions: Structurally Sparse Deep Neural Networks (SSDNNs)
Group Lasso Regularization is All you Need for SSDNNs
!!!argmin
wE w( ){ }= argmin
wED w( )+λg ⋅Rg w( ){ }
Step 1: WeightsaresplittoG groups w(1..G)
i.e. vector lengthStep 2: Add group Lasso on each group w(g)
Step 3: Sum group Lasso over all groups as a regularization:
Step 4: SGD optimizing
e.g. (w1, w2, w3,w4, w5) -> group 1: (w1, w2, w3), group 2: (w4, w5)
sqrt(w12 + w2
2 +w32), sqrt(w4
2 + w52)
sqrt(w12 + w2
2 +w32) + sqrt(w4
2 + w52)
We refer to our method as Structured Sparsity Learning (SSL)
Structurally Sparse AlexNet
0
20
40
60
80
100
0
20
40
60
80
100
41.5 42 42.5 43 43.5 44
conv1 conv2 conv3 conv4 conv5 FLOP
% Sparsity % FLOP reduction
% top-1 error
00.10.20.30.40.50.60.70.80.91
0123456789
10
1 2 3 4 5
AlexNet: sparsity and speedup
Intel Xeonspeedup
Nvidia TitanBlack speedup
Speedup Sparsity
Sparsity
conv
Removing Columns and Rows: Removing 2D convolution:
Structurally Sparse LSTMs
LSTM 1 LSTM 2 Output
Learned structurallysparse LSTMs
ours
SSL is able to remove: filters,channels, neurons, layers,matrix dimensions, hiddenstates, weight blocks, etc.
Our works on Efficient Deep Learning: NIPS 2016, ICLR 2018, ICCV 2017, ICLR 2017, CVPR 2017, ASP-DAC 2017 (Best Paper Award), DAC 2015 & 2016 (Two Best Paper Nominations).
Scalable Deep Learning for Faster Training in Distributed Systems
Challenges of Training Deep Learning Models
Google TPU CloudRiseML
Jeffreyetal.2
012
(K. Hazelwood@ISCA’18) Picture: leiphone
parallel
sequential(bottleneck)
# machines
Time
Total Limitation
(1) Communication bottleneck
Priya et al. 2017
ResNet-50
(2) Bad generalizationin large-batchSGD
Gradient Quantization for Communication Reduction
ternary gradients (TernGrad)
Parameter server!" = $!"
(&)�
�
Worker 1)"*+ ← )" − ."!"
Worker 2)"*+ ← )" −."!"
Worker N)"*+ ← )" − ."!"
……!"(+)
!"(/)
!"(0)
!" !" !"
(a) original (b) clipped (c) ternary (d) final
Iteration#
Iteration#
conv
fc
< 2 bits (>16x reduction)
It’s challenging because it losesprecision in the learning whichrelies on small updates
g̃t = ternarize(gt) = st · sign (gt) � btst , ||gt||1 , max (abs (gt))⇢P (btk = 1 | gt) = |gtk|/stP (btk = 0 | gt) = 1� |gtk|/st
An Example:𝒈"($):[0.30,-1.20,…,0.9]
𝑠":1.20Signs: [1, -1, …, 1]𝑃 𝑏") = 1|𝑔" :[..0
1.2, 1.21.2,…,..4
1.2]
𝒃":[0,1,…,1]𝒈"($)7 :[0,-1,…,1]*1.20
TernGrad: quantizing gradients to reduce communication (NIPS 2017, Oral)
(a) (b)
0
20000
40000
60000
80000
100000
Imag
es/s
ec
# of GPUs
Training throughput on GPU cluster with Ethernet and PCI switch
AlexNet FP32 AlexNet TernGradGoogLeNet FP32 GoogLeNet TernGradVggNet-A FP32 VggNet-A TernGrad
1 2 4 8 16 32 64 128 256 512 0
40000
80000
120000
160000
200000
240000
Imag
es/s
ec
# of GPUs
Training throughput on GPU cluster with InfiniBand and NVLink
AlexNet FP32 AlexNet TernGradGoogLeNet FP32 GoogLeNet TernGradVggNet-A FP32 VggNet-A TernGrad
1 2 4 8 16 32 64 128 256 512
0
1000
2000
3000
4000
0
2000
4000
6000
0%10%20%30%40%50%60%70%
0 50000 100000 150000
baselineterngrad
0
2
4
6
8
0 50000 100000 150000
baselineterngrad
0%
20%
40%
60%
80%
0 50000 100000 150000
(c) gradient sparsity of terngrad in fc6(b) training loss vs iteration(a) top-1 accuracy vs iteration
TernGrad Summary:ü No loss in AlexNet;ü <2% loss in GoogLeNet
TernGrad is now in PyTorch/Caffe2 and Adopted in Facebook AI Infra.
W. Wen, Y. Wang, F.Yan, C. Xu, Y. Chen, H. Li, “SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning”, AAAI 2018Submission.
Solution to (1)
Solution to (2)
Ref.Ref.
Ref.
Applied to:
(Convergence guaranteed)
Bad
Good