Poster Wen Efficient and Scalable Deep Learning

Efficient and Scalable Deep LearningWeiWen, Yiran Chen, Hai(Helen)Li, DukeUniversity

{wei.wen,yiran.chen,hai.li}@duke.edu

Efficient Deep Learning for Faster Inference on the Edge and in the Cloud

(Alfredo et al. 2016)60

62

64

66

68

70

72

74

76

0 10 20 30 40 50 60

TestPerplexity vsModel Size

VDLSTM,Inan etal.(2016)

NAS,Zoph &Le(2016)

RHN,Zilly etal.(2016)

ImageClassificationonImageNet LanguageModelingonPennTreebank

M

More accurate predictionbypayingcomputation cost using larger models.

Inference on the edge:• Limited computing capability• Limited memory• Limited batteryenergy

Wired PopularMechanics

A machine:48CPUs and8GPUs; a distributedversion: 1,202CPUsand176GPUs (Silver et al. 2016)

“Prototypesusearound2,500watts”

Inference in the cloud:• A challenge: Real-time response to

trillions of AI service requests• Facebook (K. Hazelwood@ISCA’18)

• 200+ Trillion predictions per day• 5+ Billion language translations per dayWe must compress and accelerate deep learning models.

Wired

OSXDaily

Fortune

IBMCloudAI

One of Our solutions: Structurally Sparse Deep Neural Networks (SSDNNs)

Group Lasso Regularization is All you Need for SSDNNs

!!!argmin

wE w( ){ }= argmin

wED w( )+λg ⋅Rg w( ){ }

Step 1: WeightsaresplittoG groups w(1..G)

i.e. vector lengthStep 2: Add group Lasso on each group w(g)

Step 3: Sum group Lasso over all groups as a regularization:

Step 4: SGD optimizing

e.g. (w1, w2, w3,w4, w5) -> group 1: (w1, w2, w3), group 2: (w4, w5)

sqrt(w12 + w2

2 +w32), sqrt(w4

2 + w52)

sqrt(w12 + w2

2 +w32) + sqrt(w4

2 + w52)

We refer to our method as Structured Sparsity Learning (SSL)

Structurally Sparse AlexNet

0

20

40

60

80

100

0

20

40

60

80

100

41.5 42 42.5 43 43.5 44

conv1 conv2 conv3 conv4 conv5 FLOP

% Sparsity % FLOP reduction

% top-1 error

00.10.20.30.40.50.60.70.80.91

0123456789

10

1 2 3 4 5

AlexNet: sparsity and speedup

Intel Xeonspeedup

Nvidia TitanBlack speedup

Speedup Sparsity

Sparsity

conv

Removing Columns and Rows: Removing 2D convolution:

Structurally Sparse LSTMs

LSTM 1 LSTM 2 Output

Learned structurallysparse LSTMs

ours

SSL is able to remove: filters,channels, neurons, layers,matrix dimensions, hiddenstates, weight blocks, etc.

Our works on Efficient Deep Learning: NIPS 2016, ICLR 2018, ICCV 2017, ICLR 2017, CVPR 2017, ASP-DAC 2017 (Best Paper Award), DAC 2015 & 2016 (Two Best Paper Nominations).

Scalable Deep Learning for Faster Training in Distributed Systems

Challenges of Training Deep Learning Models

Google TPU CloudRiseML

Jeffreyetal.2

012

(K. Hazelwood@ISCA’18) Picture: leiphone

parallel

sequential(bottleneck)

# machines

Time

Total Limitation

(1) Communication bottleneck

Priya et al. 2017

ResNet-50

(2) Bad generalizationin large-batchSGD

Gradient Quantization for Communication Reduction

ternary gradients (TernGrad)

Parameter server!" = $!"

(&)�

�

Worker 1)"*+ ← )" − ."!"

Worker 2)"*+ ← )" −."!"

Worker N)"*+ ← )" − ."!"

……!"(+)

!"(/)

!"(0)

!" !" !"

(a) original (b) clipped (c) ternary (d) final

Iteration#

Iteration#

conv

fc

< 2 bits (>16x reduction)

It’s challenging because it losesprecision in the learning whichrelies on small updates

g̃t = ternarize(gt) = st · sign (gt) � btst , ||gt||1 , max (abs (gt))⇢P (btk = 1 | gt) = |gtk|/stP (btk = 0 | gt) = 1� |gtk|/st

An Example:𝒈"($):[0.30,-1.20,…,0.9]

𝑠":1.20Signs: [1, -1, …, 1]𝑃 𝑏") = 1|𝑔" :[..0

1.2, 1.21.2,…,..4

1.2]

𝒃":[0,1,…,1]𝒈"($)7 :[0,-1,…,1]*1.20

TernGrad: quantizing gradients to reduce communication (NIPS 2017, Oral)

(a) (b)

0

20000

40000

60000

80000

100000

Imag

es/s

ec

# of GPUs

Training throughput on GPU cluster with Ethernet and PCI switch

AlexNet FP32 AlexNet TernGradGoogLeNet FP32 GoogLeNet TernGradVggNet-A FP32 VggNet-A TernGrad

1 2 4 8 16 32 64 128 256 512 0

40000

80000

120000

160000

200000

240000

Imag

es/s

ec

# of GPUs

Training throughput on GPU cluster with InfiniBand and NVLink

AlexNet FP32 AlexNet TernGradGoogLeNet FP32 GoogLeNet TernGradVggNet-A FP32 VggNet-A TernGrad

1 2 4 8 16 32 64 128 256 512

0

1000

2000

3000

4000

0

2000

4000

6000

0%10%20%30%40%50%60%70%

0 50000 100000 150000

baselineterngrad

0

2

4

6

8

0 50000 100000 150000

baselineterngrad

0%

20%

40%

60%

80%

0 50000 100000 150000

(c) gradient sparsity of terngrad in fc6(b) training loss vs iteration(a) top-1 accuracy vs iteration

TernGrad Summary:ü No loss in AlexNet;ü <2% loss in GoogLeNet

TernGrad is now in PyTorch/Caffe2 and Adopted in Facebook AI Infra.

W. Wen, Y. Wang, F.Yan, C. Xu, Y. Chen, H. Li, “SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning”, AAAI 2018Submission.

Solution to (1)

Solution to (2)

Ref.Ref.

Ref.

Applied to:

(Convergence guaranteed)

Bad

Good

Documents

Poster Wen Efficient and Scalable Deep Learning