1
Efficient and Scalable Deep Learning Wei Wen, Yiran Chen, Hai (Helen) Li, Duke University {wei.wen, yiran.chen, hai.li}@duke.edu Efficient Deep Learning for Faster Inference on the Edge and in the Cloud (Alfredo et al. 2016) 60 62 64 66 68 70 72 74 76 0 10 20 30 40 50 60 Test Perplexity vs Model Size VD LSTM, Inan et al. (2016) NAS, Zoph & Le (2016) RHN, Zillyet al. (2016) Image Classification on ImageNet Language Modeling on Penn Treebank M More accurate prediction by paying computation cost using larger models. Inference on the edge: Limited computing capability Limited memory Limited batteryenergy PopularMechanics A machine: 48 CPUs and 8 GPUs; a distributed version: 1,202 CPUs and 176 GPUs (Silver et al. 2016) Inference in the cloud: A challenge: Real-time response to trillions of AI service requests Facebook (K. Hazelwood@ISCA’18) 200+ Trillion predictions per day 5+ Billion language translations per day We must compress and accelerate deep learning models. Wired OSXDaily Fortune IBM Cloud AI One of Our solutions: Structurally Sparse Deep Neural Networks (SSDNNs) Group Lasso Regularization is All you Need for SSDNNs arg min w E w ( ) { } = arg min w E D w ( ) + λ g R g w ( ) { } Step 1: Weights are split to G groups w (1..G) i.e. vector length Step 2: Add group Lasso on each group w (g) Step 3: Sum group Lasso over all groups as a regularization: Step 4: SGD optimizing e.g. ( w 1 , w 2 , w 3 , w 4 , w 5 ) -> group 1: ( w 1 , w 2 , w 3 ), group 2: ( w 4 , w 5 ) sqrt( w 1 2 + w 2 2 +w 3 2 ), sqrt( w 4 2 + w 5 2 ) sqrt( w 1 2 + w 2 2 +w 3 2 ) + sqrt( w 4 2 + w 5 2 ) We refer to our method as Structured Sparsity Learning (SSL) Structurally Sparse AlexNet 0 20 40 60 80 100 0 20 40 60 80 100 41.5 42 42.5 43 43.5 44 conv1 conv2 conv3 conv4 conv5 FLOP % Sparsity % FLOP reduction % top-1 error 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 AlexNet: sparsity and speedup Intel Xeon speedup Nvidia Titan Black speedup Speedup Sparsity Sparsity conv Removing Columns and Rows: Removing 2D convolution: Structurally Sparse LSTMs LSTM 1 LSTM 2 Output Learned structurally sparse LSTMs ours SSL is able to remove: filters, channels, neurons, layers, matrix dimensions, hidden states, weight blocks, etc. Our works on Efficient Deep Learning: NIPS 2016, ICLR 2018, ICCV 2017, ICLR 2017, CVPR 2017, ASP-DAC 2017 (Best Paper Award), DAC 2015 & 2016 (Two Best Paper Nominations). Scalable Deep Learning for Faster Training in Distributed Systems (K. Hazelwood@ISCA’18) Picture: leiphone parallel sequential (bottleneck) # machines Time Total Limitation (1) Communication bottleneck Priya et al. 2017 ResNet-50 (2) Bad generalization in large-batch SGD ternary gradients (TernGrad) Parameter server ! " =$ ! " (&) Worker 1 ) "*+ ←) " −. " ! " Worker 2 ) "*+ ←) " −. " ! " Worker N ) "*+ ←) " −. " ! " …… ! " (+) ! " (/) ! " (0) ! " ! " ! " < 2 bits (>16x reduction) ˜ g t = ternarize(g t )= s t · sign (g t ) b t s t , ||g t || 1 , max (abs (g t )) P (b tk =1 | g t )= |g tk |/s t P (b tk =0 | g t )=1 - |g tk |/s t An Example: " ($) : [0.30, -1.20, …, 0.9] " : 1.20 Signs: [1, -1, …, 1] ") = 1| " :[ ..0 1.2 , 1.2 1.2 ,…, ..4 1.2 ] " :[0,1,…,1] " ($) 7 : [0, -1, …, 1]*1.20 TernGrad: quantizing gradients to reduce communication (NIPS 2017, Oral) (a) 0 20000 40000 60000 80000 100000 Images/sec # of GPUs Training throughput on GPU cluster with Ethernet and PCI switch AlexNet FP32 AlexNet TernGrad GoogLeNet FP32 GoogLeNet TernGrad VggNet-A FP32 VggNet-A TernGrad 1 2 4 8 16 32 64 128 256 512 Images/sec 0 1000 2000 3000 4000 0% 10% 20% 30% 40% 50% 60% 70% 0 50000 100000 150000 baseline terngrad (a) top-1 accuracy vs iteration TernGrad Summary: ü No loss in AlexNet; ü <2% loss in GoogLeNet TernGrad is now in PyTorch/Caffe2 and Adopted in Facebook AI Infra. W. Wen, Y. Wang, F. Yan, C. Xu, Y. Chen, H. Li, “SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning”, AAAI 2018 Submission. Solution to (1) Solution to (2) Ref. Ref. Ref. Applied to: (Convergence guaranteed) Bad Good

Poster Wen Efficient and Scalable Deep Learning

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Poster Wen Efficient and Scalable Deep Learning

Efficient and Scalable Deep LearningWeiWen, Yiran Chen, Hai(Helen)Li, DukeUniversity

{wei.wen,yiran.chen,hai.li}@duke.edu

Efficient Deep Learning for Faster Inference on the Edge and in the Cloud

(Alfredo et al. 2016)60

62

64

66

68

70

72

74

76

0 10 20 30 40 50 60

TestPerplexity vsModel Size

VDLSTM,Inan etal.(2016)

NAS,Zoph &Le(2016)

RHN,Zilly etal.(2016)

ImageClassificationonImageNet LanguageModelingonPennTreebank

M

More accurate predictionbypayingcomputation cost using larger models.

Inference on the edge:• Limited computing capability• Limited memory• Limited batteryenergy

Wired PopularMechanics

A machine:48CPUs and8GPUs; a distributedversion: 1,202CPUsand176GPUs (Silver et al. 2016)

“Prototypesusearound2,500watts”

Inference in the cloud:• A challenge: Real-time response to

trillions of AI service requests• Facebook (K. Hazelwood@ISCA’18)

• 200+ Trillion predictions per day• 5+ Billion language translations per dayWe must compress and accelerate deep learning models.

Wired

OSXDaily

Fortune

IBMCloudAI

One of Our solutions: Structurally Sparse Deep Neural Networks (SSDNNs)

Group Lasso Regularization is All you Need for SSDNNs

!!!argmin

wE w( ){ }= argmin

wED w( )+λg ⋅Rg w( ){ }

Step 1: WeightsaresplittoG groups w(1..G)

i.e. vector lengthStep 2: Add group Lasso on each group w(g)

Step 3: Sum group Lasso over all groups as a regularization:

Step 4: SGD optimizing

e.g. (w1, w2, w3,w4, w5) -> group 1: (w1, w2, w3), group 2: (w4, w5)

sqrt(w12 + w2

2 +w32), sqrt(w4

2 + w52)

sqrt(w12 + w2

2 +w32) + sqrt(w4

2 + w52)

We refer to our method as Structured Sparsity Learning (SSL)

Structurally Sparse AlexNet

0

20

40

60

80

100

0

20

40

60

80

100

41.5 42 42.5 43 43.5 44

conv1 conv2 conv3 conv4 conv5 FLOP

% Sparsity % FLOP reduction

% top-1 error

00.10.20.30.40.50.60.70.80.91

0123456789

10

1 2 3 4 5

AlexNet: sparsity and speedup

Intel Xeonspeedup

Nvidia TitanBlack speedup

Speedup Sparsity

Sparsity

conv

Removing Columns and Rows: Removing 2D convolution:

Structurally Sparse LSTMs

LSTM 1 LSTM 2 Output

Learned structurallysparse LSTMs

ours

SSL is able to remove: filters,channels, neurons, layers,matrix dimensions, hiddenstates, weight blocks, etc.

Our works on Efficient Deep Learning: NIPS 2016, ICLR 2018, ICCV 2017, ICLR 2017, CVPR 2017, ASP-DAC 2017 (Best Paper Award), DAC 2015 & 2016 (Two Best Paper Nominations).

Scalable Deep Learning for Faster Training in Distributed Systems

Challenges of Training Deep Learning Models

Google TPU CloudRiseML

Jeffreyetal.2

012

(K. Hazelwood@ISCA’18) Picture: leiphone

parallel

sequential(bottleneck)

# machines

Time

Total Limitation

(1) Communication bottleneck

Priya et al. 2017

ResNet-50

(2) Bad generalizationin large-batchSGD

Gradient Quantization for Communication Reduction

ternary gradients (TernGrad)

Parameter server!" = $!"

(&)�

Worker 1)"*+ ← )" − ."!"

Worker 2)"*+ ← )" −."!"

Worker N)"*+ ← )" − ."!"

……!"(+)

!"(/)

!"(0)

!" !" !"

(a) original (b) clipped (c) ternary (d) final

Iteration#

Iteration#

conv

fc

< 2 bits (>16x reduction)

It’s challenging because it losesprecision in the learning whichrelies on small updates

g̃t = ternarize(gt) = st · sign (gt) � btst , ||gt||1 , max (abs (gt))⇢P (btk = 1 | gt) = |gtk|/stP (btk = 0 | gt) = 1� |gtk|/st

An Example:𝒈"($):[0.30,-1.20,…,0.9]

𝑠":1.20Signs: [1, -1, …, 1]𝑃 𝑏") = 1|𝑔" :[..0

1.2, 1.21.2,…,..4

1.2]

𝒃":[0,1,…,1]𝒈"($)7 :[0,-1,…,1]*1.20

TernGrad: quantizing gradients to reduce communication (NIPS 2017, Oral)

(a) (b)

0

20000

40000

60000

80000

100000

Imag

es/s

ec

# of GPUs

Training throughput on GPU cluster with Ethernet and PCI switch

AlexNet FP32 AlexNet TernGradGoogLeNet FP32 GoogLeNet TernGradVggNet-A FP32 VggNet-A TernGrad

1 2 4 8 16 32 64 128 256 512 0

40000

80000

120000

160000

200000

240000

Imag

es/s

ec

# of GPUs

Training throughput on GPU cluster with InfiniBand and NVLink

AlexNet FP32 AlexNet TernGradGoogLeNet FP32 GoogLeNet TernGradVggNet-A FP32 VggNet-A TernGrad

1 2 4 8 16 32 64 128 256 512

0

1000

2000

3000

4000

0

2000

4000

6000

0%10%20%30%40%50%60%70%

0 50000 100000 150000

baselineterngrad

0

2

4

6

8

0 50000 100000 150000

baselineterngrad

0%

20%

40%

60%

80%

0 50000 100000 150000

(c) gradient sparsity of terngrad in fc6(b) training loss vs iteration(a) top-1 accuracy vs iteration

TernGrad Summary:ü No loss in AlexNet;ü <2% loss in GoogLeNet

TernGrad is now in PyTorch/Caffe2 and Adopted in Facebook AI Infra.

W. Wen, Y. Wang, F.Yan, C. Xu, Y. Chen, H. Li, “SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning”, AAAI 2018Submission.

Solution to (1)

Solution to (2)

Ref.Ref.

Ref.

Applied to:

(Convergence guaranteed)

Bad

Good