110
with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 Scaling Distributed Machine Learning

Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

with System and Algorithm Co-design

Mu Li Thesis Defense

CSD, CMU Feb 2nd, 2017

Scaling Distributed Machine Learning

Page 2: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

2

minw

nX

i=1

fi(w)

Page 3: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

2

minw

nX

i=1

fi(w)

Large-scale problems

Page 4: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

2

minw

nX

i=1

fi(w)

Large-scale problems

✧ Distributed systems ✧ Large scale optimization methods

Page 5: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Large Scale Machine Learning

✧ Machine learning learns from data ✧ More data ✓ better accuracy ✓ can use more complex models

3

Data size

Accu

racy

More complex models

Page 6: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Ads Click Prediction

✧ Predict if an ad will be clicked ✧ Each ad impression is an example ✧ Logistic regression ✓ Single machine processes 1 million

examples per second

4

Page 7: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Ads Click Prediction

✧ Predict if an ad will be clicked ✧ Each ad impression is an example ✧ Logistic regression ✓ Single machine processes 1 million

examples per second ✧ A typical industrial size problem has ✓ 100 billion examples ✓ 10 billion unique features

5

0

175

350

525

700

Year

2010 2011 2012 2013 2014

Training data size

(TB)

Page 8: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

6

✧ Recognize the object in an image ✧ Convolutional neural network ✧ A state-of-the-art network ✓ Hundreds of layers ✓ Billions of floating-point operation for

processing a single image

Image Recognition

Page 9: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

7

✧ Distribute workload among many machines

✧ Widely available thanks to cloud providers (AWS, GCP, Azure) machine

machine

machine

machine

switch switch

switch switch

Distributed Computing for Large Scale Problems

Page 10: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

8

✧ Distribute workload among many machines

✧ Widely available thanks to cloud providers (AWS, GCP, Azure)

✧ Challenges ✓ Limited communication bandwidth (10x

less than memory bandwidth) ✓ Large synchronization cost (1ms latency) ✓ Job failures

machine

machine

switch

machine

machine

switch

switch switch

Distributed Computing for Large Scale Problems

Page 11: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

8

✧ Distribute workload among many machines

✧ Widely available thanks to cloud providers (AWS, GCP, Azure)

✧ Challenges ✓ Limited communication bandwidth (10x

less than memory bandwidth) ✓ Large synchronization cost (1ms latency) ✓ Job failures

machine

machine

switch

machine

machine

switch

switch switch

0

8

16

24

Machine time (hour)

100 1000 10000Failu

re ra

te (%

)

Distributed Computing for Large Scale Problems

Page 12: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Distributed Optimization for Large Scale ML

9

m[

i=1

Ii = {1, 2, . . . , n}

X

i2I1

@fi(wt)

X

i2I2

@fi(wt)

X

i2Im

@fi(wt)

wt wt+1+

minw

nX

i=1

fi(w)

Page 13: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Distributed Optimization for Large Scale ML

9

m[

i=1

Ii = {1, 2, . . . , n}

X

i2I1

@fi(wt)

X

i2I2

@fi(wt)

X

i2Im

@fi(wt)

wt wt+1+

…✧ Challenges ✓ Massive communication traffic ✓ Expensive global synchronization

minw

nX

i=1

fi(w)

Page 14: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

10

Distributed SystemsDistributed Systems

✓ Large data size, complex models ✓ Fault tolerant ✓ Easy to use

Scaling Distributed Machine Learning

Distributed SystemsLarge Scale Optimization

✓ Communication efficient ✓ Convergence guarantee

Page 15: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

11

Distributed SystemsDistributed Systems

Scaling Distributed Machine Learning

Distributed SystemsLarge Scale Optimization

for machine learning

Parameter Server for machine learning

for machine learning

MXNet for deep learning

for machine learning

DBPG for non-convex non-smooth fi

for machine learning

EMSO for efficient minibatch SGD

Page 16: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

12

Distributed SystemsDistributed Systems

Scaling Distributed Machine Learning

Distributed SystemsLarge Scale Optimization

for machine learning

Parameter Server for machine learning

for machine learning

MXNet for deep learning

for machine learning

DBPG for non-convex non-smooth fi

for machine learning

EMSO for efficient minibatch SGD

Page 17: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Existing Open Source Systems in 2012

13

✧ MPI (message passing interface) ✓ Hard to use for sparse problems ✓ No fault tolerance

✧ Key-value store, e.g. redis ✓ Expensive individual key-value pair communication ✓ Difficult to program on the server side

✧ Hadoop/Spark ✓ BSP data consistency makes efficient implementation challenging

Page 18: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Parameter Server Architecture

14

[Smola’10, Dean’12]

Page 19: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Parameter Server Architecture

14

Training data

[Smola’10, Dean’12]

Page 20: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Parameter Server Architecture

14

Training data

Worker machines

[Smola’10, Dean’12]

Page 21: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Parameter Server Architecture

14

Training data

Worker machines

Model[Smola’10, Dean’12]

Page 22: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Parameter Server Architecture

14

Training data

Worker machines

Server machines

Model[Smola’10, Dean’12]

Page 23: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Parameter Server Architecture

14

Training data

push

Worker machines

Server machines

Model[Smola’10, Dean’12]

Page 24: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Parameter Server Architecture

14

Training data

push

Worker machines

Server machines

Model[Smola’10, Dean’12]

update

Page 25: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Parameter Server Architecture

14

Training data

push

Worker machines

Server machines

pull

Model[Smola’10, Dean’12]

update

Page 26: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Keys Features of our Implementation

15

[Li et al, OSDI’14]

Page 27: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Keys Features of our Implementation

✧ Trade off data consistency for speed ✓ Flexible consistency models ✓ User-defined filters

15

[Li et al, OSDI’14]

Page 28: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Keys Features of our Implementation

✧ Trade off data consistency for speed ✓ Flexible consistency models ✓ User-defined filters

✧ Fault tolerance with chain replication

15

[Li et al, OSDI’14]

X

X

Page 29: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Flexible Consistency Model

16

gradient push & pulliter 0

gradient push & pulliter 1

execute after finished dependency

Page 30: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Flexible Consistency Model

16

gradient push & pulliter 0

gradient push & pulliter 1

execute after finished dependency

gradient push & pulliter 0

gradient push & pulliter 1

no dependency

Page 31: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Flexible Consistency Model

16

gradient push & pulliter 0

gradient push & pulliter 1

execute after finished dependency

gradient push & pulliter 0

gradient push & pulliter 1

no dependency

Flexible models via task dependency graph

Page 32: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Flexible Consistency Model

16

1 2 3Sequential / BSP 4

gradient push & pulliter 0

gradient push & pulliter 1

execute after finished dependency

gradient push & pulliter 0

gradient push & pulliter 1

no dependency

Flexible models via task dependency graph

Page 33: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Flexible Consistency Model

16

Eventual / Total asynchronous

[Smola 10]

1 2 3 4

1 2 3Sequential / BSP 4

gradient push & pulliter 0

gradient push & pulliter 1

execute after finished dependency

gradient push & pulliter 0

gradient push & pulliter 1

no dependency

Flexible models via task dependency graph

Page 34: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Flexible Consistency Model

16

Bounded delay / SSP [Langford 09, Cipar 13]

1 2 3 4 5

Eventual / Total asynchronous

[Smola 10]

1 2 3 4

1 2 3Sequential / BSP 4

gradient push & pulliter 0

gradient push & pulliter 1

execute after finished dependency

gradient push & pulliter 0

gradient push & pulliter 1

no dependency

Flexible models via task dependency graph

Page 35: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

User-defined Filters

✧ User defined encoder/decoder for efficient communication

✧ Lossless compression ✓ General data compression: LZ, LZR, ..

✧ Lossy compression ✓ Random skip ✓ Fixed-point encoding

17

Encoder

Decoder

Data

Data

machine

machine

efficient communication

Page 36: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Fault Tolerance with Chain Replication✧ Model is partitioned by consistent hashing ✧ Chain replication

18

worker 0 server 0 server 1

push push

ackack

Page 37: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Fault Tolerance with Chain Replication✧ Model is partitioned by consistent hashing ✧ Chain replication

18

worker 0 server 0 server 1

push push

ackack

push

ack

push

push

ack

ackworker 0

server 0 server 1

worker 1

✧ Option: aggregation reduces backup traffic

Page 38: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

19

Distributed SystemsDistributed Systems

Scaling Distributed Machine Learning

Distributed SystemsLarge Scale Optimization

for machine learning

Parameter Server for machine learning

for machine learning

MXNet for deep learning

for machine learning

DBPG for non-convex non-smooth fi

for machine learning

EMSO for efficient minibatch SGD

Page 39: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Proximal Gradient Method

✓ fi: continuously differentiable but not necessarily convex ✓ h: convex but possibly non-smooth

20

minw2⌦

nX

i=1

fi(w) + h(w)

[Combettes’09]

Page 40: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Proximal Gradient Method

✓ fi: continuously differentiable but not necessarily convex ✓ h: convex but possibly non-smooth

20

minw2⌦

nX

i=1

fi(w) + h(w)

✧ Iterative update

[Combettes’09]

Prox⌘(x) := argmin

y2⌦h(y) +

1

2⌘

kx� yk2

wt+1 = Prox�t

"wt � ⌘t

nX

i=1

fi(wt)

#

where

Page 41: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Delayed Block Proximal Gradient

✧ Algorithm design tailored for parameter server implementation ✓ Update a block of coordinates each time ✓ Allow delay among blocks ✓ Use filters during communication

✧ Only 300 lines of codes

21

[Li et al, NIPS’14]

data

model

Page 42: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Delayed Block Proximal Gradient

✧ Algorithm design tailored for parameter server implementation ✓ Update a block of coordinates each time ✓ Allow delay among blocks ✓ Use filters during communication

✧ Only 300 lines of codes

21

[Li et al, NIPS’14]

data

model

Page 43: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Convergence Analysis

✧ Assumptions: ✓ Block Lipschitz continuity: within block , cross blocks

✓ Delay is bounded by 𝜏

✓ Lossy compressions such as random skip filter and significantly-modified filter

✧ DBPG converges to a stationary point if the learning rate is chosen as

22

Lvar Lcor

⌘t <1

Lvar

+ ⌧Lcor

Page 44: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Experiments on Ads Click Prediction

✧ Real dataset used in production ✓ 170 billion examples, 65 billion unique

features, 636 TB in total ✧ 1000 machines ✧ Sparse logistic regression

23

min

w

nX

i=1

log(1 + exp(�yi hxi, wi)) + �kwk1

Page 45: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Experiments on Ads Click Prediction

✧ Real dataset used in production ✓ 170 billion examples, 65 billion unique

features, 636 TB in total ✧ 1000 machines ✧ Sparse logistic regression

23

0

0.5

1

1.5

2

Maximal delay 𝜏

0 1 2 4 8 16

computing waiting

time (hour)

Time to achieve the same objective value

min

w

nX

i=1

log(1 + exp(�yi hxi, wi)) + �kwk1

Page 46: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Experiments on Ads Click Prediction

✧ Real dataset used in production ✓ 170 billion examples, 65 billion unique

features, 636 TB in total ✧ 1000 machines ✧ Sparse logistic regression

23

0

0.5

1

1.5

2

Maximal delay 𝜏

0 1 2 4 8 16

computing waiting

time (hour)

sequential

Time to achieve the same objective value

min

w

nX

i=1

log(1 + exp(�yi hxi, wi)) + �kwk1

Page 47: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Experiments on Ads Click Prediction

✧ Real dataset used in production ✓ 170 billion examples, 65 billion unique

features, 636 TB in total ✧ 1000 machines ✧ Sparse logistic regression

23

0

0.5

1

1.5

2

Maximal delay 𝜏

0 1 2 4 8 16

computing waiting

time (hour)

sequential

Time to achieve the same objective value

min

w

nX

i=1

log(1 + exp(�yi hxi, wi)) + �kwk1

Page 48: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Experiments on Ads Click Prediction

✧ Real dataset used in production ✓ 170 billion examples, 65 billion unique

features, 636 TB in total ✧ 1000 machines ✧ Sparse logistic regression

23

0

0.5

1

1.5

2

Maximal delay 𝜏

0 1 2 4 8 16

computing waiting

time (hour)

sequential

best trade-off1.6x

Time to achieve the same objective value

min

w

nX

i=1

log(1 + exp(�yi hxi, wi)) + �kwk1

Page 49: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Filters to Reduce Communication Traffic

24

Server

Worker

Page 50: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Filters to Reduce Communication Traffic

24Tr

affic (

%)

0

25

50

75

100

Baseline Key Caching

Compre-ssing

KKT Filter

Traffi

c (%

)

0

25

50

75

100

Baseline Key Caching

Compre-ssing

KKT Filter

Server

Worker

Page 51: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Filters to Reduce Communication Traffic

✧ Key caching ✓ Cache feature IDs on both sender and

receiver

24Tr

affic (

%)

0

25

50

75

100

Baseline Key Caching

Compre-ssing

KKT Filter

Traffi

c (%

)

0

25

50

75

100

Baseline Key Caching

Compre-ssing

KKT Filter

Server

Worker

2x

2x

Page 52: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Filters to Reduce Communication Traffic

✧ Key caching ✓ Cache feature IDs on both sender and

receiver✧ Data compression

24Tr

affic (

%)

0

25

50

75

100

Baseline Key Caching

Compre-ssing

KKT Filter

Traffi

c (%

)

0

25

50

75

100

Baseline Key Caching

Compre-ssing

KKT Filter

Server

Worker

2x

40x

2x 2.5x

Page 53: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Filters to Reduce Communication Traffic

✧ Key caching ✓ Cache feature IDs on both sender and

receiver✧ Data compression✧ KKT filter ✓ Shrink gradient to 0 based on the KKT

condition

24Tr

affic (

%)

0

25

50

75

100

Baseline Key Caching

Compre-ssing

KKT Filter

Traffi

c (%

)

0

25

50

75

100

Baseline Key Caching

Compre-ssing

KKT Filter

Server

Worker

2x

40x 40x

2x 2.5x12x

Page 54: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

25

Distributed SystemsDistributed Systems

Scaling Distributed Machine Learning

Distributed SystemsLarge Scale Optimization

for machine learning

Parameter Server for machine learning

for machine learning

MXNet for deep learning

for machine learning

DBPG for non-convex non-smooth fi

for machine learning

EMSO for efficient minibatch SGD

Page 55: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Deep Learning is Unique✧ Complex workloads

✧ Heterogeneous computing

✧ Easy to use programming interface

26

“deep learning” trend in the past 5 years

Page 56: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Key Features of MXNet

✧ Easy-to-use front-end ✓ Mixed programming

✧ Scalable and efficient back-end ✓ Computation and memory optimization ✓ Auto-parallelization ✓ Scaling to multiple GPU/machines

27

[Chen et al, NIPS’15 workshop] (corresponding author)

Page 57: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

28

Mixed Programming

import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) model = mx.module.Module(net) model.forward(data=c) model.backward()

✧ Declarative programs are easy to optimize ✓ e.g. TensorFlow, Theano, Caffe, …

Good for defining the neural network

Page 58: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

28

✧ Imperative programming is flexible ✓ e.g. Numpy, Matlab, Torch, …

Mixed Programming

import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b print(c) c += 1

import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) model = mx.module.Module(net) model.forward(data=c) model.backward()

✧ Declarative programs are easy to optimize ✓ e.g. TensorFlow, Theano, Caffe, …

Good for defining the neural networkGood for updating and interacting

with the neural network

Page 59: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Back-end System

29

Back-end

import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1

import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) texec = mx.module.Module(net) texec.forward(data=c) texec.backward()

Front-end

Page 60: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Back-end System

29

a b

1

+

Back-end

import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1

import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) texec = mx.module.Module(net) texec.forward(data=c) texec.backward()

Front-end

Page 61: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Back-end System

29

a b

1

+

c

fullc

softmax

weight

bias

Back-end

import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1

import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) texec = mx.module.Module(net) texec.forward(data=c) texec.backward()

Front-end

Page 62: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Back-end System

29

a b

1

+

c

fullc

softmax

weight

bias

Back-end

import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1

import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) texec = mx.module.Module(net) texec.forward(data=c) texec.backward()

Front-end

Page 63: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Back-end System

29

a b

1

+

c

fullc

softmax

weight

bias

Back-end

import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1

import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) texec = mx.module.Module(net) texec.forward(data=c) texec.backward()

Front-end

✧ Optimization ✓ Memory optimization ✓ Operator fusion and

runtime compilation ✧ Scheduling ✓ Auto-parallelization

Page 64: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scale to Multiple GPU Machines

30

PCIe Switch

GPU

GPU

GPU

GPU

CPU

Network Switch

63 GB/s 4 PCIe 3.0 16x

15.75 GB/s PCIe 3.0 16x

1.25 GB/s 10 Gbit Ethernet

Page 65: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scale to Multiple GPU Machines

30

PCIe Switch

GPU

GPU

GPU

GPU

CPU

Network Switch

63 GB/s 4 PCIe 3.0 16x

15.75 GB/s PCIe 3.0 16x

1.25 GB/s 10 Gbit Ethernet

Hierarchical parameter server

Level-1 Servers

Workers

Level-2 Servers

GPUs

CPUs

✧ 1000 lines of codes

Page 66: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Experiment Setup

✧ ✓ 1.2 million images with 1000 classes

✧ Resnet 152-layer model ✧ EC2 P2.8 xlarge ✓ 8 K80 GPUs per machine

31

GPU 0-7

PCIe switchesCPU

Page 67: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Experiment Setup

✧ ✓ 1.2 million images with 1000 classes

✧ Resnet 152-layer model ✧ EC2 P2.8 xlarge ✓ 8 K80 GPUs per machine

31

GPU 0-7

PCIe switchesCPU

✧ Minibatch SGD ✓ Draw a random set of examples It at

iteration t ✓ Update

wt+1 = wt �⌘t|It|

X

i2It

@fi(wt)

✧ Synchronized updating

Page 68: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Communication Cost

✧ Fix #GPUs per machine

32

Page 69: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Communication Cost

✧ Fix #GPUs per machine

32

time

(sec

)

0.2

0.325

0.45

0.575

0.7

# of GPUs

0 32 64 96 128

1 GPU/machine2 GPU/machine4 GPU/machine8 GPU/machine

Page 70: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Communication Cost

✧ Fix #GPUs per machine

32

time

(sec

)

0.2

0.325

0.45

0.575

0.7

# of GPUs

0 32 64 96 128

1 GPU/machine2 GPU/machine4 GPU/machine8 GPU/machine

Page 71: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Communication Cost

✧ Fix #GPUs per machine

32

time

(sec

)

0.2

0.325

0.45

0.575

0.7

# of GPUs

0 32 64 96 128

1 GPU/machine2 GPU/machine4 GPU/machine8 GPU/machine

Page 72: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Communication Cost

✧ Fix #GPUs per machine

32

time

(sec

)

0.2

0.325

0.45

0.575

0.7

# of GPUs

0 32 64 96 128

1 GPU/machine2 GPU/machine4 GPU/machine8 GPU/machine

Page 73: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scalability

33

Page 74: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scalability

33

time

(sec

)

0

0.25

0.5

0.75

1

# of GPUs

0 32 64 96 128

Communication costbatch size/GPU=4batch size/GPU=8batch size/GPU=16

Page 75: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scalability

33

time

(sec

)

0

0.25

0.5

0.75

1

# of GPUs

0 32 64 96 128

Communication costbatch size/GPU=4batch size/GPU=8batch size/GPU=16

Page 76: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scalability

33

time

(sec

)

0

0.25

0.5

0.75

1

# of GPUs

0 32 64 96 128

Communication costbatch size/GPU=4batch size/GPU=8batch size/GPU=16

Page 77: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scalability

33

time

(sec

)

0

0.25

0.5

0.75

1

# of GPUs

0 32 64 96 128

Communication costbatch size/GPU=4batch size/GPU=8batch size/GPU=16

Page 78: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scalability

33

time

(sec

)

0

0.25

0.5

0.75

1

# of GPUs

0 32 64 96 128

Communication costbatch size/GPU=4batch size/GPU=8batch size/GPU=16

115x speedup

Page 79: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Convergence

34

Page 80: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Convergence

34

Top-

1 val

idat

ion

accu

racy

0.1

0.275

0.45

0.625

0.8

# of epochs

0 30 60 90 120

batch size=256batch size=2,560batch size=5,120

Page 81: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Convergence

✧ Increase learning rate by 5x

34

Top-

1 val

idat

ion

accu

racy

0.1

0.275

0.45

0.625

0.8

# of epochs

0 30 60 90 120

batch size=256batch size=2,560batch size=5,120

Page 82: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Convergence

✧ Increase learning rate by 5x

✧ Increase learning rate by 10x, decrease it at epoch 50, 80

34

Top-

1 val

idat

ion

accu

racy

0.1

0.275

0.45

0.625

0.8

# of epochs

0 30 60 90 120

batch size=256batch size=2,560batch size=5,120

Page 83: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

35

Distributed SystemsDistributed Systems

Scaling Distributed Machine Learning

Distributed SystemsLarge Scale Optimization

for machine learning

Parameter Server for machine learning

for machine learning

MXNet for deep learning

for machine learning

DBPG for non-convex non-smooth fi

for machine learning

EMSO for efficient minibatch SGD

Page 84: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Minibatch SGD

36

Batch size (b)

Better system performance

Worse

Page 85: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Minibatch SGD

✧ Large batch size b in SGD ✓ Better parallelization within a batch ✓ Less switching/communication cost

36

Batch size (b)

Better system performance

Worse

Page 86: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Minibatch SGD

✧ Large batch size b in SGD ✓ Better parallelization within a batch ✓ Less switching/communication cost

✧ Small batch size b ✓ Faster convergenceN: number of examples processed

36

O(1/pN + b/N)

Batch size (b)

Better system performance

convergence rateWorse

Page 87: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Motivation

37

Batch size (b)

Better system performance

convergence rate

[Li et al, KDD’14]

Worse

Page 88: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Motivation

✧ Improve converge rate for large batch size ✓ Example variance decreases with batch size ✓ Solve a more “accurate” optimization

subproblem over each batch

37

Batch size (b)

Better system performance

convergence rate

[Li et al, KDD’14]

Worse

Page 89: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Efficient Minibatch SGD✧ Define . Minibatch SGD solves

38

fIt(w) :=X

i2It

fi(w)

wt = argminw2⌦

fIt(wt�1) + h@fIt(wt�1), w � wt�1i+

1

2⌘tkw � wt�1k22

Page 90: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Efficient Minibatch SGD✧ Define . Minibatch SGD solves

38

fIt(w) :=X

i2It

fi(w)

first-order approximation

wt = argminw2⌦

fIt(wt�1) + h@fIt(wt�1), w � wt�1i+

1

2⌘tkw � wt�1k22

�conservative penalty

Page 91: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Efficient Minibatch SGD✧ Define . Minibatch SGD solves

38

fIt(w) :=X

i2It

fi(w)

first-order approximation

wt = argminw2⌦

fIt(wt�1) + h@fIt(wt�1), w � wt�1i+

1

2⌘tkw � wt�1k22

�conservative penalty

wt = argminw2⌦

fIt(w) +

1

2⌘tkw � wt�1k22

�✧ EMSO solves the subproblem at each iteration

Page 92: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Efficient Minibatch SGD✧ Define . Minibatch SGD solves

38

fIt(w) :=X

i2It

fi(w)

first-order approximation

wt = argminw2⌦

fIt(wt�1) + h@fIt(wt�1), w � wt�1i+

1

2⌘tkw � wt�1k22

�conservative penalty

exact objective

wt = argminw2⌦

fIt(w) +

1

2⌘tkw � wt�1k22

�✧ EMSO solves the subproblem at each iteration

Page 93: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Efficient Minibatch SGD✧ Define . Minibatch SGD solves

38

fIt(w) :=X

i2It

fi(w)

first-order approximation

wt = argminw2⌦

fIt(wt�1) + h@fIt(wt�1), w � wt�1i+

1

2⌘tkw � wt�1k22

�conservative penalty

✧ For convex fi, choose . EMSO has convergence rate (compared to )

⌘t = O(b/pN)

O(1/pN)

O(1/pN + b/N)

exact objective

wt = argminw2⌦

fIt(w) +

1

2⌘tkw � wt�1k22

�✧ EMSO solves the subproblem at each iteration

Page 94: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Experiment✧ Ads click prediction with fixed run time

39

Obj

ectiv

e

0.01

0.016

0.023

0.029

0.035

Batch size

1e3 1e4 1e5

SGDEMSO

Single machine

Page 95: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Experiment✧ Ads click prediction with fixed run time

39O

bjec

tive

0.195

0.201

0.208

0.214

0.22

Batch size

1e3 1e4 1e5

SGDEMSO

Obj

ectiv

e

0.01

0.016

0.023

0.029

0.035

Batch size

1e3 1e4 1e5

SGDEMSO

Single machine 10 machines

Page 96: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Experiment✧ Ads click prediction with fixed run time

39

Extended to deep learning in [Keskar et al, arXiv’16]O

bjec

tive

0.195

0.201

0.208

0.214

0.22

Batch size

1e3 1e4 1e5

SGDEMSO

Obj

ectiv

e

0.01

0.016

0.023

0.029

0.035

Batch size

1e3 1e4 1e5

SGDEMSO

Single machine 10 machines

Page 97: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

40

minw

nX

i=1

fi(w)

✧ Distributed systems ✧ Large scale optimization

Large-scale problems

Page 98: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

40

minw

nX

i=1

fi(w)

✧ Distributed systems ✧ Large scale optimization

Large-scale problems

Reduce communication

cost

Page 99: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

40

minw

nX

i=1

fi(w)

✧ Distributed systems ✧ Large scale optimization

Large-scale problems

Reduce communication

cost

✓ Communicate less ✓ Message compression ✓ Relaxed data consistency

Page 100: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

40

minw

nX

i=1

fi(w)

✧ Distributed systems ✧ Large scale optimization

Large-scale problems

Reduce communication

costCo-design

✓ Communicate less ✓ Message compression ✓ Relaxed data consistency

Page 101: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

40

minw

nX

i=1

fi(w)

✧ Distributed systems ✧ Large scale optimization

Large-scale problems

Reduce communication

costCo-design

With appropriate computational frameworks and algorithm design, distributed machine learning can be made simple,

fast, and scalable, both in theory and in practice.

Page 102: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Acknowledgement

41

with other 13 collaborators

Advisors

QQ & Alex

Committee members

Page 103: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Backup slides

42

Page 104: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scaling to 16 GPUs in a Single Machine

43

Page 105: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scaling to 16 GPUs in a Single Machine

43

time

(sec

)

0

0.25

0.5

0.75

1

# of GPUs

0 4 8 12 16

Comm Costbs/GPU=2bs/GPU=4bs/GPU=8bs/GPU=16

Page 106: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scaling to 16 GPUs in a Single Machine

43

time

(sec

)

0

0.25

0.5

0.75

1

# of GPUs

0 4 8 12 16

Comm Costbs/GPU=2bs/GPU=4bs/GPU=8bs/GPU=16

Page 107: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scaling to 16 GPUs in a Single Machine

43

time

(sec

)

0

0.25

0.5

0.75

1

# of GPUs

0 4 8 12 16

Comm Costbs/GPU=2bs/GPU=4bs/GPU=8bs/GPU=16

Communication dominates

Page 108: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Scaling to 16 GPUs in a Single Machine

43

time

(sec

)

0

0.25

0.5

0.75

1

# of GPUs

0 4 8 12 16

Comm Costbs/GPU=2bs/GPU=4bs/GPU=8bs/GPU=16

Communication dominates

15x

Page 109: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Compare to a L-BFGS Based System

44

System AParameter Server

0.1 1 10

Time (hour)

0

1.25

2.5

3.75

5

System A Parameter Server

computing waiting

Obj

ectiv

e

Tim

e (h

our)

Page 110: Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

Sections not Covered

✧ AdaDelay: model the actual delay for asynchronized SGD ✧ Parsa: data placement to reduce communication cost ✧ Difacto: large scale factorization machine

45