Scaling Distributed Machine Learning with the · Overview of machine learning raw data training data machine learning system model (key,value) pairs scale to industry problems efﬁcient

Scaling Distributed Machine Learning with the

Mu Li [email protected]

mailto:[email protected]

query: “osdi” search engine

👆

user

query: “osdi” search engine

advertisercharge

👆

user

Machine learning is concerned with systems that can learn from data


0

20

40

60

80

0

175

350

525

700

Year

2010 2011 2012 2013 2014

Training data

size (TB)

Ad click prediction


0

20

40

60

80

0

175

350

525

700

Year

2010 2011 2012 2013 2014

Training data

size (TB)

Annual revenue

(Billion $)

Ad click prediction


0

20

40

60

80

0

175

350

525

700

Year

2010 2011 2012 2013 2014

Training data

size (TB)

Annual revenue

(Billion $)

We will report results using 1,000 machines!

Ad click prediction

Overview of machine learningraw data training data machine learning system model

(key,value) pairs

example by feature matrix

osdi www.usenix.org user_mu_li


(key,value) pairs

1 1 1




(key,value) pairs

1 1 1 ✦ 100 billion examples ✦ 10 billion features ✦ 1T —1P training data ✦ 100—1000 machines

Scale of Industry problems




(key,value) pairs

✦ scale to industry problems ✦ efficient communication ✦ fault tolerance ✦ easy to use

1 1 1 ✦ 100 billion examples ✦ 10 billion features ✦ 1T —1P training data ✦ 100—1000 machines

Scale of Industry problems

Industry size machine learning problems

Data and model partition

Training data


Training data Worker machines



Model



Server machines

Model


Training data

push

Worker machines

Server machines

Model


Training data

push

Worker machines

Server machines

pull

Model

Example: distributed gradient descent

Worker machines

Server machines


Workers pull the working set of model

Worker machines

Server machines


Workers pull the working set of modelIterate until stop

workers compute gradients

Worker machines

Server machines



workers compute gradientsworkers push gradients

Worker machines

Server machines



workers compute gradientsworkers push gradientsupdate model

Worker machines

Server machines



workers compute gradientsworkers push gradientsupdate modelworkers pull updated model

Worker machines

Server machines


Efficient communication

Challenges for data synchronization


✦ Massive communication traffic ★ frequent access to the shared model


✦ Massive communication traffic ★ frequent access to the shared model

✦ Expensive global barriers ★between iterations

Task

Task

✦ a push / pull / user defined function (an iteration)

Task

✦ a push / pull / user defined function (an iteration)✦ “execute-after-finished” dependency

Task


CPU intensive Network intensive

gradient push & pulliter 0

iter 1 gradient push & pull

Synchronous

Task


✦ executed asynchronously




Synchronous

Task


✦ executed asynchronously





Synchronous

gradient push & pulliter 0Asynchronous

Flexible consistency

✦ Trade-off between algorithm efficiency and system performance



1 2 3Sequential 4



Eventual 1 2 3 4

1 2 3Sequential 4



1-bounded delay 1 2 3 4 5

Eventual 1 2 3 4

1 2 3Sequential 4

Results for bounded delay


Ad click prediction

0

0.45

0.9

1.35

1.8

Bounded delay

0 1 2 4 8 16

computing waiting

time (hour)


Ad click prediction

0

0.45

0.9

1.35

1.8

Bounded delay

0 1 2 4 8 16

computing waiting

time (hour)

sequential


Ad click prediction

0

0.45

0.9

1.35

1.8

Bounded delay

0 1 2 4 8 16

computing waiting

time (hour)

sequential


Ad click prediction

0

0.45

0.9

1.35

1.8

Bounded delay

0 1 2 4 8 16

computing waiting

time (hour)

sequential

best trade-off

User-defined filters


✦ Selectively communicate (key, value) pairs


✦ Selectively communicate (key, value) pairs✦ E.g. , the KKT filter ★send pairs if they are likely to affect the model ★>95% keys are filtered in the ad click prediction task



Fault tolerance

Machine learning job logs in a three-month period:


0

6.5

13

19.5

26

#machine x time (hour)

100 1000 10000

failure rate %


0

6.5

13

19.5

26


100 1000 10000

failure rate %


0

6.5

13

19.5

26


100 1000 10000

failure rate %


0

6.5

13

19.5

26


100 1000 10000

failure rate %

Fault tolerance

Fault tolerance

✦ Model is partitioned by consistent hashing

Fault tolerance

✦ Model is partitioned by consistent hashing✦ Default replication: Chain replication (consistent, safe)

Fault tolerance

worker 0 server 0 server 1push x push f(x)

ackack


Fault tolerance


ackack

push x

push y

worker 0

server 0 server 1

worker 1


✦ Option: Aggregation reduces backup traffic (algo specific)

Fault tolerance


ackack

push f(x+y)

ack

push x

push yack

ackworker 0

server 0 server 1

worker 1


✦ Option: Aggregation reduces backup traffic (algo specific)

Fault tolerance


ackack

push f(x+y)

ack

push x

push yack

ackworker 0

server 0 server 1

worker 1


✦ Option: Aggregation reduces backup traffic (algo specific)implemented by efficient

vector clock



Fault tolerance Easy to use

(Key, value) vectors for the shared parameters

math sparse vector (key, value) store

i1 i2 i3

(i1, ) (i2, ) (i3, )


✦ Good for programmers: Matches mental model ✦ Good for system: Expose optimizations based upon

structure of data


i1 i2 i3

(i1, ) (i2, ) (i3, )


✦ Good for programmers: Matches mental model ✦ Good for system: Expose optimizations based upon

structure of data


i1 i2 i3

(i1, ) (i2, ) (i3, )

Example: computing gradient gradient = dataT × ( − label × 1 / ( 1 + exp (label × data × model))



Fault tolerance Easy to use Evaluation

Sparse Logistic Regression

✦ Predict ads will be clicked or not


Method Consistency LOC

System-A L-BFGS Sequential 10K

System-B Block PG Sequential 30K

Parameter Server Block PG Bounded Delay + KKT 300

✦ Predict ads will be clicked or not✦ Baseline: two systems in production


Method Consistency LOC

System-A L-BFGS Sequential 10K

System-B Block PG Sequential 30K

Parameter Server Block PG Bounded Delay + KKT 300

✦ Predict ads will be clicked or not✦ Baseline: two systems in production

✦ 636T real ads data ★170 billions of examples, 65 billions of features

✦ 1,000 machines with 16,000 cores

Progress

0.1 1 10

time (hour)

large

small

error

Progress

System A

0.1 1 10

time (hour)

large

small

error

Progress

System ASystem B

0.1 1 10

time (hour)

large

small

error

Progress

System ASystem BParameter Server

0.1 1 10

time (hour)

large

small

error

Time decomposition

time (hour)

Time decomposition

0

1.25

2.5

3.75

5

System-A System-B Parameter Server

computing waiting

time (hour)

Time decomposition

0

1.25

2.5

3.75

5


computing waiting

time (hour)

Time decomposition

0

1.25

2.5

3.75

5


computing waiting

time (hour)

Time decomposition

0

1.25

2.5

3.75

5


computing waiting

time (hour)

Topic Modeling (“LDA”)

✦ Gradient descent with eventual consistency ✦ 5B users’ click logs, Group users into 1,000

groups based on URLs they clicked




time (hour)0 10 20 30 40 50 60 70

large

small

error




1,000 machines

time (hour)0 10 20 30 40 50 60 70

large

small

error




1,000 machines6,000 machines

time (hour)0 10 20 30 40 50 60 70

4x speed up

large

small

error

Largest experiments of related systems

101 102 103 104 10510410510610710810910101011

model size

#cores

Data were collected on April’14


101 102 103 104 10510410510610710810910101011

REEF (LR)

VW (LR)

Naiad (LR)

Petuum (Lasso)

MLbase (LR)

model size

#cores

Sparse LR

Parameter Server Data were collected on April’14


101 102 103 104 10510410510610710810910101011

REEF (LR)

VW (LR)

Naiad (LR)

Petuum (Lasso)

MLbase (LR)Graphlab (LDA)

YahooLDA (LDA)model

size

#cores

Sparse LR LDA



101 102 103 104 10510410510610710810910101011

REEF (LR)

VW (LR)

Naiad (LR)

Petuum (Lasso)

MLbase (LR)Graphlab (LDA)

YahooLDA (LDA)

Distbelief (DNN)Adam (DNN)

model size

#cores

Sparse LR LDA




Fault tolerance Easy to use Evaluation



Fault tolerance Easy to use

Code available at parameterserver.org

Evaluation

http://parameterserver.org

Q&A

Documents

Scaling Distributed Machine Learning with the · Overview of machine learning raw data training data machine learning system model (key,value) pairs scale to industry problems efﬁcient