Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
query: “osdi” search engine
👆
user
query: “osdi” search engine
advertisercharge
👆
user
Machine learning is concerned with systems that can learn from data
Machine learning is concerned with systems that can learn from data
0
20
40
60
80
0
175
350
525
700
Year
2010 2011 2012 2013 2014
Training data
size (TB)
Ad click prediction
Machine learning is concerned with systems that can learn from data
0
20
40
60
80
0
175
350
525
700
Year
2010 2011 2012 2013 2014
Training data
size (TB)
Annual revenue
(Billion $)
Ad click prediction
Machine learning is concerned with systems that can learn from data
0
20
40
60
80
0
175
350
525
700
Year
2010 2011 2012 2013 2014
Training data
size (TB)
Annual revenue
(Billion $)
We will report results using 1,000 machines!
Ad click prediction
Overview of machine learningraw data training data machine learning system model
(key,value) pairs
example by feature matrix
osdi www.usenix.org user_mu_li
Overview of machine learningraw data training data machine learning system model
(key,value) pairs
1 1 1
example by feature matrix
osdi www.usenix.org user_mu_li
Overview of machine learningraw data training data machine learning system model
(key,value) pairs
1 1 1 ✦ 100 billion examples ✦ 10 billion features ✦ 1T —1P training data ✦ 100—1000 machines
Scale of Industry problems
example by feature matrix
osdi www.usenix.org user_mu_li
Overview of machine learningraw data training data machine learning system model
(key,value) pairs
✦ scale to industry problems ✦ efficient communication ✦ fault tolerance ✦ easy to use
1 1 1 ✦ 100 billion examples ✦ 10 billion features ✦ 1T —1P training data ✦ 100—1000 machines
Scale of Industry problems
Industry size machine learning problems
Data and model partition
Training data
Data and model partition
Training data Worker machines
Data and model partition
Training data Worker machines
Model
Data and model partition
Training data Worker machines
Server machines
Model
Data and model partition
Training data
push
Worker machines
Server machines
Model
Data and model partition
Training data
push
Worker machines
Server machines
pull
Model
Example: distributed gradient descent
Worker machines
Server machines
Example: distributed gradient descent
Workers pull the working set of model
Worker machines
Server machines
Example: distributed gradient descent
Workers pull the working set of modelIterate until stop
workers compute gradients
Worker machines
Server machines
Example: distributed gradient descent
Workers pull the working set of modelIterate until stop
workers compute gradientsworkers push gradients
Worker machines
Server machines
Example: distributed gradient descent
Workers pull the working set of modelIterate until stop
workers compute gradientsworkers push gradientsupdate model
Worker machines
Server machines
Example: distributed gradient descent
Workers pull the working set of modelIterate until stop
workers compute gradientsworkers push gradientsupdate modelworkers pull updated model
Worker machines
Server machines
Industry size machine learning problems
Efficient communication
Challenges for data synchronization
Challenges for data synchronization
✦ Massive communication traffic ★ frequent access to the shared model
Challenges for data synchronization
✦ Massive communication traffic ★ frequent access to the shared model
✦ Expensive global barriers ★between iterations
Task
Task
✦ a push / pull / user defined function (an iteration)
Task
✦ a push / pull / user defined function (an iteration)✦ “execute-after-finished” dependency
Task
✦ a push / pull / user defined function (an iteration)✦ “execute-after-finished” dependency
CPU intensive Network intensive
gradient push & pulliter 0
iter 1 gradient push & pull
Synchronous
Task
✦ a push / pull / user defined function (an iteration)✦ “execute-after-finished” dependency
✦ executed asynchronously
CPU intensive Network intensive
gradient push & pulliter 0
iter 1 gradient push & pull
Synchronous
Task
✦ a push / pull / user defined function (an iteration)✦ “execute-after-finished” dependency
✦ executed asynchronously
CPU intensive Network intensive
gradient push & pulliter 0
iter 1 gradient push & pull
iter 1 gradient push & pull
Synchronous
gradient push & pulliter 0Asynchronous
Flexible consistency
✦ Trade-off between algorithm efficiency and system performance
Flexible consistency
✦ Trade-off between algorithm efficiency and system performance
1 2 3Sequential 4
Flexible consistency
✦ Trade-off between algorithm efficiency and system performance
Eventual 1 2 3 4
1 2 3Sequential 4
Flexible consistency
✦ Trade-off between algorithm efficiency and system performance
1-bounded delay 1 2 3 4 5
Eventual 1 2 3 4
1 2 3Sequential 4
Results for bounded delay
Results for bounded delay
Ad click prediction
0
0.45
0.9
1.35
1.8
Bounded delay
0 1 2 4 8 16
computing waiting
time (hour)
Results for bounded delay
Ad click prediction
0
0.45
0.9
1.35
1.8
Bounded delay
0 1 2 4 8 16
computing waiting
time (hour)
sequential
Results for bounded delay
Ad click prediction
0
0.45
0.9
1.35
1.8
Bounded delay
0 1 2 4 8 16
computing waiting
time (hour)
sequential
Results for bounded delay
Ad click prediction
0
0.45
0.9
1.35
1.8
Bounded delay
0 1 2 4 8 16
computing waiting
time (hour)
sequential
best trade-off
User-defined filters
User-defined filters
✦ Selectively communicate (key, value) pairs
User-defined filters
✦ Selectively communicate (key, value) pairs✦ E.g. , the KKT filter ★send pairs if they are likely to affect the model ★>95% keys are filtered in the ad click prediction task
Industry size machine learning problems
Efficient communication
Fault tolerance
Machine learning job logs in a three-month period:
Machine learning job logs in a three-month period:
0
6.5
13
19.5
26
#machine x time (hour)
100 1000 10000
failure rate %
Machine learning job logs in a three-month period:
0
6.5
13
19.5
26
#machine x time (hour)
100 1000 10000
failure rate %
Machine learning job logs in a three-month period:
0
6.5
13
19.5
26
#machine x time (hour)
100 1000 10000
failure rate %
Machine learning job logs in a three-month period:
0
6.5
13
19.5
26
#machine x time (hour)
100 1000 10000
failure rate %
Fault tolerance
Fault tolerance
✦ Model is partitioned by consistent hashing
Fault tolerance
✦ Model is partitioned by consistent hashing✦ Default replication: Chain replication (consistent, safe)
Fault tolerance
worker 0 server 0 server 1push x push f(x)
ackack
✦ Model is partitioned by consistent hashing✦ Default replication: Chain replication (consistent, safe)
Fault tolerance
worker 0 server 0 server 1push x push f(x)
ackack
push x
push y
worker 0
server 0 server 1
worker 1
✦ Model is partitioned by consistent hashing✦ Default replication: Chain replication (consistent, safe)
✦ Option: Aggregation reduces backup traffic (algo specific)
Fault tolerance
worker 0 server 0 server 1push x push f(x)
ackack
push f(x+y)
ack
push x
push yack
ackworker 0
server 0 server 1
worker 1
✦ Model is partitioned by consistent hashing✦ Default replication: Chain replication (consistent, safe)
✦ Option: Aggregation reduces backup traffic (algo specific)
Fault tolerance
worker 0 server 0 server 1push x push f(x)
ackack
push f(x+y)
ack
push x
push yack
ackworker 0
server 0 server 1
worker 1
✦ Model is partitioned by consistent hashing✦ Default replication: Chain replication (consistent, safe)
✦ Option: Aggregation reduces backup traffic (algo specific)implemented by efficient
vector clock
Industry size machine learning problems
Efficient communication
Fault tolerance Easy to use
(Key, value) vectors for the shared parameters
math sparse vector (key, value) store
i1 i2 i3
(i1, ) (i2, ) (i3, )
(Key, value) vectors for the shared parameters
✦ Good for programmers: Matches mental model ✦ Good for system: Expose optimizations based upon
structure of data
math sparse vector (key, value) store
i1 i2 i3
(i1, ) (i2, ) (i3, )
(Key, value) vectors for the shared parameters
✦ Good for programmers: Matches mental model ✦ Good for system: Expose optimizations based upon
structure of data
math sparse vector (key, value) store
i1 i2 i3
(i1, ) (i2, ) (i3, )
Example: computing gradient gradient = dataT × ( − label × 1 / ( 1 + exp (label × data × model))
Industry size machine learning problems
Efficient communication
Fault tolerance Easy to use Evaluation
Sparse Logistic Regression
✦ Predict ads will be clicked or not
Sparse Logistic Regression
Method Consistency LOC
System-A L-BFGS Sequential 10K
System-B Block PG Sequential 30K
Parameter Server Block PG Bounded Delay + KKT 300
✦ Predict ads will be clicked or not✦ Baseline: two systems in production
Sparse Logistic Regression
Method Consistency LOC
System-A L-BFGS Sequential 10K
System-B Block PG Sequential 30K
Parameter Server Block PG Bounded Delay + KKT 300
✦ Predict ads will be clicked or not✦ Baseline: two systems in production
✦ 636T real ads data ★170 billions of examples, 65 billions of features
✦ 1,000 machines with 16,000 cores
Progress
0.1 1 10
time (hour)
large
small
error
Progress
System A
0.1 1 10
time (hour)
large
small
error
Progress
System ASystem B
0.1 1 10
time (hour)
large
small
error
Progress
System ASystem BParameter Server
0.1 1 10
time (hour)
large
small
error
Time decomposition
time (hour)
Time decomposition
0
1.25
2.5
3.75
5
System-A System-B Parameter Server
computing waiting
time (hour)
Time decomposition
0
1.25
2.5
3.75
5
System-A System-B Parameter Server
computing waiting
time (hour)
Time decomposition
0
1.25
2.5
3.75
5
System-A System-B Parameter Server
computing waiting
time (hour)
Time decomposition
0
1.25
2.5
3.75
5
System-A System-B Parameter Server
computing waiting
time (hour)
Topic Modeling (“LDA”)
✦ Gradient descent with eventual consistency ✦ 5B users’ click logs, Group users into 1,000
groups based on URLs they clicked
Topic Modeling (“LDA”)
✦ Gradient descent with eventual consistency ✦ 5B users’ click logs, Group users into 1,000
groups based on URLs they clicked
time (hour)0 10 20 30 40 50 60 70
large
small
error
Topic Modeling (“LDA”)
✦ Gradient descent with eventual consistency ✦ 5B users’ click logs, Group users into 1,000
groups based on URLs they clicked
1,000 machines
time (hour)0 10 20 30 40 50 60 70
large
small
error
Topic Modeling (“LDA”)
✦ Gradient descent with eventual consistency ✦ 5B users’ click logs, Group users into 1,000
groups based on URLs they clicked
1,000 machines6,000 machines
time (hour)0 10 20 30 40 50 60 70
4x speed up
large
small
error
Largest experiments of related systems
101 102 103 104 10510410510610710810910101011
model size
#cores
Data were collected on April’14
Largest experiments of related systems
101 102 103 104 10510410510610710810910101011
REEF (LR)
VW (LR)
Naiad (LR)
Petuum (Lasso)
MLbase (LR)
model size
#cores
Sparse LR
Parameter Server Data were collected on April’14
Largest experiments of related systems
101 102 103 104 10510410510610710810910101011
REEF (LR)
VW (LR)
Naiad (LR)
Petuum (Lasso)
MLbase (LR)Graphlab (LDA)
YahooLDA (LDA)model
size
#cores
Sparse LR LDA
Parameter Server Data were collected on April’14
Largest experiments of related systems
101 102 103 104 10510410510610710810910101011
REEF (LR)
VW (LR)
Naiad (LR)
Petuum (Lasso)
MLbase (LR)Graphlab (LDA)
YahooLDA (LDA)
Distbelief (DNN)Adam (DNN)
model size
#cores
Sparse LR LDA
Parameter Server Data were collected on April’14
Industry size machine learning problems
Efficient communication
Fault tolerance Easy to use Evaluation
Industry size machine learning problems
Efficient communication
Fault tolerance Easy to use
Code available at parameterserver.org
Evaluation
Q&A