Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
SLAQ: Quality-Driven Scheduling for Distributed Machine Learning
Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman
“AI is the new electricity.”
• Machine translation• Recommendation system• Autonomous driving• Object detection and recognition
2
Supervised Unsupervised
Transfer Reinforcement
Learning
ML algorithms are approximate
• ML model: a parametric transformation
! "#$
3
ML algorithms are approximate
• ML model: a parametric transformation
• maps input variables ! to output variables "• typically contains a set of parameters #
• Quality: how well model maps input to the correct output• Loss function: discrepancy of model output and ground truth
! "$%
4
Training ML models: an iterative process
• Training algorithms iteratively minimize a loss function• E.g., stochastic gradient descent (SGD), L-BFGS
5
WorkerWorker
Update Model
JobWorker
Data Shards
Model Replica !"#Model !"
Tasks
Send Task
Training ML models: an iterative process
• Quality improvement is subject to diminishing returns• More than 80% of work done in 20% of time
0 20 40 60 80 100CuPulDtLve TLPe %
020406080
100
LRVV
Red
uctLR
n %
LRgReg6V0
LDA0LPC
6
Exploratory ML training: not a one-time effort
• Train model multiple times for exploratory purposes• Provide early feedback, direct model search for high quality models
Collect Data
Extract Features
Train ML Models
Adjust Feature Space
Tune Hyperparameters
Restructure Models
7
WorkerJob #1
Job #2
Job #3
1
Worker3
Worker2
Worker3
3
2
1
1
Scheduler
How to schedule multiple training jobs on shared cluster?
• Key features of ML jobs• Approximate• Diminishing returns• Exploratory process
• Problem with resource fairness scheduling• Jobs in early stage: could benefit a lot from additional resources• Jobs almost converged: make only marginal improvement
8
SLAQ: quality-aware scheduling
• Intuition: in the context of approximate ML training, more resources should be allocated to jobs that have the most potential for quality improvement
0 50 100 150 200 250 TLme0.00.20.40.60.81.0
AFF
uraF
y
4ualLty-Aware FaLr 5esRurFe
0.00.61.21.82.43.0
LRss
0 50 100 150 200 250 TLme0.00.20.40.60.81.0
AFF
uraF
y
4ualLty-Aware FaLr 5esRurFe
0.00.61.21.82.43.0
LRss
0 50 100 150 200 250 TLme0.00.20.40.60.81.0
AFF
uraF
y
4ualLty-Aware FaLr 5esRurFe
0.00.61.21.82.43.0
LRss
9
Solution Overview
Normalize quality metrics
Predict quality improvement
Quality-driven scheduling
10
Normalizing quality metrics
Applicable to AllAlgorithms?
Comparable Magnitudes? Known Range? Predictable?
Accuracy / F1 Score / Area Under Curve / Confusion Matrix / etc. � � � �
Loss � � � �Normalized Loss � � � �
∆Loss � � � �Normalized ∆Loss � � � �
11
Applicable to AllAlgorithms?
Comparable Magnitudes? Known Range? Predictable?
Accuracy / F1 Score / Area Under Curve / Confusion Matrix / etc.
Loss
Normalized Loss
∆Loss
Normalized ∆Loss
Normalizing quality metrics
• Normalize change of loss values w.r.t. largest change so far• Currently does not support some non-convex optimization algorithms
0 30 60 90 120IterDtLRn
−0.20.00.20.40.60.81.0
1Rr
PDl
LzeG
∆LR
VV .-0eDnVLRgReg690
6903RlyGBTGBTReg
0L3CLDALLnReg
12
Training iterations: loss prediction
• Previous work: offline profiling / analysis [Ernest NSDI 16] [CherryPick NSDI 17]
• Overhead for frequent offline analysis is huge
• Strawman: use last ∆Loss as prediction for future ∆Loss
• SLAQ: online prediction using weighted curve fitting
LDAG%7
LLn5eg6V0
0L3CLRg5eg
6V03Rly10-4
10-3
10-2
10-1
100
3re
GLcW
LRn
Err
Rr %
0.10.0
0.40.41.1
0.2
1.20.6
4.84.7 6.14.3
52.5
3.6
6WrDwPDn WeLghWeG Curve
13
LDAG%7
LLn5eg6V0
0L3CLRg5eg
6V03Rly10-4
10-3
10-2
10-1
100
3re
GLcW
LRn
Err
Rr %
0.10.0
0.40.41.1
0.2
1.20.6
4.84.7 6.14.3
52.5
3.6
6WrDwPDn WeLghWeG Curve
Scheduling approximate ML training jobs
• Predict how much quality can be improved when assign X workers to jobs• Reassign workers to maximize quality improvement
WorkerJob #1
Job #2
Job #3
Scheduler
1
Worker3
Worker2
Worker3
3
2
1
1
PredictionResource Allocation
14
Experiment setup
• Representative mix of training jobs with• Compare against a work-conserving fair scheduler
Algorithm Acronym Type Optimization Algorithm Dataset
K-Means K-Means Clustering Lloyd Algorithm SyntheticLogistic Regression LogReg Classification Gradient Descent Epsilon [33]Support Vector Machine SVM Classification Gradient Descent EpsilonSVM (polynomial kernel) SVMPoly Classification Gradient Descent MNIST [34]Gradient Boosted Tree GBT Classification Gradient Boosting EpsilonGBT Regression GBTReg Regression Gradient Boosting YearPredictionMSD [35]Multi-Layer Perceptron Classifier MLPC Classification L-BFGS EpsilonLatent Dirichlet Allocation LDA Clustering EM / Online Algorithm Associated Press Corpus [36]Linear Regression LinReg Regression L-BFGS YearPredictionMSD
Table 1: Summary of ML algorithms, types, and the optimizers and datasets we used for testing.
4.2 Measuring and Predicting Loss
After unifying the quality metrics for different jobs,we proceed to allocate resources for global quality im-provement. When making a scheduling decision for agiven job, SLAQ needs to know how much loss reductionthe job would achieve by the next epoch if it was granteda certain amount of resources. We derive this informa-tion by predicting (i) how many iterations the job willhave completed by the next epoch (§4.2.1), and (ii) howmuch progress (i.e., loss reduction) the job could makewithin these iterations (§4.2.2).
Prediction for iterative ML training jobs is differentfrom general big-data analytics jobs. Previous work [15,38] estimates job’s runtime on some given cluster re-sources by analyzing the job computation and communi-cation structure, using offline analysis or code profiling.As the computation and communication pattern changesduring ML model configuration tuning, the process ofoffline analysis needs to be performed every time, thusincurring significant overhead. ML prediction is alsodifferent from the estimations to approximate analyticalSQL queries [16, 17] where the resulting accuracy can bedirectly inferred with the sampling rate and analytics be-ing performed. For iterative ML training jobs, we need tomake online predictions for the runtime and intermediatequality changes for each iteration.
4.2.1 Runtime Prediction
SLAQ is designed to work with distributed ML trainingjobs running on batch-processing computational frame-works like Spark and MapReduce. The underlyingframeworks help achieve data parallelization for trainingML models: the training dataset is large and gets parti-tioned on multiple worker nodes, and the size of mod-els (i.e., set of parameters) is comparably much smaller.The model parameters are updated by the workers, ag-gregated in the job driver, and disseminated back to theworkers in the next iteration.
SLAQ’s fine-grained scheduler resizes the set of work-ers for ML jobs frequently, and we need to predict the it-eration of each job’s iteration, even while the number and
set of workers available to that job is dynamically chang-ing. Fortunately, the runtime of ML training—at leastfor the set of ML algorithms and model sizes on whichwe focus—is dominated by the computation on the par-titioned datasets. SLAQ considers the total CPU time ofrunning each iteration as c · S, where c is a constant de-termined by the algorithm complexity, and S is the sizeof data processed in an iteration. SLAQ collects the ag-gregate worker CPU time and data size information fromthe job driver, and it is easy to learn the constant c froma history of past iterations. SLAQ thus predicts an itera-tion’s runtime simply by c ·S/N, where N is the numberof worker CPUs allocated to the job.
We use this heuristic for its simplicity and accu-racy (validated through evaluation in §6.3), with the as-sumption that communicating updates and synchroniz-ing models does not become a bottleneck. Even withmodels larger than hundreds of MBs (e.g., Deep Neu-ral Networks), many ML frameworks could significantlyreduce the network traffic with model parallelism [39] orby training with relaxed model consistency with boundedstaleness [40], as discussed in §7. Advanced runtime pre-diction models [41] can also be plugged into SLAQ.
4.2.2 Loss Prediction
Iterations in some ML jobs may be on the order of10s–100s of milliseconds, while SLAQ only schedules onthe order of 100s of milliseconds to a few seconds. Per-forming scheduling on smaller intervals would be dis-proportionally expensive due to scheduling overhead andlack of meaningful quality changes. Further, as disparatejobs have different iteration periods, and these periodsare not aligned, it does not make sense to try to scheduleat “every” iteration of the jobs.
Instead, with runtime prediction, SLAQ knows howmany iterations a job could complete in the givenscheduling epoch. To understand how much quality im-provement the job could get, we also need to predict theloss reduction in the following several iterations.
A strawman solution is to directly use the loss reduc-tion obtained from the last iteration as the predicted lossreduction value for the following several iterations. This
15
Evaluation: resource allocation across jobs
• 160 training jobs submitted to cluster following Poisson distribution• 25% jobs with high loss values• 25% jobs with medium loss values• 50% jobs with low loss values (almost converged)
0 100 200 300 400 500 600 700 8007iPe (seconds)
0
20
40
60
80
100
6ha
Ue o
f Clu
steU
C3
8s
(%)
%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs
0 100 200 300 400 500 600 700 8007iPe (seconds)
0
20
40
60
80
100
6ha
Ue o
f Clu
steU
C3
8s
(%)
%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs
0 100 200 300 400 500 600 700 8007iPe (seconds)
0
20
40
60
80
100
6ha
Ue o
f Clu
steU
C3
8s
(%)
%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs
0 100 200 300 400 500 600 700 8007iPe (seconds)
0
20
40
60
80
100
6ha
Ue o
f Clu
steU
C3
8s
(%)
%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs
16
Evaluation: cluster-wide quality and time
• SLAQ’s average loss is 73% lower than that of the fair scheduler
0 100 200 300 400 500 600 700 8007Lme (seFRQds)
0.000.050.100.150.20
LRss
)aLr 5esRurFe 6LA4
80 85 90 95 100LRss 5eduFtLRQ %
102040
100200
TLm
e (s
eFRQ
ds) )aLr 5esRurFe SLA4
• SLAQ reduces time to reach 90% (95%) loss reduction by 45% (30%)
Quality
Time
17
SLAQ Evaluation: Scalability
• Frequently reschedule and reconfigure in reaction to changes of progress• Even with thousands of concurrent jobs, SLAQ makes rescheduling
decisions in just a few seconds
1000 2000 4000 8000 160001umber of WorNers
0.0
0.5
1.0
1.5
2.06
ched
ulin
J Ti
me
(s) 1000 2000 3000 4000 Jobs
18
Conclusion
• SLAQ leverages the approximate and iterative ML training process
• Highly tailored prediction for iterative job quality• Allocate resources to maximize quality improvement
• SLAQ achieves better overall quality and end-to-end training time19
0 100 200 300 400 500 600 700 8007iPe (seconds)
0
20
40
60
80
100
6ha
Ue o
f Clu
steU
C3
8s
(%)
%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs
LDAG%7
LLn5eg6V0
0L3CLRg5eg
6V03Rly10-4
10-3
10-2
10-1
100
3re
GLcW
LRn
Err
Rr %
0.10.0
0.40.41.1
0.2
1.20.6
4.84.7 6.14.3
52.5
3.6
6WrDwPDn WeLghWeG Curve
Training iterations: runtime prediction
32 64 96 128 160 192 224 2561umber oI Cores
101
102
103
104Ite
ratio
n 7i
me
(s)
2347 2307 2323 2318 2394 2398 2406 2406
10K 100K 10 100
• Iteration runtime: ! " #/%• Model complexity !, data size #, number of workers %• Model update (i.e., size of Δ() is comparably much smaller
20