SLAQ: Quality-Driven Scheduling for Distributed Machine Learning · SLAQ: Quality-Driven Scheduling for Distributed Machine Learning HaoyuZhang, Logan Stafman, Andrew Or, Michael

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman

“AI is the new electricity.”

• Machine translation• Recommendation system• Autonomous driving• Object detection and recognition

2

Supervised Unsupervised

Transfer Reinforcement

Learning

ML algorithms are approximate

• ML model: a parametric transformation

! "#$

3

ML algorithms are approximate

• ML model: a parametric transformation

• maps input variables ! to output variables "• typically contains a set of parameters #

• Quality: how well model maps input to the correct output• Loss function: discrepancy of model output and ground truth

! "$%

4

Training ML models: an iterative process

• Training algorithms iteratively minimize a loss function• E.g., stochastic gradient descent (SGD), L-BFGS

5

WorkerWorker

Update Model

JobWorker

Data Shards

Model Replica !"#Model !"

Tasks

Send Task

Training ML models: an iterative process

• Quality improvement is subject to diminishing returns• More than 80% of work done in 20% of time

0 20 40 60 80 100CuPulDtLve TLPe %

020406080

100

LRVV

Red

uctLR

n %

LRgReg6V0

LDA0LPC

6

Exploratory ML training: not a one-time effort

• Train model multiple times for exploratory purposes• Provide early feedback, direct model search for high quality models

Collect Data

Extract Features

Train ML Models

Adjust Feature Space

Tune Hyperparameters

Restructure Models

7

WorkerJob #1

Job #2

Job #3

1

Worker3

Worker2

Worker3

3

2

1

1

Scheduler

How to schedule multiple training jobs on shared cluster?

• Key features of ML jobs• Approximate• Diminishing returns• Exploratory process

• Problem with resource fairness scheduling• Jobs in early stage: could benefit a lot from additional resources• Jobs almost converged: make only marginal improvement

8

SLAQ: quality-aware scheduling

• Intuition: in the context of approximate ML training, more resources should be allocated to jobs that have the most potential for quality improvement

0 50 100 150 200 250 TLme0.00.20.40.60.81.0

AFF

uraF

y

4ualLty-Aware FaLr 5esRurFe

0.00.61.21.82.43.0

LRss

0 50 100 150 200 250 TLme0.00.20.40.60.81.0

AFF

uraF

y


0.00.61.21.82.43.0

LRss

0 50 100 150 200 250 TLme0.00.20.40.60.81.0

AFF

uraF

y


0.00.61.21.82.43.0

LRss

9

Solution Overview

Normalize quality metrics

Predict quality improvement

Quality-driven scheduling

10

Normalizing quality metrics

Applicable to AllAlgorithms?

Comparable Magnitudes? Known Range? Predictable?

Accuracy / F1 Score / Area Under Curve / Confusion Matrix / etc. � � � �

Loss � � � �Normalized Loss � � � �

∆Loss � � � �Normalized ∆Loss � � � �

11

Applicable to AllAlgorithms?

Comparable Magnitudes? Known Range? Predictable?

Accuracy / F1 Score / Area Under Curve / Confusion Matrix / etc.

Loss

Normalized Loss

∆Loss

Normalized ∆Loss

Normalizing quality metrics

• Normalize change of loss values w.r.t. largest change so far• Currently does not support some non-convex optimization algorithms

0 30 60 90 120IterDtLRn

−0.20.00.20.40.60.81.0

1Rr

PDl

LzeG

∆LR

VV .-0eDnVLRgReg690

6903RlyGBTGBTReg

0L3CLDALLnReg

12

Training iterations: loss prediction

• Previous work: offline profiling / analysis [Ernest NSDI 16] [CherryPick NSDI 17]

• Overhead for frequent offline analysis is huge

• Strawman: use last ∆Loss as prediction for future ∆Loss

• SLAQ: online prediction using weighted curve fitting

LDAG%7

LLn5eg6V0

0L3CLRg5eg

6V03Rly10-4

10-3

10-2

10-1

100

3re

GLcW

LRn

Err

Rr %

0.10.0

0.40.41.1

0.2

1.20.6

4.84.7 6.14.3

52.5

3.6

6WrDwPDn WeLghWeG Curve

13

LDAG%7

LLn5eg6V0

0L3CLRg5eg

6V03Rly10-4

10-3

10-2

10-1

100

3re

GLcW

LRn

Err

Rr %

0.10.0

0.40.41.1

0.2

1.20.6

4.84.7 6.14.3

52.5

3.6


Scheduling approximate ML training jobs

• Predict how much quality can be improved when assign X workers to jobs• Reassign workers to maximize quality improvement

WorkerJob #1

Job #2

Job #3

Scheduler

1

Worker3

Worker2

Worker3

3

2

1

1

PredictionResource Allocation

14

Experiment setup

• Representative mix of training jobs with• Compare against a work-conserving fair scheduler

Algorithm Acronym Type Optimization Algorithm Dataset

K-Means K-Means Clustering Lloyd Algorithm SyntheticLogistic Regression LogReg Classification Gradient Descent Epsilon [33]Support Vector Machine SVM Classification Gradient Descent EpsilonSVM (polynomial kernel) SVMPoly Classification Gradient Descent MNIST [34]Gradient Boosted Tree GBT Classification Gradient Boosting EpsilonGBT Regression GBTReg Regression Gradient Boosting YearPredictionMSD [35]Multi-Layer Perceptron Classifier MLPC Classification L-BFGS EpsilonLatent Dirichlet Allocation LDA Clustering EM / Online Algorithm Associated Press Corpus [36]Linear Regression LinReg Regression L-BFGS YearPredictionMSD

Table 1: Summary of ML algorithms, types, and the optimizers and datasets we used for testing.

4.2 Measuring and Predicting Loss

After unifying the quality metrics for different jobs,we proceed to allocate resources for global quality im-provement. When making a scheduling decision for agiven job, SLAQ needs to know how much loss reductionthe job would achieve by the next epoch if it was granteda certain amount of resources. We derive this informa-tion by predicting (i) how many iterations the job willhave completed by the next epoch (§4.2.1), and (ii) howmuch progress (i.e., loss reduction) the job could makewithin these iterations (§4.2.2).

Prediction for iterative ML training jobs is differentfrom general big-data analytics jobs. Previous work [15,38] estimates job’s runtime on some given cluster re-sources by analyzing the job computation and communi-cation structure, using offline analysis or code profiling.As the computation and communication pattern changesduring ML model configuration tuning, the process ofoffline analysis needs to be performed every time, thusincurring significant overhead. ML prediction is alsodifferent from the estimations to approximate analyticalSQL queries [16, 17] where the resulting accuracy can bedirectly inferred with the sampling rate and analytics be-ing performed. For iterative ML training jobs, we need tomake online predictions for the runtime and intermediatequality changes for each iteration.

4.2.1 Runtime Prediction

SLAQ is designed to work with distributed ML trainingjobs running on batch-processing computational frame-works like Spark and MapReduce. The underlyingframeworks help achieve data parallelization for trainingML models: the training dataset is large and gets parti-tioned on multiple worker nodes, and the size of mod-els (i.e., set of parameters) is comparably much smaller.The model parameters are updated by the workers, ag-gregated in the job driver, and disseminated back to theworkers in the next iteration.

SLAQ’s fine-grained scheduler resizes the set of work-ers for ML jobs frequently, and we need to predict the it-eration of each job’s iteration, even while the number and

set of workers available to that job is dynamically chang-ing. Fortunately, the runtime of ML training—at leastfor the set of ML algorithms and model sizes on whichwe focus—is dominated by the computation on the par-titioned datasets. SLAQ considers the total CPU time ofrunning each iteration as c · S, where c is a constant de-termined by the algorithm complexity, and S is the sizeof data processed in an iteration. SLAQ collects the ag-gregate worker CPU time and data size information fromthe job driver, and it is easy to learn the constant c froma history of past iterations. SLAQ thus predicts an itera-tion’s runtime simply by c ·S/N, where N is the numberof worker CPUs allocated to the job.

We use this heuristic for its simplicity and accu-racy (validated through evaluation in §6.3), with the as-sumption that communicating updates and synchroniz-ing models does not become a bottleneck. Even withmodels larger than hundreds of MBs (e.g., Deep Neu-ral Networks), many ML frameworks could significantlyreduce the network traffic with model parallelism [39] orby training with relaxed model consistency with boundedstaleness [40], as discussed in §7. Advanced runtime pre-diction models [41] can also be plugged into SLAQ.

4.2.2 Loss Prediction

Iterations in some ML jobs may be on the order of10s–100s of milliseconds, while SLAQ only schedules onthe order of 100s of milliseconds to a few seconds. Per-forming scheduling on smaller intervals would be dis-proportionally expensive due to scheduling overhead andlack of meaningful quality changes. Further, as disparatejobs have different iteration periods, and these periodsare not aligned, it does not make sense to try to scheduleat “every” iteration of the jobs.

Instead, with runtime prediction, SLAQ knows howmany iterations a job could complete in the givenscheduling epoch. To understand how much quality im-provement the job could get, we also need to predict theloss reduction in the following several iterations.

A strawman solution is to directly use the loss reduc-tion obtained from the last iteration as the predicted lossreduction value for the following several iterations. This

15

Evaluation: resource allocation across jobs

• 160 training jobs submitted to cluster following Poisson distribution• 25% jobs with high loss values• 25% jobs with medium loss values• 50% jobs with low loss values (almost converged)

0 100 200 300 400 500 600 700 8007iPe (seconds)

0

20

40

60

80

100

6ha

Ue o

f Clu

steU

C3

8s

(%)

%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs

0 100 200 300 400 500 600 700 8007iPe (seconds)

0

20

40

60

80

100

6ha

Ue o

f Clu

steU

C3

8s

(%)


0 100 200 300 400 500 600 700 8007iPe (seconds)

0

20

40

60

80

100

6ha

Ue o

f Clu

steU

C3

8s

(%)


0 100 200 300 400 500 600 700 8007iPe (seconds)

0

20

40

60

80

100

6ha

Ue o

f Clu

steU

C3

8s

(%)


16

Evaluation: cluster-wide quality and time

• SLAQ’s average loss is 73% lower than that of the fair scheduler

0 100 200 300 400 500 600 700 8007Lme (seFRQds)

0.000.050.100.150.20

LRss

)aLr 5esRurFe 6LA4

80 85 90 95 100LRss 5eduFtLRQ %

102040

100200

TLm

e (s

eFRQ

ds) )aLr 5esRurFe SLA4

• SLAQ reduces time to reach 90% (95%) loss reduction by 45% (30%)

Quality

Time

17

SLAQ Evaluation: Scalability

• Frequently reschedule and reconfigure in reaction to changes of progress• Even with thousands of concurrent jobs, SLAQ makes rescheduling

decisions in just a few seconds

1000 2000 4000 8000 160001umber of WorNers

0.0

0.5

1.0

1.5

2.06

ched

ulin

J Ti

me

(s) 1000 2000 3000 4000 Jobs

18

Conclusion

• SLAQ leverages the approximate and iterative ML training process

• Highly tailored prediction for iterative job quality• Allocate resources to maximize quality improvement

• SLAQ achieves better overall quality and end-to-end training time19

0 100 200 300 400 500 600 700 8007iPe (seconds)

0

20

40

60

80

100

6ha

Ue o

f Clu

steU

C3

8s

(%)


LDAG%7

LLn5eg6V0

0L3CLRg5eg

6V03Rly10-4

10-3

10-2

10-1

100

3re

GLcW

LRn

Err

Rr %

0.10.0

0.40.41.1

0.2

1.20.6

4.84.7 6.14.3

52.5

3.6


Training iterations: runtime prediction

32 64 96 128 160 192 224 2561umber oI Cores

101

102

103

104Ite

ratio

n 7i

me

(s)

2347 2307 2323 2318 2394 2398 2406 2406

10K 100K 10 100

• Iteration runtime: ! " #/%• Model complexity !, data size #, number of workers %• Model update (i.e., size of Δ() is comparably much smaller

20

Documents

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning · SLAQ: Quality-Driven Scheduling for Distributed Machine Learning HaoyuZhang*, Logan Stafman*, Andrew Or, Michael

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning · SLAQ: Quality-Driven Scheduling for Distributed Machine Learning HaoyuZhang, Logan Stafman, Andrew Or, Michael