Upload
spark-summit
View
51
Download
1
Embed Size (px)
Citation preview
Ernest Efficient Performance Prediction for Advanced Analytics on Apache Spark
Shivaram Venkataraman, Zongheng Yang Michael Franklin, Benjamin Recht, Ion Stoica
Workload TRENDS: ADVANCED ANALYTICS
Workload Trends: Advanced Analytics
4
Workload Trends: Advanced Analytics
5
Workload Trends: Advanced Analytics
Keystone-ML TIMIT PIPELINE
Cosine Transform Normalization Linear Solver Raw Data
Cosine Transform Normalization
Linear Solver
~100 iterations
Iterative (each iteration many jobs)
Long Running à Expensive
Numerically Intensive
7
Keystone-ML TIMIT PIPELINE Raw Data
Properties
Cloud Computing CHOICEs
Amazon EC2
t2.nano, t2.micro, t2.small m4.large, m4.xlarge, m4.2xlarge, m4.4xlarge, m3.medium, c4.large, c4.xlarge, c4.2xlarge, c3.large, c3.xlarge, c3.4xlarge, r3.large, r3.xlarge, r3.4xlarge, i2.2xlarge, i2.4xlarge, d2.xlarge d2.2xlarge, d2.4xlarge,…
MICROSOFT AZURE
Basic tier: A0, A1, A2, A3, A4 Optimized Compute : D1, D2, D3, D4, D11, D12, D13 D1v2, D2v2, D3v2, D11v2,… Latest CPUs: G1, G2, G3, … Network Optimized: A8, A9 Compute Intensive: A10, A11,…
n1-standard-1, ns1-standard-2, ns1-standard-4, ns1-standard-8, ns1-standard-16, ns1highmem-2, ns1-highmem-4, ns1-highmem-8, n1-highcpu-2, n1-highcpu-4, n1-highcpu-8, n1-highcpu-16, n1-highcpu-32, f1-micro, g1-small…
Google Cloud Engine
Instance Types and Number of Instances
TYRANNY of CHOICE
User Concerns
“What is the cheapest configuration to run my job in 2 hours?” Given a budget, how fast can I run my job ? “What kind of instances should I use on EC2 ?”
10
DO CHOICES MATTER ? MATRIX MULTIPLY
0
5
10
15
20
25
30
Tim
e (s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.large
Matrix size: 400K by 1K
Cores = 16 Memory = 244 GB Cost = $2.66/hr
DO CHOICES MATTER ?
0
5
10
15
20
25
30
Tim
e (s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.large
Matrix Multiply: 400K by 1K
0
5
10
15
20
25
30
35
Tim
e (s)
QR Factorization 1M by 1K
Network Bound Mem Bandwidth Bound
0
10
20
30
0 100 200 300 400 500 600
Tim
e (s)
Cores
Actual Ideal
r3.4xlarge instances, QR Factorization:1M by 1K
13
Do choices MATTER ?
Computation + Communication à Non-linear Scaling
APPROACH
Job Data +
Performance Model
Challenges
Black Box Jobs
Model Building Overhead
Regular Structure + Few Iterations
Modeling Jobs
15
Computation patterns
Time INPUT Time 1 machines
Communication Patterns
ONE-To-ONE Tree DAG All-to-one
CONSTANT LOG LINEAR
17
BASIC Model
time = x1 + x2 ∗input
machines+ x3 ∗ log(machines)+ x4 ∗ (machines)
Serial Execution
Computation (linear)
Tree DAG
All-to-One DAG
Collect Training Data Fit Linear Regression
Does THE MODEL match real applications ?
Algorithm Compute Communication
O(n/p) O(n2/p) O(1) O(p) O(log(p)) Other
GLM Regression ✔ ✔ ✔ KMeans ✔ ✔ ✔ O(center)
Naïve Bayes ✔ ✔ Pearson Correlation ✔ ✔
PCA ✔ ✔ ✔ QR (TSQR) ✔ ✔
Number of data items: n, Number of processors: p, Constant number of features
COLLECTING TRAINING DATA
1%
2%
4%
8%
1 2 4 8
Inpu
t
Machines
Grid of input, machines
Associate cost with each experiment
Baseline: Cheapest configurations first
OPTIMAL Design of EXPERIMENTS
384 7 Statistical estimation
σ
prob
ability
ofcorrectde
tection
0.2 0.3 0.4 0.50.9
0.95
1
Figure 7.8 The Chernoff lower bound (solid line) and a Monte Carlo esti-mate (dashed line) of the probability of correct detection of symbol s1, asa function of σ. In this example the noise is Gaussian with zero mean andcovariance σ2I.
7.5 Experiment design
We consider the problem of estimating a vector x ∈ Rn from measurements orexperiments
yi = aTi x+ wi, i = 1, . . . ,m,
where wi is measurement noise. We assume that wi are independent Gaussianrandom variables with zero mean and unit variance, and that the measurementvectors a1, . . . , am span Rn. The maximum likelihood estimate of x, which is thesame as the minimum variance estimate, is given by the least-squares solution
x̂ =
!m"
i=1
aiaTi
#−1 m"
i=1
yiai.
The associated estimation error e = x̂− x has zero mean and covariance matrix
E = E eeT =
!m"
i=1
aiaTi
#−1
.
The matrix E characterizes the accuracy of the estimation, or the informativenessof the experiments. For example the α-confidence level ellipsoid for x is given by
E = {z | (z − x̂)TE−1(z − x̂) ≤ β},
where β is a constant that depends on n and α.We suppose that the vectors a1, . . . , am, which characterize the measurements,
can be chosen among p possible test vectors v1, . . . , vp ∈ Rn, i.e., each ai is one of
Given a Linear Model
21
Lower variance à Better model
λi - Fraction of times each experiment is run comparing two schemes: in the first scheme we collect datain an increasing order of machines and in the second schemewe use a mixed strategy as shown in Figure 6. From the fig-ure we make two important observations: (a) in this partic-ular case, the mixed strategy gets to a lower error quickly.After three data points we get to less than 15% error. (b) Wesee a trend of diminishing returns where adding more datapoints does not improve accuracy by much. Thus, in order tominimize the time spent on collecting training data we needtechniques that will help us find how much training data isrequired and what those data points should be.
4. Improving PredictionsTo improve the time taken for training without sacrificingthe prediction accuracy, in this section we outline a schemebased on optimal experiment design, a statistical techniquethat can be used to minimize the number of experiment runsrequired. After that we discuss the trade-offs associated withincluding more fine-grained information in the form of timetaken per-task as opposed to end-to-end jobs.
4.1 Optimal Experiment DesignIn statistics, experiment design [55] refers to the study ofhow to collect data required for any experiment given themodeling task at hand. Optimal experiment design specifi-cally looks at how to choose experiments that are optimalwith respect to some statistical criterion. At a high-level thegoal of experiment design is to determine data points that cangive us most information to build an accurate model. In or-der to do this we choose some subset of training data pointsand then determine how far a model trained with those datapoints is from the ideal model.
More formally, consider the problem where we are try-ing to fit a linear model X given measurements y1, . . . , ymand features a1, . . . , am for each measurement. Each featurevector could in turn consist of a number of dimensions (sayn dimensions). In the case of a linear model we typicallyestimate X using linear regression. We can denote this esti-mate as ˆX and ˆX � X is the estimation error or a measureof how far our model is from the true model.
To measure estimation error we can compute the MeanSquared Error (MSE) which takes into account both thebias and the variance of the estimator. In the case of thelinear model above if we have m data points each havingn features, then the variance of the estimator is represented
by the n⇥ n covariance matrix (
mPi=1
aiaTi )
�1. The key point
to note here is that the covariance matrix only depends on thefeature vectors that were used for this experiment and not onthe model that we are estimating.
In optimal experiment design we choose feature vectors(i.e. ai) that minimize the estimation error. Thus we canframe this as an optimization problem where we minimizethe estimation error subject to some constraints on the costor number of experiments we wish to run. More formally we
can set �i as the fraction of time an experiment is chosen andminimize the trace of the covariance matrix:
Minimize tr((mX
i=1
�iaiaTi )
�1)
subject to �i � 0,�i 1
Using Experiment Design: The predictive model we de-scribed in the previous section can be formulated as an ex-periment design problem. The feature set used for model de-sign consisted of just the scale of the data used and the num-ber of machines used for the experiment. Given some boundsfor the scale and number of machines we want to explore, wecan come up with all the features that could be used in ourexperiment. For example if the scale bounds range from say1% to 10% of the data and the number of machine we canuse ranges from 1 to 5, we can enumerate 50 different fea-ture vectors from all the scale and machine values possible.We can then feed these feature vectors into the experimentdesign setup described above and only choose to run thoseexperiments whose � values are non-zero.Accounting for Cost: One additional factor we need toconsider in using experiment design is that each experimentwe run costs a different amount. This cost could be in termsof time (i.e. it is more expensive to train with larger fractionof the input) or in terms of machines (i.e. there is a fixedcost to say launching a machine). To account for the costof an experiment we can augment the optimization problemwe setup above with an additional constraint that the totalcost should be lesser than some budget. That is if we have acost function which gives us a cost ci for an experiment withscale si and mi machines, we add a constraint to our solverthat
mPi=1
ci�i B where B is the total budget. For the rest
of this paper we use the time taken to collect training dataas the cost and ignore any machine setup costs as we canusually amortize that cost over all the training data we needto collect. However we can plug-in in any user-defined costfunction in our framework.Implementation: The optimization problem we definedabove can be solved using any standard convex program-ming solver like CVX [38, 39]. Even for a large range ofscale and machine values we find that the time to completethis process is a few seconds. Thus we believe that thereshould be no additional overhead due to this step. Finallywe also note that the results from experiment design can beused across multiple jobs as it only depends on the scale andnumber of machines under consideration.
4.2 Using Per-Stage TimingsAlthough the model described in the previous section en-codes some of the patterns we see in execution DAGs, wedidn’t explicitly measure anything other than the end-to-endrunning time of the whole job. Given that existing data pro-cessing frameworks already have support for fine grained
6 2015/3/26
Bound total cost
comparing two schemes: in the first scheme we collect datain an increasing order of machines and in the second schemewe use a mixed strategy as shown in Figure 6. From the fig-ure we make two important observations: (a) in this partic-ular case, the mixed strategy gets to a lower error quickly.After three data points we get to less than 15% error. (b) Wesee a trend of diminishing returns where adding more datapoints does not improve accuracy by much. Thus, in order tominimize the time spent on collecting training data we needtechniques that will help us find how much training data isrequired and what those data points should be.
4. Improving PredictionsTo improve the time taken for training without sacrificingthe prediction accuracy, in this section we outline a schemebased on optimal experiment design, a statistical techniquethat can be used to minimize the number of experiment runsrequired. After that we discuss the trade-offs associated withincluding more fine-grained information in the form of timetaken per-task as opposed to end-to-end jobs.
4.1 Optimal Experiment DesignIn statistics, experiment design [55] refers to the study ofhow to collect data required for any experiment given themodeling task at hand. Optimal experiment design specifi-cally looks at how to choose experiments that are optimalwith respect to some statistical criterion. At a high-level thegoal of experiment design is to determine data points that cangive us most information to build an accurate model. In or-der to do this we choose some subset of training data pointsand then determine how far a model trained with those datapoints is from the ideal model.
More formally, consider the problem where we are try-ing to fit a linear model X given measurements y1, . . . , ymand features a1, . . . , am for each measurement. Each featurevector could in turn consist of a number of dimensions (sayn dimensions). In the case of a linear model we typicallyestimate X using linear regression. We can denote this esti-mate as ˆX and ˆX � X is the estimation error or a measureof how far our model is from the true model.
To measure estimation error we can compute the MeanSquared Error (MSE) which takes into account both thebias and the variance of the estimator. In the case of thelinear model above if we have m data points each havingn features, then the variance of the estimator is represented
by the n⇥ n covariance matrix (
mPi=1
aiaTi )
�1. The key point
to note here is that the covariance matrix only depends on thefeature vectors that were used for this experiment and not onthe model that we are estimating.
In optimal experiment design we choose feature vectors(i.e. ai) that minimize the estimation error. Thus we canframe this as an optimization problem where we minimizethe estimation error subject to some constraints on the costor number of experiments we wish to run. More formally we
can set �i as the fraction of time an experiment is chosen andminimize the trace of the covariance matrix:
Minimize tr((mX
i=1
�iaiaTi )
�1)
subject to �i � 0,�i 1
Using Experiment Design: The predictive model we de-scribed in the previous section can be formulated as an ex-periment design problem. The feature set used for model de-sign consisted of just the scale of the data used and the num-ber of machines used for the experiment. Given some boundsfor the scale and number of machines we want to explore, wecan come up with all the features that could be used in ourexperiment. For example if the scale bounds range from say1% to 10% of the data and the number of machine we canuse ranges from 1 to 5, we can enumerate 50 different fea-ture vectors from all the scale and machine values possible.We can then feed these feature vectors into the experimentdesign setup described above and only choose to run thoseexperiments whose � values are non-zero.Accounting for Cost: One additional factor we need toconsider in using experiment design is that each experimentwe run costs a different amount. This cost could be in termsof time (i.e. it is more expensive to train with larger fractionof the input) or in terms of machines (i.e. there is a fixedcost to say launching a machine). To account for the costof an experiment we can augment the optimization problemwe setup above with an additional constraint that the totalcost should be lesser than some budget. That is if we have acost function which gives us a cost ci for an experiment withscale si and mi machines, we add a constraint to our solverthat
mPi=1
ci�i B where B is the total budget. For the rest
of this paper we use the time taken to collect training dataas the cost and ignore any machine setup costs as we canusually amortize that cost over all the training data we needto collect. However we can plug-in in any user-defined costfunction in our framework.Implementation: The optimization problem we definedabove can be solved using any standard convex program-ming solver like CVX [38, 39]. Even for a large range ofscale and machine values we find that the time to completethis process is a few seconds. Thus we believe that thereshould be no additional overhead due to this step. Finallywe also note that the results from experiment design can beused across multiple jobs as it only depends on the scale andnumber of machines under consideration.
4.2 Using Per-Stage TimingsAlthough the model described in the previous section en-codes some of the patterns we see in execution DAGs, wedidn’t explicitly measure anything other than the end-to-endrunning time of the whole job. Given that existing data pro-cessing frameworks already have support for fine grained
6 2015/3/26
OPTIMAL Design of EXPERIMENTS
1%
2%
4%
8%
1 2 4 8
Inpu
t
Machines
Use off-the-shelf solver (CVX)
USING ERNEST
Training Jobs
Job Binary
Machines, Input Size
Linear Model
Experiment Design
Use few iterations for training
0
200
400
600
800
1000
1 30 900
Tim
e
Machines
ERNEST
EVALUATION
24
Workloads Keystone-ML
Spark MLlib ADAM
GenBase Sparse GLMs
Random Projections
OBJECTIVES Optimal number of machines Prediction accuracy Model training overhead Importance of experiment design Choosing EC2 instance types Model extensions
Workloads Keystone-ML
Spark MLlib ADAM
GenBase Sparse GLMs
Random Projections
OBJECTIVES Optimal number of machines Prediction accuracy Model training overhead Importance of experiment design Choosing EC2 instance types Model extensions
NUMBER OF INSTANCES: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
27
0 4000 8000
12000 16000 20000 24000
0 20 40 60 80 100 120 140
Tim
e (s)
Machines
For 2 hour deadline, up to 5x lower cost
ACCURACY: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, Time Per Iteration
28
0 40 80
120 160 200
0 20 40 60 80 100 120 140 Tim
e Pe
r Ite
ratio
n (s)
Machines
10%-15% Prediction error
Accuracy: SparK MLlib
29
0 0.2 0.4 0.6 0.8 1 1.2
Regression
KMeans
NaiveBayes
PCA
Pearson
Predicted / Actual
64 machines 45 machines
TRAINING TIME: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
30
7 data points Up to 16 machines Up to 10% data
EXPERIMENT DESIGN
0 1000 2000 3000 4000 5000 6000
42 machines
Time (s)
Training Time Running Time
Accuracy, Overhead: GLM REGRESSION MLlib Regression: < 15% error
3.9%
2.9%
113.4%
112.7%
0 1000 2000 3000 4000 5000
64
45
Time (seconds)
Mac
hines
Actual Predicted Training
31
0% 20% 40% 60% 80% 100%
Regression
Classification
KMeans
PCA
TIMIT
Prediction Error (%)
Experiment Design
Cost-based
Is Experiment Design useful ?
32
MORE DETAILS
Detecting when the model is wrong Model extensions Amazon EC2 variations over time Straggler mitigation strategies
http://shivaram.org/publications/ernest-nsdi.pdf
OPEN SOURCE
Python-based implementation
Experiment Design, Predictor Modules
Example using SparkML + RCV1
https://github.com/amplab/ernest
IN Conclusion
Workload Trends: Advanced Analytics in the Cloud Computation, Communication patterns affect scalability Ernest: Performance predictions with low overhead
End-to-end linear model Optimal experimental design
Workloads / Traces ?
Shivaram Venkataraman [email protected]
35