Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
ACHIEVING PERFORMANCE AND AVAILABILITY
GUARANTEES WITH AMAZON SPOT INSTANCES
MICHELE MAZZUCCO*
UNIVERSITY OF TARTU
* And Marlon Dumas. Thanks to Madeline Gonzalez for the useful discussions.
INTRODUCTION (1)
• Cloud computing = pay-as-you go + “unlimited” capacity • Amazon Elastic Cloud Compute (EC2)
• Web service providing resizable compute capacity • Different instance types, e.g., small, large, xlarge servers • Different purchasing options
1. On-Demand Instances • Pay for resources by the hour • No long-term commitments
2. Reserved Instances • Upfront payment for reserving instances • Discounted usage rate
2
INTRODUCTION (2)
3. Spot Instances • Users bid for unused resources at a “limit price” (the
maximum price the bidder is willing to pay) • Amazon gathers the bids and determines a clearing
price (“spot price”) based on the bids and the available capacity
• A bidder gets the required instances if his/her limit price is above the clearing price. In this case, the bidder pays the clearing price (not his/her limit price)
• The clearing price is updated as new bids arrive. If the clearing price goes above the bidder’s limit price, the bidder’s running spot instances are terminated.
3
INTRODUCTION (3)
• In this presentation I will focus on “spot instances” • They allow IaaS providers to sell spare capacity
(via auctions), thus improving resource utilization • At the same time they enable IaaS users to
acquire resources at a price lower than on-demand price
• However, spot instances are terminated any time the “clearing price” goes above the user’s “limit price”
4
PROBLEM STATEMENT
“Can we use spot instances to run paid web services while achieving
performance and availability guarantees?”!
5
TALK OUTLINE
6
Bidding strategy
• Decides how much to bid on the spot market
Server allocation
• Decides how many servers to use
Admission control
• Decides how many jobs to admit
Mission accomplished Meet performance
targets
Meet availability goals
PART 1: SPOT PRICE PREDICTION
• We must determine the optimal “limit price” that the SaaS provider should bid on the spot market • The goal is to bid the lowest possible price capable of
achieving the desired level of availability
• Let’s see how some spot prices look like…
7
SPOT PRICES (M1.SMALL)
8
0.03
0.04
0.05
0.06
0.07
0.08
0.09
25/12 01/01 08/01 15/01 22/01 29/01 05/02 12/02 19/02 26/02
Pri
ce (
$/h
)
m1.small (Linux/UNIX)
0.029
0.0295
0.03
0.0305
0.031
11/Fri 12/Sat 13/Sun 14/Mon 15/Tue
Pri
ce (
$/h
)
SPOT PRICES (M1.LARGE)
9
0.1
0.15
0.2
0.25
0.3
0.35
25/12 01/01 08/01 15/01 22/01 29/01 05/02 12/02 19/02 26/02
Pri
ce (
$/h)
m1.large (Linux/UNIX)
0.113
0.116
0.119
0.122
0.125
0.128
11/Fri 12/Sat 13/Sun 14/Mon 15/Tue
Pri
ce (
$/h
)
SPOT PRICES (M1.XLARGE)
10
0.2
0.3
0.4
0.5
0.6
0.7
0.8
25/12 01/01 08/01 15/01 22/01 29/01 05/02 12/02 19/02 26/02
Pri
ce (
$/h)
m1.xlarge (Linux/UNIX)
0.24
0.27
0.3
0.33
0.36
0.39
11/Fri 12/Sat 13/Sun 14/Mon 15/Tue
Pri
ce (
$/h
)
1ST ATTEMPT • Assume that prices follow patterns, e.g., day/night, week
days/week ends, etc. • Use triple exponential smoothing (“Holt-Winters”) for
predicting future prices
11
St =!ytIt+ (1!!)(St!1 + bt!1)
bt = " (St ! St!1)+ (1!" )bt!1
It = #ytSt+ (1!#)It!L
Ft+m = (St +mbt )It!L+m
Updates the trend
Updates the smoothed value
Updates the seasonal component
m periods ahead’s forecast
Con
stan
ts m
inim
izin
g th
e M
SE
WIKIPEDIA TRAFFIC
Time [hours]
Arr.
rate
[req
./hou
r]
0 100 200 300 400 500 600 700
2.0e
+07
2.5e
+07
3.0e
+07
3.5e
+07
12
~ 10,000 req./sec at peak
WIKIPEDIA TRAFFIC: TIME SERIES DECOMPOSITION
13
2.0e+07
3.0e+07
data
−6e+06
−2e+06
2e+06
seasonal
2.3e+07
2.6e+07
2.9e+07
trend
−1e+07
−4e+06
2e+06
0 5 10 15 20 25 30
remainder
time
The max irregular component is ~11% of the data
= da
ta –
(tre
nd +
sea
sona
l)
PREDICTING WIKIPEDIA TRAFFIC WITH HOLT WINTERS
Predicting Wikipedia traffic with Holt−Winters
Time [days]
Arr.
rate
[req
./hou
r]
0 5 10 15 20 25 30
2.0e
+07
2.5e
+07
3.0e
+07
3.5e
+07 Observed
Predicted
14
PREDICTING SPOT PRICES WITH HOLT WINTERS (1)
m1.small
Time [~ days]
Pric
e [$
/h]
10 20 30 40
0.03
0.04
0.05
0.06
0.07
0.08
ObservedPredicted
15
Days [5−9]
Time [hours]
Pric
e [$
/h]
0 20 40 60 80 100 120
0.03
00.
035
0.04
00.
045
0.05
0
ObservedPredicted
PREDICTING SPOT PRICES WITH HOLT WINTERS (2)
m1.large
Time [~ days]
Pric
e [$
/h]
10 20 30 40
0.15
0.20
0.25
0.30
ObservedPredicted
16
Days [5−9]
Time [hours]
Pric
e [$
/h]
0 20 40 60 80 100 120
0.15
0.20
0.25
0.30
ObservedPredicted
PREDICTING SPOT PRICES WITH HOLT WINTERS (3)
m1.xlarge
Time [~ days]
Pric
e [$
/h]
10 20 30 40
0.3
0.4
0.5
0.6
0.7
0.8 Observed
Predicted
17
Days [5−9]
Time [hours]
Pric
e [$
/h]
0 20 40 60 80 100 120
0.3
0.4
0.5
0.6
0.7
0.8 Observed
Predicted
XLARGE PRICES: TIME SERIES DECOMPOSITION
m1.xlarge
0.3
0.5
0.7
data
−0.03
−0.01
0.01
0.03
seasonal
0.25
0.35
0.45
trend
−0.2
0.0
0.2
0.4
5 10 15
remainder
time
18
The max irregular component is ~50% of the data
2ND ATTEMPT (1) • What about employing the distribution of the relative
errors to correct the prediction?
19
Histogram of errors
Relative error [(observed − predicted)/observed]
Den
sity
−0.2 −0.1 0.0 0.1 0.2
01
23
45
6
m1.large
We are interested in this part
2ND ATTEMPT (2)
20
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
CDF rel. error (Wikipedia)
Relative error [(observed − predicted)/observed]
Fn(x
)
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●● ●●●●●●●●
●●●●● ●●●●●● ● ●● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ●
All the arrival rates can be estimated with 83% accuracy
2ND ATTEMPT (3)
21
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.2
0.4
0.6
0.8
1.0
CDF rel. error. (xlarge)
Relative error [(observed − predicted)/observed]
Fn(x
)
It does not really work, does it?
ONE MORE THING… Dec. 25, 2010 − Feb. 23, 2011
Time [hours]
Pric
e [$
/h]
0 200 400 600 800 1000
0.03
0.05
0.07
Jun. 22, 2011 − Sept. 23, 2011
Time [hours]
Pric
e [$
/h]
0 100 200 300 400 500
0.03
0.05
0.07
22
• ... they look quite different, don’t they?
ACF(l) =(xt ! x )(xt+l ! x )
t=1
N
"
(xt ! x )2
t=1
N
"
3RD ATTEMPT
1. Employ the autocorrelation function (ACF) to determine the similarity between prices as a function of the time difference between them (lag)
23
Average value
Value at time t
No. of observations
Lag
2ND ATTEMPT
• The autocorrelation function (ACF) of the prices confirms the lack of any relationship between prices over the time
2. Use a normal approximation to model prices • There is some evidence that prices in similar markets are
approximately normally distributed 24
0 50 100 150
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
LAG (hours)
ACF
m1.small
2ND ATTEMPT
• The autocorrelation function (ACF) of the prices confirms the lack of any relationship between prices over the time
2. Use a normal approximation to model prices • There is some evidence that prices in similar markets are
approximately normally distributed 25
0 50 100 150
0.0
0.2
0.4
0.6
0.8
1.0
LAG (hours)
ACF
m1.large
0 50 100 150
0.0
0.2
0.4
0.6
0.8
1.0
LAG (hours)
ACF
2ND ATTEMPT
• The autocorrelation function (ACF) of the prices confirms the lack of any relationship between prices over the time
2. Use a normal approximation to model prices • There is some evidence that prices in similar markets are
approximately normally distributed 26
m1.xlarge
SPOT PRICE DISTRIBUTION
Data Normal Approx. Minimum 0.029 0.02098 1st Quartile 0.029 0.02877 Median 0.031 0.03020 3rd Quartile 0.031 0.03163 Maximum 0.085 0.04010 Mean (1st moment) 0.0302 0.03025 Variance (2nd moment) 4.503941 × 10-6 4.515323 × 10-6
Skewness (3rd moment) 1.877202 × 10 1.659259 × 10-2 Kurtosis (4th moment) 4.700735 × 102 3.004374
27
m1.small (Linux/Unix), us-east1 region. Dec. 25, 2010 – Feb. 23, 2011
The
dist
ribut
ion
of th
e pr
ices
is m
ore
heav
y ta
iled
(c
onfir
med
als
o by
Q-Q
plo
ts a
nd S
hapi
ro-W
ilk te
st)
PRICE PREDICTION ALGORITHM ① Collect price statistics, and compute mean and variance ② Compute ACF of the prices for lag l, with l being the prediction
horizon ③ If ACF(l) > 0.4,
a) Predict the future prices using linear regression, y = a + bx, otherwise
b) Use the quantile function (inverse CDF) of the Normal distribution to predict the future prices
• The inverse CDF returns the value of x such as P(X ≤ x) = p
④ Use the max value returned by (3) as a “limit price”
⑤ In case of out-of-bid events, increase the bid by 40%
28
PREDICTION ALGORITHM PERFORMANCE
29 m1.small (Linux/Unix), us-east1 region. Dec. 25, 2010 – Feb. 23, 2011.
The price of “on-demand” instances is 0.085$/h
Prediction (hours)
Target availability
Achieved availability
Avg. big ($/h)
6 90% 99.673% 0.03170 6 99% 99.673% 0.03270 6 99.999% 99.673% 0.03463 12 90% 99.675% 0.03207 12 99% 99.675% 0.03307 12 99.999% 99.675% 0.03499 24 90% 99.679% 0.03253 24 99% 99.679% 0.03350 24 99.999% 99.679% 0.03533
PART 2: MODEL AND SLA (1)
The SaaS provider 1. Pays the IaaS for renting computing resources 2. Charges its customer for executing WS transactions 3. Pays penalties to its customer for failing to meet performance and availability objectives
30
MODEL AND SLA (2)…
1. For each accepted and completed request, the client pays a charge of c $
2. Performance: if the avg. resp. time, β, over an interval of length t exceeds the threshold q, then the provider pays a penalty of r1 $ for each job executed in the interval
3. Availability: the provider should pay a penalty of r2 $ for each rejected job
4. Disaster: the provider is liable to pay a r3 $ for every accepted job that is lost due to resources becoming unavailable
5. The provider pays r4 $/h to rent each server • The provider tries to optimize the average net revenue
earned per unit time
31
… OR
• r1 = penalty for “performance” • r2 = penalty for “availability” • r3 = penalty for “disaster” • r4 = server rental cost • γ = rate at which jobs are accepted into the system
32
R = ![c! r1P(" > q)]! r2 (# !! )! r3P(lp < sp )L ! r4n
Prob. that the avg. resp. time exceeds the threshold
Rate at which jobs are rejected
Prob. that spot instances are terminated
Avg. no. of jobs in the system
POLICIES Resource allocation and admission control
• Admission control defined by means of a threshold, K
1. “Threshold” policy
• The simultaneous optimization of K and n is non trivial • Treats the system as an M/M/n/K queue • Uses a “Hill Climbing” algorithm to find the “best” value of
n and K 2. “Heuristic”: assumes that the response time is dominated
by the service time (true in “large” systems) • Treats the system as an GI/G/n queue • K = ∞, e.g., no job is rejected
33
PERFORMANCE EVALUATION - SETTINGS
b 0.5 sec. Average service time q 1 sec. Performance threshold c 2.951 × 10-5 $ Charge per job
r1 1.5 × c Penalty (performance) r2 2 × c Penalty (availability) r3 3 × c Penalty (disaster) r4 0.085 $/h Rental cost (“on-demand” instances) t 1 min. Interval length (SLA evaluation)
34
REVENUE AS FUNCTION OF K AND N
35
-100
-80
-60
-40
-20
0
20
40
60
20 40 60 80 100 120 140 160 180 200 220
Rev
enu
e, $
/h
Admission threshold, K
n=23n=27n=30n=31n=35
p(lp < cp) = 0
REVENUE AS FUNCTION OF K AND N
36
-100
-80
-60
-40
-20
0
20
40
60
20 40 60 80 100 120 140 160 180 200 220
Rev
enu
e, $
/h
Admission threshold, K
n=23n=27n=30n=31n=35
p(lp < cp) = 0.25
REVENUE AS FUNCTION OF K AND N
37
-100
-80
-60
-40
-20
0
20
40
60
20 40 60 80 100 120 140 160 180 200 220
Rev
enu
e, $
/h
Admission threshold, K
n=23n=27n=30n=31n=35
p(lp < cp) = 0.5
REVENUE AS FUNCTION OF R1, R2 AND K
38 n = 10, ρ = 10, p(lp < cp) = 0 (e.g., no disasters occur)
-4
-3
-2
-1
0
1
2
10 15 20 25 30 35 40
Rev
enu
e, $
/h
Admission threshold, K
r1=2.5c, r2=0r1=2c, r2=0.5c
r1=1.5c, r2=cr1=c, r2=1.5c
r1=0.5c, r2=2cr1=0, r2=2.5c
The system is saturated • Have to choose between a
long queue and rejecting many jobs
REVENUE AS FUNCTION OF THE ARRIVAL RATE
39 ca2=1, cs2=1, ρ = 7.5,…,25, p(lp < cp) = 0.001
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
15 20 25 30 35 40 45 50
Rev
enue
, $/h
Arrival rate, ! (req./sec.)
On-Demand (Threshold)On-Demand (Heuristic)
Spot (Threshold)Spot (Heuristic)
1. The revenue increases with the load
2. The use spot instances makes the system more profitable
3. The “Heuristic” policy performs almost as good as the “Threshold” algorithm
Each point represents a run lasting 28 days
CONCLUSIONS
• I have discussed an approach aiming at maximizing the net revenue earned by a SaaS using spot instances to provide a web service to paying customers 1. How much to bid for resources on the spot market? 2. How many servers should be allocated for a given time
period, and how many jobs to accept?
• The number of running servers, as well as the maximum number of jobs admitted into the system, have a significant effect on the earned revenues • The optimal queue length is highly dependent on the
availability level (e.g., shorter queue when the likelihood premature termination is high)
40
41