ACHIEVING PERFORMANCE AND AVAILABILITY GUARANTEES WITH AMAZON SPOT …tarmo/tday-torve/mazzucco-slides.pdf · 2011-10-13 · INTRODUCTION (1) • Cloud computing = pay-as-you go +

ACHIEVING PERFORMANCE AND AVAILABILITY

GUARANTEES WITH AMAZON SPOT INSTANCES

MICHELE MAZZUCCO*

UNIVERSITY OF TARTU

* And Marlon Dumas. Thanks to Madeline Gonzalez for the useful discussions.

INTRODUCTION (1)

•  Cloud computing = pay-as-you go + “unlimited” capacity •  Amazon Elastic Cloud Compute (EC2)

•  Web service providing resizable compute capacity •  Different instance types, e.g., small, large, xlarge servers •  Different purchasing options

1.  On-Demand Instances •  Pay for resources by the hour •  No long-term commitments

2.  Reserved Instances •  Upfront payment for reserving instances •  Discounted usage rate

2

INTRODUCTION (2)

3.  Spot Instances •  Users bid for unused resources at a “limit price” (the

maximum price the bidder is willing to pay) •  Amazon gathers the bids and determines a clearing

price (“spot price”) based on the bids and the available capacity

•  A bidder gets the required instances if his/her limit price is above the clearing price. In this case, the bidder pays the clearing price (not his/her limit price)

•  The clearing price is updated as new bids arrive. If the clearing price goes above the bidder’s limit price, the bidder’s running spot instances are terminated.

3

INTRODUCTION (3)

•  In this presentation I will focus on “spot instances” •  They allow IaaS providers to sell spare capacity

(via auctions), thus improving resource utilization •  At the same time they enable IaaS users to

acquire resources at a price lower than on-demand price

•  However, spot instances are terminated any time the “clearing price” goes above the user’s “limit price”

4

PROBLEM STATEMENT

“Can we use spot instances to run paid web services while achieving

performance and availability guarantees?”!

5

TALK OUTLINE

6

Bidding strategy

• Decides how much to bid on the spot market

Server allocation

• Decides how many servers to use

Admission control

• Decides how many jobs to admit

Mission accomplished Meet performance

targets

Meet availability goals

PART 1: SPOT PRICE PREDICTION

•  We must determine the optimal “limit price” that the SaaS provider should bid on the spot market •  The goal is to bid the lowest possible price capable of

achieving the desired level of availability

•  Let’s see how some spot prices look like…

7

SPOT PRICES (M1.SMALL)

8

0.03

0.04

0.05

0.06

0.07

0.08

0.09

25/12 01/01 08/01 15/01 22/01 29/01 05/02 12/02 19/02 26/02

Pri

ce (

$/h

)

m1.small (Linux/UNIX)

0.029

0.0295

0.03

0.0305

0.031

11/Fri 12/Sat 13/Sun 14/Mon 15/Tue

Pri

ce (

$/h

)

SPOT PRICES (M1.LARGE)

9

0.1

0.15

0.2

0.25

0.3

0.35

25/12 01/01 08/01 15/01 22/01 29/01 05/02 12/02 19/02 26/02

Pri

ce (

$/h)

m1.large (Linux/UNIX)

0.113

0.116

0.119

0.122

0.125

0.128


Pri

ce (

$/h

)

SPOT PRICES (M1.XLARGE)

10

0.2

0.3

0.4

0.5

0.6

0.7

0.8

25/12 01/01 08/01 15/01 22/01 29/01 05/02 12/02 19/02 26/02

Pri

ce (

$/h)

m1.xlarge (Linux/UNIX)

0.24

0.27

0.3

0.33

0.36

0.39


Pri

ce (

$/h

)

1ST ATTEMPT •  Assume that prices follow patterns, e.g., day/night, week

days/week ends, etc. •  Use triple exponential smoothing (“Holt-Winters”) for

predicting future prices

11

St =!ytIt+ (1!!)(St!1 + bt!1)

bt = " (St ! St!1)+ (1!" )bt!1

It = #ytSt+ (1!#)It!L

Ft+m = (St +mbt )It!L+m

Updates the trend

Updates the smoothed value

Updates the seasonal component

m periods ahead’s forecast

Con

stan

ts m

inim

izin

g th

e M

SE

WIKIPEDIA TRAFFIC

Time [hours]

Arr.

rate

[req

./hou

r]

0 100 200 300 400 500 600 700

2.0e

+07

2.5e

+07

3.0e

+07

3.5e

+07

12

~ 10,000 req./sec at peak

WIKIPEDIA TRAFFIC: TIME SERIES DECOMPOSITION

13

2.0e+07

3.0e+07

data

−6e+06

−2e+06

2e+06

seasonal

2.3e+07

2.6e+07

2.9e+07

trend

−1e+07

−4e+06

2e+06

0 5 10 15 20 25 30

remainder

time

The max irregular component is ~11% of the data

= da

ta –

(tre

nd +

sea

sona

l)

PREDICTING WIKIPEDIA TRAFFIC WITH HOLT WINTERS

Predicting Wikipedia traffic with Holt−Winters

Time [days]

Arr.

rate

[req

./hou

r]

0 5 10 15 20 25 30

2.0e

+07

2.5e

+07

3.0e

+07

3.5e

+07 Observed

Predicted

14

PREDICTING SPOT PRICES WITH HOLT WINTERS (1)

m1.small

Time [~ days]

Pric

e [$

/h]

10 20 30 40

0.03

0.04

0.05

0.06

0.07

0.08

ObservedPredicted

15

Days [5−9]

Time [hours]

Pric

e [$

/h]

0 20 40 60 80 100 120

0.03

00.

035

0.04

00.

045

0.05

0

ObservedPredicted


m1.large

Time [~ days]

Pric

e [$

/h]

10 20 30 40

0.15

0.20

0.25

0.30

ObservedPredicted

16

Days [5−9]

Time [hours]

Pric

e [$

/h]

0 20 40 60 80 100 120

0.15

0.20

0.25

0.30

ObservedPredicted


m1.xlarge

Time [~ days]

Pric

e [$

/h]

10 20 30 40

0.3

0.4

0.5

0.6

0.7

0.8 Observed

Predicted

17

Days [5−9]

Time [hours]

Pric

e [$

/h]

0 20 40 60 80 100 120

0.3

0.4

0.5

0.6

0.7

0.8 Observed

Predicted

XLARGE PRICES: TIME SERIES DECOMPOSITION

m1.xlarge

0.3

0.5

0.7

data

−0.03

−0.01

0.01

0.03

seasonal

0.25

0.35

0.45

trend

−0.2

0.0

0.2

0.4

5 10 15

remainder

time

18

The max irregular component is ~50% of the data

2ND ATTEMPT (1) •  What about employing the distribution of the relative

errors to correct the prediction?

19

Histogram of errors

Relative error [(observed − predicted)/observed]

Den

sity

−0.2 −0.1 0.0 0.1 0.2

01

23

45

6

m1.large

We are interested in this part

2ND ATTEMPT (2)

20

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

CDF rel. error (Wikipedia)


Fn(x

)

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●● ●●●●●●●●

●●●●● ●●●●●● ● ●● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ●

All the arrival rates can be estimated with 83% accuracy

2ND ATTEMPT (3)

21

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.2

0.4

0.6

0.8

1.0

CDF rel. error. (xlarge)


Fn(x

)

It does not really work, does it?

ONE MORE THING… Dec. 25, 2010 − Feb. 23, 2011

Time [hours]

Pric

e [$

/h]

0 200 400 600 800 1000

0.03

0.05

0.07

Jun. 22, 2011 − Sept. 23, 2011

Time [hours]

Pric

e [$

/h]

0 100 200 300 400 500

0.03

0.05

0.07

22

•  ... they look quite different, don’t they?

ACF(l) =(xt ! x )(xt+l ! x )

t=1

N

"

(xt ! x )2

t=1

N

"

3RD ATTEMPT

1.  Employ the autocorrelation function (ACF) to determine the similarity between prices as a function of the time difference between them (lag)

23

Average value

Value at time t

No. of observations

Lag

2ND ATTEMPT

•  The autocorrelation function (ACF) of the prices confirms the lack of any relationship between prices over the time

2.  Use a normal approximation to model prices •  There is some evidence that prices in similar markets are

approximately normally distributed 24

0 50 100 150

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

LAG (hours)

ACF

m1.small

2ND ATTEMPT




0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

LAG (hours)

ACF

m1.large

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

LAG (hours)

ACF

2ND ATTEMPT




m1.xlarge

SPOT PRICE DISTRIBUTION

Data Normal Approx. Minimum 0.029 0.02098 1st Quartile 0.029 0.02877 Median 0.031 0.03020 3rd Quartile 0.031 0.03163 Maximum 0.085 0.04010 Mean (1st moment) 0.0302 0.03025 Variance (2nd moment) 4.503941 × 10-6 4.515323 × 10-6

Skewness (3rd moment) 1.877202 × 10 1.659259 × 10-2 Kurtosis (4th moment) 4.700735 × 102 3.004374

27

m1.small (Linux/Unix), us-east1 region. Dec. 25, 2010 – Feb. 23, 2011

The

dist

ribut

ion

of th

e pr

ices

is m

ore

heav

y ta

iled

(c

onfir

med

als

o by

Q-Q

plo

ts a

nd S

hapi

ro-W

ilk te

st)

PRICE PREDICTION ALGORITHM ①  Collect price statistics, and compute mean and variance ②  Compute ACF of the prices for lag l, with l being the prediction

horizon ③  If ACF(l) > 0.4,

a)  Predict the future prices using linear regression, y = a + bx, otherwise

b)  Use the quantile function (inverse CDF) of the Normal distribution to predict the future prices

•  The inverse CDF returns the value of x such as P(X ≤ x) = p

④  Use the max value returned by (3) as a “limit price”

⑤  In case of out-of-bid events, increase the bid by 40%

28

PREDICTION ALGORITHM PERFORMANCE

29 m1.small (Linux/Unix), us-east1 region. Dec. 25, 2010 – Feb. 23, 2011.

The price of “on-demand” instances is 0.085$/h

Prediction (hours)

Target availability

Achieved availability

Avg. big ($/h)

6 90% 99.673% 0.03170 6 99% 99.673% 0.03270 6 99.999% 99.673% 0.03463 12 90% 99.675% 0.03207 12 99% 99.675% 0.03307 12 99.999% 99.675% 0.03499 24 90% 99.679% 0.03253 24 99% 99.679% 0.03350 24 99.999% 99.679% 0.03533

PART 2: MODEL AND SLA (1)

The SaaS provider 1.  Pays the IaaS for renting computing resources 2.  Charges its customer for executing WS transactions 3.  Pays penalties to its customer for failing to meet performance and availability objectives

30

MODEL AND SLA (2)…

1.  For each accepted and completed request, the client pays a charge of c $

2.  Performance: if the avg. resp. time, β, over an interval of length t exceeds the threshold q, then the provider pays a penalty of r1 $ for each job executed in the interval

3.  Availability: the provider should pay a penalty of r2 $ for each rejected job

4.  Disaster: the provider is liable to pay a r3 $ for every accepted job that is lost due to resources becoming unavailable

5.  The provider pays r4 $/h to rent each server •  The provider tries to optimize the average net revenue

earned per unit time

31

… OR

•  r1 = penalty for “performance” •  r2 = penalty for “availability” •  r3 = penalty for “disaster” •  r4 = server rental cost •  γ = rate at which jobs are accepted into the system

32

R = ![c! r1P(" > q)]! r2 (# !! )! r3P(lp < sp )L ! r4n

Prob. that the avg. resp. time exceeds the threshold

Rate at which jobs are rejected

Prob. that spot instances are terminated

Avg. no. of jobs in the system

POLICIES Resource allocation and admission control

•  Admission control defined by means of a threshold, K

1.  “Threshold” policy

•  The simultaneous optimization of K and n is non trivial •  Treats the system as an M/M/n/K queue •  Uses a “Hill Climbing” algorithm to find the “best” value of

n and K 2.  “Heuristic”: assumes that the response time is dominated

by the service time (true in “large” systems) •  Treats the system as an GI/G/n queue •  K = ∞, e.g., no job is rejected

33

PERFORMANCE EVALUATION - SETTINGS

b 0.5 sec. Average service time q 1 sec. Performance threshold c 2.951 × 10-5 $ Charge per job

r1 1.5 × c Penalty (performance) r2 2 × c Penalty (availability) r3 3 × c Penalty (disaster) r4 0.085 $/h Rental cost (“on-demand” instances) t 1 min. Interval length (SLA evaluation)

34

REVENUE AS FUNCTION OF K AND N

35

-100

-80

-60

-40

-20

0

20

40

60

20 40 60 80 100 120 140 160 180 200 220

Rev

enu

e, $

/h

Admission threshold, K

n=23n=27n=30n=31n=35

p(lp < cp) = 0


36

-100

-80

-60

-40

-20

0

20

40

60

20 40 60 80 100 120 140 160 180 200 220

Rev

enu

e, $

/h


n=23n=27n=30n=31n=35

p(lp < cp) = 0.25


37

-100

-80

-60

-40

-20

0

20

40

60

20 40 60 80 100 120 140 160 180 200 220

Rev

enu

e, $

/h


n=23n=27n=30n=31n=35

p(lp < cp) = 0.5

REVENUE AS FUNCTION OF R1, R2 AND K

38 n = 10, ρ = 10, p(lp < cp) = 0 (e.g., no disasters occur)

-4

-3

-2

-1

0

1

2

10 15 20 25 30 35 40

Rev

enu

e, $

/h


r1=2.5c, r2=0r1=2c, r2=0.5c

r1=1.5c, r2=cr1=c, r2=1.5c

r1=0.5c, r2=2cr1=0, r2=2.5c

The system is saturated •  Have to choose between a

long queue and rejecting many jobs

REVENUE AS FUNCTION OF THE ARRIVAL RATE

39 ca2=1, cs2=1, ρ = 7.5,…,25, p(lp < cp) = 0.001

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

15 20 25 30 35 40 45 50

Rev

enue

, $/h

Arrival rate, ! (req./sec.)

On-Demand (Threshold)On-Demand (Heuristic)

Spot (Threshold)Spot (Heuristic)

1.  The revenue increases with the load

2.  The use spot instances makes the system more profitable

3.  The “Heuristic” policy performs almost as good as the “Threshold” algorithm

Each point represents a run lasting 28 days

CONCLUSIONS

•  I have discussed an approach aiming at maximizing the net revenue earned by a SaaS using spot instances to provide a web service to paying customers 1.  How much to bid for resources on the spot market? 2.  How many servers should be allocated for a given time

period, and how many jobs to accept?

•  The number of running servers, as well as the maximum number of jobs admitted into the system, have a significant effect on the earned revenues •  The optimal queue length is highly dependent on the

availability level (e.g., shorter queue when the likelihood premature termination is high)

40

41

Documents

ACHIEVING PERFORMANCE AND AVAILABILITY GUARANTEES WITH AMAZON SPOT …tarmo/tday-torve/mazzucco-slides.pdf · 2011-10-13 · INTRODUCTION (1) • Cloud computing = pay-as-you go +