Cost-Minimizing Online VM Purchasing for Application ...i.cs.hku.hk/~cwu/papers/skshi-cloud15.pdf · framework for dynamic, cost-minimizing migration of content distribution services

Cost-Minimizing Online VM Purchasing for ApplicationService Providers with Arbitrary Demands

Shengkai Shi∗‡, Chuan Wu∗, and Zongpeng Li†∗The University of Hong Kong, †University of Calgary, ‡Ecosystem and Cloud Services Business Group, Lenovo

Email: [email protected], [email protected], [email protected]

Abstract—Recent years witness the proliferation ofInfrastructure-as-a-Service (IaaS) cloud services, whichprovide on-demand resources (CPU, RAM, disk) in the formof virtual machines (VMs) for hosting applications/services ofthird parties. Given the state-of-the-art IaaS offerings, it isstill a problem of fundamental importance how the ApplicationService Providers (ASPs) should rent VMs from the cloudsto serve their application needs, in order to minimize thecost while meeting their job demands over a long run. Cloudproviders offer different pricing options to meet computingrequirements of a variety of applications. However, the challengefacing an ASP is how these pricing options can be dynamicallycombined to serve arbitrary demands at the optimal cost. In thispaper, we propose an online VM purchasing algorithm basedon the Lyapunov optimization technique, for minimizing thelong-term-averaged VM rental cost of an ASP with time-varyingand delay-tolerant workloads, while bounding the maximumresponse delay of its jobs. In stark contrast with the existingstudies, the proposed algorithm enables an ASP to optimallydecide the amount of reserved, on-demand and spot instancesto purchase simultaneously. Rigorous analysis shows that ouralgorithm can achieve a time-averaged resource cost close tothe offline optimum. Trace-driven simulations further verify theefficacy of our algorithm.

I. INTRODUCTION

As a major type of cloud services, Infrastructure-as-a-Service (IaaS) cloud offerings provide abundant and elasticcomputing resources for third party usage, which has re-markably revolutionized the way of enabling scalable anddynamic Internet applications. More and more ApplicationService Providers (ASPs) are launching their applicationsin clouds, without the need for building and maintainingtheir owned IT infrastructures. The leading online contentprovider Netflix [1] offers on-demand Internet video serviceand receives enormous streaming requests from its worldwidesubscribers every minute. With Amazon EC2 [2], Netflix canrun critical encoding tasks, to serve its clients’ video demands,with a number of VM instances configured with the selectedencoding software, and shut them down when completed [3].

Cost management is still a crucial task in such a switch tocloud-based services. VM instance purchases from the cloudsplay a critical role for cost management of an ASP. Under thepay-as-you-go pricing model, a practical problem for ASPs ishow to minimize the VM purchasing cost while guaranteeinga good service performance. Cloud vendors usually offer

This paper was supported in part by grants from Hong Kong RGC undercontracts HKU 717812 and 718513.

TABLE IPRICING OF RESERVED INSTANCE, ON-DEMAND INSTANCE AND SPOT

INSTANCE (LINUX, US WEST) IN AMAZON EC2, AS OF APR 25, 2015.

Instance Type Pricing Option Up-front Hourly

m3.medium1-year reserved $372 $0.0425on-demand $0 $0.070spot $0 $0.008

m3.large1-year reserved $751 $0.0857on-demand $0 $0.140spot $0 $0.0255

multiple pricing options that allow the flexibility to optimizecosts [4]. The commonly adopted cloud pricing schemes are(1) reserved instance pricing, (2) on-demand instance pricing,and (3) spot instance pricing. With reserved instances, userspay an one-time upfront fee and reserve instances with asignificantly lower hourly charge for a 1-year or 3-year term.On-demand instances enable users to pay a fixed hourly ratewith no long-term usage commitment. Spot instances, offeredby Amazon EC2 [2], allow users to bid whatever price theywant for spare instances with no upfront commitment andto run them at an hourly rate substantially lower than theon-demand rate, whenever their bid price is larger than thespot market price. A pricing example of reserved instance,on-demand instance and spot instance is given in Table I.

While reserved instances are more beneficial for application-s with long-term steady or predictable workload, on-demandinstances are more recommended for applications with short-term spiky or unpredictable workload. Spot instances couldact as complements of reserved and on-demand instancesto achieve further cost savings. However, the spot price isfluctuating all the time in tune with the demand and supplylevels. Fig. 1 shows a significant variation in Amazon EC2 spotprices for Linux/UNIX instances of type r3.xlarge from Mar27, 2015, to Apr 24, 2015. We observe that at times the spotprice exceeds even the on-demand price. Thus purchasing spotinstances risks frequent job interruptions during the execution,and is more suitable for time-flexible or delay-tolerant appli-cations. Simply operating the entire workload with only onepricing option can be highly cost-ineffective. Since gaps doexist among different pricing schemes, it is quite desirable yetchallenging for ASPs to intermingle different pricing optionsbased on their own demand, in order to optimize the long-term-averaged cost. In particular, with time-varying demands,one ASP should answer two basic questions at any decision-making instant: (1) how many instances to purchase, and (2)which type to purchase (reserved, on-demand, spot or all)?

2015 IEEE 8th International Conference on Cloud Computing

2159-6190/15 $31.00 © 2015 IEEE

DOI 10.1109/CLOUD.2015.29

146

Fig. 1. The variation in Amazon EC2 spot prices for Linux/UNIX r3.xlargeinstances in the US-West-2c region from Mar 27, 2015, to Apr 24, 2015.

Amazon EC2 introduced one customized service, AWSTrusted Advisor, to help users realize more cost reduction [5].The AWS Trusted Advisor can make recommendations onwhether the customer can save money with a more suitableVM instance, by drawing on the previous usage data aggre-gated across all consolidated billing accounts with complexdata mining and machine learning techniques. There havebeen some efforts on IaaS cost management in terms ofuncertain demand [6][7], but they require a priori knowledgeof the workloads or accurate prediction of future information.Even though some statistics may be obtained using dynamicprogramming techniques, it suffers from high computationcomplexity, and hence not suitable for online decision makingin practice [8][9]. In addition, very few studies have addressedthe randomness of the spot prices, and more importantly, howto achieve cost optimization under this price uncertainty.

To the contrast, we seek to integrate all available pricingschemes and design effective online algorithms for the long-term operation of ASPs. We formulate the long-term-averagedVM cost minimization problem of an ASP with time-varyingand delay-tolerant workloads in a stochastic optimization mod-el. An efficient online VM purchasing algorithm is designedto guide the VM purchasing decisions of the ASP based onthe Lyapunov optimization technique. Lyapunov optimizationprovides a framework for designing algorithms with perfor-mance arbitrarily close to the optimal offline performance overa long run of the system, without the need for any futureinformation. It has been widely applied to resource allocationoptimization in data centers [10][11]. Different from theseexisting studies, our online VM purchasing algorithm does notrequire any prior knowledge or assume any distribution of theworkload. Moreover, it addresses the possible job interruptiondue to uncertain availability of spot instances. To our bestknowledge, this work is the first effort on jointly leveraging allthree common IaaS cloud pricing options, in order to exploitthe highly-coveted cost advantages of cloud computing.

The rest of the paper is organized as follows. We reviewrelated literature in Sec. II, describe the system model inSec. III, design the online algorithm in Sec. IV, present thesimulation results in Sec. V, and conclude the paper in Sec. VI.

II. RELATED WORK

We now provide a snapshot of the related work. Therehas been a growing interest in cost management of clouds.

Sharma et al. [12] propose a cost-aware capacity provision-ing mechanism for ASPs to choose server configurationsand reconfigurations in order to minimize the rental cost ofcloud infrastructure. Qiu et al. [13] propose an optimizationframework for dynamic, cost-minimizing migration of contentdistribution services into a hybrid cloud infrastructure thatspans geographically distributed data centers. Khanafer etal. [14] model the cost optimization problem of a cloudfile system as a variant of classical Ski-Rental problem, andpropose new randomized algorithms to generate significantcost savings. Setty et al. [15] provide a cost-effective resourceprovisioning scheme for deploying publish/subscribe servicesin the cloud, so as to minimize the total cost of VM acquisitionand bandwidth consumption, while ensuring a good subscribersatisfaction. Roh et al. [16] formulate a concave game takinginto account both the resource pricing of clouds and resourcecompetition of ASPs, and investigate the characteristics of theequilibrium point.

There have been some works discussing the strategic combi-nation of different cloud pricing options. Leslie et al. [17] andLu et al. [18] exploit cost-effective hybrid resource provision-ing approaches for deploying applications on on-demand andspot VMs. Menache et al. [19] propose an online learningalgorithm for allocating on-demand and spot VM instancesfor batch applications. The candidate policy weights are dy-namically adjusted through learning from performance on jobexecutions, spot prices and workload characteristics, in orderto reinforce best performing policies. Hong et al. [7] studythe costs of margins, which are a pool of servers kept activefor unpredictable potential workload surges, and propose adynamic programming approach for minimizing the margincost. Then a VM purchasing strategy combining reserved andon-demand VMs is designed to achieve optimal true cost.However, the proposed approaches are possible only when apriori knowledge of the workload is available. Zhao et al. [6]analyze the time-varying spot prices in Amazon EC2 and showthat the spot price is highly unpredictable. Then a stochasticresource rental planning model is proposed to take spot priceuncertainty into account, and a hybrid VM renting approachbased on on-demand and spot VM instances is provided. Thedemand in the planning horizon is simply assumed to beknown. Wang et al. [9] design an instance reservation strategyvia a cloud brokerage mechanism to minimize the total cost ofboth reserved and on-demand VM instances. A cloud brokercan aggregate the demands from a large number of users to s-mooth out individual demand bursts, and time-multiplex partialusage in the same instance-hour. The proposed method suffersfrom the prohibitive complexity of dynamic programming andrelies on workload prediction. Based on [9], Wang et al. [8]further propose a deterministic algorithm and a randomizedonline reservation algorithm blending the two pricing optionswithout any knowledge of future demand. However, spot VMsare not leveraged for achieving more cost reductions and moreflexibility in VM acquisition. Also in general, these schemesdo not provide any tradeoff guarantee between the cost andthe service performance. Our design addresses these issues.

147

VM VM VM VMVM

Customer

Cloud ProviderASP

Service Request

Fig. 2. System overview.

III. SYSTEM MODEL

As illustrated in Fig. 2, we consider an ASP providingservices to its customers over the Internet. Instead of usingits own resources, the ASP accepts and processes job requestswith VMs purchased from an IaaS cloud provider. The systemruns in a time-slotted fashion with time slots of equal lengthsindexed by t = 0, 1, . . ., where each t is a decision-makinginstant of the ASP. In practice, a slot t could be one hour.

A. Job ModelThere are total G types of jobs served by the ASP. Each job,

or service request from a customer of the ASP, is characterizedby a three-tuple (sg, lg,mg). Here, sg ∈ [1, S] represents thetype of the required VM, where S is the maximum number ofVM types, and each type comprises a different configuration ofCPU, memory, and storage; lg is the Service Level Agreement(SLA) of the type-g job, specified by the maximum responsedelay for scheduling a job, i.e., the time-span from whenthe job arrives to when it is dropped or starts to run onthe scheduled VM; mg is the number of time slots requiredand is referred to the workload of one type-g job. During itsexecution, a job could be suspended and resumed later.

B. Scheduling ModelA FIFO queue is maintained at the ASP to store the

unscheduled workload, Qg , for each job type g ∈ [1, G]. Inevery time slot, jobs arrive at the ASP. Let rg(t) ∈ [0, rmax

g ]denote the number of type-g jobs that arrive at the beginningof each time slot t, where rmax

g is the maximum value ofrg(t). We assume that rg(t) is an i.i.d stochastic process acrosstime slots. Upon arrival of rg(t) type-g jobs, mgrg(t) units ofworkload are appended to Qg. The ASP then decides howto distribute the awaiting jobs. We use ug(t) to represent thenumber of type-g jobs successfully scheduled in time slot t.When a type-g job is scheduled for processing, the job departsfrom its queue and starts to run on a type-sg VM. Let ug(t−)denote the number of type-g jobs scheduled before t, whichare still running at t. For each newly scheduled type-g jobat t or each leftover type-g job at t, one unit workload isdeducted from Qg at the end of time slot t. When a job’smaximum response time cannot be met, it is dropped. LetDg(t) ∈ [0, Dmax

g ] be the number of type-g jobs dropped

by the ASP in time slot t, where Dmaxg is the maximum

number of type-g jobs allowed to be dropped in one timeslot. The drop of Dg(t) type-g jobs introduces mgDg(t) unitsof workload reduced from Qg. Denote Qg(t) as the totalunscheduled workload of type-g jobs in time slot t. Thus, itevolves over time following the dynamics specified by

Qg(t+ 1) =max{Qg(t)− ug(t)− ug(t−)−mgDg(t), 0}

+mgrg(t), (1)

assuming that the initial queue is empty, i.e., Q(0) = 0.In order to satisfy the SLA constraint, we define the

following virtual queues, each associated with a workloadqueue Qg. We use the ϵ−persistence queue technique [20]to create virtual queue Zg(t), which starts from Zg(0) = 0and evolves as

Zg(t+ 1) =max{Zg(t) + 1{Qg(t)>0} · [ϵg − ug(t)− ug(t−)]

−mgDg(t)− 1{Qg(t)=0}umaxg , 0}, ∀g ∈ [1, G].

(2)

Here, ϵg is a constant that ensures Zg(t) grows whenever thereis unscheduled workload in Qg, and umax

g is the maximumnumber of type-g jobs that could be processed simultaneouslyby the ASP, with 0 ≤ ug(t) + ug(t−) ≤ umax

g . Indicatorfunction 1{Qg(t)>0} is 1 when Qg(t) > 0 and 0 otherwise.Similarly, 1{Qg(t)=0} is 1 when Qg(t) = 0 and 0 otherwise.Length of a virtual queue reflects the cumulated responsedelay of workloads from the respective workload queue. Ifthe system can bound the maximum lengths of the workloadqueues and virtual queues with properly set ϵg, then themaximum response delay of jobs can be bounded.

C. VM Provisioning Model

The system deploys a hybrid VM provisioning mechanismthrough an integration of three different pricing options: (1)reserved instances, (2) on-demand instances, and (3) spotinstances. After observing the VM demands of all jobs beingprocessed, including newly scheduled jobs and leftover jobs,the ASP makes a decision to reserve as(t) ∈ [0, amax

s ] type-s VMs in time slot t (i.e., reserved instances), ∀s ∈ [1, S],where amax

s is the reservation limit of type-s instances pertime slot. Let N denote the fixed number of time slotsany reserved VM is reserved for (i.e., the reservation pe-riod). Each reserved type-s instance will stay effective inthe whole reservation period [t, t + N − 1]. So the totalnumber of type-s reserved VMs remaining effective at t is!t

τ=t−N+1 as(τ). The aggregated reserved resource may beinsufficient to accommodate all job demands. The ASP mayalso launch additional bs(t) ∈ [0, bmax

s ] on-demand type-sVMs, ∀s ∈ [1, S], to serve the residual demand, where bmax

s

is the on-demand limit of type-s instances per time slot.Spot VMs can also be launched as an alternative to reserved

or on-demand VMs. After observing the current spot price,the ASP bids for spot instances and use them whenever itswillingness-to-pay is larger than the spot price. How to setthe bid prices depends on the ASP’s evaluation on a number

148

of operational factors. In this paper, we set the bid price fortype-s spot VMs as the price of type-s on-demand VMs. Therationale is that it is more cost-efficient to launch spot VMsonly when the current spot price is smaller than the currenton-demand price. Let fs(t) ∈ [0, fmax

s ] denote the number oftype-s spot VMs the ASP obtains in time slot t, where fmax

s

is the upper limit. Different from on-demand and reservedinstances, spot instances are priced in real-time as illustratedin Fig. 1. So the ASP should always prepare for the possibilitythat its spot instances are terminated when the spot pricesexceed the bid prices. Since the bid price is the same foreach type-s spot VM started at the same time slot, all type-sspot VMs acquired at t will be terminated once the spot priceexceeds the bid price. We consider that the interrupted jobsdue to terminated spot VMs will be served immediately bynewly launched on-demand VMs. In case that an out-of-bidevent occurs for type-s spot VMs during time slot t, fs(t)on-demand VMs will be instantly purchased to carry on theleftover workloads of the interrupted jobs. We assume thatthe switch time from terminated spot VMs to newly launchedon-demand VMs is negligible.

At each time slot, the total supply of VMs purchased shouldalways accommodate the total demand of job scheduling:

t"

τ=t−N+1

as(τ) + bs(t) + 1{βs>γs(t)}fs(t)

≥"

g:sg=s

[ug(t) + ug(t−)], ∀t, ∀s ∈ [1, S]. (3)

IV. DYNAMIC COST MINIMIZING ALGORITHM

A. Problem Formulation

A number of cost parameters are associated with the VMcost optimization. When dropping one job, a penalty is en-forced to compensate for the customer’s loss. Let σg denotethe penalty to drop one type-g job. For reserved instances, auser needs to make a one-time payment for the reservation,and the charging policies slightly differ across different typesof reserved instances [21]. We limit our discussions to reservedinstances with fixed costs, which represent a common case inIaaS clouds. The total cost of a reserved instance is abstractedas a one-time upfront reservation fee, which is denoted as αs

for a type-s reserved VM.We assume that the length of each time slot t matches the

duration of one billing cycle for on-demand and spot instances,e.g., one hour. Let βs denote the price of running one type-son-demand VM per billing cycle. To model the availability ofspot VMs during one time slot, we use Ps(t) ∈ [0, 1] to denotethe probability that a termination event may happen within[t, t+1] for the type-s spot VMs acquired at t, which could bedynamically estimated based on historical price series, withina window of past time slots of a certain size. Let γs(t) denotethe spot price at t of running one type-s spot VM for onebilling cycle. We consider two cases: (1) Case 1: a type-sspot VM successfully runs for one entire time slot and ischarged at γs(t) for the billing cycle [t, t+1], which happens

with probability 1 − Ps(t). (2) Case 2: a type-s spot VMis terminated during [t, t + 1] with probability Ps(t), and isreplaced immediately by a newly launched type-s on-demandVM to process the leftover workloads. Spot instances will notbe charged for any partial billing cycle of usage, while partialbilling cycle consumed by on-demand instances is charged asa full billing cycle. Thus the expected VM cost in Case 2should be [1− Ps(t)]γs(t) + Ps(t)βs.

Given the system model and cost model, the total VM costto accommodate all demand in time slot t is given by

Cost(t) ="

s∈[1,S]

{αsas(t) + βsbs(t)

+ 1{βs>γs(t)}[Ps(t)βsfs(t)

+ (1− Ps(t))γs(t)fs(t)]}+"

g∈[1,G]

Dg(t)σg. (4)

The time-averaged expected VM purchasing cost is

Costav = limT→∞

1

T

T−1"

t=0

E[Cost(t)]. (5)

Therefore, the VM cost minimization pursued by the ASP canbe formulated as follows

min Costav (6)s.t. 0 ≤ rg(t) ≤ rmax

g , ∀g ∈ [1, G], ∀t; (7)0 ≤ ug(t) + ug(t

−) ≤ umaxg , ∀g ∈ [1, G], ∀t; (8)

0 ≤ Dg(t) ≤ Dmaxg , ∀g ∈ [1, G], ∀t; (9)

0 ≤ as(t) ≤ amaxs , ∀s ∈ [1, S], ∀t; (10)

0 ≤ bs(t) ≤ bmaxs , ∀s ∈ [1, S], ∀t; (11)

0 ≤ fs(t) ≤ fmaxs , ∀s ∈ [1, S], ∀t; (12)

Constraints (1)-(3),

where umaxg = N ∗amax

sg +bmaxsg +fmax

sg . The objective of ouroptimization problem is to make dynamic job scheduling andVM acquiring decisions, so as to minimize long-term time-averaged VM cost. Important notations are summarized inTable II, for the ease of reference.

B. Online Algorithm Design

We now design an online algorithm to solve the costminimization problem in (6). To minimize the time-averagedobjective function in (6) based on decisions at each time slot,we resort to the drift-plus-penalty framework in Lyapunovoptimization [20], a classical technique for transforming along-term time-average optimization problem into a series ofsimilar one-shot optimization problems. In each time slot t,we have a set of queues Θ(t) as

Θ(t) = {Qg(t), Zg(t)|g ∈ [1, G]}. (13)

Define Lyapunov function as follows

L(Θ(t)) =1

2

"

g∈[1,G]

[(Qg(t))2 + (Zg(t))

2], (14)

149

TABLE IIIMPORTANT NOTATIONS

sg type of the VM required by one type-g jobmg number of time slots required by one type-g joblg maximum response delay for scheduling one type-g jobrg(t) number of type-g jobs arriving at the beginning of tDg(t) number of type-g jobs dropped by the ASP at tug(t) number of type-g jobs scheduled for processing at tug(t−) number of running type-g jobs left over before tQg(t) length of workload queue for type-g jobs at tas(t) number of new type-s reserved VMs at tN reservation period for reserved VMs / length of a super time

framebs(t) number of launched type-s on-demand VMs at tfs(t) number of obtained type-s spot VMs at tαs upfront reservation fee for one type-s reserved VMβs price for running one type-s on-demand VM per billing cycleγs(t) spot price at t of running one type-s spot VM for one billing

cycleσg penalty for dropping one type-g jobPs(t) estimated probability of a termination event in time span [t,

t+1] for type-s spot VMs obtained at t

where the constant 12 is added for the convenience of math-

ematical derivations. Next, we define the one-slot conditionalLyapunov drift as follows

∆(Θ(t)) = E[L(Θ(t+ 1))− L(Θ(t))|Θ(t)]. (15)

Following the framework of Lyapunov drift-plus-penaltyalgorithm, we add the VM purchasing cost as a penaltyfunction to obtain the drift-plus-penalty term

∆(Θ(t)) + V Cost(t), (16)

where V > 0 is a user-defined positive constant that can beunderstood as the weight of the VM purchasing cost in thisexpression, and can be tuned to different values to indicatethe tradeoff between the cost and the SLA guarantee. Thefollowing lemma defines such an upper bound on this drift-plus-penalty term.

Lemma 1: Let V > 0, and let ϵg > 0. Then the drift-plus-penalty term satisfies the following inequality:

∆(Θ(t)) + V Cost(t) ≤B +

!

g∈[1,G]

Qg(t)[mgrg(t)− ug(t)− ug(t−)−mgDg(t)]

+!

g∈[1,G]

Zg(t)[ϵg − ug(t)− ug(t−)−mgDg(t)]

+ V!

s∈[1,S]


+ 1{βs>γs(t)}[Ps(t)βsfs(t) + (1− Ps(t))γs(t)fs(t)]}+ V

!

g∈[1,G]

Dg(t)σg, (17)

where B = 12

!g∈[1,G]{[ϵg]2 + (mgrmax

g )2 + 2[umaxg +

mgDmaxg ]2}.

Proof: Detailed proof is provided in [22].Different from previous work using Lyapunov optimiza-

tion [10][11], we seek to model the more general scenarioin which a job may take more than one time slot to finishand the ASP can choose any pricing option at every time

slot. In each time slot t, the ASP observes the queues Qg(t)and Zg(t), and the number of left-over jobs ug(t−), and thendecides the optimal values of Dg(t), ug(t), as(t), bs(t) andfs(t) to minimize the the upper bound shown on the right-hand side of inequality (17). We can get the following one-shot optimization problem to be solved by the ASP in eachtime slot t:

min ϕ1(t) + ϕ2(t) (18)s.t. Constraints (1)-(3), (8)-(12),

where

ϕ1(t) =!

g∈[1,G]

Dg(t)[V σg −mgQg(t)−mgZg(t)]

ϕ2(t) = V!

s∈[1,S]


+ 1{βs>γs(t)}[(1− Ps(t))γs(t)fs(t) + Ps(t)βsfs(t)]}−

!

g∈[1,G]

ug(t)[Qg(t) + Zg(t)].

After careful derivation, minimization problem (18) can bedecoupled to two independent optimization problems dealingwith (a) job dropping, and (b) job scheduling and VM pur-chasing, respectively.

(a) Job Dropping. The number of dropped jobs Dg(t),∀g ∈ [1, G], is obtained by solving the following optimizationproblem:

min Dg(t)[V σg −mgQg(t)−mgZg(t)] (19)s.t. Constraint (9).

The optimal solution of problem (19) is:

Dg(t) =

#Dmax

g if Qg(t) + Zg(t) >V σg

mg

0 if Qg(t) + Zg(t) ≤ V σg

mg

.

It indicates that one type-g job with more dropping penaltyand less workload is less likely to be dropped.

(b) Job scheduling and VM purchasing. The decisions onthe number of jobs to schedule, the number of reserved VMsto purchase, the number of on-demand VMs to launch, andthe number of spot VMs to acquire, all at t, can be obtainedby solving the following optimization problem:

min V"

s∈[1,S]


+ 1{βs>γs(t)}[(1− Ps(t))γs(t)fs(t)

+ Ps(t)βsfs(t)]}−"

g∈[1,G]

ug(t)[Qg(t) + Zg(t)]

(20)s.t. Constraints (3)(8)(10)(11)(12).

Problem (20) is a joint job scheduling and VM purchasingproblem. We can start with solving ug(t) by assuming alreadyknown feasible assignments to as(t), bs(t), and fs(t). To

150

minimize (20), we should maximally schedule jobs of type-g∗s ,whose observed value of Qg(t) + Zg(t) is the largest amongall types of jobs requiring type-s VMs. We have

g∗s = argmaxg:gs=s[Qg(t) + Zg(t)], ∀s ∈ [1, S]. (21)

The number of type-g∗s jobs we can schedule at t is decidedby constraint (3), at

ug∗s(t) =as(t) + bs(t) + 1{βs>γs(t)}fs(t)

−"

g:sg=s

ug(t−) + φs(t). (22)

Here, φs(t) =!t−1

τ=t−N+1 as(τ). Except type-g∗s jobs, noother types of jobs are scheduled, i.e.,

ug(t) = 0, ∀g ̸= g∗s , ∀s ∈ [1, S]. (23)

Hence the second part of (20) can be expressed using variablesas(t), bs(t), and fs(t). Removing the constants, (20) can beconverted to the following equivalent VM purchasing problem

min"

s∈[1,S]

{as(t)[V αs −Qg∗s(t)− Zg∗

s(t)]

+ bs(t)[V βs −Qg∗s(t)− Zg∗

s(t)]

+ 1{βs>γs(t)}fs(t)[V (1− Ps(t))γs(t) + V Ps(t)βs

−Qg∗s(t)− Zg∗

s(t)]} (24)

s.t. as(t) + bs(t) + fs(t) ≥"

g:sg=s

ug(t−)− φs(t)

Constraints (8)(10)(11)(12).

Here, we show the solutions when βs > γs(t). The other caseβs ≤ γs(t) can be solved in a similar manner with fs(t) =0. Based on real cases of Amazon EC2 [4], the followinginequality holds in general given the fact Ps(t) ∈ [0, 1]

[(1− Ps(t))γs(t) + Ps(t)βs] ≤ βs ≤ αs.

The objective function of (24) is linear in as(t), bs(t), andfs(t). There are four cases in terms of the workload queuelength and the virtual queue length.

Case 1: Qg∗s(t)+Zg∗

s(t) ≤ V (1−Ps(t))γs(t)+V Ps(t)βs.

The objective function is always non-negative. as(t), bs(t),and fs(t) should be as small as possible.

Case 1.1:!

g:sg=s ug(t−)− φs(t) ≤ 0. The reserved VMsremaining effective are sufficient to accommodate left-overjobs. Then we have as(t) = bs(t) = fs(t) = 0.

Case 1.2: 0 <!

g:sg=s ug(t−)−φs(t) ≤ fmaxs . Spot VMs

should be launched to run left-over jobs with the reservedVMs remaining effective. Then we have as(t) = bs(t) = 0and fs(t) =

!g:sg=s ug(t−)− φs(t).

Case 1.3: 0 <!

g:sg=s ug(t−) − φs(t) − fmaxs ≤ bmax

s .On-demand and Spot VMs should be launched to run left-over jobs with the reserved VMs remaining effective. Thenwe have as(t) = 0, bs(t) =

!g:sg=s ug(t−)− φs(t)− fmax

s ,and fs(t) = fmax

s .

Case 1.4: 0 <!

g:sg=s ug(t−) − φs(t) − fmaxs − bmax

s ≤amaxs . All three types of VMs should be acquired to accommo-

date left-over jobs. Then we have as(t) =!

g:sg=s ug(t−)−φs(t)− fmax

s − bmaxs , bs(t) = bmax

s , and fs(t) = fmaxs .

Case 2: V (1−Ps(t))γs(t)+V Ps(t)βs < Qg∗s(t)+Zg∗

s(t) ≤

V βs. Case 3: V βs < Qg∗s(t)+Zg∗

s(t) ≤ V αs. Case 4: V αs <

Qg∗s(t)+Zg∗

s(t). The solutions of these cases can be found in

our technical report [22].The intuition here is to trade the queueing delay for cost

reduction by using the workload queue length and virtualqueue length as a guidance for making job scheduling and VMpurchasing decisions. Once as(t), bs(t) and fs(t) are decided,the job scheduling decisions can be made based on Eqn. (22)and Eqn. (23).

In the standard Lyapunov optimization framework [20],decisions made at the current time slot do not have anyinfluence on decision making at the subsequent time slots. Thispaper handles a more general case in which one scheduledjob will occupy VM resources for mg consecutive time slots,directly affecting the job scheduling and VM purchasingdecisions in later times. We make the following design toour cost minimization algorithm, in order to achieve provablealgorithmic optimality.

First, we group N time slots into a super time frame, whereN is the fixed reservation period for reserved instances. Letmmax denote the maximum units of workload of all types ofjobs. We assume that N > mmax. The above job schedulingand VM purchasing algorithm varies depending on which timeslot it is running at : at a time slot t ∈ [xN, (x + 1)N −mmax], where x could be any non-negative integer, the abovealgorithm remains the same; in a time slot t ∈ [(x + 1)N −mmax +1, (x+1)N − 1], only jobs with mg ≤ (x+1)N − tare considered in the selection of g∗s :

g∗s = argmaxg:gs=s,mg≤(x+1)N−t[Qg(t) + Zg(t)]. (25)

This design indicates that a new type-g job is scheduled onlyif it can finish its service within the super time frame.

Second, we impose another constraint that instances re-served at t will remain effective in the time span [t,N(⌊ t

N ⌋+1)− 1] instead of [t, t+N − 1]. All instances reserved duringone super time frame will be cleared out at the end of thatsuper time frame. So within one super time frame, the ASPuses reserved VMs over time in a nondecreasing manner.

C. Performance Analysis

We next analyze the performance of the designed onlinealgorithm in terms of queueing delay bound, no job droppingconditions and cost optimality.

Theorem 1 (Queueing Delay Bound): If mgDmaxg >

max{mgrmaxg , ϵg}, then each workload queue Qg(t) and

each virtual queue Zg(t) are upper bounded by Qmaxg =

V σg/mg +mgrmaxg and Zmax

g = V σg/mg + ϵg, respectively,∀t, ∀g ∈ [1, G]. The SLA of each job can be guaranteed byQmax

g +Zmaxg

ϵg, ∀g ∈ [1, G], if we set ϵg =

Qmaxg +Zmax

g

lg.

Proof: Detailed proof is provided in [22].

151

Theorem 2 (No Job Dropping Conditions): There is no jobdropping in each time slot if the two conditions are satisfied

amaxsg + bmax

sg + fmaxsg ≥ (

"

g′∈G

mg′ )(mmaxrmax + ϵmax)

(26)

V σg

mg≥ V αmax + (

"

g′∈G

mg′ )(mmaxrmax + ϵmax), ∀g ∈ G.

(27)

Here, αmax = max{αs, ∀s ∈ S}, rmax = max{rmaxg , ∀g ∈

G}, and ϵmax = max{ϵg, ∀g ∈ G}.Proof: With condition (26), we can prove Qg(t) +Zg(t) ≤

V αmax + (!

g′∈G mg′ )(mmaxrmax + ϵmax), ∀g ∈ G. Thencondition (27) can guarantee Qg(t)+Zg(t) ≤ V σg

mg. According

to the optimal solution of problem (19), there is no jobdropping. Detailed proof is provided in [22].

We next prove the performance optimality of our onlinealgorithm. Define χ as the vector of time-averaged arrivingworkload for different types of jobs, i.e.,

χg = limT→∞

1

T

T−1"

t=0

mgrg(t).

A workload arrival rate vector χ is said to be supportable ifthere exist job scheduling and VM purchasing algorithms withno job dropping and no violation of the SLA requirements,under which all workload queues can be stabilized. The setof all supportable vectors of workload arrival rates is definedas the capacity region C at the ASP. We call an algorithm(1 + δ)-optimal if the algorithm can support any χ such that(1 + δ)χ ∈ C for some δ > 0.

Theorem 3 (Performance Optimality): Suppose conditions(26) and (27) are satisfied, and (1+δ)N

N−mmaxχ ∈ C for some δ >0, under our online algorithm we have :

limκ→∞

1κN

κ−1!

x=0

(x+1)N−1!

t=xN

E[Cost(t)]

≤ Cost(1+δ)N

N−mmax +BV

+(N −mmax)(N −mmax − 1)

2V NB1

+N − 12V

!

g∈[1,G]

[(ϵg)2 + (mg)

2(rmaxg )2]

+mmax

N

!

s∈[1,S]

(αsamaxs + βsb

maxs + βsf

maxs )

+(N −mmax)(N −mmax − 1)

2N

!

s∈[1,S]

(fmaxs )(βs − γmin

s ),

(28)

where B is given in Lemma 1, γmins is the minimum spot

price for type-s spot VMs, and B1 =!

g∈[1,G][mgrmaxg +

2umaxg + ϵg]umax

g . RHS is the (1+δ)NN−mmax -optimal cost plus a

constant. Note that 1-optimal cost is the offline optimum forthe cost-minimization problem in Eqn. (6).

0 100 200 300 400 500 600 7000

500

1000

1500

2000

2500

Time slot (hour)

Job

No.

Fig. 3. The arrival pattern of one typical job type.

Proof: Detailed proof is provided in [22].Theorem 1 and Theorem 3 show that, given a control

parameter V , our algorithm is O(1/V )-optimal with respectto the average VM cost against the (1+δ)N

N−mmax -optimal offlinealgorithm, while the queue length is bounded by O(V ). Byincreasing V , the developed online algorithm can push thetime-average cost closer to the (1+δ)N

N−mmax -optimal value at theexpense of increasing the queueing delay at the ASP. Hence,by appropriately selecting the control parameter V , we canachieve a desired tradeoff between the VM cost and thequeueing delay. If V → ∞, N → ∞ and N

V < ∞, ouralgorithm can achieve a time-averaged cost with a constantgap to the 1-optimal cost, i.e., the offline optimum.

V. PERFORMANCE EVALUATION

In this section, we evaluate the performance of the pro-posed online algorithm through trace-driven simulations underrealistic settings. We simulate an ASP which schedules andprocesses workloads with VMs purchased from one IaaS cloudprovider like Amazon EC2 [2].

A. Simulation SetupJob Types. We consider six types of VMs S = {m3.xlarge,

m3.2xlarge, c3.2xlarge, c3.4xlarge, r3.xlarge, r3.2xlarge}. Ajob requiring some of the VMs lasts for different numbers oftime slots. The number of time slots a job needs the VMsfor, i.e., the workload, is chosen randomly from {1, 2, 3, 4}.Hence, there are 24 types of jobs in total.

Demand Curve. We conduct our simulations based onGoogle cluster-usage traces [23], reflecting the resource de-mands (CPU, memory, etc.) of jobs submitted to the Googlecluster. We translate the Google data into concrete hourly jobarrival rates as our input. One time slot is an hour. Fig. 3shows job arrivals at the ASP for one typical job type.

Pricing. The cost parameters in the problem formulationare all set according to Amazon EC2 pricing policies [4].Specifically, the one-time payment for a reserved VM iscalculated by scaling down the total charge (upfront fee plushourly charges) of a real reserved instance with 1-year termaccording to the ratio N/(365 ∗ 24). The hourly price forrunning one on-demand VM is set as {$0.28 $0.56 $0.42 $0.84$0.35 $0.7} for the six VM types, the same as the instanceshosted on Amazon EC2. We extract real spot prices of the sixVM types in the region US West (Oregon) of Amazon EC2from Jun 22 to Jul 22, 2014 by Amazon EC2 CLI Tools [24].

152

0 1000 2000 3000 4000 50000

0.2

0.4

Time slot (5 minutes)

Spot

Pric

e ($

)

r3.xlargeOn−demand Price

0 1000 2000 3000 4000 50000

0.5

1

1.5


Spot

Pric

e ($

)

m3.xlargeOn−demand Price

0 1000 2000 3000 4000 50000

1

2

3


Spot

Pric

e ($

)

c3.2xlargeOn−demand Price

Fig. 4. The spot price fluctuations of three typical VM types.

0 100 200 300 400 500 600 7000

1

2

3

4

5 x 104

Time slot (V=3000, N=24)

Valu

e ($

)

CostPenalty

Fig. 5. Cost and penalty in each time slot.

Spot prices are updated every 5 minutes. Fig. 4 illustratestemporal variations in spot prices for three typical VM types,in contrast with fixed on-demand prices.

B. CostWe first run our online algorithm for 720 time slots with

parameters V = 3000, N = 24, ϵg = 50 ∗ mg and σg =1000 ∗ αsg . We scale down the value of N due to the limitof available spot price traces, but it still represents the maincharacteristics of reserved instances. Fig. 5 shows the costand penalty incurred by dropped jobs of the ASP in each timeslot. We see that no penalty occurs under our setup, validatingTheorem 2 in Sec. IV.

For comparison purposes, we also implement a heuristic al-gorithm that always maximally schedules type-g†s jobs, whoseobserved value of Qg(t) is the largest among all types of jobsrequiring type-s VMs. Fig. 6 shows the costs achieved bythe two algorithms respectively. We observe that our onlinealgorithm outperforms the heuristic algorithm over time.

C. Impact of Spot Price FluctuationWe illustrate the fractions of VM acquisition cost for three

typical VM types with our online algorithm in the upper figurein Fig. 7. We observe that the VM acquisition cost comesprimarily from reserved and on-demand VMs. To furtherunderstand the connection between spot price fluctuationsand cost savings, we next evaluate the VM acquisition costdiscount due to using and not using spot instances in our

0 100 200 300 400 500 600 7000

1

2

3

4

5 x 104

Time slot (V=3000, N=24)

Cost

($)

Our Online AlgorithmHeuristic Algorithm

Fig. 6. Comparison of costs between our algorithm and heuristic algorithm.

m3.xlarge r3.xlarge c3.2xlarge0

2

4

6

8

10

12 x 105

Tota

l VM

Acq

uisit

ion

Cost

($)

Reserved VMOn−demand VMSpot VM

m3.xlarge r3.xlarge c3.2xlarge0

0.5

1

1.5

2

2.5 x 106

Tota

l VM

Acq

uisit

ion

Cost

($)

No SpotOur Online Algorithm

− 52.7%− 53.2%

− 52.5%

Fig. 7. Cost savings with and without spot VMs for different VM types.

online algorithm for three typical VM types. The lower figurein Fig. 7 shows the cost advantage of our online algorithmover another online algorithm No Spot that conducts VMacquisition without spot VMs for comparison purposes. Asillustrated in Fig. 4, spot prices for the three typical VMtypes present quite different temporal fluctuation patterns. Butwe see that our online algorithm can bring more than 50%cost savings for all three VM types, when spot instances areexploited. Therefore our algorithm is robust to the fluctuationof spot prices. We also observe that though the number ofspot instances exploited by our online algorithm is smallas compared to the numbers of reserved and on-demandinstances, the cost saving is huge, due to the very low pricesof spot instances.

D. Impact of V and N

We next study the scheduling delays experienced by jobs.Fig. 8 shows the average response delays and maximumresponse delays with our online algorithm under differentvalues of V . We see that both average response delay andmaximum response delay increase as V increases, confirmingthe impact of V on the queueing delay in Theorem 1.

From Theorem 3, we note that the cost performance of theproposed online algorithm depends on two critical factors, Vand N . Fig. 9 reveals how the time-averaged cost achieved byour algorithm varies with different values of V and N . We seethat as V increases, the time-averaged cost decreases, verifying

153

1000 2000 3000 4000 5000 6000 7000 8000 90000

50

100

150

200

V Value ( εg=50 × mg)

Aver

age

resp

onse

del

ay (T

ime

Slot

)

N=24N=72N=120

(a)

1000 2000 3000 4000 5000 6000 7000 8000 90000

200

400

600

800


Max

imum

resp

onse

del

ay (T

ime

Slot

)

N=24N=72N=120

(b)

Fig. 8. Average and maximum response delays under different values of V.

1000 2000 3000 4000 5000 6000 7000 8000 90000.4

0.6

0.8

1

1.2

1.4

1.6

1.8 x 104


Tim

e−av

erag

ed c

osts

($)

N=168N=336N=504N=720

(a)

100 200 300 400 500 600 7000

5000

10000

15000

N Value ( εg=50 × mg)

Tim

e−av

erag

ed c

osts

($)

V=3000V=5000V=7000

(b)

Fig. 9. Time-averaged costs under different values of V and N.

the O(1/V )-versus-O(V ) cost-delay tradeoff in Theorem 1and Theorem 3. N is the reservation period and the length ofa super frame as well. Fig. 9 (b) suggests that the value ofN has relatively less impact on the time-averaged cost of ouronline algorithm . When V increases, ϵ is properly set, and Nis large enough, the time-averaged cost is arbitrarily close tothe offline optimum plus a constant.

E. Characterizing Algorithm RobustnessAs mentioned in Sec. IV, our algorithm needs to predict the

probability of the occurrence of a termination event for type-sspot VMs in the coming time slot Ps(t), which is estimatedbased on historical activity records of type-s spot VMs withina window of a certain number of past time slots. Now weexplore the influence of the prediction window size on the costperformance. In Fig. 10, we show the time-averaged costs withdifferent prediction window sizes. We can see that the costperformance of our online algorithm is completely insensitiveto the prediction window size. Therefore, the proposed onlinealgorithm is robust in our estimation.

VI. CONCLUDING REMARKS

This paper investigates cost minimization strategies at ASPswith arbitrary demands which provision application serviceswith VMs purchased from IaaS Clouds. We design a dynamicalgorithm for an ASP to schedule job service/drop, and todecide the amount of reserved, on-demand and spot instancesto purchase simultaneously in the most economic fashion,under time-varying job arrivals and fluctuating spot instanceprices. The proposed algorithm can obtain a time-averagedVM purchasing cost with a constant gap from its offlineminimum, based on solid theoretical analysis and trace-drivenevaluations.

REFERENCES

[1] https://www.netflix.com/global.[2] http://aws.amazon.com/ec2/.

2 4 6 8 109000

9200

9400

9600

9800

10000

10200

Prediction Window Size ( × N, N = 24)

Tim

e−av

erag

ed c

osts

($)

V=3000V=5000V=7000

Fig. 10. Time-averaged costs with different prediction window sizes.

[3] http://aws.amazon.com/solutions/case-studies/netflix/.[4] http://aws.amazon.com/ec2/pricing/.[5] https://aws.amazon.com/premiumsupport/trustedadvisor/.[6] H. Zhao, M. Pan, X. Liu, and Y. Fang, “Optimal resource rental planning

for elastic applications in cloud market,” in Proc. of IEEE InternationalParallel and Distributed Processing Symposium (IPDPS), 2012.

[7] Y. Hong, J. Xue, and M. Thottethodi, “Dynamic Server Provisioning toMinimize Cost in an IaaS Cloud,” in Proc. of ACM SIGMETRICS, 2011.

[8] W. Wang, B. Li, and B. Liang, “To Reserve or Not to Reserve: OptimalOnline Multi-Instance Acquisition in IaaS Clouds,” in Proc. of USENIXInternational Conference on Autonomic Computing, 2013.

[9] W. Wang, D. Niu, B. Li, and B. Liang, “Dynamic Cloud ResourceReservation via Cloud Brokerage,” in Proc. of IEEE InternationalConference on Distributed Computing Systems (ICDCS), 2013.

[10] Y. Yao, L. Huang, A. Sharma, L. Golubchik, and M. Neely, “DataCenters Power Reduction: A Two Time Scale Approach for DelayTolerant Workload,” in Proc. of IEEE INFOCOM, 2012.

[11] S. Ren, Y. He, and F. Xu, “Provably-Efficient Job Scheduling forEnergy and Fairness in Geographically Distributed Data Centers,” inProc. of IEEE International Conference on Distributed ComputingSystems (ICDCS), 2012.

[12] U. Sharma, S. Prashant, S. Sahu, and A. Shaikh, “A Cost-Aware Elastic-ity Provisioning System for the Cloud,” in Proc. of IEEE InternationalConference on Distributed Computing Systems (ICDCS), 2011.

[13] X. Qiu, H. Li, C. Wu, Z. Li, and F. C. Lau, “Cost-Minimizing DynamicMigration of Content Distribution Services into Hybrid Clouds,” inProc. of IEEE INFOCOM, Mini Conference, 2012.

[14] A. Khanafer, M. Kodialam, and K. P. N. Puttaswamy, “The ConstrainedSki-Rental Problem and its Application to Online Cloud Cost Optimiza-tion,” in Proc. of IEEE INFOCOM, 2013.

[15] V. Setty, R. Vitenberg, G. Kreitz, U. Guido, and M. v. Steen, “Cost-Effective Resource Allocation for Deploying Pub/Sub on Cloud,” inProc. of IEEE International Conference on Distributed ComputingSystems (ICDCS), 2014.

[16] H. Roh, C. Jung, W. Lee, and D. Du, “Resource Pricing Game in Geo-distributed Clouds,” in Proc. of IEEE INFOCOM, 2013.

[17] L. M. Leslie, Y. C. Lee, P. Lu, and A. Y. Zomaya, “Exploiting Perfor-mance and Cost Diversity in the Cloud,” in Proc. of IEEE InternationalConference on Cloud Computing, 2013.

[18] S. Lu, X. Li, L. Wang, H. Kasim, H. Palit, T. Hung, E. F. T. Legara,and G. Lee, “A Dynamic Hybrid Resource Provisoning Approach forRunning Large-scale Computional Applications on Cloud Spot andOn-demand Instances,” in Proc. of IEEE International Conference onParallel and Distributed Systems (ICPADS), 2013.

[19] I. Menache, O. Shamir, and N. Jain, “On-demand, Spot, or Both:Dynamic Resource Allocation for Executing Batch Jobs in the Cloud,”in Proc. of USENIX International Conference on Autonomic Computing,2014.

[20] M. J. Neely, Stochastic Network Optimization with Application toCommunication and Queueing Systems. Morgan&Claypool Publishers,2010.

[21] https://aws.amazon.com/ec2/purchasing-options/reserved-instances/.[22] http://i.cs.hku.hk/%7ecwu/papers/techreport-skshicloud15.pdf.[23] http://code.google.com/p/googleclusterdata/.[24] http://aws.amazon.com/documentation/cli/.

154

Documents

Cost-Minimizing Online VM Purchasing for Application ...i.cs.hku.hk/~cwu/papers/skshi-cloud15.pdf · framework for dynamic, cost-minimizing migration of content distribution services