Cost-aware Cooperative Resource Provisioning for ...weisong.eng.wayne.edu/_resources/pdfs/zhan12_phoenix.pdf · data centers [32], which is ﬁve to ten over small-scale deployments

1

Cost-aware Cooperative Resource Provisioningfor Heterogeneous Workloads in Data Centers

Jianfeng Zhan, Lei Wang, Xiaona Li, Weisong Shi, Senior Member, IEEE , Chuliang Weng,Wenyao Zhang, and Xiutao Zang

Abstract—Recent cost analysis shows that the server cost still dominates the total cost of high-scale data centers or cloud systems.In this paper, we argue for a new twist on the classical resource provisioning problem: heterogeneous workloads are a fact of life inlarge-scale data centers, and current resource provisioning solutions do not act upon this heterogeneity. Our contributions are threefold:first, we propose a cooperative resource provisioning solution, and take advantage of differences of heterogeneous workloads so as todecrease their peak resources consumption under competitive conditions; second, for four typical heterogeneous workloads: parallelbatch jobs, Web servers, search engines, and MapReduce jobs, we build an agile system PhoenixCloud that enables cooperativeresource provisioning; and third, we perform a comprehensive evaluation for both real and synthetic workload traces. Our experimentsshow that our solution could save the server cost aggressively with respect to the non-cooperative solutions that are widely usedin state-of-the-practice hosting data centers or cloud systems: e.g., EC2, which leverages the statistical multiplexing technique, orRightScale, which roughly implements the elastic resource provisioning technique proposed in related state-of-the-art work.

Index Terms—Data Centers, Cloud, Cooperative Resource Provisioning, Statistical Multiplexing, Cost, and Heterogeneous Workloads.

�

1 INTRODUCTION

More and more computing and storage are moving fromPC-like clients to data centers [32] or (public) clouds [21],which are exemplified by typical services like EC2 andGoogle Apps. The shift toward server-side computing isdriven primarily not only by user needs, such as easeof management (no need of configuration or backups)and ubiquity of access supported by browsers [32], butalso by the economies of scale provided by high-scaledata centers [32], which is five to ten over small-scaledeployments [20] [29]. However, high-scale data centercost is very high, e.g., it was reported in Amazon [29]that the cost of a hosting data center with 15 megawatt(MW) power facility is high as $5.6 M per month. Highdata center cost puts a big burden on resource providersthat provide both data center resources like power andcooling infrastructures and server or storage resources tohosted service providers, which directly provides servicesto end users, and hence how to lower data center cost is avery important and urgent issue.

Previous efforts [32] [44] [46] [51] studied the costmodels of data centers in Amazon, and Google etc, andconcluded with two observations: first, different fromenterprise systems, the personnel cost of hosting datacenter shifts from top to nearly irrelevant [29]; second,

• J. Zhan and L. Wang, X. Li, X. Zang are with the Institute of Com-puting Technology, Chinese Academy of Sciences. W. Shi is with WayneState University. C. Weng is with Shanghai Jiaotong University. As thecorresponding author, W. Zhang is with Beijing Institute of Technology.

the server cost—the cost of server hardware 1 contributesthe largest share (more than a half), and the power &cooling infrastructure cost, the power cost and otherinfrastructure cost follow, respectively. In this context,the focus of this paper is how to lower the server cost, whichis the largest share of data center cost.

Current solutions to lowering data center cost can beclassified into two categories: first, virtualization-basedconsolidation is proposed: a) to combat server sprawlcaused by isolating applications or operating systemheterogeneity [50], which refers to a situation that under-utilized servers take up more space and consume moredata center, server or storage resources than that arerequired by by their workloads; b) to provision elasticresources for either multi-tier services [48] [49] or scien-tific computing workloads [20] in hosting data centers.Second, statistical multiplexing techniques [27] are lever-aged to decrease the server cost in state-of-the-practiceEC2 systems; an overbooking resource approach [39],routinely used to maximize yield in airline reservationsystems, is also proposed to maximize the revenue—the number of applications with light loads that can behoused on a given hardware configuration. Multiplexingmeans that a resource is shared among a number of users[58]. Different from peak rate allocation multiplexing

1. For a data center with 15 MW power facility in Amazon, the shareof server cost, power & cooling infrastructure cost, power cost andother infrastructure cost are 53, 23, 19 and 5 percents, [29] respectively.Internet-scale services typically employ direct attached disks ratherthan storage-area networks common in enterprise deployments [29],so the cost of storages is included into the server cost. For a realdeployment of 50k serves with 40 servers per rack, the network costis under 3% about the server cost [29].

Digital Object Indentifier 10.1109/TC.2012.103 0018-9340/12/$31.00 © 2012 IEEE

IEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

2

that guarantees the peak resource demands for each user[58], statistical multiplexing allows the sum of the peakresource demands of each user exceeding the capacity ofa data center.

On one hand, as more and more computing moves todata centers, resource providers have to confront withthe server sprawl challenge caused by increasing hetero-geneous workloads in terms of different classes of workloads,e.g., Web server, data analysis jobs, and parallel batchjobs. This observation is supported by more and morecase reports. For example, search engine systems, likeNutch (http://nutch.apache.org/), include two majorheterogeneous workloads: MapReduce-like parallel dataanalysis and search engine services The Boeing companyalso reported this trend of consolidating parallel batchjobs and Web service applications on one data centerin the recent IDC HPC user forum, held in October30, 2010 at Beijing, China. Heterogeneous workloadsoften have significantly different resource consumptioncharacteristics and performance goals, of which we deferthe discussion to Section 3, and hence the server sprawlchallenge caused by them can not be simply resolved onthe extension of the previous virtualization-based con-solidation work. On the other hand, simply leveragingstatistical multiplexing or overbooking resources can noteffectively cope with the server sprawl challenge causedby increasing heterogeneous workloads. Constrained bythe power supply and other issues, the scale of a data cen-ter can not be limitless. Even many users and servicescan be deployed or undeployed, it is impossible thatthe resource curves of the systems that leverage thestatistical multiplexing technique, like EC2, will remainsmooth. Besides, the overbooking solution to resourceprovisioning is probabilistic, and hence it is unsuitablefor critical services.

In this paper, we gain an insight from the previousdata cost analysis [32] [44] [46] [51] to decrease datacenter costs. From the perspective of a resource provider,the highest fraction of the cost—the server cost (53%)[29] is decided by the system capacity, and the power& infrastructure cost is also closely related with thesystem capacity. Since the system capacity at least mustbe greater than the peak resource consumption of work-loads, which is the minimum system capacity allowed, andhence we can decrease the peak resources demands ofworkloads so as to save both the server cost and thepower & infrastructure cost.

We leverage differences of heterogeneous workloadsin terms of resource consumption characteristics andperformance goals, and propose a cooperative resourceprovisioning solution to decreasing the peak resource con-sumption of workloads on data centers. Our solutioncoordinates the resource provider’s resource provision-ing actions for service providers running heterogeneousworkloads under competitive conditions, and hence wecan lower the server cost. On a basis of our previouswork [20] [21], we design and implement an innovativesystem PhoenixCloud that enables cooperative resource

provisioning for four typical heterogeneous workloads:parallel batch jobs, Web servers, search engines, andMapReduce jobs. The contributions of our paper arethreefold:

First, leveraging differences of heterogeneous work-loads, we propose a cooperative resource provisioningsolution, and take advantage of statistical multiplexingat a higher granularity—a group of two heterogeneousworkloads to save the server cost.

Second, we build an innovative system PhoenixCloudto enable cooperative resource provisioning for fourrepresentative heterogeneous workloads: parallel batchjobs, Web servers, search engines, and MapReduce jobs.

Third, we perform a comprehensive evaluation ofPhoenixCloud and the non-cooperative EC2+RightScale-like systems.

The rest of the paper is organized as follows. Sec-tion 2 formulates the problem. Section 3 presents thecooperative resource provisioning solution, followed bythe description of our cooperative resource provisioningsolution in Section 4. A comprehensive evaluation andcomparison of the systems are depicted in Section 5.Finally, related work and concluding remarks are listedin Section 6 and 7, respectively.

2 PROBLEM FORMULATION AND MOTIVATION

According to [29] [32] [46], the server cost contributes thelargest proportion of the total cost of data centers. Fromthe cost model analysis [29] [32] [46], we can observethat the system capacity is a key factor in loweringdata center costs: on one hand, the highest fraction ofthe cost—the server cost (53%) [29] is decided by thesystem capacity in terms of servers; on the other hand, thesecond highest fraction of data center cost—the power& cooling infrastructure cost depends on the maximumdesign (rated) power consumption that is decided bythe system capacity in addition to the degree of faulttolerance and functionality desired of data centers [46].

Since the system capacity at least must be greaterthan the peak resource consumption of workloads, andhence we must decrease the peak resource demandsof workloads so as to lower the server cost and thepower & cooling infrastructure cost. At the same time, aresource provider (in short, RP) also concerns the totalresource consumption of workloads. Since each event ofrequesting, releasing or provisioning resources will trig-ger a setup action, for example wiping off the operatingsystem or data, elastic resource provisioning in EC2-likesystems will result in the management overhead, whichcan be measured in terms of (the accumulated countsof adjusting resources) × (the average setup duration). Sofor a RP, the total resource consumption is the sum ofthe effective resource consumption, directly consumed byworkloads, and the management overhead.

We gives the accurate definitions of the peak resourceconsumption and the effective resource consumption asfollows:


3

For a service provider (in short, SP), the workloadsare wi,i=1...N . For wi, the whole running duration isTi. At the jth time unit of leasing resources (Δt),j = 1...M , the size of the resources provisioned to wi

is ci,j servers, and the peak resource consumption ismax(

∑Ni=1 ci,j,j=1...M ), while the effective resource con-

sumption is∑N

i=1

∑� TiΔt �

j=1 ci,jΔt.Lowering the server cost can not sacrifice performance

concerns of both SPs and end users. For workloads,when a user leases resources, resources are charged ata granularity of a time unit of leasing resources—Δt. Forexample, in EC2, the time unit of leasing resources isone hour. So we can use the effective resource consumptionof workloads that is the sizes of leased resources times theirrespective leasing terms to indirectly reflect the SP cost.The other performance metrics are closely related withdifferent workloads. For parallel batch jobs or MapRe-duce jobs, the major concern of an SP is the throughputin terms of the number of completed jobs [3] [8]; whilethe main concern of end users is the average turnaroundtime per job, which is the average time from submitting jobstill completing them [8] [10]. For Web servers and searchengines, the major concern of an SP is the throughput interms of requests per second [5] [7], while the quality ofservice in terms of the average response time per request isthe major concern of end users [5] [7].

In this context, this paper focuses on how to lowerthe server cost while not severely degrading the performancemetrics of SPs and end users. As explained above, the min-imum system capacity allowed in terms of the numberof nodes at least must be greater than the peak resourceconsumption of workloads, so we measure the minimumserver cost in terms of the peak resource demands ofworkloads, and our resource provisioning solution fo-cuses on how to decrease the peak resource demandswhile not severely degrading the performance metricsof SPs and end users.

2.1 Why previous work fails to propose a coopera-tive solution?

First, in the past, users choose dedicated systems forparallel batch jobs or Web services [20]. The ideathat interactive/web workloads and batch workloadsshould be scheduled differently is classics. Moreover,the trace data of individual HPC or Web service work-loads are publicly available in http://www.cs.huji.ac.il/labs/parallel/workload/ or http://ita.ee.lbl.gov/html/traces.html, respectively. In this context, most of re-searchers focus on how to optimize for homogeneousworkloads.

Second, cloud is an emerging platform in recent years.Only after Amazon’ EC2 becomes a public utility, peo-ple begin to realize that more and more heterogeneousworkloads appeared on the same platform.

Third, the trace data of consolidating heterogeneousworkloads on the same platform are not publicly available,

and hence researchers have no publicly available bench-marks and traces to evaluation their solutions, whichprevents them from deep investigations into the issueof cooperative resources provisioning for heterogeneousworkloads. Recently, we release a benchmark suite forcloud computing [61].

Lastly, EC2 uses a statistical multiplexing technique,which indeed leverages the differences of heterogenousworkloads. However, our work takes a further actionand takes advantage of statistical multiplexing at ahigher granularity—a group of two heterogeneous work-loads to save the server cost.

3 THE COOPERATIVE RESOURCE PROVISION-ING MODEL

As more and more computing moves to data centers,a RP needs to provision resources for increasing hetero-geneous workloads. Different from the server sprawlcaused by isolating applications or operating systemheterogeneity [50] (server consolidation), increasing het-erogeneous workloads in terms of both types and in-tensities raise new challenges in the system capacityplanning, since they have significantly different resourceconsumption characteristics.

In this paper, for four typical workloads: batch jobs,Web servers, search engines, and MapReduce jobs, weleverage the workload differences to save the server costin two ways.

First, their resource demands in terms of usage mode,timing, intensity, size and duration are significantly dif-ferent. Web server workloads are often composed ofa series of requests with short durations like seconds;the ratios of peak loads to normal loads are high; re-quests can be serviced simultaneously and interleavedlythrough resource multiplexing. Batch job workloads arecomposed of a series of submitted jobs with varyingresource demand sizes, and each job is a parallel orserial application, whose runtime is varying but muchlonger, e.g., hours; running a parallel application needsa group of exclusive resources. Though MapReduce jobsshare several similarities with batch jobs written in MPI,they have two distinguished differences: (a) MapReducejobs are often composed of different jobs with varioussizes of data inputs, which decide scales of their resourcedemands. (b) In running, MapReduce tasks are indepen-dent of each other so killing one task will not impactanother [28]. This independence between tasks is incontrast to batch jobs of programming models like MPIin which tasks execute concurrently and communicateduring their execution [28].

Second, their performance goals are different, and wecan coordinate resource provisioning actions of serviceproviders running heterogeneous workloads to decreasepeak resource consumption. In general, from perspec-tives of end users, submitted jobs can be queued whenresources are not available. However, for Web serviceslike Web servers or search engines, each individual


4

request needs an immediate response. So when urgentresources are needed by web service spikes, we canfirstly satisfy the resource requests of Web services whiledelaying execution of parallel batch jobs or MapReducejobs.

For cooperative resource provisioning, we proposetwo guiding principles as follows: first, if an SP doesnot allow cooperative resource provisioning, e.g., forsecurity or privacy, or can not find a coordinated SP,the RP will independently provision resources for itsloads. We call two SPs adopting the cooperative resourceprovisioning solution two coordinated SPs. Second, ifallowed by SPs, the RP supports cooperative resourceprovisioning at a granularity of a group, which consists oftwo heterogeneous workloads.

Fig. 1: The system capacity and two resource bounds.

For each group of heterogeneous workloads, we pro-pose the following cooperative resource provisioningmodel.

First, when an SP resorts to a hosting data center, itneeds to specify two resource bounds: the lower resourcebound and the upper resource bound. The RP will guaranteethat the resources within the lower resource bound canonly be allocated to the specified SP or its coordinatedSP. Regarding the resources between the lower andthe upper resource bounds, the RP firstly satisfies re-source requests of the specified SP or its coordinatedSP, however the resources between two bounds can bereallocated to another SP when they are idle. Fig.1 showsthe relationship among the system capacity and tworesource bounds.

Second, the RP decides whether a group of two SPcandidates can join or not according to the followingstrategy.

The RP will set and maintain a resource bookingthreshold—RBT that determines whether two SP candidatesare allowed to join or not. Through setting a reasonablevalue of RBT, the RP guarantees that the sum of theupper bounds of all SPs joined in the system and the twoSPs to be joined (candidates) is lower than the systemcapacity times RBT. When RBT is larger than one, itmeans resources are overbooked. In addition, the sumof the lower bounds of the two SP candidates mustbe lower than the size of the current idle resources inthe system. Only when the two requirements are met atthe same time, the two SP candidates could join in thesystem.

Third, if a group is allowed to join, the RP allocatesresources within the lower resource bounds to the group,which consists of the two SPs running heterogeneousworkloads, at their startup, respectively.

Fourth, for each group of two heterogenous work-loads, one heterogenous workload whose resource re-quests need an immediate response has a higher pri-ority than another one whose resource requests canbe queued. For example, for two typical heterogenousworkloads: Web servers and parallel batch jobs, Webservers have a higher priority than batch jobs becausesubmitted jobs can be queued when resources are notavailable, while for Web services each individual requestneeds an immediate response.

We propose the following runtime algorithm as shownin Algorithm 1 of Appendix A.

1) If it is an SP with a higher priority to requestresources, the RP will allocate resources with therequested size immediately. However, if it is an SPwith a lower priority to request resources, the extraconditions need to be checked as follows. Since theresources are not unlimited, the RP sets a usage ratereference —URR and a proportional sharing factor—PSF to coordinate resource sharing under compet-itive conditions. For an SP with a lower priority, ifthe size of its allocated resources plus its requestedresources is over its upper bound, no resources willbe allocated. Otherwise the SP will be granted withonly a part of the requested resources, which isdecided by the RP through comparing the usage ratethat is the ratio of the resources allocated to all SPs tothe system capacity, with URR.

2) If the usage rate is larger than URR, the RP willgrant resources to the SP with the size defined asfollows:AllocatedSize = min{the size of the requestedresources, the size of the unallocated resources}∗PSF .We define PSF as the ratio of LB of the SP witha lower priority requesting for resources over thesum of LBs of all joined SPs with lower priorities.

3) When the usage rate is not larger than URR, ifthe size of the requested resources is less thanthe size of the unallocated resources, the RP willgrant resources to the SP with the requested size,otherwise the RP will grant resources to the SP withthe size defined as follows:AllocatedSize = (the size of the unallocated re-sources) ∗PSF .

Fifth, it is the SP that proactively decides to requestor release resources according to their own resourcedemands.

Lastly, if an SP does not allow cooperative resourceprovisioning or can not find a coordinated SP, we cantreat the coordinated SP as a null and simplify the abovemodel as an independent resource provisioning solutionfor either a workload with a higher priority or a lowerone.


5

When we presume the RP has unlimited resources,both the resource booking threshold and the usage rate refer-ence can be set as one. This case indicates that the resourcesare enough with respect to SPs’ resources requests, soresources overbooking is not necessary and resourcesdemanded in each request � the system capacity.

4 DESIGN AND IMPLEMENTATION OFPHOENIXCLOUD

In this section, we give out the design and implementa-tion of PhoenixCloud, which is on the basis of our pre-vious system—-DawningCloud [20] [21]. Different fromDawningCloud, the unique difference of PhoenixCloudis that it supports cooperative resource provisioning forheterogeneous workloads through creating coordinatedruntime environments that are responsible for cooperativelymanaging resources and workloads.

Presently, PhoenixCloud only supports cooperative re-source provisioning between Web servers, parallel batchjobs, search engine, and MapReduce jobs. However, itcan be extended for other data-intensive programmingmodels, e.g., Dryad-like data flow and All-Pairs, whichare supported by our previous Transformer system [33].

PhoenixCloud follows a two-layered architecture: oneis the common service framework (in short CSF) for aRP, and another is thin runtime environment software (inshort, TRE) for an SP, which is responsible for managingresources and workloads for each SP. The two-layeredarchitecture indicates that there lies a separation betweenthe CSF and a TRE: the CSF is provided and managedby a RP, independent of any TRE; with the supportof the CSF, a TRE or two coordinated TREs can becreated on demand. The concept of TRE implies that forheterogeneous workloads, the common sets of functionsof runtime environments are delegated to the CSF, whilea TRE only implements the core functions for a specificworkload.

Fig. 2 describe the system architecture view ofPhoenixCloud, of which a TRE for parallel batch jobsor MapReduce jobs (in short PBJ TRE) and a TRE forWeb servers (in short WS TRE) reuse the CSF. Hereby,we just simply gives out the major components of CSFand TRE, and the detail of the other components can befound at our publicly available technical report [35].

Among the CSF, the resource provision service is re-sponsible for coordinating resources provisioning; thereare two types of monitors: the resource monitor and theapplication monitor. The resource monitor on each nodemonitors usages of physical resources, e.g. CPU, mem-ory, swap, disk I/O and network I/O. The applicationmonitor detects application status.

There are three components in a TRE: the manager, thescheduler, and the Web portal. The manager is responsiblefor dealing with user requests, managing resources, andinteracting with the CSF. The scheduler is responsible forscheduling jobs or distributing requests. The Web portalis a graphical user interface, through which end users

Fig. 2: Interactions of a PBJ TRE and a WS TRE with theCSF.

submit and monitor jobs or applications. When a TREis created, a configuration file is used to describe theirdependencies.

Fig.2 shows the interactions among two TREs and theCSF. The coordination of two service providers runningheterogeneous workloads and the RP is as follows:

Specified for the resource provision service, a resourceprovision policy determines when the resource provisionservice provisions how many resources to a TRE orhow to coordinate resources between two coordinatedruntime environments. In Section 3, we introduce ourresource provision policy.

Specified for the manager—the management entity ofeach TRE, a resource management policy determines whenthe manager requests or releases how many resourcesfrom or to the resource provision service according towhat policy.

For different workloads, the scheduling policy has differ-ent implications. For parallel batch jobs or MapReducejobs, the scheduling policy determines when and howthe scheduler chooses jobs or tasks (which constitute ajob) for running. For Web servers, the scheduling policyincludes two specific policies: the instance adjustmentpolicy and the request distribution policy. The instanceadjustment policy decides when the number of Webserver instances is adjusted to what an extent, and therequest distribution policy decides how to distributerequests according to what criteria.

As an example, the interaction of a WS TRE and a PBJTRE with the CSF is as follows, respectively:

1) The WS manager obtains initial resources withinthe lower resource bound from the resource provi-sion service, and runs Web server instances with amatching scale.

2) The WS manager interacts with the load balancer—a type of the scheduler that is responsible for as-


6

signing workloads to Web server instances to setits request distribution policy. The WS managerregisters the IP and port information of Web serverinstances to the load balancer, and the load balancerdistributes requests to Web server instances accord-ing to the request distribution policy. We integrateLVS (http://www.linuxvirtualserver.org/) as theIP-level load balancer.

3) The monitor on each node periodically checks re-sources utilization rates and reports them to theWS manager. If the threshold performance valueis exceeded, e.g., the average of utilization rates ofCPUs consumed by instances exceeds 80 percent,the WS manager adjusts the number of Web serverinstances according to the instance adjustment pol-icy.

4) According to the current Web server instances, theWS manager requests or releases resources from orto the resource provision service.

For parallel batch jobs or MapReduce jobs, Phoenix-Cloud adopts the resource management policy as fol-lows:

There is a time unit of leasing resources, which is repre-sented as L. We presume that the lease term of a resourceis a time unit of leasing resource times a positive integer. InEC2-like systems, for parallel batch jobs, each end useris responsible for manually managing resources, and wepresume that a user only releases resources at the end ofeach time unit of leasing resources if a job runs over. Thisis because: first, EC2-like systems charge the resourceusage in terms of a time unit of leasing resources (anhour); second, it is difficult for end users to predict thecompleted time of jobs, and hence releasing resources toRP on time is almost impossible.

We define the ratio of adjusting resource as the ratio ofthe accumulated resource demands of all jobs in queue tothe current resources owned by a TRE. When the ratio ofadjusting resource is greater than one, it indicates that forimmediate running, some jobs in the queue need moreresources than that currently owned by a TRE.

We set two threshold values of adjusting resources,and call them the threshold ratio of requesting resource,which is represented as U , and the threshold ratio ofreleasing resource, which is represented as V , respectively.The processes of requesting and releasing resources areas follows:

First, the PBJ manager registers a periodical timer (atime unit of leasing resources) for adjusting resources pertime unit of leasing resources. Driven by the periodicaltimer, the PBJ manager scans jobs in queue.

Second, if the ratio of adjusting resources exceeds thethreshold ratio of requesting resource, the PBJ managerwill request resources with the size of DR1 as follows:

for parallel batch jobs:DR1=((the accumulated resources needed by all jobs

in the queue)−(the current resources owned by a PBJTRE))/2)

for MapReduce jobs:

DR1=((the resources needed by the present biggest jobin queue))

Third, for parallel batch jobs if the ratio of adjustingresource does not exceed the threshold ratio of request-ing resources, but the ratio of the resource demand of thepresent biggest job in queue to the current resources owned bya TRE is greater than one, the PBJ manager will requestresources with the size of DR2:

DR2=(the resources needed by the present biggest jobin queue)-(the current idle resources owned by a TRE)

For parallel batch jobs when the ratio of the resourcedemand of the present biggest job in the queue to the cur-rent resources owned by a TRE is greater than one, itimplies that the largest job will not run without availableresources. Please note that significantly different fromparallel batch jobs, MapReduce tasks (that constitutesa job) are independent of each other [28], so the casementioned above will not happen.

Fourth, if the ratio of adjusting resources is lowerthan the threshold ratio of releasing resources, the PBJmanager will releases idle resources with the size of RSS(ReleaSing Size).

RSS=(idle resources owned by the PBJ TRE)/2.Fifth, if the resource provision service proactively

provisions resources to the PBJ manager, the latter willreceive resources.

Zhang et al. [22] argue that in managing web servicesof data centers, actual experiments are cheaper, simpler,and more accurate than models for many managementtasks. We also hold the same position. In this paper, wedeploy the real systems to decide the resource manage-ment policy for two Web servers and one search engineworkload traces.

5 PERFORMANCE EVALUATIONSIn this section, for four typical workloads: Web servers,parallel batch jobs, search engine, and MapReduce jobs,we compare the performance of the non-cooperativesystem: EC2+RightScale-like systems and PhoenixCloudfor the same workload traces: first, on a hosting datacenter, the RP deploys an EC2-like system for a largeamount of end users that submit parallel batch jobs orMapReduce jobs, and RightScale for service providersrunning Web server or search engine; second, the RPdeploys PhoneixCloud, providing a cooperative resourceprovisioning solution for heterogeneous workloads. Wechoose EC2+RightScale-like systems because EC2 lever-ages the statistical multiplexing technique to decreaseserver sprawl while RightScale roughly implementsthe elastic resource provisioning technique proposed instate-of-the-art work [47] [48] [49] [50].

5.1 Real workload tracesFor parallel batch jobs, we choose three typical work-load traces: NASA iPSC, SDSC BLUE and LLNLThunder from http://www.cs.huji.ac.il/labs/parallel/workload/. NASA iPSC is a real trace segment of two


7

weeks from Oct 01 00:00:03 PDT 1993. For the NASAiPSC trace, the configuration of the cluster system is128 nodes. SDSC BLUE is a real trace segment of twoweeks from Apr 25 15:00:03 PDT 2000. For the SDSCBLUE trace, the cluster configuration is 144 nodes. LLNLThunder is a real trace segment of two weeks fromFeb 1 18:10:25 PDT 2007. For the LLNL Thunder trace,the configuration of the cluster system is 1002 nodes(excluding the management nodes and so on).

For Web servers, we use two real workload traces fromhttp://ita.ee.lbl.gov/html/traces.html: one is the WorldCup workload trace [2] from June 7 to June 20 in 1998,and the other is the HTTP workload trace from a busyInternet service provider—ClarkNet from August 28 toSeptember 10 in 1995. Meanwhile, we choose a publiclyavailable anonymous search engine workload trace fromMarch 28 00:00:00 to 23:59:59 in 2011 [56], which we callSEARCH in the rest of this paper.

5.2 Synthetic Workload tracesTo the best of our knowledge, the real traces of parallelbatch jobs, Web servers, search engines, and MapReducejobs on the same platform are not available. So in ourexperiments, on a basis of the real workload traces intro-duced in Section 5.1, we create synthetic workload traces. We propose a tuple of (PRCA, PRCB) to representtwo heterogeneous workload traces: A that has a lowerpriority, e.g., parallel batch jobs or MapReduce jobs, andB that has a higher priority, e.g., Web servers or searchengines; PRCA is the maximum resource demand ofthe largest job in A, and PRCB is the peak resourceconsumption of B. For example, a tuple of (100, 60) thatis scaled on a basis of the SDSC BLUE and World Cuptraces has two-fold implications: (a) we scale the SDSCBLUE and World Cup traces with two different constantfactors, respectively; (b) on the same simulated clustersystem, the maximum resource demand of the largestjob in SDSC BLUE and the peak resource consumptionof World Cup is 100 and 60 nodes, respectively.

5.3 Experiment methodsTo evaluate and compare the PhoenixCloud andEC2+RightScale-like systems, we adopt the followingexperiments methods.

5.3.1 The experiments of deploying real systemsFor Web servers and search engines, we obtain resourceconsumption traces through deploying and running thereal systems for the three real workload traces in Section5.1. For MapReduce jobs, we also perform experimentsthrough deploying and running real systems on physicalhardware as shown in Section 5.5.2.

5.3.2 The simulated experiments for heterogeneousworkloadsThe period of a typical workload trace is often weeks,or even months. To evaluate a system, many key factors

have effects on experiment results, and we need tofrequently perform time-consuming experiments. More-over, when we perform experiments for several hun-dreds of workload traces, their resource consumptionis up to more than ten thousand nodes, which arenot affordable. So we use the simulation method tospeedup experiments. We speed up the submission andcompletion of jobs by a factor of 100. This speedupallows two weeks’ trace to complete in about threehours. In addition, in Section 5.7, we deploy the realsystems on physical hardware to validate the accuracies ofour simulated systems.

5.4 The testbed

Shown in Fig.3, the testbed includes three types of nodes,nodes with the name starting with glnode, nodes withthe name starting with ganode, and nodes with the namestarting with gdnode. The nodes of glnode have the sameconfiguration, and each node has 2 GB memory andtwo quad-core Intel(R) Xeon(R) (2.00 GHz) processors.The OS is a 64-bit Linux with the kernel of 2.6.18-xen.The nodes of ganode have the same configuration, andeach node has 1 GB memory and 2 AMD Optero242 (1.6GHz) processors. The OS is an 64-bit Linux with thekernel version of 2.6.5-7.97-smp. The nodes of gdnodehave the same configuration, and each node has 4 GBmemory and one quad-core Intel(R) Xeon(R) (1.60 GHz)processor. The OS is an 64-bit Linux with the kernelversion of 2.6.18-194.8.1.el5. All nodes are connectedwith a 1 Gb/s switch.

Fig. 3: The testbed.

5.5 The experiments of deploying real systems

5.5.1 Running real web servers and search enginessystemsIn this section, we deploy the real systems to obtain theresource consumption traces for two Web servers andone search engine workload traces.

We deploy eight XEN virtual machines (http://www.xensource.com/) on each node of starting with glnode.For each XEN virtual machine, one core and 256 MBmemory is allocated, and the guest operating system


8

is a 64-bit CentOS with the kernel version of 2.6.18-XEN. we deploy the real PhoenixCloud system: theload balancer is LVS with the direct route mode http://kb.linuxvirtualserver.org/wiki/LVS/DR; each agentand each monitor is deployed on each virtual machine,respectively; LVS and the other services are deployed onganode004, since all of them have light loads.

We choose the least-connection schedulingpolicy (http://kb.linuxvirtualserver.org/wiki/Least-Connection Scheduling) to distribute requests.We choose httperf (http://www.hpl.hp.com/research/linux/httperf/) as the load generator and an opensource application—ZAP! (http://www.indexdata.dk/)as the target Web server. The versions of httperf,LVS and ZAP! are 0.9.0, 1.24 and 1.4.5, respectively.Two httperf instances are deployed on ganode002 andganode003.

The Web server workload trace is obtained from theWorld Cup workload trace [2] with a scaling factor of2.22. The experiments include two steps. First, we decidethe instance adjustment policy; secondly, we obtain theresource consumption trace.

In the first step, we deploy PhoenixCloud with theinstance adjustment policy disabled. For this configu-ration, the WS manager will not adjust the numberof the Web service instances. On the testbed of 16virtual machines, 16 ZAP! instances are deployed witheach instance deployed on each virtual machine. Whenhttperf generates different scales of loads, we record theactual throughput, the average response time, and theaverage utilization rate of CPU cores. Since we guaranteethat one CPU core is allocated to one virtual machine,for virtual machine, the number of VCPUs is numberof CPU cores. So the average utilization rate of eachCPU core is also the average utilization rate of VCPUs.We observed that when the average utilization rate ofVCPUs is below 80 percent, the average response time ofrequests is less than 50 milliseconds. However, when theaverage utilization rate of VCPUs increases to 97%, theaverage response time of requests dramatically increaseto 1528 milliseconds. Based on the above observation,we choose the average utilization rate of VCPUs as thecriterion for adjusting the number of instances of ZAP!,and set 80 percent as the threshold value. For ZAP!, wespecify the instance adjustment policy as follows: theinitial service instances are two. If the average utilizationrate of VCPUs consumed by all instances of Web serviceexceeds 80% in the past 20 seconds, the WS managerwill add one instance. If the average utilization rateof VCPUs, consumed by the current instances of Webservice, is lower than (0.80(n−1

n )) in the past 20 secondswhere n is the number of current instances, the WSmanager will decrease one instance.

In the second step, we deploy PhoenixCloud withthe above instance adjustment policy enabled. The WSmanager adjusts the number of Web service instancesaccording to the instance adjustment policy. In the ex-periments, we also record the relationship between the

actual throughput, the average response time, and thenumber of virtual machine. We observe that for differentnumber of VMs, the average response time is below 700milliseconds and the throughput increases linearly whenthe number of VM increases. This indicates that theinstance adjust policy is appropriate, may not optimal.

With the above policies, we obtain the resource con-sumption trace of two weeks for the World Cup work-load trace. Using the same approach, we also obtain theresource consumption trace of two weeks for ClarkNetand an anonymous search engine trace—SEARCH [56]with a scaling factor of 772.03 and 171.29, respectively.The resource consumption traces can be found at AppendixB.1.

5.5.2 Experiments of running MapReduce jobs on phys-ical hardwareThis section evaluates PhoenixCloud for MapReducejobs through deploying the real systems.

The testbed is composed of 20 X86-64 nodes, start-ing from gdnode36 to gdnode55 as shown in Fig.3. weselect a series of widely used MapReduce jobs as ourMapReduce workloads, the details of which can befound in Appdendix A.5. Meanwhile, we replay theSEARCH resource consumption trace introduced in Sec-tion 5.5.1. We use (PRCA, PRCB) to present the peakresource consumption of MapReduce jobs and searchengine workloads. In this experiment, (PRCA, PRCB)is (19, 64).

We use Hadoop with the version 0.21.0 to run MapRe-duce jobs, and its JDK version is 1.6.0.20. We configurethe namenode and jobtracker on one node while datan-odes and tasktrackers on the other nodes. In Hadoop,the block size is 64 MB, and there are 4 map slots and2 reduce slots on every node, and the scheduling policyis first come first serve(FCFS).

We adopt the resource management policy in Section4. In PhoenixCloud, for MapReduce jobs, the baselineparameters are [R0.28/E19/U5/V 0.05/L60]. we also setRBT and URR as one, respectively. In Hadoop, theresources requirement of a MapReduce job is determinedby its map task number, and hence the requested re-sources are the map task number divided the the mapslots of each node. So as to prevent the peak resourcedemand of MapReduce jobs from exceeding the systemcapacity of the testbed, we divide each resource requestby a scaling number. For the current 20-node testbed,the scaling number is 16. For an EC2-like system, werun each MapReduce job one by one according to itstimestamp. Finally, we add up the resource consumptionof all MapReduce jobs to get its peak and total resourceconsumption. For the same reason, we also use a scalingfactor to decrease the peak resource demand of MapRe-duce jobs.

From the above results, we can observe that ouralgorithm works well for MapReduce jobs. With re-spect to the non-cooperative systems: EC2+RightScale-like systems, PhoenixCloud can significantly decrease


9

TABLE 1: The RP’s metrics for MapReduce jobs and thesearch engine workload—SEARCH.

System peakresourceconsump-tion

resourceconsump-tion(node ∗hour)

saved peakresourceconsump-tion

savedresourceconsump-tion

PhoenixCloud

78 960 63.21% 57.33%

Non-cooperative

212 2250 0 0

TABLE 2: The service provider’s metrics for MapReducejobs.

System number ofcompletedjobs

average turnaround time(seconds)

resourceconsumption(node ∗ hour)

PhoenixCloud

477 568 960

Non-cooperative

477 310 2250

the peak resource consumption by 63.21 percent andtotal resource consumption by 57.33 percent. Meanwhile,the throughput of PhoenixCloud is the same as that ofthe EC2-like system; the average turnaround time perjob is 568, 310 seconds for PhoenixCloud and the EC2-like system, respectively, and the average delay is 258seconds per job.

5.6 Simulation ExperimentsIn this section, we compare the performance ofEC2+RightScale-like systems and PhoenixCloud for Webserver and parallel batch jobs workload traces, and theexperiment setups are as follows:

5.6.1 The simulated systems(a) The simulated clusters. The workload traces men-tioned above are obtained from the platforms with differ-ent configurations. For example, NASA iPSC is obtainedfrom the cluster system with each node composed ofone processor; SDSC BLUE is obtained from the clustersystem with each node composed of eight processors;LLNL Thunder is obtained from the cluster system witheach node composed of four processors. In the rest ofexperiments, our simulated cluster system is modeled afterthe NASA iPSC cluster, comprising only single-processornodes.

(b) The simulated EC2+RightScale-like systems.Based on the framework of PhoenixCloud, we imple-ment and deploy the EC2+rightScale-like systems asshown in Fig.4 on the testbed. The simulated systemincludes two simulation modules: the job simulator andthe resource simulator. With respect to the real Phoenix-Cloud system, we only keep the resource provisionservice, the WS manager and the PBJ manager. Thejob simulator reads the number of nodes which eachjob requests in the trace file and sends the resourcerequests to the resource provision service, which assigns

corresponding resources for each job. When each jobruns over, the job simulator will release resources tothe resource provision service. The resource simulatorsimulates the varying resources consumption and drivesthe WS manager to request or release resources fromor to the resource provision service. Because RightScaleprovides the same scalable management for Web serviceas PhoenixCloud, we just use the same resource consump-tion traces of Web service in Section 5.5.1 in two systems,respectively.

(c) The simulated PhoenixCloud. With respect tothe real PhoenixCloud system in Fig.2, our simulatedPhoenixCloud keeps the resource provision service, thePBJ manager, the WS manager, and the scheduler, whileother services are removed. The resource provisionservice enforces the cooperative resource provisioningmodel defined in Section 3.

Fig. 4: The simulated system of EC2+RightScale-likesystems.

5.6.2 Experiment configurations(a) The scheduling policy of parallel batch jobs. Forparallel batch jobs, PhoenixCloud adopts the first-fitscheduling policy. The first-fit scheduling policy scansall the queued jobs in the order of job arrivals andchooses the first job, whose resources requirement canbe met by the system, to execute. The EC2-like systemneeds no scheduling policy, since it is each end userthat is responsible for manually requesting or releasingresources for running parallel batch jobs.

(b) The resource management policy. For parallelbatch jobs, we adopt the resource management policystated in Section 4.

5.6.3 Experiments with limited system capacitiesIn Appendix B.2, we present experiments results whenwe presume the RP has limitless resources. In this sec-tion, we report results when we presume the RP haslimited resources of different system capacities.

For the six workload traces introduced in Section5.1, there are three combinations. We use the combi-nations of heterogeneous workloads:((IPSC, WorldCup),(SDSC, Clark), (LLNL, SEARCH)), and their numbers are((128, 128), (144, 128), (1002, 896)), respectively.

Through comparisons with a large amountof experiments, we set the baseline parametersin PhoenixCloud: [R0.5/E400/U1.2/V 0.1/L60]for (IPSC,WorldCup) and (SDSC,Clark); and


10

[R0.5/E3000/U1.2/V 0.1/L60] for (LLNL,SEARCH).[Ri/Ej/Uk/V l/Lm] indicates that the ratio of the sumof the two lower resource bounds to the sum of PRCA andPRCB , which is represented as R, is i; the sum of thetwo upper resource bounds, which is represented as E, is jnodes; the threshold ratio of requesting resources U is k;the threshold ratio of releasing resources V is l; the timeunit of leasing resources L is m minutes. In AppendixB.4, we investigate the effects of different parameters inPhoenixCloud.

Each experiment is performed six times, and we reportthe mean values across six times experiments. In Table3, for different system capacities, i.e., 4000, 5000, 6000,in addition to the baseline parameters for SPs, we setthe usage rate reference—URR, and the resource bookingthreshold—RBT for the RP as 0.5 and 2, respectively. ForE (the sum of the two upper resource bounds), [A/B/C]means that for (IPSC, WorldCup), the sum of two upperbounds is A; for (SDSC, Clark), the sum of two upperbounds is B; for(LLNL, SEARCH), the sum of two upperbounds is C.

TABLE 3: The configurations v.s. different system capac-ities.

systemcapacity(nodes)

system E [A/B/C] URR RBT

4000 PhoenixCloud

[400/400/3000] 0.5 2

5000 PhoenixCloud

[400/400/3000] 0.5 2

6000 PhoenixCloud

[400/400/3000] 0.5 2

TABLE 4: The RP’s metrics of six real workload traces inPhoenixCloud v.s. system capacities.


system peakresourcecon-sumption(nodes)

effectiveresourceconsump-tion(node ∗hour)

totalresourceconsump-tion(node ∗hour)

six traces Non-cooperative

6126 677190 687441

4000 PhoenixCloud

3518 545450 551845

5000 PhoenixCloud

3283 546395 552846

6000 PhoenixCloud

2899 543569 549832

From Table 4, Table 5, Table 6, Table 7 and Table 8, wecan say that our algorithm works quite well on differentsystem capacities:

First, from the perspective of the RP, PhoenixCloudcan save a significant amount of peak resource con-sumption. When the system capacity in PhoenixClouddecreases to 4000 nodes, only 65.3 percent of the minimumsystem capacity allowed in EC2+RightScale-like systems,which is at least largest than the peak resource consump-tion, PhoenixCloud can save the peak resource consump-tion and the total resource consumption with respect

TABLE 5: The service provider’ metrics of NASA iPSCv.s. system capacities.


system numberof com-pletedjobs

averageturnaroundtime(seconds)

resourcecon-sumption(node ∗hour)


2603 573 54118

4000 PhoenixCloud

2603 766 35786

5000 PhoenixCloud

2603 738 35621

6000 PhoenixCloud

2603 733 35146

TABLE 6: The service provider’ metrics of SDSC Bluev.s. system capacities.






2657 1975 35838

4000 PhoenixCloud

2656 2642 31231

5000 PhoenixCloud

2652 2546 31654

6000 PhoenixCloud

2654 2535 31573

to EC2+RightScale-like systems maximally by 43 and20 percents, respectively. The experiment results havetwofold reasons: (a) parallel batch jobs can be queued inPhoenixCloud, while the required resources will be pro-visioned immediately for each batch job in the EC2-likesystem, hence the resource provider using PhoenixCloudcan decrease the peak resource consumption and totalresource consumption with respect to EC2+RightScale-like systems; (b) in PhoenixCloud, setting an usage ratereference and a proportional sharing factor can help de-crease the peak resource consumption effectively undercompetitive conditions.

Second, from the perspectives of the service providers,the throughput of PhoenixCloud in terms of the num-ber of completed jobs has little change with respectto EC2+RightScale-like systems (the maximal decreaseis only 0.2 percent); the average turn around time isdelayed with small amounts (the maximal delay is 34,34 and 21 percents for NASA, SDSC and LLNL, respec-tively). On the other hand, the resource consumption ofeach SP is significantly saved, and the maximal savingis 35 ,13, 21 percents for NASA, SDSC and LLNL withrespect to EC2+RightScale-like systems, respectively. Thereason is as follows: parallel batch jobs may be queuedin PhoenixCloud, while the required resources will beprovisioned immediately for each batch job in the EC2-like system, hence the number of completed jobs and theaverage turn around time would be affected, but the re-source management policy in PhoenixCloud makes theseinfluences slightly and lets the SPs save the resource


11

TABLE 7: The service provider’ metrics of LLNL Thun-der v.s. system capacities.






7273 2465 339416

4000 PhoenixCloud

7261 2941 270847

5000 PhoenixCloud

7262 2882 271233

6000 PhoenixCloud

7262 2978 268298

TABLE 8: With respect to EC2+RightScale-like systems,the peak and total resource consumption saved byPhoenixCloud for different system capacities.


system savedpeakresourceconsump-tion

savedtotalresourceconsump-tion

4000 PhoenixCloud

42.57% 19.72%

5000 PhoenixCloud

46.41% 19.58%

6000 PhoenixCloud

52.68% 20.02%

consumption significantly.

5.6.4 Experiments with different amounts of syntheticsheterogeneous workloads

This section evaluate the efficiency of our approachunder different amounts of synthetics heterogeneousworkloads from 6 to 120.

The synthetic workloads are generated based on theones introduced in Section 5.1, which are (IPSC, World-Cup),(SDSC, Clark), and (LLNL, SEARCH). When weperform experiments for a group of workload traces ofan amount of (6N + 6), we generate N new workloadcombinations as follows:

We record the ith workload combination as((iPSCi,WorldCupi), (SDSCi, Clarki), (LLNLi,SEARCHi),i=1...N ). We divide each workload trace in((IPSC,WorldCup),(SDSC,Clark) and (LLNL,SEARCH))into (N + 1) parts. So as to prevent from the dupli-cation of workload traces in the same period, in thesame combination of workload traces, for iPSC, SDSCor LLNL, we swap the first i parts with the rest of thefollowing parts to obtain ((iPSCi), (SDSCi), (LLNLi));for Clark, WorldCup or SEARCH, we swap the lasti parts with the rest of the preceding parts to obtain((WorldCupi), (Clarki), (SEARCHi)).

For parallel batch job workload traces, we take IPSCi

as an example: first, we define Sub = ((336∗3600∗i)/(N+1)) Second, for a submitted job, if its time stamp t islower than Sub, we reset its submitting time as (t+(336∗3600− Sub)), else t− Sub.

For Web service workload traces, we take Clarki asan example. We define MP = ((336 ∗ i)/(N + 1)).RCj,j=1...336 is the resource consumption of the Clarkworkload trace at the jth hour. For synthetic workloadsClarki,i=1...N , if j < MP , RCi,j = RC336−MP+i, elseRCi,j = RCi−MP .

Since we consider a large amount of workload traces,we presume the RP has unlimited resources, and setthe resource booking threshold and the usage rate ref-erence as one, respectively (See explanation in Sec-tion 3). We set the baseline parameters in Phoenix-Cloud: [R0.5/E100000/U1.2/V 0.1/L60]. E100000 meansthat each service provider has a large upper bound andwill get enough resources. Table 9 summarizes the RP’smetrics of EC2+RightScale-like systems and Phoenix-Cloud running different amounts of workloads.

From Table 9, we can observe that for larger-scale de-ployments, our solution could save the server cost moreaggressively. For small-scale (6126 nodes), medium-scale (28206 nodes), and large-scale deployments (116030nodes), our solution saves the server cost by 18.01, 57.60,and 65.54 percents with respect to the EC2+RightScale-like solutions, respectively.

Please note that in Table 9, for the six workload traces,the configurations of PhoenixCloud are sightly differentfrom that in Table 8 and hence they have different results.First, in Table 9, we presume that each service providerhas a larger upper bound; second, in Table 9, we presumethat the resource provider has unlimited resources, andset the resource booking threshold and the usage ratereference as one, respectively.

TABLE 9: With respect to EC2+RightScale-like systems,the peak and total resource consumption saved byPhoenixCloud for different amounts of workloads.

Workloadnumber

peak resourceconsumption ofEC2+RightScale-like systems

saved peakresourceconsumption

saved totalresourceconsumption

6 6126 18.01% 17.22%30 28206 57.60% 15.32%60 56920 61.14% 16.52%120 116030 65.54% 16.56%

5.7 The accuracies of the simulated systemsIn order to verify the accuracies of our simulated system,we deploy the real PhoenixCloud system on the testbed.The testbed is composed of 40 X86-64 nodes, startingfrom gdnode36 to gdnode75 as shown in Fig. 3.

Since we have run the real web service workloads inSection 5.5.1, we only perform the real experiment ofrunning parallel batch workload on physical hardware.We synthesize a parallel batch job workload trace, whichincludes 100 jobs with the size from 8 to 64 cores. 100 jobsare submitted to PhoenixCloud within 10 hours, and theaverage interval of submitting jobs is 300 seconds. Ourexperiments include two steps: first, we submit the syn-thetic parallel batch workloads to the real PhoenixCloud


12

system, and then collect the workload trace. Second, afterwe obtain the workload trace through the real systemexperiments, we submit the same workload trace to thesimulated system again. We compare the metrics of thereal and simulated systems to evaluate the accuracyof the simulated system. The baseline parameters inPhoenixCloud are [R0.5/E40/U1.2/V 0.1/L60], and weset RBT and URR as one, respectively. Through theexperiments, we find that the ratios of the peak resourceconsumption and the total resource consumption of thereal PhoenixCloud system to that of the simulated sys-tem are about 1.14, respectively, which validates that ouremulated systems are enough accurate.

6 RELATED WORK

In this section, we summarize the related work from fourperspectives.

6.1 Approaches and Systems Leveraged to SaveData Center CostsVirtualization-based consolidation. Vogels et al. [50]summarizes both state-of-the-practice and state-of-the-art work on leveraging virtualization-based consolida-tion solutions to combat server sprawl caused by iso-lating applications or operating system heterogeneity.Several previous efforts utilize virtual machines to pro-vision elastic resources for either multi-tier services [49][48] or HPC jobs [20] in hosting data centers. Chaseet al. [43] propose to provision server resources for co-hosted services in hosting data centers in a way thatautomatically adapts to offered load so as to improvethe energy efficiency of server clusters.

Statistical multiplexing. As pointed by Zaharia et al.[27], state-of-the-practice EC2 systems leverage on statis-tical multiplexing techniques to decrease the server cost(the system capacity). Urgaonkar et al. [39] present theoverbooking resource approach to maximize the numberof applications with light loads that can be housedon a given hardware configuration, however they onlyvalidated their results for large amount of light serviceloads with fixed intensities. Based on insights gained fromdetailed profiling of several applicationsłboth individualand consolidated, Choi et al. [53] developed models forpredicting average and sustained power consumption ofconsolidated applications.

6.2 Data Center Software InfrastructureTwo open source projects, OpenNebula(http://www.opennebula.org/) and Haizea(http://haizea.cs.uchicago.edu/), are complementaryand can be used to manage virtual infrastructures inprivate/hybrid clouds [25]. Steinder et al. [18] showthat a virtual machine allows heterogeneous workloadsto be collocated on any server machine. Our previousDawningCloud system [20] [21] aims to provide anenabling platform for answering the question: can

scientific communities benefit from the economies ofscale. Hindman et al. [26] present Mesos, a platform forsharing commodity clusters between multiple diversecluster computing frameworks, such as Hadoop andMPI. The systems proposed in [5] [18] [25] [20] [26]do not support cooperative resource provisioning forheterogeneous workloads.

6.3 Resource provisioning and SchedulingSteinder et al. [18] only exploits a range of new au-tomation mechanisms that will benefit a system with ahomogeneous, particularly non-interactive workload byallowing more effective scheduling of jobs. Silbersteinet al. [16] devise a scheduling algorithm for massivelyparallel tasks with different resource demands.

Lin et al. [12] provide an OS-level scheduling mecha-nism, VSched. VSched enforces compute rate and interac-tivity goals for interactive workloads, and provides softreal-time guarantees for VMs hosted on a single servermachine. Margo et al. [13] are interested in metaschedul-ing capabilities (co-scheduling for Grid applications) inthe TeraGrid system, including user-settable reservationsamong distributed cluster sites. Sotomayor et al. [17]present the design of a lease management architecturethat only consider homogeneous workloads—parallelbatch jobs mixed with best-effort lease requests andadvanced reservation requests.

Isard et al. [28] and Zaharia et al. [27] focus on resolv-ing the conflicts between the scheduling fairness and thedata locality for MapReduce likes data-intensive jobs onlow-end system with directly attached storages. Benoitet al. [54] dealt with the problem of scheduling mul-tiple applications, made of collections of independentand identical tasks, on a heterogeneous master-workerplatform. Sadhasivam et al. [57] enhanced Hadoops fairscheduler that queues the jobs for execution in a finegrained manner using task scheduling.

In non-cooperative themes, Waldspurger et al. [59]present lottery scheduling, a novel randomized resourceallocation mechanism, which provides efficient, respon-sive control over the relative execution rates of computa-tions; Jin et al. [60] propose an approach to proportionalshare resource control in shared services by interposedrequest scheduling.

6.4 Datacenter and cloud benchmarksBenchmarking is the foundation of evaluating computersystems. Zhan et al. [41] systematically identifies threecategories of throughput oriented workloads in data cen-ters, whose targets are to increase the volume of through-put in terms of processed requests or data, or sup-ported maximum number of simultaneous subscribers,respectively, and coin a new term high volume throughputcomputing (in short HVC) to describe those workloadsand data center systems designed for them. Luo et al.[61] propose a new benchmark suite, CloudRank, tobenchmark and rank private cloud systems that are


13

shared for running big data applications. Jia et al. [40]propose twenty-one representative applications in dif-ferent domains as a benchmark suite—BigDataBench forbig data applications.

7 CONCLUSIONSIn this paper, we leveraged the differences of heteroge-neous workloads and proposed a cooperative resourceprovisioning solution to save the server cost, whichis the largest share of hosting data center costs. Tothat end, we built an innovative system PhoenixCloudto enable cooperative resource provisioning for fourrepresentative heterogeneous workloads: parallel batchjobs, Web server, search engine, and MapReduce jobs.We proposed an innovative experimental methodologythat combines real and emulated system experiments,and performed a comprehensive evaluation. Our experi-ments show taking advantage of statistical multiplexingat a higher granularity—a group of two heterogeneousworkloads gains more benefit with respect to the non-cooperative EC2+RightScale-like systems, which simplyleverage statical multiplexing technical; for small-scale(6126 nodes), medium-scale (28206 nodes), and large-scale deployments (116030 nodes), our solution saves theminimum server cost allowed (at least larger than thepeak resource consumption) by 18, 58, and 66 percentswith respect to that of the non-cooperative systems,respectively.

ACKNOWLEDGMENT

We are very grateful to anonymous reviewers. Thiswork is supported by the Chinese 973 project (GrantNo.2011CB302500) and the NSFC project (GrantNo.60933003).

REFERENCES[1] K. Appleby et al. 2001. Oceano–SLA Based Management of a

Computing Utility. In Proc. of IM 2001, pp. 855-868.[2] M. Arlitt et al. 1999. Workload Characterization of the 1998 World

Cup Web Site, Hewlett-Packard Company, 1999[3] A. AuYoung et al. 2006. Service contracts and aggregate utility

functions. In Proc.of 15th HPDC, pp.119-131.[4] A. Bavier et al. 2004. Operating system support for planetary-scale

network services. In Proceedings of NSDI 04, pp. 19-19.[5] J. S. Chase et al. 2001. Managing energy and server resources in

hosting centers. In Proc. of SOSP ’01, pp.103-116.[6] J. S. Chase et al. 2003. Dynamic Virtual Clusters in a Grid Site

Manager. In Proc. of the 12th HPDC, pp.90-103.[7] A. Fox et al. 1997. Cluster-based scalable network services. SIGOPS

Oper. Syst. Rev. 31, 5 (Dec. 1997), pp. 78-91.[8] K. Gaj et al. 2002. Performance Evaluation of Selected Job Manage-

ment Systems. In Proc. of 16th IPDPS, pp.260-260.[9] L. Grit et al. 2008. Weighted fair sharing for dynamic virtual

clusters. In Proceedings of SIGMETRICS 2008, pp. 461-462.[10] K. Hwang et al. Scalable Parallel Computing: Technology, Archi-

tecture, Programming, and McGraw-Hill 1998.[11] D. Irwin et al. 2006. Sharing networked resources with brokered

leases. In Proc. of USENIX ’06, pp.18-18.[12] B. Lin et al. 2005. Vsched: Mixing batch and interactive virtual

machines using periodic real-time scheduling. In Proc. of SC 05,pp.8-8.

[13] M. W. Margo et al. 2007. Impact of Reservations on ProductionJob Scheduling. In Proc. of JSSPP 07, pp.116-131.

[14] B. Rochwerger et al. 2009. The Reservoir model and architecturefor open federated cloud computing IBM J. Res. Dev., Vol. 53, No.4,2009.

[15] P. Ruth et al. 2005. VioCluster: Virtualization for dynamic com-putational domains. In Proceedings of Cluster 05, pp. 1-10.

[16] M. Silberstein et al. 2006. Scheduling Mixed Workloads in Multi-grids: The Grid Execution Hierarchy. In Proc. of 15th HPDC, pp.291-302.

[17] B. Sotomayor et al. 2008. Combining Batch Execution and LeasingUsing Virtual Machines. In Proc. of HPDC 2008, pp.87-96.

[18] M. Steinder et al. 2008. Server virtualization in autonomic man-agement of heterogeneous workloads. SIGOPS Oper. Syst. Rev. 42,1 (Jan. 2008), pp.94-95.

[19] J. Zhan et al. 2005. Fire Phoenix Cluster Operating System Kerneland its Evaluation. In Proc. of Cluster 05, pp.1-9.

[20] L. Wang et al. 2009. In cloud, do MTC or HTC service providersbenefit from the economies of scale?. In Proc. of SC-MTAGS ’09.ACM, New York, NY, 1-10.

[21] L. Wang et al. 2011. In Cloud, Can Scientific Communities Benefitfrom the Economies of Scale?. IEEE TPDS. vol.23, no.2, pp.296-303,Feb. 2012

[22] W. Zheng et al. 2009. JustRunIt: Experiment-based managementof virtualized data centers, in Proc. of USENIX 09.

[23] Z. Zhang et al. 2006. Easy and reliable cluster management: theself-management experience of Fire Phoenix, In Proc. of IPDPS2006.

[24] J. Zhan et al. 2008. Phoenix Cloud: Consolidating Different Com-puting Loads on Shared Cluster System for Large Organization. InProc. of CCA’ 08. http://arxiv.org/abs/0906.1346.

[25] B. Sotomayor et al. 2009. Virtual infrastructure management inprivate and hybrid clouds, IEEE Internet Computing, pp.14-22,2009.

[26] B. Hindman et al. 2010. Mesos: A platform for fine-grainedresource sharing in the data center, In Proc. NSDI 2011.

[27] M. Zaharia et al. 2010. Delay scheduling: a simple technique forachieving locality and fairness in cluster scheduling. In Proc. ofEuroSys ’10. ACM, New York, NY, 265-278.

[28] M. Isard et al. 2009. Quincy: fair scheduling for distributedcomputing clusters. In Proc. of SOSP ’09. ACM, New York, NY,261-276.

[29] J. Hamilton. 2009. Internet-scale service infrastructure efficiency.SIGARCH Comput. Archit. News 37, 3 (Jun. 2009), 232-232.

[30] D. Thain et al. Distributed Computing in Practice: The Condor Ex-perience. Concurrency and Computation: Practice and Experience,17(2):323-356, February 2005.

[31] U. Hoelzle et al. 2009. The Datacenter as a Computer: an Intro-duction to the Design of Warehouse-Scale Machines. 1st. Morganand Claypool Publishers.

[32] P. Wang et al. 2010. Transformer: A New Paradigm for BuildingData-Parallel Programming Models. IEEE Micro 30, 4 (Jul. 2010),55-64.

[33] L. Rodero-Merino et al. 2010. From infrastructure delivery toservice management in clouds. Future Gener. Comput. Syst. 26,8 (Oct. 2010), 1226-1240.

[34] J. Zhan et al. 2010. PhoenixCloud: Provisioning Runtime Envi-ronments for Heterogeneous Cloud Workloads. Technical Report.http://arxiv.org/abs/1003.0958v1.

[35] S. Govindan, et al. 2009. Statistical profiling-based techniques foreffective power provisioning in data centers. In Proc. of EuroSys’09. ACM, New York, NY, 317-330.

[36] P. Padala et al. Automated control of multiple virtualized re-sources. In Proc. EuroSys ’09.

[37] B. Urgaonkar et al. 2002. Resource overbooking and applicationprofiling in shared hosting platforms, in Proc. OSDI’02, Boston,MA.

[38] Z. Jia et al. BigDataBench: a Benchmark Suite for Big DataApplications in Data Centers. ICT Technical Report.

[39] J. Zhan et al. High Volume Throughput Computing: Identifyingand Characterizing Throughput Oriented Workloads in Data Cen-ters. In Proc. LSPP 2012 in conjunction with IPDSP 2012.

[40] A. Verma et al. 2008. Power-aware dynamic placement of HPCapplications. In Proc. ICS ’08.

[41] J. Chase et al. Managing energy and server resources in hostingcenters. In Proc. of SOSP’01.

[42] C. D Patel et al. Cost Model for Planning, Development and Op-eration of a Data Center, Hewlett-Packard Laboratories TechnicalReport, HPL-2005-107(R.1), June 9, 2005


14

[43] J. Moreira et al. The Case for Full-Throttle Computing: An Alter-native Datacenter Design Strategy, IEEE Micro, vol. 30, no. 4, pp.25-28, July/Aug. 2010

[44] J. Karidis et al. 2009. True value: assessing and optimizing the costof computing at the data center level. In Proc. of CF ’09. ACM, NewYork, NY, 185-192.

[45] W. Iqbal et al. 2010. SLA-Driven Dynamic Resource Managementfor Multi-tier Web Applications in a Cloud. In Proc. of CCGrid’ 10,pp.832-837.

[46] P. Campegiani et al. 2009. A General Model for Virtual MachinesResources Allocation in Multi-tier Distributed Systems. In Proc. ofICAS’ 09.

[47] I. Goiri et al. 2010. Characterizing Cloud Federation for EnhancingProviders’ Profit. In Proc. of Cloud’ 10, pp. 123-130

[48] W. Vogels, Beyond Server Consolidation, ACM Queue, vol. 6, no.1, 2008, pp. 20-26.

[49] X. Fan et al. 2007. Power provisioning for a warehouse-sizecomputer. In Proc. of ISCA’ 07.

[50] L. Rodero-Merino et al. 2010. From infrastructure delivery toservice management in clouds. Future Gener. Comput. Syst. 26,8 (Oct. 2010), 1226-1240.

[51] J. Choi et al. 2010. Power Consumption Prediction and Power-Aware Packing in Consolidated Environments. Computers, IEEETransactions on , vol.59, no.12, pp.1640-1654, Dec. 2010

[52] A. Benoit et al. Scheduling Concurrent Bag-of-Tasks Applicationson Heterogeneous Platforms. Computers, IEEE Transactions on ,vol.59, no.2, pp.202-217, Feb. 2010

[53] P. Martı et al. 2009. Draco: Efficient Resource Management forResource-Constrained Control Tasks. IEEE Trans. Comput. 58, 1(January 2009), 90-105.

[54] H. Xi et al. Characterization of Real Workloads of Web SearchEngines. In Proc. IISWC 2011.

[55] G. S. Sadhasivam et al. Design and Implementation of a TwoLevel Scheduler for HADOOP Data Grids. Int. J. of AdvancedNetworking and Applications 295 Volume: 01, Issue: 05, Pages:295-300 (2010)

[56] http://www.sics.se/ aeg/report/node6.html[57] C. A. Waldspurger et al. Lottery scheduling: Flexible proportional-

share resource management.In Proc. OSDI 1994.[58] W. Jin et al. Interposed proportional sharing for a storage service

utility. In Proc. SIGMETRICS 2004.[59] C. Luo et al. CloudRank: Benchmarking and Ranking Private

Cloud Systems. Frontier of Computer Sciences.

Jianfeng Zhan received the Ph.D degree incomputer engineering from Chinese Academy ofSciences, Beijing, China, in 2002. He is currentlyan Associate Professor of computer sciencewith Institute of Computing Technology, ChineseAcademy of Sciences. His current research in-terests include distributed and parallel systems.He was a recipient of the Second-class ChineseNational Technology Promotion Prize in 2006,and the Distinguished Achievement Award of theChinese Academy of Sciences in 2005.

Lei Wang received the master degree in com-puter engineering from Chinese Academy of Sci-ences, Beijing, China, in 2006. He is currentlya senior engineer with Institute of ComputingTechnology, Chinese Academy of Sciences. Hiscurrent research interests include resource man-agement of cloud systems. He was a recipientof the Distinguished Achievement Award of theChinese Academy of Sciences in 2005.

Xiaona Li received the master degree in soft-ware engineering from Beihang University, Bei-jing, China, in 2012. From 2010 to 2012, shewas a visiting graduate student at Institute ofComputing Technology, Chinese Academy ofSciences, China. She is currently a softwareengineer in Project Management Department inBaidu.

Weisong Shi received the Ph.D degree in com-puter engineering from Chinese Academy of Sci-ences, Beijing, China, in 2000. He is currentlyan Associate Professor of computer science withWayne State University. His current researchinterests include computer systems and mobilecomputing. He is a recipient of the NSF CA-REER award.

Chuliang Weng received the PhD degree incomputer software and theory from ShanghaiJiao Tong University (SJTU), China, in 2004. Heis currently an associate professor with the De-partment of Computer Science and Engineeringat SJTU. His research interests include paralleland distributed systems, system virtualization,and system security.

Wenyao Zhang received the PhD degree incomputer science from Chinese Academy ofScience, China, in 2003. He is currently an as-sistant professor with School of Computer Sci-ence, Beijing Institute of Technology. He is alsoa member of Beijing Lab of Intelligent Informa-tion Technology. His current research interestsinclude scientific visualization, media computing,and parallel processing.

Xiutao Zang received the master degree incomputer engineering from Chinese Academy ofSciences, Beijing, China, in 2010. He is currentlya software engineer with Institute of ComputingTechnology, Chinese Academy of Sciences. Hiscurrent research interests include testbeds fordatacenter and cloud computing.


Documents

Cost-aware Cooperative Resource Provisioning for ...weisong.eng.wayne.edu/_resources/pdfs/zhan12_phoenix.pdf · data centers [32], which is ﬁve to ten over small-scale deployments