11
Jua : Customer Facing System to Improve User Experience in Public Clouds Arjun Singhvi [email protected] Shreya Kamath [email protected] Shruthi Racha [email protected] Abstract—Public cloud providers lease virtual machines(VM) on a server to customers. This model makes it easier for cloud providers to manage their infrastructure. However, on the other hand, it presents the customers with difficulty in terms of manually selecting the VM instance type. The inherent performance heterogeneity observed in pub- lic clouds further leads to the uncertainty of VM performance observed. We propose Jua, a customer facing system that solves the above problems. It consists of two main modules - VM Instance Selector and VM placement Gamer. The former performs the task of choosing the appropriate VM instance type adopting a machine learning methodology and the latter module imple- ments a number of placement gaming strategies to ensure that the customers get the best performance relative to the cost they incur. Initial evaluation of Jua confirms the viability of using a machine learning based approach for the selection of instance type. Fi- nally, we develop a simulator to model and implement the various placement strategies and evaluate the same using a wide variety of synthetic benchmarks. We achieve a maximum performance improvement of 70%. 1 INTRODUCTION Public cloud providers use an easy to understand coarse-grained billing model in which customers pay a flat fee per time quantum to use a virtual machine (VM) pre-configured with a bundle of vir- tualized resources. For example, Amazon’s Elastic Compute Cloud (EC2) charges customers hourly and allows customers to choose from a wide variety of VM instance types that mainly differ based on the virtualized resources that they offer [1]. Var- ious other public cloud providers also follow suit [2, 3]. Adopting such an approach for billing makes managing the cloud easier for the cloud providers by limiting the ways in which the cloud resources can be used by customers. However, there are two main shortcomings of the aforementioned billing approach from the point of view of the customers. Firstly, the customers are presented with a wide range of VM instance types to choose from. Each VM instance type is characterized by a num- ber of low level technical specifications such as CPU, memory, storage and networking capacity. In an effort to provide more flexibility to cus- tomers, each VM instance type is normally offered in multiple sizes. As a result, the customers are faced with the non-trivial task of choosing the appropriate VM instance type for their application based on these specifications. The selection process involves customers mapping their high-level appli- cation specific service level attributes (HL-SLAs) to a set of low level technical specifications leading to choosing the specific VM instance type. However, the entire mapping process can be a complex task for customers as it is non-intuitive for them to translate their application requirements to low level technical specifications. Secondly, since the billing model does not allow customers to have a fine-grained control over the physical resources allocated for hosting their VM, not all instances of a given type offer the same performance. The main reason for this performance heterogeneity can be attributed to the fact that data centers contain multiple generations of hardware. VMs of the same type are not guaranteed to be hosted on the same underlying hardware. Extensive prior work [4, 5, 6, 7] in this domain indeed confirms that performance heterogeneity is a com- monly observed phenomenon in public clouds to- day. In [8], the authors state and verify that the three main causes for the observed performance hetero- geneity are due to inter-architecture heterogeneity i.e differences due to processor architecture or sys- tem configuration, intra-architecture heterogeneity i.e differences within a processor architecture or system configuration and temporal heterogeneity i.e differences within a single machine over time. As a part of this work, we propose method- ologies to solve the above two problems. The former problem is tackled using a module devel-

Jua : Customer Facing System to Improve User Experience in …pages.cs.wisc.edu/~shruthir/Documents/Jua_Xpress736.pdf · Jua confirms the viability of using a machine learning based

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Jua : Customer Facing System to Improve User Experience in Public Clouds

    Arjun [email protected]

    Shreya [email protected]

    Shruthi [email protected]

    Abstract—Public cloud providers lease virtualmachines(VM) on a server to customers. This modelmakes it easier for cloud providers to managetheir infrastructure. However, on the other hand,it presents the customers with difficulty in termsof manually selecting the VM instance type. Theinherent performance heterogeneity observed in pub-lic clouds further leads to the uncertainty of VMperformance observed.

    We propose Jua, a customer facing system thatsolves the above problems. It consists of two mainmodules - VM Instance Selector and VM placementGamer. The former performs the task of choosing theappropriate VM instance type adopting a machinelearning methodology and the latter module imple-ments a number of placement gaming strategies toensure that the customers get the best performancerelative to the cost they incur. Initial evaluation ofJua confirms the viability of using a machine learningbased approach for the selection of instance type. Fi-nally, we develop a simulator to model and implementthe various placement strategies and evaluate thesame using a wide variety of synthetic benchmarks.We achieve a maximum performance improvementof 70%.

    1 INTRODUCTIONPublic cloud providers use an easy to understandcoarse-grained billing model in which customerspay a flat fee per time quantum to use a virtualmachine (VM) pre-configured with a bundle of vir-tualized resources. For example, Amazon’s ElasticCompute Cloud (EC2) charges customers hourlyand allows customers to choose from a wide varietyof VM instance types that mainly differ based onthe virtualized resources that they offer [1]. Var-ious other public cloud providers also follow suit[2, 3]. Adopting such an approach for billing makesmanaging the cloud easier for the cloud providersby limiting the ways in which the cloud resourcescan be used by customers. However, there are twomain shortcomings of the aforementioned billingapproach from the point of view of the customers.

    Firstly, the customers are presented with a

    wide range of VM instance types to choose from.Each VM instance type is characterized by a num-ber of low level technical specifications such asCPU, memory, storage and networking capacity.In an effort to provide more flexibility to cus-tomers, each VM instance type is normally offeredin multiple sizes. As a result, the customers arefaced with the non-trivial task of choosing theappropriate VM instance type for their applicationbased on these specifications. The selection processinvolves customers mapping their high-level appli-cation specific service level attributes (HL-SLAs) toa set of low level technical specifications leading tochoosing the specific VM instance type. However,the entire mapping process can be a complex taskfor customers as it is non-intuitive for them totranslate their application requirements to low leveltechnical specifications.

    Secondly, since the billing model does notallow customers to have a fine-grained control overthe physical resources allocated for hosting theirVM, not all instances of a given type offer the sameperformance. The main reason for this performanceheterogeneity can be attributed to the fact that datacenters contain multiple generations of hardware.VMs of the same type are not guaranteed to behosted on the same underlying hardware. Extensiveprior work [4, 5, 6, 7] in this domain indeedconfirms that performance heterogeneity is a com-monly observed phenomenon in public clouds to-day. In [8], the authors state and verify that the threemain causes for the observed performance hetero-geneity are due to inter-architecture heterogeneityi.e differences due to processor architecture or sys-tem configuration, intra-architecture heterogeneityi.e differences within a processor architecture orsystem configuration and temporal heterogeneity i.edifferences within a single machine over time.

    As a part of this work, we propose method-ologies to solve the above two problems. Theformer problem is tackled using a module devel-

  • 0 100 200 300 400 500 600 700 800Trial Number

    63.6

    63.8

    64.0

    64.2

    64.4

    64.6

    64.8

    65.0

    Band

    widt

    h ( M

    bps)

    Figure 1: Temporal network performance hetero-geneity with microbenchmark iPerf3

    oped by us that automatically translates applicationspecific HL-SLAs to an appropriate VM instancetype. The latter problem is tackled by introducinga number of customer-controlled strategies for se-lecting instances in order to mitigate the negativeimpact of the existing performance heterogeneity inpublic clouds. The proposed strategies are largelybased on the strategies mentioned in [8].

    Our main contributions can be summarizedas follows-

    • Verification of the existence of performanceheterogeneity in public clouds

    • Jua, a customer facing system that ensures

    – Automatic VM instance type selectionbased on HL-SLAs

    – Customers get the best performance rel-ative to the cost they incur by using anumber of VM placement strategies

    The rest of the paper is organized as follows:Section 2 gives details about the experiments car-ried out on Amazon’s EC2 to verify the existenceof performance heterogeneity in public clouds. Thedetails regarding the proposed solution are presentin Section 3. Section 4 gives details regarding theextensions incorporated by Jua in the context of thevarious VM placements strategies. We present thekey insights from the evaluation of Jua in Section 5and demonstrate a performance improvement upto amaximum of 70%. Section 6 gives details regardingprior work done in the same domain of research.Section 7 lists the future enhancements based onwhich Jua can be extended. Section 8 summarizesthe work put forth in this paper.

    0 1000 2000 3000 4000 5000 6000 7000Trial Number

    30

    40

    50

    60

    70

    Time T

    aken

    (seco

    nds)

    Figure 2: Temporal CPU performance heterogene-ity with microbenchmark NQueens

    Benchmark Measured resourcesNQueens CPU

    iPerf3 Network bandwidth

    ApacheBench CPU and Networkbandwidth

    Table 1. Microbenchmarks for EC2 performanceevaluation

    2 EC2 PERFORMANCE EVALUATION2.1 MethodologyThe main purpose of this performance evaluationwas to verify if performance heterogeneity due tothe factors mentioned in [8] still exist. We carriedout our evaluation in the US-West region of Ama-zon’s EC2. We restricted the scope of our evaluationto investigate whether or not performance variationexists in t2.micro instances, which are guaranteedto have a balance of compute, network and memoryresources. We aim to extensively study the natureand range of performance variation across the dif-ferent available instance types in the future.

    In order to verify if temporal heterogeneitystill exists we use two benchmarks - NQueensand iPerf3 [9]. The former benchmark focuses onbenchmarking CPU performance whereas the latterfocuses on benchmarking network performance. Weused Apache web server [10] to verify if perfor-mance heterogeneity exists due to inter-architectureheterogeneity and intra-architecture heterogeneity.Table 1 details the various benchmarks and applica-tion used to evaluate the performance of the choseninstance.

    2.2 Temporal Performance HeterogeneityIn order to verify the existence of temporal hetero-geneity, we ran iPerf3 [9] as well as NQueens for

    2

  • CPU Count Freq[GHz]Cache[MB]

    Mem[GiB]

    Intel E5-2670 29 2.50 25 1Intel E5-2676 21 2.40 30 1

    Table 2. Examples of processor heterogeneity inAmazon EC2

    Sl. No. Requests/s Time perrequest (ms)Transfer

    rate (KBps)1 1073.85 1.02 12,356.642 1042.62 1.16 11,997.293 725.54 1.77 8,348.62

    Table 3. Performance heterogeneity observed on 3t2.micro instances that use the same processor

    24 hours on a t2.micro instance and recorded therelevant metrics reported by the benchmarks. Asseen from Figures 1 and 2, there is performancevariation observed across time - the execution timefor the NQueens benchmark varies across multi-ple runs while the bandwidth reported by iPerf3also varies. One probable reason for the observedheterogeneity could be due to the presence ofcompeting workloads on the physical machine onwhich the VM under test is hosted.

    2.3 Performance Heterogeneity due to Pro-cessor Heterogeneity

    As reported in [8], one of the major causes ofperformance heterogeneity is the presence of dif-ferent processor architectures in the public cloud.Newer generation processor architectures tend toperform better than the older generation processorarchitectures due to incorporation of the latest tech-nological performance enhancements in the newergenerations. To verify the presence of different pro-cessors in Amazon’s EC2, we launched a number ofinstances and recorded the details of the underlyinghardware on which the VM is hosted. Table 2lists out the different processors that we encoun-tered while randomly launching fifty instances oft2.micro.

    CPU Requests/s Time perrequest (ms)Transfer

    rate (KBps)Intel E5-2670 927.59 1.12 10673.69Intel E5-2676 947.95 1.05 12792.84

    Table 4. Performance heterogeneity observed on 2t2.micro instances that use different processors

    Customer

    Collaboration

    Information

    Bank

    VM Instance

    --> App

    --> Performance

    Daemon

    Public Cloud

    HL-SLAs

    1

    Customer

    HL-SLAs

    1

    Figure 3: Proposed Design of Jua

    In order to verify the presence of intra-architecture and inter-architecture performance het-erogeneity, we benchmarked the performance of theApache web server [10] using ApacheBench(ab)[11]. As seen in Tables 3 and 4, performanceheterogeneity due to processor heterogeneity stillexists. Probable reasons for the observed variationcould be due to different peripherals being associ-ated with the various VM instances as well as thepresence of different competing workloads.

    3 DesignWe propose Jua, a customer facing system thatresolves the above mentioned issues. As can be seenfrom Figure 3, the system consists of the followingthree modules:

    • VM Instance Selector (VIS)• VM Placement Gamer (VPG)• Collaboration Information Bank (CIB)

    3.1 Jua WorkflowWe envision a customer to install Jua on a Manage-ment VM via which Jua can appropriately chooseand monitor the VMs allocated for the applicationof the customer while ensuring that the customergets the best deal with respect to the ease ofdeciding the instance type as well as performancerelative to the cost.

    A customer interacting with Jua, would spec-ify as input to the system their HL-SLAs whichsatisfy the requirements of the application theywould like to host in the public cloud. For exam-ple, a customer who wants to host a web serverlike Apache can specify the HL-SLAs in terms

    3

  • of metrics such as requests per second, requestconcurrency level and so on. The aforementionedgroup of HL-SLAs are used by the VIS moduleto predict the appropriate VM instance type thatwould best meet the customer’s requirements. Theoutput of the VIS (i.e VM instance type) is passedas input to the VPG.

    Based on the output of the VIS moduleand the VM placement strategy selected by thecustomer, Jua launches a number of VMs in thepublic cloud. Post which, Jua also ensures that aperformance monitoring daemon that monitors theapplication’s performance in terms of the customerspecified HL-SLAs, is launched in the VM. At theend of each quantum, these daemons will commu-nicate the measured performance statistics alongwith the details of the underlying hardware to thecentralized CIB. The various strategies of the VPGmodule use the performance statistics to decidewhether a particular VM should be retained forthe next quantum or not. The following subsectionsgive a more detailed implementation of the modulesof Jua.

    3.2 VM Instance Selector

    The VIS is responsible for mapping the customer’sapplication needs to the various low level technicalspecifications and then choosing the appropriateVM instance type. This module takes in as inputthe customer’s application requirements in terms ofa vector of HL-SLAs and outputs a specific VMinstance type.

    A machine learning based approach is used tosolve the task of automatic instance type selectionbased on the customer’s HL-SLAs. Internally, theVIS consists of a machine learning based classifierper application type that does the required mappingtask for customer application level requirements toVM instance type. All applications that have thesame type of HL-SLAs are assumed to be of thesame application type. Based on the customer’s HL-SLAs, the appropriate classifier is chosen to do therequired mapping. Each of the classifier is trainedwith data that has the required mapping for theentire range of permissible values of the HL-SLAsvector. The training data as described above canbe collected based on the decisions taken by thecustomer or by launching a benchmarking phase

    to collect the required data. More details about thelatter are given in the evaluation section.

    3.3 VM Placement Gamer

    This module is responsible for ensuring that thecustomers get the best performance relative to thecost they incur while mitigating the performanceheterogeneity observed due to the various factorsmentioned earlier. Internally, the VPG implementsa number of VM placement gaming strategies withthe objective of maximizing performance relativeto cost. Currently, Jua supports and extends all thestrategies proposed in [8]. This was achieved byensuring that Jua has the capability of launchingperformance monitoring daemons on each of thecustomer’s VMs. The VPG in-turn decides the nextstep to be taken, based on the strategy chosen by thecustomer as well as the metrics it has aggregatedover the last quantum from the various daemonsrunning on the VMs of the customer.

    The VPG leverages the fact that cloudproviders allow customers to launch instances pro-grammatically using the the provider’s ApplicationProgramming Interface(API) as well as return VMsat the end of each quantum. However, the task of theVPG is harder than it might seem primarily due tothe fact that customers have a coarse-grained con-trol over the physical hardware on which the VMinstances are launched, i.e, the customers cannotselect the underlying physical hardware on whichon the VM should be hosted.

    3.4 Collaboration Information Bank

    The collaboration information bank is a centralizedrepository of performance measurements that en-able customers to simultaneously learn about thedistribution of VMs in the public cloud along withascertaining the performance difference betweenthe various instance types by taking the underlyinghardware into consideration. Performance monitor-ing daemons running on the customer’s VMs willfeed back to the CIB the details of the hardwareon which their VM is running, along with themeasured values of the HL-SLAs at the end of eachquantum.

    The details regarding how the various strate-gies of VPG leverage the CIB are described in thenext section.

    4

  • 4 Core VM Placement Strategies:

    4.1 Background

    A placement strategy embodies a trade-off betweenexploration and exploitation. The basis of exploita-tion based strategies is to continue using a particularVM that gives the required performance. On theother hand, the basis of exploitation startegies isto launch new VM instances with the hope ofdiscovering a better performing VM. Prior work [8]in this domain have considered a restricted space ofplacement strategies called (A,B)-strategies. Theserun at least A servers in every quantum and launchan additional B “exploratory” instances for onequantum each at some point during the job exe-cution. Two general mechanisms useful in building(A, B)-strategies:

    • Up-front exploration: aims to find high per-forming VMs early so that they can be usedfor a longer duration. An (A, B)-strategythat uses up-front exploration launches all B“exploratory” instances at the start of a job.At the end of the first quantum, the highestperforming A instances are retained and theremaining B instances are shut down.

    • Opportunistic replacement: By migrating aninstance (shutting it down and replacing it by anew one to continue its work) we can seek outbetter performing instances or adapt to slow-downs in performance. An (A, B)-strategy thatuses opportunistic replacement will retain anyVM that is deemed a high performer for agiven time quantum and migrate any VM thatis a low performer.

    In the “CPU” strategy the high performinginstances are determined based on their processortype. On the other hand, in the “PERF” strategythe high performing instances are determined basedon VM performance that is measured. The abovestrategies are extended to perform opportunisticreplacement by launching additional VMs in eachtime quantum in order to find a better performingVM instance. The extensions are abbrebriviated as“CPU OPREP” (CPU with opportunistic replace-ment) and “PERF OPREP” (PERF with oppor-tunistic replacement).

    4.2 Customer Collaboration

    The strategies based on opportunistic replacementproposed in [8] have been extended to leverage theperformance statistics present in the CIB. Specif-ically, when the collaboration based strategies arebeing executed by the VPG module, the modulehas the ability to kill an allocated VM immediatelyif the reported performance from the CIB does notseem promising and launches a new VM instead.However, the proposed strategy can have a nega-tive impact on the customer if the strategy keepspreemptively killing VMs and fails to find a betterVM. Reason being, the customer would be chargedeven for the VMs that are being killed preemptivelyby the VPG module. Thus, there is a need for amechanism by which customers can control the ag-gressiveness of the collaboration based strategies sothat the customers do not incur additional costs. Theaggressiveness of the strategies can be controlledmanually by customers by setting two parametersdefined by us - thrift index and preemptive index.

    Thrift index is defined as the minimum prob-ability required for finding a VM that will behosted on the underlying hardware for which bestperformance has been reported by other customersusing the CIB. A lower thrift index implies thatthe customer is ready to take a higher risk inanticipation for a better performing VM. On theother hand, a higher thrift index corresponds to aless aggressive strategy as the preemptive killingof VMs would kick in only if there is a highprobability of finding a better performing VM.

    Preemptive index is defined as the maximumnumber of VMs that can be killed preemptively perquantum in the hope of finding a better performingVM. A higher preemptive index would imply amore aggressive strategy whereas a lower valuewould imply a relatively conservative strategy.

    By incorporating the above two parametersJua gives a customer the flexibility to control theaggressiveness of the collaboration based strategies.Currently, the customers have to manually set thevalue of these parameters based on their economicconstraints. We plan to come up with a modulein the future that would automatically tune theseparameters based on a customer’s desired monetarybudget.

    5

  • 4.3 Application-Level Metrics MonitoringSupport

    Jua allows customers to specify multiple applica-tion specific HL-SLAs that need to be satisfiedby VMs hosting the application. The strategiesimplemented as a part of VPG take all the metricsinto account and make the scheduling decisions.Jua achieves support for the same by launchingmeasurement daemons that give statistics corre-sponding to the HL-SLAs on every active VM.The main advantage of using HL-SLAs is that thecustomers can immediately relate to the decisiontaken by the strategy in action as opposed to ascenario wherein the strategy bases its decisionson lower-level metrics such as CPU performance,bandwidth usage etc.

    5 EVALUATIONThis section gives details regarding the evaluationof the two main modules of Jua - VIS and VPG.

    5.1 VM Instance Selector5.1.1 Methodology

    In order to evaluate the effectiveness of the VIS,we took the example of web servers as an appli-cation type. The customer’s HL-SLAs consideredcorresponding to web servers are: time(s), requestsper second, time per request(ms) and transfer rate(KBps). The VIS based on these HL-SLAs selectsthe appropriate classifier that would choose theappropriate VM instance type. The effectivenessof a machine learning based classifier depends onthe data used to train the classifier. The collec-tion of training data involved launching three webservers - Apache [10], NGINX [12] and Lighttpd[13], and benchmarking their performance usingApacheBench requesting for a file of size 100 MB.ApacheBench reported the results of benchmarkingin terms of the HL-SLAs selected.

    More specifically, the three web servers werelaunched on four different VM instance types- t2.nano, t2.micro, t2.small and t2.large. UsingApacheBench, we obtained a corpus of data withwhich the accuracy of the VIS could be measured.

    5.1.2 Accuracy

    In order to select the best performing classifier,we evaluated the classification performance of the

    RF MNB DT LSVC SVCMachine Learning Classifier

    0

    20

    40

    60

    80

    100

    120

    Perc

    enta

    ge A

    ccur

    acy(

    %)

    Random Forest(RF)Multinomial Naive Bayes(MNB)Decision Tree(DT)Linear Support Vector Cluster(LSVC)Support Vector Cluster(SVC)

    Figure 4: Machine Learning Accuracy

    following machine learning models: Random For-est(RF), Multinomial Naive Bayes(MNB), DecisionTree(DT), Linear Support Vector Cluster(LSVC),and Support Vector Cluster(SVC). The training dataconsists of data gotten from benchmarking Apacheand NGINX web servers whereas the data collectedfrom Lighttpd was used as the testing data. FromFigure 4, we can see that Random Forest gives thebest accuracy of 80% in comparison to the othermodels considered.

    5.2 VM Placement Gamer

    5.2.1 Methodology

    In order to evaluate the effectiveness of the variousstrategies used in VPG, we developed a simulatorto quickly evaluate the performance of these strate-gies under different cloud provider configurations.A cloud provider configuration basically indicatesthe distribution fraction of the different kinds ofmachines along with their expected performance interms of a mean rate and the associated standarddeviation.

    At the end of each quantum, when a VMneeds to be chosen, the simulator chooses an in-stance randomly using the weights correspondingto the distribution fraction of the various machines.The VM’s per-quantum performance is selected asan independent normal random variable with themean and standard deviation of the VM in question.The input to the simulator consists of

    • S - strategy to run• T - total simulator execution time• q - time quantum(in seconds)• A - number of machines required to run the

    job

    6

  • • B - number of extra machines that can be usedby the placement gamer

    • p - migration penalty (in seconds)• tIn - thrift index• pIn - preemptive index

    The simulator calculates the total work done byeach strategy in the same way as proposed in [8].The relative speedup of each strategy is calculatedby comparing its performance to the null strategy inwhich the required number of VMs are randomlychosen at the beginning of the first quantum andused till the end of the job.

    5.2.2 Synthetic Simulations

    We have used three different synthetic cloudprovider configurations corresponding to the threefeatures of the cloud environment that play animportant role in determining the effectiveness ofthe various placement strategies. The first syntheticsimulation evaluates the performance of the strate-gies when the difference in performance betweenmachines is varied (architecture variation). Thenext synthetic simulation examines the impact ofperformance variability observed by the customers(instance variation) on the various strategies. Lastly,we examine the behavior of the various strategieswhen the distribution of the different performingmachines in the cloud is varied (architecture mix).

    To limit the scope of our parameter search,we fixed the value of T to 24, A to 30 and B to 20to indicate a workload that ran for 24 hours with10 VMs computing at all times with the flexibilityof launching an additional 20 ”exploratory” VMsduring the job run time for. For simplicity, we haveassumed the public cloud to consist of only twotypes of machines, ”good” machines and ”bad”machines with an even likelihood of each, unlessmentioned otherwise.

    As stated earlier, the effectiveness of thecollaboration strategies depends on the values ofthe thrift index as well as the preemptive index.We performed a parameter sweep of these twoparameters by varying the former from 0.1 to 1.0and the latter from 1 to 10, and recording theperformance of the strategies in each case. Thepreexisting strategies have been compared with thecollaboration strategies when the two parametershave been set to maximize the performance.

    5.0

    6.0 7.0

    8.0

    9.0

    Mean Performance of Bad Machine

    �10

    0

    10

    20

    30

    40

    50

    60

    70

    80

    Perfo

    rman

    ce Im

    prov

    emen

    t(%)

    CPUPERFCPU_OPREPPERF_OPREPCPU_COLLABPERF_COLLAB

    Figure 5: Observe performance improvement forthe simulation of architecture variation

    MeanPerformance

    of BadMachine

    ThriftIndex pIn PERF pIn CPU

    5.0 0.25 9 106.0 0.25 3 107.0 0.25 8 108.0 0.25 2 79.0 0.25 8 6

    Table 5. Optimal thrift index(tIn) and preemptiveindex(pIn) for simulation of varying architecture

    Architecture variation. In this simulation,we hold the performance variability (standard devi-ation) steady at 5% of the average performance (anumber chosen to match the variation reported in[8]) and evenly spread machines between the twodistributions. The performance gap between “good”machines and “bad” machines was varied from 10%to 50%.

    As seen in Figure 5, the collaboration basedstrategies (CPU COLLAB and PERF COLLAB)consistently outperform the ones void of collabora-tion. As the gap between the ”good” and ”bad” ma-chines increases the performance benefits achievedfrom collaboration also increase. This behavior isconsistent with our expectation. Reason being, thecollaboration based strategies strive to get holdof the best performing VM and when the gap ismaximum, there is more benefit to migrate to a”good” machine from a relatively ”bad” machine.The values for the thrift index and preemptiveindex corresponding to the above measurementshave been presented in Table 5.

    It mush be noted that the collaboration strate-gies give little or no benefit when there is noperformance gap between the different machines.

    7

  • 2.5 5.0 7.5 10.0

    Percentage Variation of Standard Deviation(%)

    0

    5

    10

    15

    20

    25

    30

    35

    40

    Perfo

    rman

    ce Im

    prov

    emen

    t(%)

    CPUPERFCPU_OPREPPERF_OPREPCPU_COLLABPERF_COLLAB

    Figure 6: Observe performance improvement forthe simulation of instance variation

    This is due to the fact that in such a scenario thebenefit of running on a better VM is outweighedby the migration cost.

    Instance variation. We then proceeded toevaluate the collaboration based strategies on in-stances with temporal heterogeneity. For this we en-forced variability within a machine’s performanceby holding steady the average performance differ-ence between “good” and “bad” instances at 30%(again, a number reported in [8]) and instead vary-ing the standard deviation of performance withinthat instance type. We vary the standard deviationfrom 2.5% up to 10%.

    As seen in Figure 6, the collaboration basedstrategies continue to outperform the other strate-gies as the performance variability is varied. Theminimum percentage improvement across all valuesof performance variability is 26% (when variabilityis 2.5%). That said, even such a modest improve-ment can ensure that customers get a better valuefor the cost they incur. However, as one can notice,the degree of variability does not have a significantimpact on the performance of the collaborationbased strategies. This is due to the fact that theperformance improvement achieved by using thesestrategies largely depends on the benefit of migrat-ing to a ”good” machine and the availability of”good” machines. For the above mentioned per-formance measurements, the thrift index was setto 0.25 and the preemptive index was set to 2 asthis combination corresponds to when maximumimprovements were noticed.

    Architecture Mix. Lastly, we look at howthe distribution of machines affect each strategy inorder to determine how each strategy fares when

    0.1 0.2

    0.3 0.4 0.5

    0.6 0.7

    0.8 0.9 1.0

    Fraction of Good Machines

    �5

    0

    5

    10

    15

    20

    25

    30

    35

    40

    Perfo

    rman

    ce Im

    prov

    emen

    t(%)

    CPUPERFCPU_OPREPPERF_OPREPCPU_COLLABPERF_COLLAB

    Figure 7: Observe performance improvement forthe simulation of architecture mix

    Mean ofgood

    machines

    Thriftindex pIn PERF pIn CPU

    0.1 0.05 2 10.2 0.1 4 40.3 0.15 4 100.4 0.2 4 20.5 0.25 6 100.6 0.25 5 50.7 0.35 10 40.8 0.4 8 60.9 0.45 3 61.0 0.5 3 7

    Table 6. Optimal thrift index(tIn) and preemptiveindex(pIn) for simulation of architecture mix

    the cards are stacked in its favor, or against it. Theperformance gap between ”good” and ”bad” ma-chines is set to 30% and the performance variabilityis fixed at 5%.

    As seen from Figure 7, we continue to ob-serve the same trend of the collaboration basedstrategies outperforming the other ones. When thefraction of ”good” machines is 1.0 which basicallyimplies that all the machines in the public cloudoffer the same performance, none of the strategiesgive a performance improvement as the customerswill incur unnecessary migration costs while theperformance they get remains the same. The valuesfor thrift index and preemptive index correspondingto the above performance measurements have beenpresented in Table 6.

    Another trend observed is that initiallywhen the fraction of good machines is less,PERF COLLAB outperforms CPU COLLAB.However, when the fraction of good machinesis high (70% onwards), the CPU COLLABperforms better as the strategy makes decisions

    8

  • 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Thrift Index

    10

    15

    20

    25

    30

    35

    Num

    ber o

    f Migr

    ation

    s

    Figure 8: Variation of thrift index(tIn) for a fixeddistribution(50%) of good machines

    to preemptively kill VMs based on the CPU typeand not the performance observed in the previousquantum. Preemptive killing of VMs based onCPU types has greater benefits as compared toPERF COLLAB that does preemptive killingonly if it observes a drop in performance, whenthe fraction of finding the ”good” machine ishigh. Reason being, the latter may stick to acomparatively less performing machine if there isno significant drop in the observed performance.On the other hand CPU COLLAB always strivesfor the best of the available machines.

    From the above evaluation of Jua in ex-haustive simulated cloud distribution scenarios itis evident that the proposed collaboration basedstrategies outperform the strategies mentioned in[8]. A maximum performance improvement of 70%is achieved.

    5.2.3 Customer Collaboration

    As stated before, the effectiveness of the collabora-tion based strategies depends on the values of thethrift index as well as the preemptive index set bythe customer. The main idea behind these strategiesis to get a VM whose underlying hardware hasgiven the best performance to previous customers.In order to see the overhead involved in the thesestrategies, we carried out a simple experiment tosee the number of migrations done when the col-laboration strategy kicks in as compared to when itis not in operation.

    For this experiment, the probability of findinga ”good” machine was set to be equal to theprobability of finding a ”bad” machine, i.e, thefraction of each type in the public cloud was 0.5.The preemptive index was set to 20 and the thrift

    index was varied from 0.1 to 1.0. In such a setting,preemptive killing of VMs is observed only whenthe probability of finding a ”good” machine wasgreater than or equal to the customer specified thriftindex. On the other hand, no preemptive killing ofVMs is observed when the thrift index is higherthan the probability of finding a ”good” machine.In such a scenario, the strategy just falls back tothe default strategy without any leveraging of theCIB.

    In terms of the overhead of the collaborationbased strategies, it can be seen from Figure 8that the number of migrations are far lesser whencollaboration based strategies are operational (whenthrift index lies between 0.1 and 0.5) in comparisonto the migrations observed when the default strategyis operational (when the thrift index in greater than0.5). This simple experiment brings out the fact thatthe collaboration based strategies perform betterwhen compared to the other strategies and also pro-vide this improvement at a far lower cost due to thereduced number of migrations involved. The mainreason for the reduced number of migrations is dueto the fact that the collaboration based strategiespreemptively keep migrating until the best of theavailable VMs is assigned and post that it does notcarry out any migrations.

    6 RELATED WORK

    Public cloud providers present their clients with auser-friendly billing model and a pre-defined setof minimal primitives in-order to have autonomouscontrol over remote virtual machines(VMs). Due tointer-architecture heterogeneity, intra-architectureheterogeneity and temporal heterogeneity, VM in-stances of the same configuration in a publiccloud environment do not guarantee identical per-formance. Extensive prior work[14, 15, 16, 17, 18]in this domain have addressed this as a resourcescheduling problem among virtual machines(VM)in the cloud. Recent works[8, 19, 20] have fueledresearch interest to guarantee homogeneous perfor-mance across identical VM instances by incorpo-rating scheduling mechanisms on client side.

    Paragon [15] uses collaborative filtering tech-niques to quickly and accurately classify an un-known incoming application, by identifying simi-larity to previously scheduled applications. Quasar

    9

  • [16] on the other hand does not solely rely on userspecified resource reservations as that can lead tounder-utilization due to the fact that users are obliv-ious to the actual workload requirements. InsteadQuasar uses classification techniques tailored to ap-plication categories to perform resource allocationand assignment. The VIS in Jua draws inspirationfrom the above two ideologies and leverages appli-cation specific machine learning classifiers to mapuser defined service level agreements(SLA) to aVM instance in the public cloud.

    VM placement gaming policies proposed in[8] use exploratory and exploitative placementstrategies. These strategies choose the next can-didate VM based on either the CPU performanceor the system performance. The aim is to providethe best user experience given a customer’s SLA.However the work in [8] supported a single SLAand no customer collaboration support. On the otherhand, Jua, supports multiple application centricSLAs. In addition it also provides customers theflexibility of collaborating among themselves inorder to obtain the best performance relative to thecost they incur.

    ClouDiA [19] is a client side deploymentadvisor that selects application node deploymentswhich minimize either largest latency between ap-plication nodes or the longest critical path amongall application nodes. Their search deployment se-lects and allocates an instance VM based on thenature of the path between the VM and the applica-tion node. Jua, on the other hand selects an instanceVM based on the customer satisfaction quotient, i.e.the number of customer provided HL-SLAs whichare fulfilled by an instance.

    7 FUTURE WORKThough Jua has been able to ensure customersenjoy the benefits of considerable amount of per-formance improvements, there are a number ofaspects of Jua that can be extended to ensure farmore benefits to the customer. Jua can be enhancedthrough the following extensions -

    • Automatic selection of thrift index as well aspreemptive index - Currently, the customersare required to manually set these two param-eters. However, as seen in the previous sectionthe performance benefits achieved heavily de-

    pends on the values of these two parameters.We plan to have a module that would performthe task of setting the optimal values of theseparameters based on the customer’s cost bud-get.

    • VIS correction strategy - Currently, Jua as-sumes that the instance type chosen by VIS isalways right. However, it may so happen thatthe instance chosen by the VIS is wrong. Weplan to introduce a new strategy in the VPGthat would validate the correctness of the VISselection and may even go to the extent ofrectifying the mistake of VIS.

    8 CONCLUSIONAt present, customers get a raw deal when itcomes to using VMs in public clouds. Firstly, thecustomers need to perform the non-trivial task ofchoosing the VM instance type manually. Secondly,the customers experience performance heterogene-ity due to the fact that they cannot control theunderlying physical hardware on which the VMsare hosted. We propose Jua, a customer facing sys-tem to solve the above problems. Initial evaluationof Jua confirms the viability of using a machinelearning approach to select a VM instance. Also,the evaluation of Jua using a simulator shows amaximum performance improvement of 70% forthe proposed collaboration based strategies. Thisbrings out the fact that the proposed extensions tothe placement strategies are indeed beneficial.

    ACKNOWLEDGEMENTWe would like to thank Prof. Michael Swift for hisguidance in helping us complete this project.

    References[1] Amazon Web Services. Amazon EC2 instance types.

    http://aws.amazon.com/ec2/instance-types/[2] Microsoft Corp. Windows azure: Pricing details.

    http: //www.windowsazure.com/en-us/pricing/details/[3] Rackspace Inc. How we price cloud

    servers.http://www.rackspace.com/cloud/cloud hostingproducts/servers/pricing

    [4] El-Khamra, Yaakoub, Hyunjoo Kim, Shantenu Jha, andManish Parashar. ”Exploring the performance fluctua-tions of hpc workloads on clouds.” In Cloud ComputingTechnology and Science (CloudCom), pp. 383-387. IEEE,2010.

    [5] Dave Mangot. EC2 variability: The numbers revealed.http://tech.mangot.com/roller/dave/entry/ec2 variability

    10

    http://aws.amazon.com/ec2/instance-types/http://www.windowsazure.com/en-us/pricing/details/http://www.windowsazure.com/en-us/pricing/details/http://www.rackspace.com/cloud/cloud_hosting_products/servers/pricing/http://www.rackspace.com/cloud/cloud_hosting_products/servers/pricing/http://tech.mangot.com/roller/dave/entry/ec2_variability_the_numbers_revealedhttp://tech.mangot.com/roller/dave/entry/ec2_variability_the_numbers_revealedhttp://tech.mangot.com/roller/dave/entry/ec2_variability_the_numbers_revealed

  • the numbers revealed, May 2009.[6] Zhonghong Ou, Hao Zhuang, Jukka K. Nurminen, Antti

    Ylä-Jääski, and Pan Hui. ”Exploiting hardware hetero-geneity within the same instance type of amazon EC2”.In 4th USENIX Workshop on Hot Topics in Cloud Com-puting (HotCloud), 2012.

    [7] Schad, J., Dittrich, J. and Quiané-Ruiz, J.A. ”Runtimemeasurements in the cloud: observing, analyzing, and re-ducing variance”. Proceedings of the VLDB Endowment,3(1-2), pp.460-471. VLDB, 2010.

    [8] Benjamin Farley, Ari Juels, Venkatanathan Varadarajan,Thomas Ristenpart, Kevin D. Bowers, Michael M. Swift,”More for your money: exploiting performance hetero-geneity in public clouds”, In Proc. of the 3rd ACM Symp.on Cloud Computing, 2012.

    [9] A. Tirumala, F. Qin, J Dugan, J. Ferguson, and K. Gibbs.Iperf: The TCP/UDP bandwidth measurement tool, ver-sion 2.0.5. http://sourceforge.net/projects/iperf/, 2010.

    [10] The Apache HTTP server. https://httpd.apache.org/[11] The Apache HTTP server Benchmarking (ab) tool.

    https://httpd.apache.org/docs/2.4/programs/ab.html[12] The Nginx Web Server https://www.nginx.com/[13] The Lighttpd Web Serverhttps://www.lighttpd.net/[14] David Lo , Liqun Cheng , Rama Govindaraju ,

    Parthasarathy Ranganathan , Christos Kozyrakis, ”Hera-cles: improving resource efficiency at scale”, Proceedings

    of the 42nd Annual International Symposium on Com-puter Architecture, June 13-17, 2015, Portland, Oregon.

    [15] C. Delimitrou and C. Kozyrakis. ”Paragon: QoS-AwareScheduling for Heterogeneous Datacenters”. In Proc. ofthe 18th intl. conf. on Architectural Support for Program-ming Languages and Operating Systems, 2013.

    [16] C. Delimitrou and C. Kozyrakis. ”Quasar: Resource-Efficient and QoS-Aware Cluster Management”. In Proc.of the 19th intl. conf. on Architectural Support for Pro-gramming Languages and Operating Systems, 2014.

    [17] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar,and A. Goldberg. Quincy: fair scheduling for distributedcomputing clusters. In Proc. of the 22nd Symp. onOperating System Principles, 2009.

    [18] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, andJ. Wilkes. ”Omega: flexible, scalable schedulers for largecompute clusters”. In Proc. of the EuroSys Conf., 2013.

    [19] Tao Zou , Ronan Le Bras , Marcos Vaz Salles , AlanDemers , Johannes Gehrke. ”ClouDiA: A DeploymentAdvisor for Public Clouds”. In Proc. of the VLDB En-dowment, Vol. 6, No. 2,2012

    [20] D. Novakovic, N. Vasic, S. Novakovic, D. Kostic, andR. Bianchini. ”DeepDive: Transparently Identifying andManaging Performance in Virtualized Environments”. InProc. of the USENIX Annual Technical Conference, 2013.

    11

    http://tech.mangot.com/roller/dave/entry/ec2_variability_the_numbers_revealedhttp://tech.mangot.com/roller/dave/entry/ec2_variability_the_numbers_revealedhttp://sourceforge.net/projects/iperf/https://httpd.apache.org/https://httpd.apache.org/docs/2.4/programs/ab.htmlhttps://www.nginx.com/https://www.lighttpd.net/

    INTRODUCTIONEC2 PERFORMANCE EVALUATIONMethodologyTemporal Performance HeterogeneityPerformance Heterogeneity due to Processor Heterogeneity

    DesignJua WorkflowVM Instance Selector VM Placement GamerCollaboration Information Bank

    Core VM Placement Strategies: BackgroundCustomer CollaborationApplication-Level Metrics Monitoring Support

    EVALUATIONVM Instance SelectorMethodologyAccuracy

    VM Placement GamerMethodologySynthetic SimulationsCustomer Collaboration

    RELATED WORKFUTURE WORKCONCLUSIONReferences