Reading Report Cost of VM Live Migration
By William Voorsluys1, James Broberg1, Srikumar Venugopal2, and Rajkumar Buyya1
CLOUD 2009
the paper’s goal Migration overhead is acceptable but cannot
be disregarded , especially when SLA is strict. The paper gives a performance evaluation of
live migration.
Related Works(1) Multicore [8,9], paravirtualization[1] ,
hardware-assisted virtualization [10], live migration[3]
Individual measurement of VM runtime overhead imposed by hypervisors on a variety of workloads [1, 11, 12]
the impact of consolidating several applications on a single server running Xen[13]
Related Works(2) performance degradation when migrating CPU
and memory intensive workloads as well as migrating multiple VMs at the same time in a stop-and-copy way[15]
quantify its effects on a set of four applications common to hosting environments, primarily focusing on quantifying downtime and total migration time and demonstrating the viability of live migration.[3]
They don’t evaluated the effect of migration in the performance of modern Internet workloads, such as multi-tier and social network oriented applications.
Related Works evaluate the efficacy of migrating VMs across
long distances, such as over the Internet[16] the vConsolidate benchmark [14] a Web server , a database server a Java server , a mail server , an idle server. The Cloudstone benchmark [17] aims at
computing the monetary cost, in dollars/user/month, for hosting Web 2.0 applications in cloud computing platforms
Background:advantage of live migration Live (or hot) migration : Hypervisors allow
migrating an OS as it continues to run. Stop-and-copy (or cold)migration : halting the
VM ,copying all its memory pages to the destination then restarting the new VM.
The advantage of live migration : the possibility to migrate an OS with near-0 downtime .
Background:characteristic of modern Internet application Highly dynamic and interactive features make
Web2.0 apps explode. Social networking features make each user’s
actions affect many other users , which makes static load partitioning unsuitable as a scaling strategy.
By means of blogs , photostreams and tagging , users now publish content to one another rather than just consuming the static content.
Testbed specifications A cluster composed of 6 servers,1 head-node and 5 VM
hosts. Each equipped with Intel Xeon E5410 (2.33 GHz Quad-
core with 2x6MB L2 cache and Intel VT technology) , 4GB memory , 7200rpm hard drive , connected through a Gigabit Ethernet switch.
head-node : Ubuntu Server 7.10 with no hypervisor. other nodes : Citrix XenServer Enterprise Edition 5.0.0. All VMs run 64-bit Ubuntu Linux 8.04 Server Edition,
paravirtualized kernel version 2.6.24-23. The installed web server is Apache 2.2.8 running in prefork mode. PHP version is 5.2.4-2. MySQL, with Innodb engine, is version 5.1.32.
WorkLoad Olio[18] as a Web2.0 application , combined
with Faban load generator[19] Olio’s PHP implementation , employing the
popular LMAP stack(Linux Apache MySQL PHP) The Olio/Faban was originally proposed as part
of the CloudStone benchmark[17] The main metric : Service Level Agreement
defined in Cloudstone.
Cloudstone's SLA
Table 1. : The 90th/99th percentile of response times measured in any 5-minute window during steady state should not excess the following values (in seconds)
Benchmarking architecture(1)
Benchmarking architecture(2) MySQL tends to be CPU-bound when serving
the Olio database Apache/PHP tends to be memory-bound [17] All nodes share an NFS (Network File System)
mounted storage device, which resides in the head-node and stores VM images and virtual disks.
A local virtual disk is hosted in the server that hosts MySQL.
The load is driven from the head-node, where the multi-threaded workload drivers run, along with Faban's master component.
Experimental objective To quantify slowdown and downtime
experienced by the application when VM migrations are performed in the middle of a run.
In a series of runs did not consist of migrating a VM back and forth between the same two machines.
Preliminary experiments To define exact VM sizes. Without migration Driven load against Olio and gradually increase the
number of concurrent users. By analyzing the SLA , they found 600 is the max
concurrent users. By memory and CPU usage, they found the min VM
sizes serving 600 users could be : VM hosting Apache/PHP 1 vCPU 2GB memory VM hosting MySQL 2 vCPU 1GB memory Host SQL on NFS can only support 400 users. So the experiment would not include database server
migration.
Migration Experiment First set of experiment with Olio : 10-minute and 20 minute benchmark runs
with 600 concurrent users. To evaluate how the SLA is violated when the
system is nearly oversubscribed but not overloaded and also quantify the downtime when live migrations happen.
Then, run the benchmark with smaller numbers of concurrent users, namely 100,200,..,500, searching for a “safe” level (lower risk of SLA violation).
Result and Discussion(1) Result shows that overhead due to live
migration is acceptable but cannot be disregarded, especially in SLA-oriented environments equiring more demanding service levels.
Result and Discussion(2)
Fig.2.Effects of a live migration on Olio's homepage loading activity
Result and Discussion(3) Figure 2 shows the effect of a single migration performed
after five minutes in steady state of one run. A downtime of 3 seconds is experienced near the end of a
44 second migration. The highest peak observed in response times takes place
immediately after the VM resumes in the destination node; 5 seconds elapse until the system can fully serve all requests
that had initiated during downtime. In spite of that, no requests were dropped or timed out due
to application downtime. The downtime experienced by Olio when serving 600
concurrent users is well above the expected millisecond level , previously reported in the literature for a range of workloads [3]. This interesting result suggest that the workload complexity imposes a unusual memory
Result and Discussion(4)
Fig. 3. 90th and 99th percentile SLA computed for the homepage loading response time with 600 concurrent users. The maximum allowed response time is 1 second
Result and Discussion(5) Figure 3 presents the effect of multiple
migrations on the homepage loading response times. These result corresponds to the average of 5 runs.
It is paramount that this information is employed by SLA-oriented VM-allocation mechanisms with the objective of reducing the risk of SLA non-compliance in situations when VM migrations are inevitable.
Result and Discussion(6)
Result and Discussion(7) Table 2 presents more detailed results listing
maximum response times for all user actions as computed by the 99th percentile SLA formula when one migration was performed in the middle of a 10 minute run.
These results indicate that a workload of 500 users is the load level at which a live migration of the Web server should be carried out (e.g. to a least loaded server) in order to decrease the risk of SLA violation.
Reference(1)
Reference(2)
Reference(3)