Virtualization overheads

Computer Measurement Group, India 1

Performance overheads of Virtualization

Sandeep JoshiPrincipal SDE, Storage Startup

26 April 2014

www.cmgindia.org

http://www.cmgindia.org/


Contents

1. Hypervisor classification and overview 2. Overheads introduced3. Benchmarking4. Analytic models5. Conclusion


Not covered in this talk

• Mobile virtualization - Motorola Evoke QA4 was the first to run two OS. - new Hyp mode in ARM Cortex A-15 processor.• Nested virtualization - running one hypervisor on top of another.• Network virtualization - SDN, OpenFlow.• Containers (aka OS-level virtualization) - Solaris Zones, LXC, OpenVZ.• Older hypervisors which did binary translation.


Classification

• Image : blog.technet.com/b/chenley


VMWare ESX

• Image : blog.vmware.com


VMWare ESX

l Each virtual machine has multiple worlds (threads), some of which correspond to guest CPUs and others are dedicated to the device processing (Run “esxtop” on the host).

• Monolithic kernel. Hardware support limited to those drivers installed in the hypervisor.


KVM

Used in Google cloud, Eucalyptus, or most Openstack clouds.

• Image : Redhat Summit, June 2013


KVM

Linux is the hypervisor. Leverages Linux features (device drivers, NAPI, CPU and IO schedulers, cgroups, madvise, NUMA, etc.)

• Guest OS sits inside Linux process running QEMU shell; each Virtual CPU is a thread inside this process.

• Uses QEMU for device virtualization. QEMU in one guest is not aware of QEMU running in another guest.


Microsoft HyperV

Used in Microsoft Azure cloud


Xen

When you use Amazon or Rackspace, you are using Xen.


Contents



Overheads introduced

1. CPU : nested scheduling triggers lock preemption problem (use gang scheduling), VM exits are exits.

2. Memory : Nested page table, NUMA topology.3. Disk : Nested filesystems, page cache, IO

schedulers, interrupt delivery, DMA.4. Network : DMA, interrupt delivery.

• Next few slides will cover hardware assists, nested filesystems, nested IO schedulers and benefits of IO paravirtualization.


Hardware assists

Hardware assists have considerably eased many virtualization overheads:1. CPU : Binary translation was replaced by extra CPU rings : root

and guest mode, each with 4 rings.2. MMU : Shadow table in software replaced by EPT/Nested page

table. 3. IOMMU : during DMA, it converts Machine Physical Address to

Guest Physical Address.4. IO-APIC : interrupt delivery is done directly to the guest using

IDT.5. SR-IOV : virtual functions implemented in the NIC (SR-IOV is also

defined for storage adapters but not yet implemented)

Benefits: Hardware assistance reduces CPU cache contention as well as “Service Demand” on the VM. (Service Demand = CPU Utilization/Throughput). Higher throughput is obtained for lesser CPU utilization.


Hardware assists

Image:intel.comImage: virtualizationdeepdive.wordpress.com

IOMMU APIC


How much does hardware assist help?

Ganesan et al (2013) ran microbenchmarks on 2-core Intel Xeon. Compare native vs Xen, with and without hardware assistance (SR-IOV, IOMMU, Intel VT).

Finding: Network throughput is near-native with SR-IOV but CPU utilization still remains high (possibly because interrupt processing is still triggering numerous guest VM-hypervisor transitions ?). Chart shows Max throughput with iPerf.

Mbps Dom0 CPU VM CPU

Native 940 NA 16.68

SR-IOV 940 20 65 (high)

No SR-IOV 192 82 39


How much does hardware assist help?

Further results from Ganesan et al (2013). Disk throughput tested using RUBiS(disk+net intensive) and BLAST(disk intensive). Finding: Disk IO is not yet benefiting from hardware assists. Most of the RUBiS improvement comes from SR-IOV rather than IOMMU. Similar finding with BLAST.


Nested IO scheduling

VM and hypervisor are both running IO scheduling algorithms (and so is the disk drive). IO requests are rearranged and merged by the IO scheduler (scheduler can be set in Xen Dom0 or KVM host but not in ESX).

On Linux, there are 4 schedulers - CFQ, NOOP, Anticipatory, Deadline. Each block device can have a different scheduler.

Image: dannykim.me



Results of Boutcher and Chandra

• Best combination of schedulers depends on workload. • Tradeoff between fairness (and VMs) and throughput• Scheduler closest to workload has most impact. • NOOP has best throughput in the hypervisor but is least fair by

Jain's fairness measure.

• In guest VM, CFQ is 17% better than Anticipatory for FFSB benchmark but for Postmark, Anticipatory is 18% better than CFQ.

• On Hypervisor, NOOP is 60% better than CFQ for FFSB and 72% better for Postmark.



Boutcher's numbers for FFSB on Xen 3.2, 128 threads, each VM allocated contiguous 120GB space on 500GB SATA drive.X-axis is scheduler in VM; Y-axis is scheduler in hypervisor. Numbers in table are approx because they were converted from a bar graph (Transactions per sec. on Xen).

anticipatory

CFQ Deadline NOOP

Anticipatory

200 260 175 240

CFQ 260 240 155 160

Deadline 315 360 250 255

NOOP 320 370 245 255


Sequential IO becomes random

• Sequential IO issued from multiple VMs to same block device becomes random when aggregated in the hypervisor.

• Set longer disk queue length in hypervisor to enable better aggregation. On VMWare, you can set

disk.SchedReqNumOutstanding=NNN.

• Use PCI flash or SSDs to absorb random writes.


Nested filesystems and page cache

Filesystem on VM can map to a flat file on underlying filesystem, raw block device (local or iSCSI), NFS. Flat file on FS is preferred for ease of management. It can be raw, qcow2, vmdk or vhd format.

VM1

Guest filesystem

/dev/sda

Files in /vmfs on hypervisor

/dev/sdc

VM-2

Guest filesystem

/dev/sdb

• Flat files introduce another performance overhead (next slide). • KVM has four caching modes (none, writeback, writethru, unsafe)

which can disable/enable either cache.• In Xen Dom0, the page cache comes into play when the file-

storage option is in use.



Le et al (2012) ran FileBench and “fio” on KVM. Tried 42 different combinations of guest and host file systems. Found worst-case 67% degradation.

Their conclusion:• Read-dominated workloads benefit from readahead.• Avoid journaled filesystems for write-dominated workloads.• Latency goes up 10-30% in best case.• Host FS should be like a dumb disk or VirtFS; it should not make

placement decisions over what guest FS has decided.

• Jannen(2013) found that overheads are worse for filesystems on SSD. On HDD, the overheads are masked by rotational latency.



Le et al (2012) – random, file write test using “fio”.Y-axis is Host file system; X-axis is Guest file system.Throughput in MB/sec

ext2 ext3 reiser xfs jfs

ext2 60 55 65 80 95

ext3 60 55 65 80 75

ext4 60 55 55 70 95

reiser 60 55 65 80 100

xfs 60 40 60 70 65

jfs 60 50 65 80 105


Nested IO stacks : use paravirtualization

Hypervisor exposes virtual NIC or storage HBA written in software to the VM. IO request issued by VM travels to the bottom of the stack before it is repackaged and reissued by the hypervisor .

Paravirtualization traps the IO request and uses shared memory to route it faster to the hypervisor.

1. VMWare: Install “VMWare tools” and select “Paravirtual SCSI controller” for storage and “vmxnet” driver for networking. VMWare claims PVSCSI offers 12% throughput improvement with 18% less CPU cost with 8Kb block size (blogs.vmware.com)

2. KVM: use newer in-kernel “vhost-net” for networking and “virtio-scsi” or “virtio-blk-data-plane” drivers for storage.

3. Xen: Split-driver used for PVM guests while HVM guests use QEMU or StubDom. HVM can also use PV drivers.


Xen: PVM and HVM difference

HVM is 3-6% better than PV guest for CPU+RAM intensive tests. For 1 VM with 4 vCPU and 6GB JVM heap size (Yu 2012).


Xen: PVM and HVM difference

Here, HVM was using PV driver. Outperforms PV by 39% for disk-intensive test running on SSD. (Yu 2012).


Contents



Virtual Machine Benchmarking

Two aspects

1.Performance : Compare performance of a consolidated server to a non-virtualized OS running on bare-metal hardware ?

2.Isolation : Does overloading one VM bring down performance of other VMs running on same node.

Impact of factors•. Virtualization-sensitive instructions issued by the guest.•. VM exits and interrupt delivery.•. Architectural choices made within hypervisor•. Interference between VMs due to shared resources (visible and

invisible).


Testing isolation capability of a hypervisor

1. Run application on cloud with collocated VMs.2. Then run in isolation with no collocated VMs to find the gaps.3. Then run it in a controlled environment, gradually adding

collocated VMs which create CPU, disk or network load, until you can simulate the behaviour seen in the cloud.


CPU isolation on Xen

(Barker and Shenoy study 2010) Find variation in completion times for a single thread which is running a “floating point operations” test periodically over a few hours.• Amazon EC2 small instance: Average completion time was 500

ms, but there was significant jitter. Some tests even took an entire second.

• Local setup: Same test on local Xen server showed almost NO variation in completion time.

• Conclusion: CPU scheduler on Xen does not provide perfect isolation.

• Further tests done to narrow down the problem in the CPU scheduler.


Xen’s credit scheduler for CPU

(Barker and Shenoy study 2010)Xen has 2 main CPU schedulers – EDF (realtime) and Credit (default). Each VM runs on one or more virtual CPUs(vCPU). Hypervisor maps vCPU to physical CPUs (floating or pinned). For each VM, you can define (weight, cap). 1. Weight = proportion of CPU allocated.2. Cap = max limit or ceiling on CPU time.

Credit scheduler periodically issues 30ms to each vCPU for use. Allocation decremented in 10ms intervals. When credits expire, VM must wait until next 30ms cycle. If VM receives an interrupt, it gets “Boost” which inserts it to the top of the vCPU queue, provided it has not exhausted its credits.

Scheduler also has a work-conserving mode which transfers unused capacity to those VMs that need it.


CPU isolation on Xen

(Barker and Shenoy study 2010)

On local setup, tied two VMs to same physical core. Varied (weight, cap) of foreground VM while keeping background VM busy. 1. First test: Keep weights of both VMs equal. Observed jitter as

seen on EC2 test.2. Second test: Vary “weight” while keeping “cap” constant.

Weight does not directly correspond to CPU time. Weight ratio of 1:1000 only translates into actual CPU ratio of 1:1.5 (33% more).

3. Third test: Vary “cap” on both VMs. CPU allocation of foreground VM was in proportion to the “cap” (even when background VM was idle).

Conclusion: Strong isolation requires pinning VM to a core or setting the “cap”.


Disk isolation on Xen

Test jitter for small random or large streaming IO to simulate game servers and media servers.(Barker, 2010)

Amazon EC2 : Found significant variation in completion time for reads and writes. Write bandwidth can vary upto 50% from mean. Read bandwidth variation can be due to caching side-effects.Isolated local setup: Completion times are consistent if there is no other VM on the Xen node.Introduce a background VM: Run same tests with another background VM doing heavy IO. Used CFQ IO scheduler in Dom0 and NOOP in guest VM. Finding: Xen has no native disk isolation mechanism to identify per-VM disk flows. Throughput of foreground VM dropped by 65-75% and latency increased by 70-80%. Limit to the degradation due to the round-robin policy of Xen Dom0 driver.


Network isolation on Xen

(Barker 2010) 1.Measure “ping” time to next hop2.Measure sum of “ping” time to first three hops. 3.Measure time to transfer 32KB block between local and EC2

instance.

Pop quiz: what is the point in conducting these three tests ?



(Barker 2010) 1.Measure “ping” time to next hop2.Measure sum of “ping” time to first three hops. 3.Measure time to transfer 32KB block between local and EC2

instance.

Purpose:a)First measurement captures jitter of network interfaceb)Second captures jitter in routers inside Amazon data center. c)Third captures Internet WAN transfer rate and jitter.

1. Saw no jitter in first measurement.2. Significant variation in second. Most took 5ms but there were

significant number which took order of magnitude longer.3. Third test showed regular variation (related to peak hours) typical

of most WAN applications.



Network latency tests on a Game server and a Media server on local Xen cloud. (Barker 2010) Found that “tc” defines per-VM flows using IP address and provides good isolation. Two ways to allocate bandwidth using Linux “tc” tool.1.Dedicated : Divide bandwidth between competing VMs and

prevent any VM from using more (i.e. Weight + cap).2.Shared : Divide bandwidth but allow VMs to draw more if required

(i.e. Weight + work-conserving).In both game and media server tests, results are consistent. “Dedicated” mode produced lower latency while “shared” mode produced lower jitter.

Interference Mean Std deviation

Dedicated 23.6 ms 29.6

Shared 33.9 ms 16.9


Long latency tails on EC2

(Xu et al, Bobtail, 2013)

Initial observations:1. Median RTTs within EC2 are upto 0.6ms but 99.9 percentile RTT

on EC2 is 4 times longer than that seen in dedicated data centers. (In other words, a few packets see much longer delays than normal.)

2. Small instances most susceptible to the problem.3. Measured RTT between node pairs on same AZ. Pattern not

symmetric. Hence, long tail not caused by location of host on network.

4. RTT between good and bad nodes in AZ can differ by order of magnitude.

5. One AZ which had newer CPU models did not return that many bad nodes.



(Xu et al, Bobtail, 2013) Experimental setup: On 4-core Xen server, dedicate 2 cores to Dom0. Remaining 2 cores are shared between 5 VMs with 40% share each. Vary the combination of latency-sensitive versus CPU-intensive VMs.

Latency-sensitive

CPU-intensive RTT

5 0 1 ms

4 1 1 ms

3 2 <10 ms

2 3 ~30 ms

1 4 ~30ms

Long-tail emerges when CPU-intensive VMs exceed number of cores.



(Xu et al, Bobtail, 2013)Hypothesis: Do all CPU-intensive VMs cause a problem? Test: Vary the CPU usage of CPU-intensive VM to find out.

Long tail occurs when a competing VM does not use all its CPU allocation. The Boost mechanism for quickly scheduling latency-sensitive VMs fails against such VMs.



(Xu et al, Bobtail, 2013)

1. Latency-sensitive VMs cannot respond in a timely manner because they are starved of CPU by other VMs.

2. The VMs which starve them are those that are CPU-intensive but are not using 100% of their allocation within 30ms.

3. The BOOST mechanism in the Xen scheduler runs in FIFO manner and treats these two types of VMs equally instead of prioritizing the latency-sensitive VM.

Authors designed “Bobtail” to select the EC2 instance on which to place a latency-sensitive VM. (see paper)


EC2 Xen settings

Tested for small instances:1. EC2 uses Xen credit scheduler in non-work-conserving mode,

which reduces efficiency but improves isolation.2. It allows vCPUs to float across cores instead of pinning them to a

core.3. Disk and network scheduling is work-conserving, but only

network scheduling has a max cap of 300 Mbps.

(Varadarajan, 2012)


Know your hypervisor : Xen

Xen : CPU has two schedulers : Credit(2) and EDF.• Credit scheduler keeps a per-VM (weight, cap). Can be work-

conserving or not. Work-conserving means “distribute any idle time to other running processes”; otherwise total CPU quantum is capped.

• I/O intensive VMs benefit from BOOST, which bumps a vCPU to the head of the queue when it receives an interrupt, provided it has not exhausted its credits.

Device scheduler:• Disk and network IO goes through Domain 0, which schedules

them in batches in round-robin fashion by Domain 0. To control network bandwidth, use Linux tools to define per-VM flows.

Best practice: Increase CPU weight of Dom0 to be proportional to the amount of IO. Dedicate core(s) to it. Dedicate memory and prevent ballooning.


Know your hypervisor - KVM

• QEMU originally started a complete machine emulator [Bellard, 2005]. Code emulation is done by TCG (tiny code generator) originally called “dyngen”. KVM was later added as another code accelerator into the QEMU framework.

• Only one IO thread; BIG QEMU Driver lock is held in many IO functions.

• Redhat “fio” benchmark in Sept 2012 reported 1.4M IOPs with 4 guests but this was using passthrough IO (i.e. bypassing QEMU)

• Similar numbers reported in Mar 2013 but this time using an experimental virtio-dataplane feature which utilizes dedicated per-device threads for IO.

• Performance of RTOS (as a guest OS) in KVM also suffers when it comes in contact with QEMU [Kiszka].


Tile-based benchmarking to test consolidation

Traditional benchmarks are designed for individual servers. For virtualization, tiles of virtual machines that mimic actual consolidation are used.

1. SPECvirt sc2013 (supersedes SPECvirt sc2010)2. VConsolidate(Intel): tile consisting of SPECjbb, Sysbench,

Webbench and a mail server3. VMMark (VMWare) : Exchange mail server, standy system,

Apache server, database server.

SPEC sc2013:•. Run for 3 hours on a single node to stress CPU, RAM, disk, and

network.•. Incorporates four workloads : web server, 4 Java app server

connected to a backend database server (to test multiple vCPU on SMP), mail server and Batch server.

•. Keep adding additional sets of virtual machines (tiles) until overall throughput reaches a peak. All VMs must continue to meet the required QoS (spec.org/virt_sc2013)


SPECvirt sc2013


Contents



Value of analytic model

Benchmarks have to Produce repeatable results.Be Comparable easily across architectures & platforms.Have predictive power (extrapolation).

Tension between realism and reproducibility.Macrobenchmarks simulate real-world conditions but are not comparable and lack extrapolation.Microbenchmarks determine the cost of primitive operations.

Need analytic model to tie benchmarks to prospective application use. Seltzer proposed three approaches : 1. Vector-based: combine system vector with application vector.2. Trace-based : Generate workload from trace to capture dynamic

sequence of requests. 3. Hybrid : combination of both.

(Mogul 1999; Seltzer et al 1999)


Analytic models for virtualization

1.Layered queuing network (Menasce; Benevenuto 2006).

2.Factor graphs to determine per-VM utilization (Lu 2011)

3.Trace-based approach (Wood, et al)4.VMBench (Moller @ Karlsruhe)5.Equations for cache and core interference

(Apparao, et al).6.Machine learning


Layered Queueing network (for Xen)

VM

Domain 0 Disk

IN

OUT Xen


Layered Queueing network (for Xen)

Total response time R = R(VM) + R(dom0) + R(disk)

For M/M/1 with feedback: R of one resource = D/[ 1- U ]

U = Utilization = λ * D = Arrival rate * Service demandU lies between 0 and 1.D = Service demand = total time taken by one request.

D(resource by VM) = D(bare) * Slowdown(resource)/P(VM) D(resource by Dom0) = D(VM) * Cost(Dom0)/P(IDD)whereP=speedup of hardware of VM as compared to bare-metal.Cost(Dom0) = BusyTime(Dom0)/BusyTime(VM) Slowdown(resource) = BusyTime(virtual)/BusyTime(bare)


Factor graphs

Resource utilization at guest VMs is known and aggregate utilization at hypervisor is known. How to determine the function which defines per-VM utilization of each resource ?

This can be modeled as a “source separation problem” studied in digital signal processing.

Measurements inside VM and on hypervisor can differ:1.Disk IO inside VM can be higher than on the hypervisor due to

merging of IOs in the hypervisor.2.CPU utilization inside a VM can be half of that at the hypervisor

because Xen issues per-VM IO through Dom0 (seen via “xentop”).


Factor graphs

CPU

Disk

Net

Mem

h1

h2

h4

h3

CPU

Disk

Net

Mem

CPU

Disk

Net

Mem

f1

f2

f4

f3

CPU

Disk

Net

Mem

CPU

Disk

Net

Mem

g1

g2

g4

g3

Host

VM1

VM2


Trace-based approach

How to model the migration from bare-metal to virtual environment?

1.Create platform profiles to measure cost of primitive operations: Run same microbenchmarks on native (bare-metal) and virtualized platform.

2.Relate native and virtualized : Formulate set of equations which relate native metrics to virtualized.

3.Capture trace of application which is to be migrated: Determine how many primitive operations it uses and plug it in.

(*Actual process employs Statistical methods and is more complicated)




Step 1: Create platform profiles?Run carefully chosen CPU, disk and network-intensive microbenchmarks on both the bare-metal and virtual environment.

Measure key metrics for each benchmark : a)CPU – percentage time spent in user, kernel and iowaitb)Network – read and write packets/sec and bytes/secc)Disk – read and write blocks/sec and bytes/sec.

Metric CPU user CPU sys CPU iowait

BareMetal 23 13 3

Virtual 32 20 8




Step 2: Relate native and virtualized : Formulate set of equations which relate native metrics to virtualized.

e.g. Util(cpu on VM) = c0 + c1*M1 + c2*M2 + ... cn*Mnwhere Mk=Metric gathered from native microbenchmark

Solve for the model coefficients using Least Squares Regression. The coefficients c_k capture relation between native and virtualized platform.

e.g. c0=4, c1=19, c2=23, etc...




Step 3: Capture trace of application which is to be migrated: And find the new metrics Mk to plug into the above equation. Then solve it. Voila !

Util(cpu on VM)= 4 * (M1) + 19 * (M2) + ...

Recap:1. Create platform profiles for native and virtual.2. Find coefficients which relate native & virtual.3. Capture application trace and apply the equation.

Their findings:4. Single model is not applicable for Intel and AMD since CPU

utilization varies.5. Feedback loop within the application can distort performance

prediction.


Conclusion

All problems in CS can be solved by another level of indirection

-David Wheeler (1997-2004, first PhD in Computer Science)

... and performance problems introduced by indirection require caching, interlayer cooperation and hardware assists (e.g. TLB cache, EPT, paravirtualization).

Virtual machines have finally arrived. Dismissed for a number of years as merely academic curiosities, they are now seen as cost-effective techniques for organizing computer systems resources to provide extraordinary system flexibility and support for certain unique applications.

[Goldberg, Survey of Virtual Machine Research, 1974]


References

1. Ganesan et al. Empirical study of performance benefits of hardware assisted virtualization, 2013.

2. Boutcher and Chandra. Does virtualization make disk scheduling passe.

3. Le at al. Understanding Performance Implications of Nested File Systems in a Virtualized Environment.

4. Jannen. Virtualize storage, not disks. 5. Yu. Xen PV Performance status and Optimization Opportunities.6. Barker and Shenoy. Empirical evaluation of latency-sensitive

application performance in the cloud7. Xu. Bobtail. Avoiding long tails in the cloud.8. Varadarajan et al. Resource freeing attacks.9. Bellard. QEMU, a fast and portable dynamic translator.10.Kiszka. Using KVM as a realtime hypervisor11.Mogul. Brittle metrics in operating system research.12.Seltzer et al. The Case for Application-Specific Benchmarking


References

1. Menasce. VIRTUALIZATION: CONCEPTS, APPLICATIONS, AND PERFORMANCE MODELING

2. Benevenuto et al. Performance Models for Virtualized Applications3. Lu at al, Untangling Mixed Information to Calibrate Resource

Utilization in Virtual Machines, 2011.4. Wood. Profiling and Modeling Resource Usage of Virtualized

Applications


BACKUP SLIDES


Classification

• OS-level virtualization : Does not run any intermediary hypervisor. Modify the OS to support namespaces for networking, processes and filesystem.

• Paravirtualization : Guest OS is modified and is aware that it is running inside a hypervisor.

• Full virtualization : Guest OS runs unmodified. Hypervisor emulates hardware devices.


NUMA/SMP

• If you run a monster server VM with many vCPUs, you may have to worry about NUMA scaling. Depending on NUMA ratio, 30-40% higher cost (latency and throughput) in accessing remote memory

• Hypervisor must be able to 1.manually pin a vCPU to a core.2.export NUMA topology to the guest OS.3.do automatic NUMA-aware scheduling of all guest VMs.

•. VMWare introduced vNUMA in vSphere 5.•. On Xen, pin Dom0 to a core. In case of NUMA, put frontend and

backend drivers on the same core.•. KVM exports NUMA topology to VM but it is still lagging on

automatic scheduling.


NUMA/SMP

• Depending on NUMA ratio, 30-40% higher cost (latency and throughput) in accessing remote memory

• Hypervisor must support ability to pin a vCPU to a core, and also allocate memory from specific NUMA node.

• Hypervisor must export NUMA topology (ACPI tables) so guest OS can do its job.

• Hypervisor should do automatic NUMA-aware scheduling of all guest VMs.

• VMWare introduced vNUMA in vSphere 5.• On Xen, pin Dom0 to a core. In case of NUMA, put frontend and

backend drivers on the same core.• KVM exports NUMA topology to VM but it is still lagging on

automatic scheduling.

• Cross-call overhead : On a SMP machine, when a semaphore is released by one thread, it issues a cross-call or inter-processor interrupt if the waiting threads are sleeping on another core. On a VM, the cross-call becomes a costly privileged op (Akther). Interrupt delivery may also trigger a cross-call.


Nested CPU scheduling

• Each guest OS runs on one or more virtual CPUs. Hypervisor schedules virtual CPUs on its run queue and then each guest OS decides which task to run on that virtual CPU.

• Introduces lock preemption problem: A process in the guest OS may get scheduled out by the hypervisor while holding a spin lock, delaying other processes waiting for that spin lock.

• Guest OS would not schedule out a process holding a spin lock but hypervisor is unaware of processes within the guest OS.

• • Solution is some form of co-scheduling or “gang scheduling”.

VMWare actively seeks to reduce “skew” between multiple vCPUs of the same VM.


Nested Page tables

• Page fault in VM may occur because the hypervisor has not allocated RAM to the VM.

• Guest Page table : Guest Virtual adddress -> Hypervisor Virtual• Hypervisor Page Table : Hypervisor Virtual -> Actual RAM.

• Earlier, hypervisors would maintain a “shadow page table” for each guest OS. This function has now moved into hardware by both Intel and AMD. Its called “Nested page tables”.

• Nested page tables require a costly two-dimensional page walk. For each step that is resolved in the guest table, you have to look up the host table.

• Overhead can be alleviated by using “huge pages” and per-VM tags in the TLB cache.


Memory overheads & solutions

Balloon driver : take back memory from guest.• -- VMWare • -- KVM (see virtio_balloon.c in linux_kernel/drivers/virtio) • -- HyperV calls it Dynamic Memory• -- Xen Transcendent Memory

Memory deduplication• -- Present in System/270 (Smith & Nair); • -- VMWare calls it Transparent Page Sharing (patented)• -- KVM uses KSM (which calls Linux madvise())• -- Xen uses KSM in HVM mode only.


Quantifying isolation

• Deshane et al(2007) defined BenchVM to test isolation.• Run normal VMs alongside a overloaded VM and test if the normal

VM remains responsive.

• On the Overloaded VM, you run various stress tests:1.CPU stress test2.Memory stress test : calloc and touch memory without free()3.Network : threaded UDP send and receive4.Disk : IOzone5.Fork bomb : test fast process creation and scheduling.

Their conclusion: Full virtualization provides better isolation than container-based virtualization. Their other results may be outdated due to advances in virtualization


VM exits are costly

• Interrupt processing causes context switches between VM and hypervisor.

• KVM EOI optimization : guest IDT (interrupt descriptor table) is shadowed.

• VMWare detects cluster of instructions that can cause guest exits.• Use combination of polling and interrupt (NAPI)


mclock

• Disk capacity varies dynamically and cannot be statically allocated like CPU or RAM.

• Need proportional sharing algorithm to reserve disk capacity• Gulati et al propose a dynamic algorithm which interleaves two

schedulers and uses three tags with every IO request.


Hadoop benchmark

• VMWare :

• HyperV (conducted on HDInsight – Microsoft's version of Hadoop) :

• KVM:

• Xen: (Netflix runs map-reduce on AWS)


HPC/Scientific benchmark

• VMWare paper : SPEC MPI and SPEC OMP

• Xen : Jackson et al (2010) ran NERSC on Amazon EC2. Six times slower than Linux cluster and 20 times slower than modern HPC system. EC2 interconnect severely limits performance. Could not use processor-specific compiler options since heterogenous mix of CPUs on every node.

• In Jun 2010, Amazon launched “Cluster Compute Nodes” which are basically nodes running Xen in hvm mode connected via 10G ethernet (no Infiniband yet).

• KVM and OpenVZ : Regola (2010) ran NPB on these nodes.


Realtime benchmark

• In order to minimize jitter and limit the worst-case latency, a realtime system must provide mechanisms for resource reservation, process preemption and prevention of priority inversion.

• Soft realtime (VoIP) vs Hard realtime. Soft means 20ms jitter between packets acceptable.

• RT-XEN• Kiszka KVM – QEMU driver lock.


Layered Queueing network (Xen)

Total response time R = R(vcpu) + R(dom0_cpu) + R(disk)

Resp. Time = Demand/[ 1- Utilization ]R(vcpu) = D(vcpu)/ [ 1 – U (vcpu) ] R(dom0_cpu) = D(vcpu)/ [ 1 - U(dom0_vcpu]R(disk) = D(disk)/ [ 1 – U(disk) ]

Util = λ * D = Arrival rate * Demand

D(vm_cpu) = D(isol_cpu) * S(cpu)/P(vm) where S=slowdown, P=speedupD(dom0_cpu) = D(vm_cpu) * Cost(dom0_vm)/P(dom0)

Cost(dom0_vm) = B(dom0_cpu)/B(vm_cpu) where B = busy time

Slowdown(cpu) = B(vm_cpu)/B(bare_cpu)Slowdown(disk) = B(vm_disk)/B(bare_disk)

Technology

Virtualization overheads