MClock: Handling Throughput Variability for Hypervisor IO Scheduling in USENIX conference on Operating Systems Design and Implementation (OSDI ) 2010

mClock: Handling Throughput Variability for Hypervisor IO Scheduling

in USENIX conference on Operating Systems Design and Implementation

(OSDI ) 2010.

Ajay GulatiVMware Inc.

Arif MerchantHP Labs

Peter VarmanRice University

2

Outline

• Introduction• Scheduling goals of mClock• mClock Algorithm• Distributed mClock• Performance Evaluation• Conclusion

3

Introduction• Hypervisors are responsible for multiplexing the underlying

hardware resources among VMs– CPU, memory, network and storage IO

HostVM

VM

VM

HostVM

VM

VM

Storage IOScheduler

CPU RAM

CPU RAM

Storage IOScheduler

Throughput available to a host is not under its own control

Storage Array

The amount of CPU and memory resources on a host are fixed andtime-invariant .

4

Introduction (cont’d)

• Existing methods provide many knobs for allocating CPU and memory to VMs.

• The current state of the art in terms of IO resource allocation is much more rudimentary.– Limited to providing proportional shares to different VMs.

• Lack of QoS support for IO resources can have widespread effects – rendering existing CPU and memory controls ineffective

when applications block on IO requests.

5

Introduction (cont’d)• The amount of IO throughput available to any particular host

can fluctuate widely based on the behavior of other hosts accessing the shared device.

VM5starts

VM1starts

VM2,3start

VM4starts

VM1stop

VM2,3stop

VM4stop

6

Introduction (cont’d)• Three main controls in resource allocation– Shares (a.k.a. weights)• proportional resource allocation

– Reservations• minimum amount of resource allocation• to provide latency guarantee

– Limits• maximum allowed resource allocation• prevent competing IO-intensive applications from

consuming all the spare bandwidth in the system

7

Scheduling goals of mClockVM IO Throughput IO Latency

Remote Desktop (RD) Low Low

Online Transaction Processing (OLTP) High Low

Data Migration (DM) High Insensitive

When reservations cannot be met: Proportional to reservations

When reservations can be met: Satisfy reservations first, then proportional to weight

Limit the maximum throughput of DM

8

Scheduling goals of mClock (cont’d)

• Each VM i has three parameters:– Reservation(ri), Limit (li), Weight (wi)

• VMs are partitioned into three sets: – Reservation-clamped(R), limit-clamped (L) or proportional (P),

based on whether their current allocation is clamped atthe lower or upper bound or is in between.

• Define

9

mClock Algorithm

• mClock uses two main ideas: – multiple real-time clocks • Reservation-based, Limit-based, and Weight-based clocks

– dynamic clock selection• Dynamic select one from multiple real-time clocks for

scheduling.

Tag assignment method is similar to the Virtual Clock scheduling.

10

mClock Algorithm (cont’d)

• Tag Adjustment– To calibrate the proportional share tags against real time

• To prevent starvation.• In virtual time based scheduling, this synchronization is done using

global virtual time. ( Si,k = max{Fi,k-1, V(ai,k)} )• In mClock, the reservation tag and limit tag must base on real time.

=> Adjust the origin of existing P tags to the real time.

11

mClock Algorithm (cont’d)Reservation first

Select the request from the VMs under limitation.

Active_IOs : count the queue length.

Tag Adjustment

12

mClock Algorithm (cont’d)

• This maintains the condition that R tags are always spaced apart by 1/ri, so that reserved service is not affected by the service provided in the weight-based phase.

time

Rk1 Rk

2 Rk3 Rk

4 Rk5

Current time trk

3 is served.The waiting time of rk

4 may be longer than 1/rk

1/rk

13

Storage-specific Issues• Bust Handling– Storage workloads are known to be bursty– Requests from the same VM often have a high spatial

locality.– We help bursty workloads that were idle to gain a limited

preference in scheduling when the system next has spare capacity.

– To accomplish this, we allow VMs to gain idle credits.

time

Pk1 Pk

2 Pk2+1/wi

Current time t:

rk3 arrival

Pk3

σi/wi

tidle

14

Storage-specific Issues (cont’d)• IO size– Since larger IO sizes take longer to complete, differently-sized

IOs should not be treated equally by the IO scheduler.– The IO latency with n random outstanding IOs with an IO size of

S each can be written as:

– Converting latency observed for an IO of size S1 to a IO of a reference size S2,

– A single request of IO size S is treated equivalent to(1 + S/(Tm ×Bpeak)) IO requests.

Tm: mechanical delay due to seek and disk rotation.Bpeak: the peak transfer bandwidth of a disk.

For a smaller reference size, this part is negligible

15

Storage-specific Issues (cont’d)• Request Location– mClock improves the overall efficiency of the system by

scheduling IOs with high locality as a batch.• A VM is allowed to issue IO requests in a batch as long as the

requests are close in logical block number space.

• Reservation Setting– IOPS = Outstanding IOs / Latency– Application that keeps 8 IOs outstanding and requires

25ms latency, 8 / 0.025 = 320 IOPS for reservation

16

Distributed mClock

• Cluster-based storage systems• dmClock runs a modified version of mClock– piggyback two integers ρi and δi with each request of VM vi to

a storage server sj .• δi : the number of IO requests from VM vi that have completed

service at all the servers between the previous request (from vi) to the server s j and the current request.

• ρi : the number of IO requests from vi that have been served as part of constraint-satisfying phase between the previous request to s j and the current request

17

Performance Evaluation

• Implemented in VMware ESX server hypervisor– By modifying the SCSI scheduling layer in the I/O stack of

VMware ESX server hypervisor.

• The host is a Dell Poweredge 2950 server – two Qlogic HBAs connected to an EMC CLARiiON CX3-40

storage array over FC SAN.– Used two different storage volumes

• A 10 disk RAID 0 disk group • A 10 disk RAID 5 disk group

18

Performance Evaluation

• Two kinds of VMs– Linux VMs with a 10GB virtual disk,

one VCPU and 512MB memory– Windows server 2003 VMs with a 16GB virtual disk,

one VCPU and 1GB memory

• Workload generator– Iometer in the Windows server VMs

• http://www.iometer.org/

– A self-designed work-load generator in Linux VMs

19

Performance Evaluation (cont’d)• Limit Enforcement

RD OLTP DMWorkload 32 random IO (75% read)

every 250msAlways backlogged(75% read)

Always backlogged(All sequential read)

IO size 4KB 8KB 32KBLatency bound 30ms 30ms XWeight 2 2 1

At t=140 the limitfor DM is set to 300 IOPS.

20

Performance Evaluation (cont’d)• Reservations Enforcement– Five VMs with weights in ratio 1:1:2:2:2.– VMs are started at 60 sec intervals

SFQ only does proportional allocation mClock enforces reservations

300 IOPS

250 IOPS

21

Performance Evaluation (cont’d)

• Bursty VM Workloads– VM1: 128 IOs every 400ms, all 4KB reads, 80% random.– VM2: 16 KB reads, 20% of them random and the rest sequential with

32 outstanding IOs.

– Idle credits do not impact the overall bandwidth allocation over time.– The latency seen by the bursty VM1 decreases as we increase the idle

credits.

22


• Filebench Workloads– Emulate the workload of OLTP VMs

[25] R. McDougall. Filebench: Application level file system benchmark. http://www.solarisinternals.com/si/tools/filebench/index.php

23


• dmClock Evaluation– Implementation in a distributed storage system that consists

of multiple storage servers (nodes).– Each node is implemented using a virtual machine running

RHEL Linux with a 10GB OS disk and a 10GB experimental disk.

24

Conclusion

• The mClock provides per-VM quality of service. The QoS requirements are expressed as – minimum reservation– maximum limit – proportional shares (weight)

• The controls provided by mClock would allow stronger isolation among VMs.

• The techniques are quite generic and can be applied to array level scheduling and to other resources such as network bandwidth allocation as well

25

Comments• Existing VM services only provide resources in terms of CPU,

memory, and storage. But I/O throughput may be the largest factor in QoS provisioning.– In terms of response time or delay time.

• It’s a good idea to combine reservation, limit and proportional share in schedule algorithms.– WF2Q-M considered the limit but no reservations.

• The problem of reservation, limit and proportional share between VMs in different hosts ??

26

Comments (cont’d)• Experiments just validate the correctness of mClock. – How about the short term fairness, latency distribution and

computation overhead ?

• The experiments just use one host machine. – Cannot reflect the condition of throughput variability when

there are multiple hosts.

Documents

MClock: Handling Throughput Variability for Hypervisor IO Scheduling in USENIX conference on Operating Systems Design and Implementation (OSDI ) 2010