Towards Linux as a Real-Time Hypervisor - LWN.netlwn.net/images/conf/rtlws11/papers/proc/p18.pdfTowards Linux as a Real-Time Hypervisor Jan Kiszka Siemens AG, Corporate Technology,

Towards Linux as a Real-Time Hypervisor

Jan KiszkaSiemens AG, Corporate Technology, CT SE 2

Otto-Hahn-Ring 6, D-81739 Munich

[email protected]

Abstract

Combining virtualization and real-time is important for an increasing amount of use cases, from em-

bedded system to enterprise computing. In this paper, we will analyze the real-time capabilities of Linux

as a hypervisor when using KVM and QEMU. We will furthermore introduce and evaluate a paravirtual

scheduling interface that helps resolving priority inversion problems in embedded virtualization scenarios.

1 Introduction

Combining virtualization and real-time requirementsis so far primarily a topic in embedded systems.In this domain, a general purpose operating system(GPOS) often needs to be executed aside a real-timeoperating system (RTOS). A hypervisor may then beused to manage shared resources and isolate the OSesfrom each other. Such architectures can be foundin industrial control systems where the RTOS takesover the time-critical control of a machine while theGPOS runs, for example, the visualization software.Futher examples are single-processor smartphoneswhere an RTOS is used to manage critical tasks ofthe radio communication while a GPOS hosts thetypical set of mobile phone applications.

With the increasing importance of real-time inenterprise computing scenarios, the overlap with vir-tualization becomes larger as well. In the absenceof highly demanding real-time requirements, virtua-lization is already used for server consolidation andload balancing. Thus, improving real-time capabil-ities of virtualization solutions will extend their us-ability also into the real-time enterprise domain.

But the gap between embedded real-time and en-terprise virtualization is still large. In many embed-ded scenarios, specialized hypervisors are used thathave a reduced feature set, are mostly statically con-figured, but can provide better performance and re-quire less resources than full-featured versions. Hy-pervisors from the enterprise domain focus on goodaverage performance as well as resource control at acomparatively high level. They manage significantlylarger resource sizes and include mechanisms to dy-

namically shift load between physical nodes.Linux as an operating system has already demon-

strated that it can adopt to a very broad range of usecases: from low-end embedded up to large-scale ma-chines with thousands of CPU cores [1]. In the real-time domain, Linux is currently approaching moredemanding use cases with the PREEMPT-RT patch[2, 3]. And Linux includes built-in virtualization sup-port with the KVM subsystem [4, 5]. Consequently,one question is what quality of service could alreadybe achieved by combining these technologies–whichfurthermore include the advantages of open source,thus are also easily adaptable to specific require-ments whenever needed. A realistic question is notif Linux can solve everthing, but how large the sharein use cases is that it can cover and how it can begradually extended so that all Linux users benefit.

2 Linux as Hypervisor

Using the Linux OS itself as a hypervisor requirestwo components: A kernel driver and a user spaceapplication. Both will be briefly introduced in thefollowing.

2.1 Kernel-Based Virtual Machine

The Kernel-Based Virtual Machine (KVM) is aLinux device driver whose main task is to manage un-privileged access to virtualization features that canonly be used directly by the privileged kernel. It wasoriginally developed to grant this access to Intel’sand AMD’s x86 hardware-based virtualization exten-sions, but meanwhile provides essentially the identi-

cal interface on ia64, s390, and PowerPC hosts–eventhough not all of them provide comparable hardwarevirtualization support.

Another important task of KVM is to providefast virtualization for frequently accessed guest de-vices. So far, interrupt and timer controllers areavailable as kernel-hosted models. The advantageis that no consultation of the controlling user spaceprocess is required if a guest accesses any of these de-vices, which reduces the virtualization overhead. In-kernel support for paravirtualized network adaptersis currently under development.

KVM is a so-called type-2 hypervisor [6], thatis it makes use of the host OS for tasks likesetup/cleanup, scheduling, native interrupt andmemory management etc. For this reason, the ex-ecution model of KVM is very simple: A Linuxuser space process acting as a hypervisor can requestto create and configure a virtual machine (VM) byopening the KVM character device and issuing a se-ries of IOCTLs. This VM shares the host memorymap with its controlling process, while the hyper-visor can freely configure the mapping between theVM’s physical and its own virtual memory.

The hypervisor process can furthermore requestto create virtual CPUs within the VM. Given a guestCPU state, they can then execute arbitrary guestcode in a confined environment. Any guest activitythat cannot be handled by the host CPU’s virtual-ization extension will trigger an exit from the virtualCPU (VCPU) execution environment. If the KVMkernel code is not able to handle the exit internally,for example by swapping in a host memory page thatthe guest has requested by accessing a virtual page,the exit event is propagated to the user space hyper-visor.

The execution context of the VCPU is implic-itly defined by the user space hypervisor: It invokesthe VCPU execution request from within one of itsthreads, and KVM preserves this scheduling contextwhile running the guest code. Thus, if the Linux hostscheduler decides to switch out a hypervisor thread,it is the corresponding guest CPU which is effectivelyscheduled out. The hypervisor process and/or thesystem administrator can therefore apply standardmeans at the level of threads, processes, users orcontrol groups to influence the scheduling of virtualmachines as well as further host services.

Guest exits also take place on asynchronous hostevents, foremostly interrupts. As soon as KVM hasrestored the host OS context, it re-allows host inter-rupt processing, thus the corresponding Linux inter-rupt handler is able to run. If the interrupt handlercreates the need to reschedule, KVM detects thisand invoke the corresponding service that possibly

switches to a different host task.

2.2 QEMU

QEMU [7] is a fast CPU and peripheral hardwareemulator. Its emulation does not depend on specifichost features, that is it also works across architec-tures. The KVM project chose QEMU as the foun-dation for its user space hypervisor as it already camewith most of the additionally required infrastructurelike I/O device emulation.

KVM’s QEMU version basically replaces theCPU emulation by the hardware-assisted executionprovided by the kernel driver. It furthermore splitsthe single-threaded execution model of QEMU intoVCPU threads and a main I/O processing loop, en-abling more scalable SMP virtualization. As theQEMU core code is not yet prepared for multi-threaded execution, a global lock protects any accessto the device emulation layer. Thus, only one VCPUor the I/O main loop may query or modify emulateddevices of a virtual machine at the same time.

The KVM project also adds asynchronous I/O(AIO) support to QEMU. As Linux’s native AIO im-plementation performs worse than a pthread-basedapproach, QEMU gained AIO via a thread-pool: foreach AIO request a user space thread is spawned orobtained from the pool that executes the request andthen signals the completion to the main I/O thread.

Although the upstream QEMU version mean-while includes KVM support, we developed our mod-ifications only for KVM’s QEMU branch [8, commitf2593a0] as it includes some required features thathave not yet been merged into upstream.

3 Prioritizing QEMU/KVM

A straightforward approach to improve guest respon-siveness over QEMU/KVM is to raise the prioritiesof QEMU’s threads and lift them into a real-timescheduling class. This comes with the risk of starv-ing host services that the guest may indirectly busy-wait upon. For example, the guest could issue anI/O request that is handled asynchronously by thehost, but requires completion within the context ofnon-real-time service like the kernel’s event or AIOthread.

This risk can mitigated by avoiding CPU over-commitment, that is running less VCPUs than realprocessor cores exist. Furthermore, Linux comeswith a real-time throttling mechanism [9] that re-stricts the CPU slice of all real-time tasks in the hostsystem. The latter approach, however, can result invery low performance as the guest system consumesa lot of CPU time without making relevant progress.

3.1 Realization

While QEMU’s main I/O thread and its VCPUthreads could also be adjusted after start-up viachrt, its AIO thread pool grows and shrinks dynam-ically. Therefore a QEMU extension was requiredto establish a stable prioritization of all involvedthreads. We defined a new command line switch forQEMU:

-rt maxprio=prio[,policy=ff|rr|ts]

[,aioprio=prio][,aiopolicy=ff|rr|ts]

If aioprio and aiopolicy are omitted, thescheduling policy is applied to all existing and fu-ture threads of the QEMU process. QEMU’s mainthread and its AIO threads then gain the specifiedmaximum priority, while VCPU threads will use thenext lower priority level. This ensures that a spin-ning guest cannot starve itself from receiving I/Oevents that are routed through those threads. How-ever, if asynchronous I/O is not considered most im-portant for a guest, the priority and policy of corre-sponding completion threads can be separately spec-ified. The effect is discussed in Section 3.3.

We furthermore added

mlockall(MCL_CURRENT | MCL_FUTURE);

in case a real-time scheduling class is selected. Thisprevents any swapping of QEMU’s process mem-ory, including guest pages, and avoids lazy map-ping. While mlockall and memory overcommitmentare generally not compatible, real-time requirementsdeny the latter in most cases anyway.

A further real-time enhancement for QEMU con-cerned switching mutexes used for synchronizing I/Ohandling to the priority inheritance protocol. Theimpact of this change, however, can be consideredminor at this point, as it is pointed out in Section 6.

3.2 Test Setups

Our test setup consisted of a host system withan Intel dual-core processor (Core2 T5500 at 1.6GHz) with 2 GB of RAM. Host as well as guestsystems were based on OpenSUSE 11.1. Westarted our evaluation with a host kernel builtout of KVM’s git tree [10, commit 0b972aa],which basically generated a 2.6.31-rc4 kernel. Weenabled CONFIG PREEMPT and CONFIG PREEMPT RCU

and disabled CONFIG ACPI PROCESSOR for best la-tencies. Tracing facilities were built in, butremained disabled during latency tests (echo 0

> /proc/sys/kernel/ftrace enabled, echo 0 >

/sys/kernel/debug/tracing/tracing enabled).

To limit the size of the test matrix, we decidedto focus on a single real-time aspect in the follow-ing evaluations: the latency of timed high-priorityuser space threads in the guest system. For measur-ing this latency, we used cyclictest from the rt-testproject [11, v0.50] with the following parameters: -m-n -p 99 -l 1000000 -h 100000 -q, that is a 1ms periodic nanosleep test over 1 million iterations.As our kernel configuration allowed both host andguest system to use the native TSC as clock source,we were able to rely on the measurement results ob-tained within the guest. Moreover, our interest in thelatency distribution was stronger than in a reliableupper bound for the worst case. We therefore ranall tests only for 15 minutes, which is typically notenough to obtain maximum latencies but sufficientto compare distributions.

The guest kernel was kept identical for all tests:2.6.29.6-rt23 plus paravirtual scheduling patches forLinux guests that are introduced in Section 5. Thelatter extensions were always enabled as Linux dis-ables them automatically during boot-up if the hy-pervisor does not provide this feature.

QEMU was started with the parameters

qemu-system-x86_64 -m 512 OpenSuse_64.raw \

-nographic -serial /dev/null -net nic \

-net user,hostfwd=::2222-:22

that is using 512 MB RAM, a raw disk image and nolocal console. Instead we controlled the guests via aforwarded SSH connection. The reasons for choosinga raw image over QEMU’s QCOW2 disk format andavoiding local graphic or even serial console emula-tions is discussed in Section 6.

The tests on PREEMPT-RT host kernels re-quired us to bind prioritized QEMU to the secondcore of our host CPU as we otherwise faced live-locks during boot-up. We did not analyze the reasonfor this in further detail, but instead decided to es-tablish this CPU binding consistently for all tests.The binding to the second core was not only appliedto the QEMU process but also to the load applica-tions as well as the host’s disk interrupt. A positiveside-effect of moving the tests was that we were ableto avoid occasional latency peaks of a few hundredmicroseconds due to system management interrupts(SMI) which apparently only hit the first core on ourplatform.

We defined two load applications that were run-ning in parallel in endless loops during the tests:bonnie 1.4 (as provided by OpenSUSE 11.1) was ex-ecuted with -y -s 2000 to stress the I/O subsys-tems. Furthermore, the cache calibrator [12] wasrun with 1660 4M /tmp/calibrator.log, that is a4 MB RAM test area to stress memory caches. For

all tests, the calibrator load stayed on the host as theintended cache pollution can be triggered from bothhost and guest domain equally well.

3.3 Evaluation

To evaluate the effect of raising QEMU’s threadpriorities, we first measured the latencies nativelyachievable with cyclictest on the host. For this test,the bonnie load ran on the host, and no QEMU in-stance was started. Figure 1 shows the result forthe CONFIG PREEMPT host kernel we used in this firstround. Note that the axes in all graphs are logarith-mically scaled.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

Fre

quen

cy

Latency [µs]

Host latency on 2.6.31-rc4

Maximum: 82555 µs

Average: 89 µs

FIGURE 1: Host Latency

Next we measured the guest latency without anyQEMU prioritization. Bonnie now ran inside theguest. In Figure 2, the difference to the host latencyis visible in form of a shifted frequency maximumand a much higher probability of multi-millisecondlatencies. In fact, the selected upper bound for thehistogram of 100 ms did no catch the maximum la-tencies. Clearly, this setup is unsuited for latencysensitive guest load.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

Fre

quen

cy

Latency [µs]

Guest Latency on 2.6.31-rc4, -rt maxprio=0

Maximum: 99999 µs

Average: 5216 µs

FIGURE 2: Unprioritized Guest Latency

We then applied SCHED FIFO with a priority of98 for I/O and 97 for VCPU threads on QEMU. Theresult is depicted in Figure 3: the average latencydropped significantly.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

Fre

quen

cy

Latency [µs]

Guest Latency on 2.6.31-rc4, -rt maxprio=98

Maximum: 43328 µs

Average: 195 µs

FIGURE 3: Prioritized Guest Latency

In the aim to reduce the I/O load impact on ourtest scenario, we moved QEMU’s AIO threads out ofthe real-time scheduling class again. Figure 4 visual-izes that this kept the overall distribution shape butreduced the probability of high or low latencies.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

Fre

quen

cy

Latency [µs]

Guest Latency on 2.6.31-rc4, -rt maxprio=98,aioprio=0

Maximum: 26480 µs

Average: 117 µs

FIGURE 4: Lowered AIO Priority

Processing guest originated I/O load, even ifhandled asynchronously, apparently contributes toexcessive latencies in cyclictest. To confirm this as-sumption, we moved the bonnie test back on thehost. This roughly resulted in the same latency dis-tribution as in the previous experiment, see Figure 5.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

Fre

quen

cy

Latency [µs]

Guest latency on 2.6.31-rc4, -rt maxprio=98, host I/O load

Maximum: 72989 µs

Average: 130 µs

FIGURE 5: I/O Load Moved to Host

4 PREEMPT-RT hosting

In the next step, we switched the host to the samePREEMPT-RT kernel that was already in use forthe guest system. The aim was to evaluate if theI/O-related priority inversions we found on a stan-dard kernel are caused by QEMU/KVM itself or bythe host kernel not respecting priorities properly inall cases.

4.1 Priority Assignment

Selecting the maximum priority of QEMU underPREEMPT-RT has to be done carefully in order toavoid livelocks. A Linux guest, for example, per-forms a timer check during boot-up, busy-waitingfor timer interrupts to arrive. On the host side, theseinterrupts flow from the timer IRQ handler via thehrtimer kernel thread to QEMU’s I/O thread andfinally to the VCPU. If the hrtimer thread is notgiven a higher priority than QEMU, the guest willspin endlessly, preventing any timer IRQ delivery.As our test setups focus on guest timer IRQ latency,we lifted hrtimer processing above any others inter-rupt threads.

# chrt -p -f 99 ‘pgrep hrtimer/1‘

4.2 Evaluation

Again we measured the timer latency via cyclicteston the host first, running bonnie and calibrator asload. Figure 6 shows promising numbers. They indi-cate that PREEMPT-RT is avoiding priority inver-sions because of low priority I/O load as far as thisshort test can reveal.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

Fre

quen

cy

Latency [µs]

Host latency on 2.6.29.6-rt23

Maximum: 70 µs

Average: 6 µs

FIGURE 6: PREEMPT-RT Host Latency

Running QEMU with raised priority onPREEMPT-RT results in clear worst case latencyimprovement over a standard PREEMPT kernel.Figure 7 depicts this measurement; it correspondsto Figure 3. On PREEMPT-RT, the latency maxi-mum moved below 100 µs, and the average latencydropped by one third.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

Fre

quen

cy

Latency [µs]

Guest Latency on 2.6.29.6-rt23, -rt maxprio=98

Maximum: 3446 µs

Average: 131 µs

FIGURE 7: Guest Latency on PREEMPT-RT Host

Corresponding to the test shown in Figure 4, wenext dropped AIO priorities to 0. Figure 8 visualizesthe best guest latency distribution we were able toachieve with our test setup. Here the latency averagedrops to only 75 µs, and a maximum latency of lessthan 500 µs was observable (again noting that thetest only ran for 15 minutes).

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

Fre

quen

cy

Latency [µs]

Guest Latency on 2.6.29.6-rt23, -rt maxprio=98,aioprio=0

Maximum: 437 µs

Average: 75 µs

FIGURE 8: PREEMPT-RT with LoweredAIO

5 Paravirtualized Scheduling

To fulfill the requirements of real-time tasks runningin a VM, the hypervisor may assign guaranteed re-source time shares to that VM. All real-time taskrequirements must then be fulfillable by granting anadequate time slice in a predefined scheduling cycle.Such a static time slice allocation can be impracticalin a highly reactive environment.

Alternatively, the hypervisor may prioritize thereal-time VM over any other VM. This approach issufficient as long as the real-time VM only executestasks that have a higher importance than any othertask in the system. Otherwise, background jobs inthe prioritized VM may use up all CPU resources ofthe host, starving the other VMs. A straightforwardapproach to avoid this priority inversion is to givethe hypervisor a hint about the internal states of itsguests. The hypervisor can then adjust the scheduleof the VCPUs accordingly.

Such an approach is already used for resolving ef-ficiency issues related to guest CPUs waiting on con-tended spinlocks [13]. Consider one VCPU holdinga spinlock, then being preempted by another VCPUwhich starts to spin on the very same lock. With-out any knowledge of the guest state, the hypervi-sor can only keep the second VCPU running untilit consumed its time slice. But given some infor-mation that the second VCPU is currently spinningon a lock, the hypervisor can instead grant the hostCPU to the first VCPU that is able to make progresson its jobs. Moreover, there is a probability that thisVCPU will have released the spinlock the next time itis preempted by the second VCPUs. Thus, the over-all efficiency increases as less CPU time is burnedunproductively.

Today this additional information about spin-lock contentions can only be provided by the guestOS itself. Linux introduced paravirtual spinlock op-erations for this purpose. These replace the na-tive spinlock code when Linux detects that it run-ning over a hypervisor which supports an alternativemechanism. Currently only Xen makes use of them[14]. Upcoming Intel and AMD CPUs will includehardware-assisted detection of spinlock loops, whichtriggers a guest exit so that the hypervisor can sched-ule a different VCPU.

However, hardware assistance cannot be ex-pected for solving the scheduling problem of prior-itized guest CPUs because task scheduling decisionsare done in software on commodity platforms likex86. Therefore, we need a software interface betweenguest and hypervisor to propagate the required infor-mation, and we have to define an execution modelwhich both hypervisor and guest OSes can adopt.

5.1 Execution Model

The least common denominator in task schedulingtoday is fixed-priority scheduling. We chose POSIX[15] to define the detailed semantics of the supportedscheduling policy. The following policies are so farsupported by our model:

• SCHED OTHER:Default time-sharing policy of the host.

• SCHED FIFO:FIFO policy as defined by POSIX.

• SCHED RR:Round-robin policy as defined by POSIX.

The hypervisor schedules a VCPU according tothe last scheduling policy and priority that have beenreported by the guest. The available priority range is[0, pmax]. pmax is a per-VCPU priority limit that thehypervisor can be configured to. The guest-providedpriority pguest, confined to the range of [0, 99], ismapped to the host priority p according to

p =⌊

pguest

pmax

99

⌋

The guest-provided scheduling parameters ap-ply if no virtual interrupt is waiting to be injectedor is currently processed by the guest. As soonas the hypervisor marks an interrupt for injectioninto a VCPU, it raises the VCPU to pmax. If theguest’s current scheduling policy is SCHED OTHER, itis changed to SCHED FIFO, otherwise it is left unmod-ified. The boost is kept until the guest reports that itcompleted interrupt handling on the correspondingVCPU.

The model also supports one level of interruptnesting to allow for accurate non-maskable interrupt(NMI) virtualization. If a virtual NMI is markedpending while interrupt boosting is already applied,the NMI boosting is added. In the following, com-pletion for both interrupt types has to be reportedby the guest CPU for deboosting it to its base policyand priority.

5.2 Interface

Our current version of a paravirtual scheduling inter-face requires two new hypercalls, that is paravirtualservice calls from the guest to the hypervisor:

• Set Scheduling Parameters

This hypercall informs the hypervisor about apotentially changed scheduling policy and/orpriority for the specified VCPU. The hy-percall’s argument list consists of the targetVCPU, the new scheduling policy and the newpriority value pguest.

In order to support SMP guest use case whereone VCPU may change the scheduling param-eters of another VCPU, the hypercall interfacecan also be invoked on remote VCPUs. In ourimplementation for x86, we use the physical IDof the emulated local APICs to specify the tar-get VCPU of this hypercall.

If the target VCPU is the same as the callerof the hypercall and the VPU currently under-goes interrupt priority boosting, the boost iscleared so that no additional Interrupt Donehypercall has to be issued.

• Interrupt Done

This hypercall has to be invoked by a VCPUthat finished its interrupt handling but willnot invoke Set Scheduling Parameters as noreschedule is required. No arguments arepassed.

For the integration in QEMU/KVM, we choseKVM’s own hypercall interface that is based on vm-call/vmmcall, the vendor-specific x86 instructions tocall from a guest into its hypervisor. Like other KVMhypercall interfaces, this extension is advertised tothe guest via a flag in KVM’s CPUID leaf.

5.3 Host-Side Realization

The modifications to implement our paravirtualscheduling interface in QEMU/KVM focused on

KVM’s kernel module. Here we had to provide theSet Scheduling Parameters hypercall interface which,after checking and mapping the passed parameters aswell as potentially deboosting the VCPU, finally in-vokes sched setscheduler() for the Linux task ofthe corresponding VCPU.

Standard interrupt handling was also straight-forward to implement. We hooked intokvm vcpu kick() and the IOCTLs to trigger aninterrupt or an NMI from user space and appliedthe required boosting. Futhermore, we added theInterrupt Done hypercall interface.

More challenging was the implementation ofVCPU boosting on APIC and PIC timer events. Toavoid a priority boost gap between the host timerIRQ and the guest IRQ injection, we have to startthe boosting as soon as the host timer fires. KVMuses Linux hrtimers for setting up guest timer events.When these timers fire, a KVM-provided handler isinvoked, looks up the target VCPU and sets a flagthat pending timers have to be considered on nextguest entry.1 It is not yet clear at this point if an IRQor NMI will finally be injected. Moreover, on non-PREEMPT-RT kernels, the hrtimer handler runs ininterrupt context, preventing a direct invocation ofsched setscheduler() due to the IRQ-unsafe im-plementation of that service.

To work around this, we register per-CPU kernelthreads, running at highest host priority, forwardingour boost requests from the timer handlers to theLinux scheduler. We additionally introduce an inter-mediate boosting state, ”timer pending”, that is keptuntil KVM decides if the pending timer will raise anIRQ or an NMI. This boost is not affected by anyguest-originated clearing.

As we rely on the physical APIC ID for lookingup the internal VCPU state during the Set Schedul-ing Parameters hypercall, our implementation is sofar limited to the standard case of in-kernel guestIRQ controller emulation. All information we needis at hand here, while it would otherwise reside onlyin user space.

To configure pmax for every VCPU, the newIOCTL KVM SET MAX PRIO is now available forKVM’s VCPU file descriptors. User space is ex-pected to call it during setup of a new virtual CPU.

The QEMU user space part only had to be ex-tended by the pvsched=on|off option for the -rt

command line switch. It controls whether paravir-tual scheduling is advertised to the guest and VCPUmaximum priorities are configured to the value de-fined by maxprio minus one.

1A plain flag is enough as timers always fire on the same host CPU the associated VCPU may run on. Thus, the guest iseither already preempted by the host timer and the flag will simply be evaluated on guest re-entry, or this will happen as soonas the VCPU resumes (i.e. re-enters guest context).

5.4 Guest-Side Realization

To demonstrate the guest-side implementation wechose Linux as well. Linux has the advantage of pro-viding a well-established OS paravirtualization layercalled paravirt-ops. It comes with detection and run-time deactivation services that allow to reuse a im-age in the absence of paravirtual scheduling supportwithout noticeable overhead.

We first introduced a new group of paravirt-ops,pv sched ops. This group contains two callbacksthat directly map on two previously defined hyper-visor calls. The wrappers for these calls are;

void cpu_set_sched(int cpu, int policy,

int priority);

void cpu_interrupt_done(void);

cpu set sched() is called from schedule(),equipped with the parameters of the next task.Furthermore, we hook into sched setscheduler

where we call cpu set sched() if the manipulatedthread is the current one on its assigned CPU.

cpu interrupt done() is inserted intonmi exit(), where it is called unconditionally, andinto irq exit(). In the latter case, we only in-voke the hook if we left interrupt handling and noreschedule is pending that will deboost the currentCPU anyway. The following code extract shows themodified function:

void irq_exit(void)

{

...

sub_preempt_count(IRQ_EXIT_OFFSET);

if (!in_interrupt()) {

if (local_softirq_pending())

invoke_softirq();

if (!need_resched())

cpu_interrupt_done();

}

...

}

While the pv sched ops declaration and setupis currently only available for x86, our hooks wereexclusively added to generic code and, thus, areportable to other architectures. Moreover, no adap-tion of this paravirtual interface was required to useit in a PREEMPT-RT guest. But we were forcedto disable KVM’s paravirtual MMU operations astheir current implementation is incompatible withPREEMPT-RT’s MMU subsystem changes.

5.5 Evaluation

Ideally, enabling paravirtual scheduling should nothave any negative impact on our test scenarios. It

should not raise the worst case latency nor shift theaverage. Realistically, paravirtual scheduling intro-duces overhead, both related to leaving the guest toinform the host about priority changes as well as ap-plying priority changes because of guest activity orinterrupt injections.

To evaluate the impact, we repeated the testof running cyclictest with raised maximum but lowAIO priority both over the standard as well as thePREEMPT-RT kernel, this time with pvsched=on.The results can be found in Figures 9 and 10 thatcorrespond to Figures 4 and 8, respectively. Wesee a slight increase in the average latency on bothhost kernels, which is assumed to reflect the par-avirtual scheduling overhead. Moveover, we see ahigher worst case latency on the standard PRE-EMPT kernel. But as this increase does not repeatover PREEMPT-RT and the tests may have not re-vealed the actual worst case, we consider this a ran-dom effect.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

Fre

quen

cy

Latency [µs]

Guest Latency on 2.6.31-rc4, -rt maxprio=98,aioprio=0,pvsched=on

Maximum: 31632 µs

Average: 144 µs

FIGURE 9: Paravirtual Scheduling onCONFIG PREEMPT

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

Fre

quen

cy

Latency [µs]

Guest Latency on 2.6.29.6-rt23, -rt maxprio=98,aioprio=0,pvsched=on

Maximum: 434 µs

Average: 86 µs

FIGURE 10: Paravirtual Scheduling onCONFIG PREEMPT RT

6 Analysis

The measurements of the real-time propertiesQEMU/KVM can provide for its guest have shownthe following:

• With careful tuning, worst case timer-drivenguest scheduling latencies below 1 ms and av-erage latencies below 100 µs can be achievedon a PREEMPT-RT host kernel.

• Parallel I/O activity of the guest has signifi-cant influence on its overall latency. This canbe mitigated by lowering the priority of AIOcompletion handling.

• Our paravirtual scheduling interface can be im-plemented in a way that no obvious priorityinversions are introduced, while only causing aslight average latency increase.

We furthermore observed the following effectsduring the tests:

• Using QCOW2 images (backing or temporary)imposes high latencies on the guest, irrespec-tive of PREEMPT-RT use and careful prioritysettings.

• Using local SDL graphic output or applyinghigh load on the emulated serial port alsocauses priority inversions for the guest.

The latter effects are apparently related to thelocking scheme in QEMU user space part. As aglobal ”I/O lock” is held while accessing potentiallyblocking Linux services, every concurrent access toQEMU’s I/O device model, for example by a VCPU,will suffer from the same delay. These bottlenecksare expected to cause problems also on large SMPguest systems as the number of parallel I/O accessnaturally increases here. On the long-term, QEMU’sdevice model will therefore have to be made thread-safe by applying a more fine-grained and scalablelocking scheme.

Yet hidden behind the above effects and/or notsufficiently stressed by our tests are delays causedby significant CPU load inside QEMU’s I/O layer.For example, the user space networking stack mayface large packet queues or decide to fork-off helperprograms. Such paths require further analysis, evenwhen a more scalable locking scheme has alreadybeen establish. They could still impact their directcaller, namely VCPUs issuing synchronous I/O oper-ations or the main I/O thread dispatching incomingdata from the host.

As the priority inversion effects in user spacedominated our tests, we had no closer look at the

locking situation in KVM’s kernel model. However,at the time of writing, there were ongoing activitiesto establish a more fine-grained locking scheme alsoin kernel space [16].

7 Conclusion

In this paper, we analyzed QEMU/KVM regardingits currently achievable soft real-time capabilities.We demonstrated that, when carefully tuning thesetup and using PREEMPT-RT on the host, sub-millisecond scheduling latencies inside guests can beachieved.

We furthermore introduced a paravirtualscheduling interface. It overcomes the priority inver-sion problems on embedded systems that are relatedto black-box scheduling of guests. Measurementsdemonstrated that our first implementation doesnot introduce obvious priority inversions on host orguest side. It currently comes with an overhead thatresults in 15% to 25% higher average latencies and isexpected to have a measurable impact on the overallsystem performance as well. This impact should bereducible by optimizing our approach, specificallyby avoiding unneeded guest exists due to paravir-tual scheduling updates. We expect improvementsby skipping updates on context switches that do notchange any parameters and by using lazy updatealgorithms.

Along with the publication of this paper, we willalso release the source code that was developed forthis research work. Our goal is to evolve the codebase towards mainline acceptability.

References

[1] Top 500, Operating System sharefor 06/2009, TOP500.Org, 2009.http://www.top500.org/stats/list/33/os

[2] Internals of the RT Patch, Steven Rostedt, Dar-ren Hart, Proceedings of the Linux Symposium,2007.

[3] Real-Time Linux Wiki, 2009. http://rt.wiki.ker-nel.org

[4] KVM: The Linux Virtual Machine Monitor, AviKivity et al., Proceedings of the Linux Sympo-sium, 2007.

[5] Kernel Based Virtual Machine, 2009.http://www.linux-kvm.org

[6] IBM Systems Virtualization, Version2 Release 1, IBM Corporation, 2005.http://publib.boulder.ibm.com/infocenter/eserver/v1r2/topic/

[7] QEMU - open source processor emulator, 2009.http://www.qemu.org

[8] qemu-kvm git repository, 2009. http://git.ker-nel.org/?p=virt/kvm/qemu-kvm.git

[9] Real-Time group scheduling, Linux kernel2.6.31, Documentation/scheduler/sched-rt-group.txt, 2009.

[10] kvm kernel git repository, 2009. http://git.ker-nel.org/?p=linux/kernel/git/avi/kvm.git

[11] RT test utils, 2009. http://git.kernel.org/?p=linux/kernel/git/tglx/rt-tests.git

[12] The Calibrator (v0.9e), Stefan Manegold, 2004.http://homepages.cwi.nl/ manegold/Calibrator

[13] Preventing Guests from Spinning Around,Thomas Friebel, Xen Summit 2008.http://www.xen.org/files/xensummitboston08/LHP.pdf

[14] [PATCH RFC 0/4] Paravirtual spin-locks, Jeremy Fitzhardinge, 2008.http://permalink.gmane.org/gmane.comp.emu-lators.xen.devel/53239

[15] The Open Group Base Specifications Issue 6,IEEE Std 1003.1, IEEE and Open Group, 2004.

[16] KVM Mailing List: [RFC] more fine grainedlocking for IRQ injection, Gleb Natapov, 2009.http://permalink.gmane.org/gmane.comp.emu-lators.kvm.devel/37114

Documents

Towards Linux as a Real-Time Hypervisor - LWN.netlwn.net/images/conf/rtlws11/papers/proc/p18.pdfTowards Linux as a Real-Time Hypervisor Jan Kiszka Siemens AG, Corporate Technology,