OSE 2011– OSE – virtual machines 1 Operating Systems Engineering Virtual Machines By Dan Tsafrir, 25/5/2011

OSE 2011– OSE – virtual machines 1

Operating Systems Engineering

Virtual Machines

By Dan Tsafrir, 25/5/2011


What’s a virtual machine?

A VM is a simulation of a full computer With its disk & NIC & OS & user-level apps, …

Running as an application On some “host” computer Simulation is called a “guest”


VMs – requirements

Simulation needs to be accurate Emulate HW faithfully, handle weird quirks of kernels & such Reproduce bugs exactly

Simulation needs to be isolated Guest must not break out of VM SW inside guest might be faulty and/or malicious

Simulation needs to be fast Well, as fast as possible…

Simulation needs to be believable Guest shouldn’t be able to distinguish VM from real computer

The “blue pill” saga [ http://en.wikipedia.org/wiki/Blue_Pill_(malware) ]

In reality, if guests can accurately time stuff, they can know (And indeed, viruses often refuse to work when virtualized)


VMs – origin

Late 1960s IBM used VMs to share mainframes

Late 1990s VMWare re-popularized VMs (for x86 HW) Economic boom: nowadays billions of $s business Everyone is playing

SW: Microsoft, IBM, Redhat, Oracle, … HW: Intel, AMD, ARM, IBM, Oracle, …


VMs – why?

For developers & power users

One computer w/ multiple OSes My Win 7 laptop also runs Ubuntu My MacBook Pro @ home also runs XP (for office)

Kernel development Like QEMU, but performs reasonably


VMs – why?

Business case: saves money! Server consolidation

Once we had underutilized machines per service… Reduces cost of HW, power consumption, cooling

Portability (why should Intel/AMD/IBM care about consolidation?) Decouples OS from HW and makes upgrades easy

Increased robustness Can backup entire machine + easily restore if HW breaks No need to reinstall all SW Can isolate important apps in their own VM (safety)

Makes cloud models possible Such as Amazon’s EC2 (“elastic cloud”)

Certain costly sys-admin chores made much easier Provisioning a new machine (just clone ready image)


What’s in a name SW that runs the show (3 names referring to same thing):

VMM Virtual machine monitor

Hypervisor (Of IBM origin) Sometimes denoted “HV”

~Host

VMMs Citrix Xen, KVM, VMWare ESXi, MS HyperV, IBM pHyp,…

2 possible settings Next 2 slides…


Hosted VMM (“type 2 hypervisor”)

• Like VMWare Workstation,Parallels, VirtualBox, QEMU,…

• Typically personal use


Bare metal / native VMM(“type 1 hypervisor”)

• XenServer, VMWare ESXi, MS HyperV, IBM pHyp,…• Typically for servers, data centers, clouds


VMM multiplexes HW

Just like an OS…

Divides memory among guests Related: de-duplication, balloon-ing

Time-shares CPU among guests Related: notion of VCPU vs. PCPU (can hot-plug)

Simulates per-guest virtual devices Disk Network, …


Virtualization refinement

Paravirtualization Guest OS is aware it is being virtualized For performance purposes Paravirtualized devices

HW support Intel-VT AMD-V


ASSUMING NO HW SUPPORTHow to virtualize x86…


VMs – how?

SW interpretation, instruction by instruction Can do it, but much, much too slow

Idea1: when possible, execute VM’s instructions on real CPU Works fine for most instructions (e.g., add %eax %ebx) But what about isolation? (e.g., VM writes outside its memory)

Idea2: run VMs at CPL=3 Ordinary instructions work fine Writing to %cr3 traps to VMM

VMM examines guest’s page table VMM can manipulate page table if it wants Only then set %cr3 and resume VM

This virtualization model is called: “trap & emulate”


VMM hides real machine

Virtual vs. real resources Virtual vs. real cr3

Virtual cr3: the VM (thinks it) sets the real cr3 Real cr3: exclusively managed (= virtualized) by VMM

Virtual vs. real machine-defined data structures Virtual page table: VM thinks it’s real Real page table: real page tables virtualized by VMM

VMM’s job Make guest see only virtual machine state Completely hide & protect real machine state

Problems Trap-&-emulate is tricky on x86

Not all privileged instructions trap at CPL=3 All those traps can be slow…


x86 state we must virtualize

state reason for hiding it

CPL (low bits of CS)

always 3; guest sometimes expects it to be 0

GDT descriptors their DPL (descriptor priv level) is 3; guest may expect 0

gtdr points to “shadow” (real) GDT

IDT descriptors trap to VMM code, not guest kernel (VMM forwards or fakes interrupts to guest when necessary)

idtr points to “shadow” (real) IDT

page tables entries don’t map to expected physical address

cr3 points to “shadow” page table

IF in EFLAGS interrupts must always be on when in guest mode

cr0 can’t allow guest to go into real mode

…


Terminology

Letters H = host G = guest P = physical V = virtual A = address

Combinations GVA = guest virtual address GP = guest physical HP = host physical …


Providing guest with illusion of physical memory (simplistic)

Guest view Wants to start at PA=0 Wants to use all “installed” DRAM

Host opposing view Must support several guests, they can’t all start at 0 Must protect on VM’s memory from the others

Idea Fake a smaller DRAM size than real DRAM Ensure paging is enabled Rewrite guest’s PTEs


Providing guest with illusion of physical memory (simplistic)

Example VMM allocates a guest phys mem 0x1000000 to 0x2000000 VMM gets trap if guest changes cr3 (guest @ CPL=3) VMM copies guest's page table to "shadow" page table While copying, VMM adds 0x1000000 to each PA in shadow tab VMM checks that each resulting HPA is < 0x2000000 Must copy the guest's page table

So guest doesn't see VMM's modifications to PAs


Address translation (reminder)

Q

012

p0

511

4KB page-table page => 512 PTEs (8B each)

p0 p1 p2 p3 offset9bits 9bits 9bits 9bits 12bits

W

012

p1

511K

012

p2

511

012

p3

511

CR3

Q

W

K

48bit VA

PA


Providing guest with illusion of physical memory (realistic)

Host allocates N pages to guest No need for them to be contiguous in phys mem Host maintains a GPA_to_HPA mapping (say, using a hash) GPAs are contiguous

What happens when guest changes cr3 Assume guest assigns GPA1 to cr3 A trap will occur and host will gain control Host’s goal:

Generate, on the fly, the shadow page table hierarchy From GVA to HPA There’s only one such shadow hierarchy at any given time

per core



The host’s actions Saves GPA1 internally Allocates brand new zeroed page = root of the shadow hierarchy

Let base of new page be HPA1 Assigns HPA1 to cr3 Resumes guest, which immediately faults on GVA2

GVA2 = virtual address of 1st fetched command of guest Takes 9 most significant bits from GVA2

Assume 48bit VA = 4 levels hierarchy (9bits each) + 4KB page 8 bytes per PTE

Computes GPA_to_HPA(GPA1) + 9bits * 8 = HPA of 2nd-level guest’s hierarchy

…



The host’s actions (cont.) … Continue like so with next 9bits, repeatedly,

Until reaching the HPA of the request page = HPA2 Now, there needs to be a GVA2=>HPA2 mapping in the

shadow hierarchy Adds the translation GVA2=>HPA2 to shadow hierarchy

Starting at HPA1 and allocating the rest of the levels in the hierarchy as needed

Resumes guest Repeats same procedure when next fault occurs

This continues until all address space is mapped Or until next context switch (=> need to start over)



Building shadow page tables is costly

Can we cache? Yes, but need to write protect all pages involved

Will generate trap whenever pages are modified Host would be able to respond accordingly

The problem How do we know when to stop write-protecting?

Solution Must employ some heuristic Can be not perfect as long as maintains correctness


Not all sensitive CPL=3 read/write trap

Push CS Will show CPL=3 (not 0) if guest reads pushed value

sgdt (save gdtr) Reveals real gdtr is guest reads it

pushf Pushes real IF Always on in guest mode (why?) Host injects interrupts to guest as needed

popf Ignores IF in CPL=3 => no trap => host won’t know if guest wants interrups

iret Invoked, e.g, after handling a system call No ring change => SS/ESP will not be restored


How can we cope?

Solution: binary translation Rewrite guest code Change every problematic instruction to INT 3 Keep track of original instructions + emulate in VMM Note: INT 3 is 1-byte long => small enough to overwrite any inst

Must be done dynamically at runtime Need to know what if bytes are code or data Need to know where instructions start (x86 is CISC) Consequently, scan code only as executed


Binary translation – example

Rewrite INT3 instead of Bad instructions (popf) First jump (jnz)

Then start guest kernel INT3 traps to host Emulates popf Look where jump could go

For each jump Translate upon the 1st

encounter of block Keep track of translated code Next time, replace INT3 with

original instructions if target is known (when j is direct)

Assume guest kernel starts like so:

pushl %ebp…popf…jnz x…j?? y

x:…j?? z


BT: indirect jumps & ret

Same, but

Can’t replace INT3 with original jump Since we’re not sure address will be the same next time ret indirect jump via pointer on the stack must take trap every time (slow!)

Can we speed up? Yes, by write our own code rather than hack original

=> more aggressive translation, addresses change See VMWare’s

“A Comparison of Software and Hardware Techniques for x86 Virtualization”, by Adams & Agesen, in ASPLOS 2006http://www.vmware.com/pdf/asplos235_adams.pdf

Read it to make sure you know how!


Intel/AMD HW support for VMs

Much easier to implement VMM w/ reasonable performance HW itself directly maintains per-guest virtual state

CS (w/ CPL), EFLAGS, idtr, etc. In-memory HW struct can be loaded/unloaded like context swt

HW knows it’s in guest mode Instructions directly modify virtual state Avoids lots of traps to VMM

HW basically adds a new privilege level VMM mode, CPL=0, ..., CPL=3 Guest-mode/CPL=0 isn’t fully privileged

No traps to VMM on system calls HW handles CPL transition

No need to shadow page Next slide…


Nested paging

In guest mode, there are *2* page tables in effect Guest page table & host page table

Guest memory refs go through multiple lookups Guest tables hold GVA=>GPA translations HW knows this, so in every level of the hierarchy HW automatically translates GPA to HPA Continues the table walk process HW table walk can take ~20 memory refs => There’s a new “page table cache” (in addition to the TLB),

which caches partial parts of the GVA in an attempt to skip levels (shown to be very effective)

Thus, guest can directly modify its page table w/o VMM having to shadow it No need for VMM to write-protect guest page tables No need for VMM to track cr3 changes


Nested paging

Is nested paging faster than shadow paging? Depends… (on what?)


Devices

trap INB and OUTB DMA addresses are physical,

VMM must trust devices or utilize HW support (IOTLOB) Device nowadays is typically shared (=> virtualized)

If you want to share between multiple guests Each guest gets a part of the disk Each guest looks like a distinct Internet host Each guest gets an X window

VMM might mimic some standard (or legacy) devices Regardless of actual h/w on host computer

Guest might run paravirtualized drivers Typically aggregate messages before switching to VMM

For high-performance I/O => device assignment Sharing through SRIOV (new standard)

Documents

OSE 2011– OSE – virtual machines 1 Operating Systems Engineering Virtual Machines By Dan Tsafrir, 25/5/2011