CrySys guest-lecture: Virtual machine introspection on modern hardware

Virtual machine introspection on

modern hardware

Tamas K Lengyel

@tklengyel

[email protected]

2/19/2015 – CrySys Lab, Budapest

Agenda

1. VMI intro

2. Intel’s split-TLB

3. Intel’s EPT and its limitations

4. Intel’s #VE / VMFUNC / EPTP-switching

5. Intel’s SMM/DMM

6. ARM

7. Conclusion

Virtual Machine Introspection (VMI)

Interpret virtual hardware

● Network, Disk, vCPU & Memory

● The semantic gap problem:

Reconstruct high-level state information from low-

level data-sources.


Bridging the semantic gap

● The guest OS is in charge of managing the virtual

hardware

● How to get the info from it?

o Install in-guest agent to query using standard

interfaces

o If OS is compromised in-guest agent can be

disabled / tampered with

o Just as vulnerable as your AntiVirus


Bridging the semantic gap

● Replicate guest OS functions externally

o Requires expert knowledge on OS internals and

hardware behavior

o Requires debug data to understand in-memory

structures

● Need access to VM memory, vCPU registers, etc.

o Hypervisor support required (or custom VMM)

o Need to emulate hardware v2p translation

Translation lookaside buffer (TLB) poisoning

Translation lookaside buffer (TLB)● Virtual to physical address translation is

expensive

● Hardware managed transparent cache

of the results

● Separate cache for data read/write and instruction fetch

(Harvard-type architecture)!

● The OS can flush it, but cannot query it

o Opportunity to whack it out of sync!

o Shadow Walker / FU rootkit

TLB poisoning

Original algorithm:Input: Splitting Page Address (addr)

Pagetable Entry for addr (pte)

invalidate_instr_tlb (pte); // flush TLB

pte = the_shadow_code_page (addr); // replace PTE in memory

mark_global (pte); // disable auto-flush

reload_instr_tlb (pte); // load it into TLB

pte = the_orig_code_page (addr); // put original entry back

TLB poisoning and virtualization

VMENTRY/VMEXIT automatically

flushes the entire TLB

● TLB poisoning is impossible

● Performance hit

Introduction of TLB tagging (VPID) in Intel Nehalem (2008)

● 16-bit field specified in the VMCS for each vCPU

● Performance boost!

● VM TLB entries invisible to the VMM

● The problem is not the split TLB, it’s the TLB itself

TLB poisoning with Windows / Linux

TLB poisoning uses global pages

● CR4.PGE (bit 7)

● Makes PTE’s marked as global survive context-switches

● Great performance boost for kernel pages!

Windows 7

● Regularly flushes global pages by disabled & re-enabling

CR4.PGE

Linux

● Doesn’t touch CR4.PGE after boot

The tagged TLB in Xen

The TLB tag is assigned to the vCPU from a global counter● asid->asid = data->next_asid++;

No flushes, just assign a new tag when needed!

When counter is overflown, flush everything and restart

from 1

A new TLB tag is assigned to the vCPU every time a MOV-

TO-CR3 is caught!

● The use of global pages is negated in the guest!

● The TLB needs to be primed on each context-switch!

The tagged TLB in KVM

Tag is assigned when vCPU structure is created

• Doesn’t matter if the vCPU is activated or not.

• Disable tagging and revert to old VMENTRY/VMEXIT TLB

flushing if out of tags

Priming the TLB in Linux guests on KVM is a problem!

• However, the split TLB still has issues

The sTLB!

Intel Nehalem introduced second-level (victim) cache

The problem:

● Split-TLB relies on a custom page-fault handler being

called to re-split the TLB when it’s evicted

● With sTLB, the entry is brought back..

o Both into the iTLB & the dTLB

o Split-TLB becomes unsplit!

● Split-TLB poisoning is unreliable in VMs!

Disabling the sTLB

2014: MoRE Shadow Walker - The evolution of TLB

splitting on x86

● sTLB doesn’t merge entries with conflicting PTE

permissions

● Definitely applies to EPT PTEs but hasn’t been tested

with regular PTEs

● Make shadow code-page X only before loading and the

custom #PF handler will be triggered when evicted

Reconstructing the guest OS’s view

Use standard OS structures at known locations

● KPCR, KDBG

● Linux Sysmap

Scan for objects of interest with signatures

● Memory is in fast-flux

● False positives

● Weak signatures

● Cross-view validation is costly

Interposition on x86 Intel

Need to trap VM execution for coherent view

● Avoids memory fast-flux problems

● Can avoid DKOM/DKSM

Two main options

● Tracing via EPT

o Granularity is that of a page but can trap R/W/X

● Tracing via #BP

o Only traps X on direct hit but is guest-visible

o Can be protected with EPT to hide it

Extended Page Tables (EPT)

Speed up guest virtual to machine

physical address translation.

Two sets of tables:

1st layer managed by guest OS

2nd layer managed by the VMM

Permissions can be different in the two layers!


Up to 512 EPTs per VM!

● Most VMMs just assign 1 EPT / VM

● Value defined in VMCS for each vCPU

● Xen 4.6 will allow different EPT / vCPU

Permission stored in bit 0-2 of EPT PTE● r : 1, /* bit 0 - Read permission */

● w : 1, /* bit 1 - Write permission */

● x : 1, /* bit 2 - Execute permission */

Lots of software programmable bits in EPT PTE available● access : 4, /* bits 61:58 - p2m_access_t */


Unlike normal PTEs, permission can be:

• R

• X

• R/W

• R/X

• R/W/X

• W and W/X triggers EPT misconfiguration

• Still traps to the VMM but viewer info transfered

EPT limitations

● When violation is trapped,

the information only gives

the start address and type

of the violation.

● A single R/W violation may touch up to 8-bytes!

● Not sufficient to match violation offset against known

locations

● Violations in the vicinity of a watched area need to be

treated as potential hits as well!

EPT limitations

Read/Write violation ambiguities"An EPT violation that occurs during as a result of execution of a read-

modify-write operation sets bit 1 (data write). Whether it also sets bit

0 (data read) is implementation-specific and, for a given

implementation, may differ for different kinds of read-modify-write

operations.“ - Intel SDM

It is possible to siphon data using r-m-w operations from a

page that doesn’t allow reading!

Fixed in Xen 4.5

EPT limitations

How to let the VM progress without missing a potential

event?

EPT limitations

How to let the VM progress without missing a potential

event?

Emulate the instruction!

• Xen comes with a baked in x86 emulator!

• Option to “emulate with no write”

• Perform the instruction but don’t let it write to

memory

Intel #VE (Virtualization Exceptions)

EPT based tracing has a 4k granularity

● Too much overhead when we only care about certain

points

● Handle EPT violations in the guest!

#VE Interrupt Service Routine (ISR)

● Interrupt handler within the guest

● Defined in guest IDT #20

● EPT PTE bit 63 determines if violation triggers #VE or

VMEXIT

VMFUNC and EPTP switching

“This instruction allows software in VMX non-root operation to invoke

a VM function, which is processor functionality enabled and

configured by software in VMX root operation. No VM exit occurs.”

VMFUNC with EAX=0 => EPTP switching

● Remember how we can have up to 512 EPT’s per VM?

● Pass EPT IDX as parameter in ECX (0-512)

● Faulting translation of GPFN is performed via the new

table!

● Guest never really “knows” the value of EPTP

EPTP switching

How do we go “back” to the original EPTP afterwards?

Single-step and restore?

“The key observation is that at any single point in time, a

given hardware thread can be fetching an instruction or

reading data, but not both. There are ways of avoiding the

single-step too. [...] It's an optimization certainly, but it's

not required, and it's not a technique we have placed in the

public domain. You could try talking to us under NDA or

figure it out for yourself.” Ed White (Intel) xen-devel, 1/16/2015

1-setting of the “EPT-violation #VE” VM-

execution control

Optional CPU feature baked into the chip

During VMFUNC[0] saves the updated EPT index (ECX[15:0])

and gives it to the next #VE

Allows you to juggle the “views” without missing anything!

Index 0 := selective X traps; 1 := only X; 2 := only R/W

• If #VE with IDX=0, switch to IDX=1 (VMEXIT if of interest)

• If #VE with IDX=1, switch to IDX=2

• If #VE with IDX=2, switch to IDX=0

System Management Mode (SMM)

SMM intended for low-level services, such as:

● thermal (fan) control

● USB emulation

● hw errata workarounds

Can be used for:

● VMI

● Anti-VMI!

Hard to take control of it on (most) Intel devices as it is

loaded by the signed BIOS.

SMM

● Normal mode SMM is

triggered by interrupts

(SMI)

● Can be configured to

happen periodically

● Always returns to the

same execution mode

afterwards

SMM

The problem for SMM based VMI systems:

“A limitation of any SMM-based solution [...] is that a

malicious hypervisor could block SMI interrupts on every

CPU in the APIC, effectively starving the introspection tool.

For VMI, trusting the hypervisor is not a problem, but the

hardware isolation from the hypervisor is incomplete.”Jain et al. “SoK: Introspections on Trust and the Semantic Gap” 2014, IEEE S&P

Intel Dual-monitor mode SMM – the DMM

Available on all CPUs with VT-x

SMM can become an independent

hypervisor!

SMIs are still available but..

A VMCALL executed by the VMM

automatically and unconditionally

traps into the DMM

Intel DMM

The VMCALL instruction can be used to instrument the

VMM.

● Same way the VMM can use #BP to instrument a VM.

The DMM can enter any execution mode on the system!

● Full control over the execution flow

● Hidden VMs

The DMM can disable SMIs for a VM!

● Forced execution

● Nothing in the system can preempt it (SMIs, NMIs, etc.)

ARM

ARM 2-stage paging

Very similar to EPT

• Fewer software available bits available

• We need to store our custom permissions somewhere so

we know if a violation was induced by us or not

• With EPT its just stored directly in the PTE

Xen mem_access for ARM

• VMM maintains a Radix tree to store custom permissions

• During violation the tree is queried

• If violation induced by monitor, forward to monitor

ARM 2-stage paging

How do we let the VM progress without potentially missing

the next event?

No MTF-like singlestep available

• Emulation?

• On x86 Xen already comes with a built-in emulator

• Not so much on ARM..

The same trick we did in the VMFUNC!

• Only this time it traps into the VMM

ARM SMC is trappable to the VMM!

ARM has no #BP that traps to the VMM

• Secure Monitor Call (SMC) can be configured to do so!

• SMC allowed only from VMM or guest kernel

• Better than nothing

What to use the TrustZone for?

• Integrity check the VMM!

• Make all VMM pages non-writable

• VMM #PF handler executes SMC

• Verify if change is allowed

Conclusion

Modern hardware is complex

Behavior sometimes vaguely defined

Intel really made a lot of progress since VT-x was

introduced but the DMM is very odd

ARM generally needs to catch-up but its progressing

rapidly

Thanks! Questions?

Devices & Hardware

CrySys guest-lecture: Virtual machine introspection on modern hardware