Execution Replay for Multiprocessor Virtual Machines

Execution Replay for

Multiprocessor Virtual Machines

George W. DunlapDominic Lucchetti

Michael A. FettermanPeter M. Chen

Big ideas

• Detection and replay of memory races is possible on commodity hardware

• Overhead high for some workloads

• …but surprisingly low for other workloads

Execution Replay

CPU

Memory

Disk

Network

Keyboard, mouse

Interrupts

Uses of Execution Replay

• Reconstructing state– Fault tolerance

• Reconstructing execution– Debugging– Realistic trace generation

• Both– Intrusion analysis

Single-processor Replay• Basic principles well understood

– Log all non-deterministic inputs– Timing of asynchronous events

• Minimal overhead (Dunlap02)– 13% worst case– Log for months or years

• Available commercially– VMWare: Record/Replay

Replay for Multiprocessors• Memory races in multiprocessor VMs• The Ordering Requirement• The CREW Protocol

– Implementing with page protections– Relation to the Ordering Requirement– Generating constrants from CREW events

• DMA-capable devices and CREW• Performance

The Multiprocessor Challenge

• Interleaved reads and writes– Fine-grained non-determinism– Much more difficult

• Existing solutions– Hardware modification– Software instrumentation

• SMP-ReVirt– Hardware MMU to detect sharing

Multiprocessor Replay

P2

Memory

P1

P1 P2

n=3n=5

if (n<4)

Ordering Memory Accesses

• Preserving order will reproduce execution– a→b: “a happens-before b”– Ordering is transitive: a→b, b→c means

a→c

• Two instructions must be ordered if:– they both access the same memory, and– one of them is a write

Constraints: Enforcing order

• To guarantee a→d:– a→d– b→d– a→c– b→c

• Suppose we need b→c– b→c is necessary– a→d is redundant

P1

a

b

c

d

P2

overconstrained

CREW Protocol

• Each shared object in one of two states:– Concurrent-Read: all processors can read,

none can write– Exclusive-Write: one processor (the

owner) can read and write; others have no access

CREW protocol, con’t• Enforced with hardware MMU

– Read/write– Read-only– None

• Change CREW states on demand– Fault, fixup, re-execute

• CREW event– Increasing or reducing permission due to CREW

state changes

CREW Property

• If two instructions on different processors: – access the same page,– and one of them is a write,– there will be a CREW event on each

processor between them.

Generating Constraints• State: Concurrent Read

– All processors read-only

• d*: CREW fault• New state: P2 Exclusive• r: privilege reduction

– Read to None

• i: privilege increase– Read to Read/write

• Log timing of r and i• Constraint:

– r → i

P1

a

d

P2

ri

d*

Direct Memory Access

• Device accesses memory directly

• Logically another processor– Reads and writes need to be ordered– IOMMU: can’t fault/fixup/re-execute

• Observation: Transaction model

• Device: non-preemptible actor

Prototype: SMP-ReVirt

• Modified Xen hypervisor

• Implement logging, CREW protocol

• Details in paper

Evaluation questions

• What is the overhead?

• What affects performance?– In paper

• When might I want to use MP?– Log with 1, 2, or N cpus?

Evaluation Workloads

• SPLASH2 parallel application suite– FMM, LU, ocean, radix, water-spatial,

radiosity

• Kernel-build

• Dbench

Predicting results• Key changes in sharing attributes

– 4096-byte sharing granularity– “Miss” is very expensive

• SPLASH2– Good: high spatial locality / low false sharing– Bad: random access patterns / high false sharing

• The Linux kernel– Tuned to 16-byte cacheline– Involving the kernel may be expensive

Single-processor Xen guests

1.001.04

1.01 1.001.03

1.13

1.001.05

0

0.2

0.4

0.6

0.8

1

1.2

FMM LU ocean radix water-spatial

kernel-build

radiosity dbench

Norm

aliz

ed r

untim

e

Unmodified 1-cpu guest

Logging 1-cpuguest

`

Log Growth RateWorkload Log growth(GB/day) Days to fill 300GB

FMM 0.234 1280

LU 0.237 1261

Ocean 0.232 1295

Radix 0.292 1025

Water-spatial 0.232 1296

Kernel-build 0.564 531

Radiosity 0.231 1295

Dbench 0.557 538

2-processor Xen guests

1.51

1.001.08

1.601.48

2.10

1.90

1.76

1.96

1.741.83

1.99

0

0.5

1

1.5

2

2.5

FMM LU ocean radix water-spatial kernel-build

No

rma

lize

d r

un

tim

e

Unmodified 2-cpuguest

Logging 2-cpu guest

Logging 1-cpu guest

2-processor, con’t

8.70

7.21

1.85 1.88

0123456789

10

radiosity dbench

No

rma

lize

d r

un

tim

e

Unmodified 2-cpu guest

Logging 2-cpu guest

Logging 1-cpu guest

Log Growth RateWorkload Log growth(GB/day) Days to fill 300GB

FMM 34.5 8.7

LU 3.2 92.7

Ocean 4.3 69.1

Radix 39.8 7.5

Water-spatial 36.3 8.25

Kernel-build 43.3 6.9

Radiosity 88.4 3.4

Dbench 77.0 3.9

4-processor Xen guests

7.36

1.12 1.28

4.20

1.72

9.03

0

2

4

6

8

10

FMM LU ocean radix water-spatial kernel-build

Nor

mal

ized

run

time

Unmodified domain, 4 cpus

CREW logging, 4 cpus

CREW logging, 2 cpus*

CREW logging, 1 cpu

Recap• Memory races in multiprocessor VMs• The Ordering Requirement• The CREW Protocol

– Implementing with page protections– Relation to the Ordering Requirement– Generating constrants from CREW events

• DMA-capable devices and CREW• Performance

Big ideas

• Detection and replay of memory races is possible on commodity hardware

• Overhead high for some workloads

• …but surprisingly low for other workloads

Questions

Documents

Execution Replay for Multiprocessor Virtual Machines