42
Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

Embed Size (px)

Citation preview

Page 1: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

Introduction to the Palacios VMMand the V3Vee Project

John R. Lange

Department of Computer ScienceUniversity of Pittsburgh

September 28th, 2010

Page 2: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

2

Outline• V3Vee Project

– Multi-institutional collaboration

• Palacios– Developed first VMM for scalable High Performance

Computing (HPC)

• Largest scale study of virtualization– Proved HPC virtualization is effective at scale

• Current Research– Symbiotic Virtualization – Cloud <-> HPC integration– New system architectures

Page 3: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

Virtuoso Project (2002-2007)virtuoso.cs.northestern.edu

• “Infrastructure as a Service” distributed/grid/cloud virtual computing system– Particularly for HPC and multi-VM scalable apps

• First adaptive virtual computing system– Drives virtualization mechanisms to increase the

performance of existing, unmodified apps running in collections of VMs

– Focus on wide-area computation

3

R. Figueiredo, P. Dinda, J. Fortes, A Case for Grid Computing on Virtual Machines, Proceedings of the 23rd International Conference on Distributed Computing (ICDCS 2003), May, 2003.

Page 4: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

4

Virtuoso: Adaptive Virtual Computing

• Providers sell computational and communication bandwidth

• Users run collections of virtual machines (VMs) that are interconnected by overlay networks

• Replacement for buying machines

• That continuously adapts…to increase the performance of your existing, unmodified applications and operating systems

See virtuoso.cs.northwestern.edu

for many papers, talks, and movies

See virtuoso.cs.northwestern.edu

for many papers, talks, and movies

Page 5: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

V3VEE Overview

• Goal: Create an original open source virtual machine monitor (VMM) framework for x86/x64 that permits per use composition of VMMs optimized for…– High Performance Computing

– Computer architecture research

– Experimental computer systems research

– Teaching

Community resource development project (NSF CRI)

X-Stack exascale systems software research (DOE)5

Page 6: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

Collaborators

6

Page 7: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

7

• University of Pittsburgh– Jack Lange, and more…

• Northwestern University– Jack Lange, Peter Dinda, Russ Joseph, Lei Xia, Chang Bae,

Giang Hoang, Fabian Bustamante, Steven Jaconette, Andy Gocke, Mat Wojick, Peter Kamm, Robert Deloatch, Yuan Tang, Steve Chen, Brad Weinberger Mahdav Suresh, and more…

• University of New Mexico– Patrick Bridges, Zheng Cui, Philip Soltero, Nathan Graham,

Patrick Widener, and more…

• Sandia National Labs– Kevin Pedretti, Trammell Hudson, Ron Brightwell, and more…

• Oak Ridge National Lab– Stephen Scott, Geoffroy Vallee, and more…

People

Page 8: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

Virtual Machine Monitor

Research in Computer Architecture

Research and Use In High Performance

ComputingTeaching

Open Source Community

broad user and developer base

Why a New VMM?

Well-served

Research in Systems,Cloud, and Datacenters

Commercial SpaceCloud / Datacenter

Page 9: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

Virtual Machine Monitor

Research in Systems,Cloud, and Datacenters

Research in Computer Architecture

Research and Use In High Performance

ComputingTeaching

Open Source Community

broad user and developer base

Why a New VMM?Commercial SpaceCloud / Datacenter

Ill-served

Page 10: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

Why a New VMM?• Code Scale

– Compact codebase by leveraging hardware virtualization assistance

• Code decoupling– Make virtualization support self-contained

– Linux/Windows kernel experience unneeded

• Make it possible for researchers and students to come up to speed quickly

• Freedom: BSD license• Composition

– Enable raw performance at large scales10

Page 11: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

11

Palacios VMM• OS-independent embeddable virtual machine monitor• Developed at Northwestern and University of New Mexico

– And now University of Pittsburgh

• Open source and freely available

• Users:– Kitten: Lightweight supercomputing OS from Sandia National Labs– MINIX 3– Modified Linux

• Successfully used on supercomputers, clusters (Infiniband and Ethernet), and servers

http://www.v3vee.org/palacios

Page 12: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

• Palacios compiles to a static library

• Compact host OS interface allows Palacios to be embedded in different Host OSes– Palacios adds virtualization support to an OS– Palacios + lightweight kernel = traditional “type-I”

“hypervisor” VMM

• Current embeddings– Kitten, GeekOS, Minix 3, Linux

12

Embeddability

Page 13: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

13

Kitten: An Open Source LWK

• Better match for user expectations– Provides mostly Linux-compatible user environment

• Including threading

– Supports unmodified compiler toolchains and ELF executables

• Better match vendor expectations– Modern code-base with familiar Linux-like organization

• Drop-in compatible with Linux

– Infiniband support

• End-goal is deployment on future capability system

http://code.google.com/p/kitten/

Page 14: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

14

Palacios as an HPC VMM

• Minimalist interface– Suitable for an LWK

• Compile and runtime configurability– Create a VMM tailored to specific environments

• Low noise

• Contiguous memory pre-allocation

• Passthrough resources and resource partitioning

Page 15: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

15

A Compact Type-I VMM

Xen: 580k lines (50k – 80k core)

KVM: 50k-60k lines + Kernel dependencies (??)+ User level devices (180k)

Page 16: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

16

HPC Performance Evaluation

• Virtualization is very useful for HPC, but…Only if it doesn’t hurt performance

• Virtualized RedStorm with Palacios– Evaluated with Sandia’s system evaluation

benchmarks

17th fastest supercomputer

Cray XT338208 cores~3500 sq ft

2.5 MegaWatts$90 million

Page 17: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

17

Scalability at Small Scales(Catamount)

Within 5%Scalable

HPCCG: conjugant gradient solver

Page 18: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

18

Large Scale Study

• Evaluation on full RedStorm system– 12 hours of dedicated system time on full machine– Largest virtualization performance scaling study to date

• Measured performance at exponentially increasing scales– Up to 4096 nodes

• Publicity– New York Times– Slashdot– HPCWire– Communications of the ACM– PC World

Page 19: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

19

Scalability at Large Scale(Catamount)

CTH: multi-material, large deformation, strong shockwave simulation

Within 3%

Scalable

Page 20: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

20

Infiniband on Commodity Linux

2 node Infiniband Ping Pong bandwidth measurement

(Linux guest on IB cluster)

Page 21: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

21

Symbiotic Virtualization

• Virtualization can scale– Near native performance for optimized VMM/guest (within 5%)

• VMM needs to know about guest internals– Should modify behavior for each guest environment

• Symbiotic Virtualization– Design both guest OS and VMM to minimize semantic gap

– Bidirectional synchronous and asynchronous communication channels

– Interfaces are optional and can be dynamically added by VMM

Page 22: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

22

Symbiotic Interfaces• SymSpy Passive Interface

– Internal state already exists but it is hidden– Asynchronous bi-directional communication

• via shared memory– Structured state information that is easily parsed

• Semantically rich

• SymCall Functional Interface– Synchronous upcalls into guest during exit handling– API

• Function call in VMM• System call in Guest

– Brand new interface construct

• SymMod Code Injection• Guest OS might lack functionality

• VMM loads code directly into guest– Extend guest OS with arbitrary interfaces/functionality

Page 23: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

VNET Overlay Networking for HPC• Virtuoso project built “VNET/U”

– User-level implementation

– Overlay appears to host to be simple L2 network (Ethernet)

– Ethernet in UDP encapsulation (among others)

– Global control over topology and routing

– Works with any VMM

– Plenty fast for WAN (200 Mbps), but not HPC

• V3VEE project is building “VNET/P”– VMM-level implementation in Palacios

– Goal: Near-native performance on Clusters and Supers for at least 10 Gbps

– Also: easily migrate existing apps/VMs to specialized interconnect networks

23

Page 24: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

Current VNET/P Node Architecture

24

Service VMApplication VM

Page 25: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

Conclusion• Palacios: original open-source VMM for

modern architectures– Small and Compact

• Easy to understand and extend/modify

– Framework for exploring alternative VMM architectures

25

V3VEE Project http://v3vee.org

Collaborators Welcome!

Page 26: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

Backup Slides

26

Page 27: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

27

Current Research

• Virtualization can scale– Near native performance for optimized VMM/guest (within 5%)

• VMM needs to know about guest internals– Should modify behavior for each guest environment– Example: Paging method to use depends on guest

• Black Box inference is not desirable in HPC environment– Unacceptable performance overhead– Convergence time– Mistakes have large consequences

• Need guest cooperation– Guest and VMM relationship should be symbiotic (Thesis)

Page 28: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

28

Semantic Gap

• VMM architectures are designed as black boxes– Explicit lowlevel OS interface (hardware or paravirtual)– Internal OS state is not exposed to the VMM

• Many uses for internal state– Performance, security, etc...– VMM must recreate that state

• “Bridging the Semantic Gap”– [Chen: HotOS 2001]

• Two existing approaches: Black Box and Gray Box– Black Box: Monitor external guest interactions– Gray Box: Reverse engineer internal guest state– Examples

• Virtuoso Project (Early graduate work)• Lycosid, Antfarm, Geiger, IBMon, many others

Page 29: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

29

Example: Swapping

Physical MemorySwappedMemory

Application Memory Working Set

Swap Disk

• Disk storage for expanding physical memory

Only basic knowledge without internal state

Guest

VMM

Page 30: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

30

SymSpy

• Shared memory page between OS and VMM– Global and per-core interfaces

• Standardized data structures– Shared state information

• Read and write without exits

Page 31: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

31

SymCall (Symbiotic Upcalls)

• Conceptually similar to System Calls– System Calls: Application requests OS services– Symbiotic Upcalls: VMM requests OS services

• Designed to be architecturally similar– Virtual hardware interface

• Superset of System Call MSRs

– Internal OS implementation• Share same system call data structures and basic operations

• Guest OS configures a special execution context– VMM instantiates that context to execute synchronous upcall– Symcalls exit via a dedicated hypercall

Page 32: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

32

SymCall Control Flow

Handle exit

Running in guest

Returnto VMM

NestedExits

Page 33: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

33

Implementation

• Symbiotic Linux guest OS– Exports SymSpy and SymCall interfaces

• Palacios– Fairly significant modifications to enable nested

VM entries• Re-entrant exit handlers

• Serialize subset of guest state out of global hardware structures

Page 34: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

34

SwapBypass

• Purpose: improve performance when swapping– Temporarily expand guest memory– Completely bypass the Linux swap subsystem

• Enabled by SymCall– Not feasible without symbiotic interfaces

• VMM detects guest thrashing– Shadow page tables used to prevent it

Page 35: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

35

Symbiotic Shadow Page tablesGuest Page Tables Shadow Page Tables

PageDirectory

PageTable

PhysicalMemory

PageDirectory

PageTable

PhysicalMemory

SwapDiskCache

Swapped out page Swap Bypass Page

Page 36: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

36

Guest 3Global Swap Disk Cache

Guest 1

SwapBypass Concept

Guest Physical MemorySwappedMemory

Swap Disk

Application Working Set

VMM physical Memory

Swap Disk

Guest 2

Guest 2 Guest 3

Page 37: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

37

Necessary SymCall: query_vaddr()

1. Get current process ID– get_current(); (Internal Linux API)

2. Determine presence in Linux swap cache– find_get_page(); (Internal Linux API)

3. Determine page permissions for virtual address– find_vma(); (Internal Linux API)

• Information extremely hard to get otherwise• Must be collected while exit is being handled

Page 38: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

38

Evaluation

• Memory system micro-benchmarks– Stream, GUPS, ECT Memperf– Configured to over commit anonymous memory

• Cause thrashing in the guest OS

• Overhead isolated to swap subsystem– Ideal swap device implemented as RAM disk

• I/O occurs at main memory speeds

• Provides lower bound for performance gains

Page 39: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

39

Bypassing Swap Overhead

Working set size

Ideal I/O improvement

Stream: simple vector kernel

Performance improves

Page 40: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

40

• Access standard OS driver API– Dependent on internal OS implementation

OS Drivers/Modules

Page 41: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

41

Standard Symbiotic Modules

• Guest exposes standard symbiotic API via SymSpy

Page 42: Introduction to the Palacios VMM and the V3Vee Project John R. Lange Department of Computer Science University of Pittsburgh September 28th, 2010

42

Secure Symbiotic Modules

• Symbiotic API, protected from guest– Secure initialization– Virtual memory overlay