Unikernels: Rise of the Library Hypervisor

  • View
    8.836

  • Download
    0

  • Category

    Software

Preview:

Citation preview

Unikernels: the Rise of the Library Hypervisor

Anil Madhavapeddy, @avsm Mindy Preston, @yomimono

Martin Lucina +the MirageOS and Docker for Mac/Win teams

Docker Inc, @docker with contributions from IBM

Docker Distributed Systems Summit 7th October 2016, Berlin, Germany

Conventional hypervisors• Run full guest operating

systems with complex emulation needs.

• Scaffolding for device emulation, instruction emulation, etc.

• Hard to compose into existing infrastructure without wrapping a full hypervisor layer.

Xen Hypervisor

qemu

xenstored

xenconsoled

Hardware

Dom0DomU

Conventional hypervisorsCVE-2016-3710: VGA emulation missing bounds checks causes exploit.

CVE-2016-5403: unbounded virtio memory usage causes DoS.

CVE-2016-3672: unrestricted qemu logging causes DoS.

CVE-2015-8554: qemu-dm buffer overrun in MSI-X causes exploit.

CVE-2015-7504: heap overflow in pcnet emulator causes exploit.

• Run full guest operating systems with complex emulation needs.

• Scaffolding for device emulation, instruction emulation, etc.

• Hard to compose into existing infrastructure without wrapping a full hypervisor layer.

How can distributed systems use hardware protection more

flexibly and composably?

Recap: Unikernels

• "library operating systems" break kernels into libraries.

• Link libraries with a boot layer, scheduler and application.

• Portable microservices that boot directly on hypervisors or Unix. Xen

Hardware

App

Linux

Hardware

DockerApp

Configuration Business Logic

HTTP JSON SSL

TCP/IP Xen Devices

Unix libev

Unix musl libc

Application

Libraries

Libraries

Recap: Unikernels

• Many benefits are lost when deploying on existing clouds.

• Tiny binaries (200k) still require scaffolding of a full OS to boot.

• Difficult to manage hypervisor from inside a container as full host privilege is needed.

• "library operating systems" break kernels into libraries.

• Link libraries with a boot layer, scheduler and application.

• Portable microservices that boot directly on hypervisors or Unix.

Library Hypervisors• Extend the "kit" model and break down hypervisor

functionality into libraries.

• Expose core functionality (CPU and memory) as library, and other pieces (device emulation) are optional.

• Benefit: huge reduction in TCB, and better fit to container-native infrastructure with privilege dropping.

• Drawback: no existing support in operating systems.

Library Hypervisors• Extend the "kit" model and break down hypervisor

functionality into libraries.

• Expose core functionality (CPU and memory) as library, and other pieces (device emulation) are optional.

• Benefit: huge reduction in TCB, and better fit to container-native infrastructure with privilege dropping.

• Drawback: no existing support in operating systems.

But let's a closer look!

What has changed?OSX

Hypervisor framework

FreeBSD bHyve

xHyveHyperKit

bhyve.org

xhyve.org

github.com/docker/hyperkit

What has changed?OSX

Hypervisor framework

Linux /dev/kvm

FreeBSD bHyve

xHyveHyperKit kvmtool

novm

ukvm

What has changed?OSX

Hypervisor framework

Linux /dev/kvm

FreeBSD bHyve

xHyveHyperKit kvmtool

novm

Docker for Mac MirageOS3

ukvm

• Easy drag and drop installation, and autoupdates to get latest Docker.

• Secure, sandboxed virtualisation architecture without elevated privileges.

• Native networking support, with VPN and network sharing compatibility.

• File sharing between container and host: uid mapping, inotify events, etc.

Docker for MacAiming for a native OSX experience that works with existing developer workflows.

• Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve.

• Sandbox friendly: processes largely run as non-root, with privileges of the local user.

Virtualisation

• Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve.

• Sandbox friendly: processes largely run as non-root, with privileges of the local user.

Virtualisation

OSX Kernel

Hypervisor.framework

Hardware virt: VMX,

nested paging

• Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve.

• Sandbox friendly: processes largely run as non-root, with privileges of the local user.

Virtualisation

OSX Kernel Userspace

Hypervisor.framework

User Process

Thread/vCPUTraps on I/O pagesManages ACPI, PCI devices

Hardware virt: VMX,

nested paging

• Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve.

• Sandbox friendly: processes largely run as non-root, with privileges of the local user.

Virtualisation

OSX Kernel Userspace

Hypervisor.framework

User ProcessHardware virt: VMX,

nested paging

ProcessLinux Kernel

VirtIO IPCVirtIO BlockVirtIO Net

Alpine Linux Userspace

Latest Docker preconfigured

QCow2VPNKit

Logs redirected to OSX host

• Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve.

• Embeds Linux: includes an embedded lightweight Alpine Linux distribution optimised for fast boot and stateless operation for containers.

Virtualisation

$ docker info Containers: 358 Running: 13 Paused: 0 Stopped: 345 Images: 485 Server Version: 1.11.1 Storage Driver: aufs Root Dir: /var/lib/docker/aufs Backing Filesystem: extfs Dirperm1 Supported: true

Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge null host Kernel Version: 4.4.9-moby Operating System: Alpine Linux v3.3 OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 3.858 GiB

HyperKit library structure

• In HyperKit, most functionality is linked as a library.

• If app doesn't need a protocol, it is not linked and not part of the trusted computing base.

• Want to hide the gory details of virtualisation from the user. The Linux VM should be "invisible".

• Not solving this leads to many user complaints:

• VPN software and corporate installations do not like bridged virtual machines or custom routing.Result: container traffic cannot connect to Internet.

• Services cannot be exposed on localhost or the external interface and are instead on the Linux VM IP address.Result: breaks common web oAuth workflows.

Networking

Networking

OSX Kernel UserspaceHypervisor.framework

HyperKitHardware virt: VMX,

nested paging

VirtIO IPC

VirtIO Block

VirtIO Net

Networking

OSX Kernel UserspaceHypervisor.framework

HyperKitHardware virt: VMX,

nested paging

VirtIO IPC

VirtIO Block

VirtIO NetEthernet In

Containers! Containers! Containers!

Networking

OSX Kernel UserspaceHypervisor.framework

HyperKitHardware virt: VMX,

nested paging

VirtIO IPC

VirtIO Block

VirtIO NetEthernet In

Bridge

EthernetKernel Module

Containers! Containers! Containers!

• Want to hide the gory details of virtualisation from the user. The Linux VM should be "invisible".

• Not solving this leads to many user complaints:

• VPN software and corporate installations do not like bridged virtual machines or custom routing.Result: container traffic cannot connect to Internet.

• Services cannot be exposed on localhost or the external interface and are instead on the Linux VM IP address.Result: breaks common web oAuth workflows.

Networking

• Challenge: Services publishing ports should be exposed on localhost without needing VM info.

• Solution: VPNKit forwards container port requests to a OSX service which binds them natively on its external interface.

• Benefits:

• docker run -P on the Mac now works without requiring any knowledge of the VM innards.

• External oAuth workflows operate with web apps.

Networking

Networking

OSX Kernel UserspaceHypervisor.framework

HyperKitHardware virt: VMX,

nested paging

VirtIO IPC

VirtIO Block

VirtIO NetEthernet In

Bridge

EthernetKernel Module

Containers! Containers! Containers!

Networking

OSX Kernel UserspaceHypervisor.framework

HyperKitHardware virt: VMX,

nested paging

VirtIO IPC

VirtIO Block

VirtIO NetEthernet In

VPNKitMirageOS

TCP/IP

DNS

SocketerKernel Sockets

Containers! Containers! Containers!

github.com/docker/vpnkit

• Challenge: Deal with custom VPN software on the host that makes it difficult to bridge.

• Solution: VPNKit, efficiently reconstructs container traffic into separate TCP/IP flows and translates them into native OSX/Windows sockets.

• Benefits:

• All network traffic is generated from normal socket calls (e.g. gethostbyaddr) on the Mac, so interacts well with firewalls, VPNs, and any local security policies.

Networking

•Native OSX application, uses HyperKit to virtualise for domain-specific purpose ("docker run")

•Links MirageOS unikernel libraries for networking and storage translation between OS boundaries.

•The library approach let us glue together these components really easily.

•Docker for Mac is quite a complex distributed system internally, but (hopefully) hidden from user.

Docker for Mac + unikernels

MirageOS 3 + Solo5

•Unikernels have been gathering pace; next challenge is to make them easily deployable.

•Build handled via Docker, but docker run shouldn't need privileges (e.g. to start a VM).

•MirageOS 3 has a new library hypervisor for Linux, developed by IBM, Docker and Cambridge University contributors.

mirage.io

MirageOS 3 + Solo5•Source: https://github.com/Solo5/solo5 •Runs as a Unix process and opens /dev/kvm for hardware isolation.

•ukvm is a small, modular monitor that links only what is needed. Can be 10k in size!

•Can run privilege separated: one process opens /dev/kvm and drops privileges and executes the unikernel.

•Boot times are the same as process fork times, since all the device setup is handled in-process.

MirageOS 3 + Solo5

Source: Dan Williams and Ricardo Koller, IBM Research, HotCloud 16

MirageOS 3 + Solo5

•Due for stable release in the next month. • Intended to be "unikernel template" for other projects to share hypervisor code.

•Liberally licensed under BSD/Apache2/ISC to encourage adoption and embedding.

•BoF and tutorials tomorrow to demonstrate it. Developers are all here and hacking!

Demo!

How can distributed systems use hardware protection more

flexibly and composably?

Questions?

Download free at docker.com

Twitter: @avsm

https://github.com/docker/hyperkit

https://github.com/docker/vpnkit

https://github.com/docker/datakit

https://github.com/mirage/

We will be hacking

tomorrow!

Backup Slides

• Challenge: Share arbitrary OSX directory tree into Linux container without requiring extensive modification of either side.

• Solution: Use a FUSE forwarding layer and translate Linux filesystem calls to OSX equivalents.

OSX Host Linux Host ContainerVOLUMEcom.docker.osxfs

Track extra metadata

Translate to OSX filesystem calls

FUSE

Filesystem Sharing

• Challenge: Need filesystem activation so events on the Mac wake up container servers and vice-versa.

• Solution: osxfs uses FSEvents API and injects inotify activation events into container.

OSX Host Linux Host ContainerVOLUMEcom.docker.osxfs

FSEvents watches open files

Events from Linux causes OSX apps

to wake up

FUSE

Filesystem Sharing

• Challenge: Need filesystem activation so events on the Mac wake up container servers and vice-versa.

• Solution: osxfs uses FSEvents API and injects inotify activation events into container.

OSX Host Linux Host ContainerVOLUMEcom.docker.osxfs

FSEvents watches open files

Events from Linux causes OSX apps

to wake up

FUSE

Filesystem Sharing

• Challenge: Deal with custom VPN software on the host that makes it difficult to bridge.

• Solution: VPNKit, efficiently reconstructs container traffic into separate TCP/IP flows and translates them into native OSX/Windows sockets.

OSX Host Linux Host ContainerRUN <...>com.docker.hyperkit-net

Reconstruct traffic

TCP flows

Translate to OSX socket calls

Ethernet bridge

DHCPv4

NTP

Networking

OSX Host Linux Host

Privileged Port Service

Container

EXPOSEPort Service

VSock Binder

RUN <...>

VSock Listener

Userland Proxy

• Challenge: Services publishing ports should be exposed on localhost without needing VM info.

• Solution: VPNKit forwards container port requests to a OSX service which binds them natively on its external interface.

Networking

$ docker run resin/armv7hf-debian uname -a

Linux 7ed2fca7a3f0 4.1.12 #1 SMP Tue Jan 12 10:51:00 UTC 2016 armv7l GNU/Linux

$ docker run justincormack/ppc64le-debian uname -a

Linux edd13885f316 4.1.12 #1 SMP Tue Jan 12 10:51:00 UTC 2016 ppc64le GNU/Linux

Multi-CPU architectures

Recommended