CSR: Small: BMX: Scalability and Low-Latency …aburtsev/doc/2014-nsf-bmx.pdfCSR: Small: BMX: Scalability and Low-Latency Communication with Bare Metal Execution Over the last ve years

CSR: Small: BMX: Scalability and Low-Latency Communicationwith Bare Metal Execution

Over the last five years computer hardware has made a revolutionary step forward. Recent CPUshave integrated memory and bus controllers, direct I/O interfaces, and low-latency point-to-pointinterconnects for bypassing the I/O bus on access to network interfaces. Early benchmarks performedby Intel as part of the Data Plane Development Kit project, a user-level device driver framework,demonstrate that careful cache optimizations, direct access to I/O devices, and the ability tobypass an operating system (OS) enable a modern server to forward 80 million packets per second.Traditional systems achieve only 15% of this speed. Historically, this level of performance was possibleonly with special-purpose hardware accelerators. A theoretical analysis of Ethernet communicationlatency on a modern Linux system yields a similar conclusion: a factor of 6 to 10 improvement ispossible if the traditional OS is removed from the critical path.

Proposed Research The PIs propose creating BMX, a novel mechanism for removing theoperating system from the critical path of a high-performance, I/O intensive application. BMXprovides applications with direct, low-latency access to hardware and an isolated, interference-free execution environment. Relying on techniques of hardware-assisted virtualization and I/OMMU-based protection, BMX implements a new operating system abstraction—a bare metal process—that runs right on the hardware, has direct access to peripheral devices, and only occasionallycommunicates with the operating system kernel using horizontal system calls. BMX providestraditional applications with the ability to avoid overheads, delays, and scheduling unpredictabilitiestraditionally introduced by the operating system on the I/O path. While modifying the operatingsystem stack, BMX preserves compatibility with existing software engineering and debugging tools.

Intellectual Merit The first contribution of this work is design of the BMX execution environment.BMX implements the principle of separation of control and data paths in a traditional operatingsystem kernel. The second main contribution is a collection of mechanisms required for buildingefficient applications inside the BMX environment. BMX provides applications with performanceisolation from the operating system kernel, and implements evnironment in which applicationshave direct control over the data path. Having direct access to hardware, bare metal applicationshave complete freedom to implement purpose-built I/O protocols based on optimizing I/O paththrough the CPU cache, applying cache friendly data structures, and leveraging low-latency accessto peripheral devices.

Broader Impacts The PIs expect that the proposed work will enable new kinds of datacenterapplications. In a transparent manner, BMX enriches traditional operating systems with propertiesof lightweight execution runtimes used in supercomputers and achieves performance improvementstraditionally impossible without specialized hardware. BMX is a practical foundation for efficientlyutilizing datacenters for solving big data analytics tasks. BMX will be implemented as part of thede-facto research and industry standard Linux operating system, and will be open source, directlybenefiting the broader community.

Key Words. operating systems; datacenters; low-latency networking; virtualization.

Contents

1 Overview 11.1 BMX: Scalability and Low-latency Through Direct Hardware Access . . . . . . . . . 21.2 BMX Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 BMX Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 The State of the Art is Inadequate 5

3 Details of Proposed Research 73.1 Bare metal processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Bare Metal Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Broader Impacts 12

5 Timeline and Management Plan 12

6 Results from Prior NSF Support 13

1 Overview

Modern operating systems inherit their design from early computing environments. Four decadesago, the main task of an operating system was multiplexing relatively long-lived jobs on scarcehardware resources. Performance of the system on a large data set was often limited by diskaccess speed. If blocked on an I/O operation, an optimal strategy was to perform a process switch;overheads of the operating system were acceptable.

Today, the same operating systems control dramatically faster hardware in a datacenter environ-ment; they run tasks that span thousands of machines, serve billions of web views per day [78], andmay keep an entire data set in memory [56]. Performance of datacenter systems depends on theability to support millions of connections, saturate fast network interfaces, and communicate withmicrosecond latencies. To achieve one microsecond node-to-node communication latency, or to servea 1 KB packet on a saturated gigabit link, fewer than 2, 000 cycles are available per operation. Atraditional operating system cannot work within this budget: it requires more than 2, 000 cyclesjust to move a packet from a device interface to the application [45].

Facing the challenge to serve billions of users a day, companies like eBay utilize 52, 000 servers [24].To achieve a one microsecond latency, high-performance computing clusters, and high-frequencytrading platforms rely on specialized non-Ethernet network interfaces like Infiniband [33,63] andpurpose-built FPGA-based network accelerators [46, 48]. Lacking proper performance isolation, thehigh-performance computing community developed a family of lightweight kernels [30,44].

Evidence suggests that a 6–10x performance improvement is possible on modern hardware ifthe operating system is removed from the critical path and applications receive complete controlover hardware [37]. For example, running on bare metal, carefully optimized event-driven softwarecan be capable of achieving a forwarding rate of 80, 000, 000 64-byte packets per second [36, 37].The research question addressed by this proposal is: Can we build systems softwareexposing this kind of near-peak performance without completely sacrificing backward-compatibility with existing systems?

How do traditional OSes lose performance? Two main problems turn a traditional operatingsystem into a bottleneck. First, operating systems were designed to be general and multiplexhardware across multiple applications. Today, the overheads introduced by general multiplexing arewasteful and can even be prohibitive. In a traditional kernel, receiving a network packet involvesacknowledging the hardware interrupt in a fast upper half of an interrupt handler, scheduling of aslow bottom half, context switching in and out of the interrupt handler, copying packet data fromthe kernel to the user space, returning to the user application waiting on the packet in a system call,and occasionally going through overheads of application scheduling and process switching. Thisgeneral packet receive path takes several microseconds [45].

Second, traditional operating systems were designed in the era in which performance overhead ofcache misses and shared memory consistency was lower. Today, the budget for processing a networkpacket is a handful of cache misses. To fit this requirement, data structures need to be redesignedto minimize the cache footprint and avoid shared memory contention. The cost of translationlookaside buffer (TLB) misses and associated cache misses due to page table walks prohibit the useof small pages. The network receive path should rely on lock- and contention- free buffers. Despitemany improvements in traditional operating systems, many parts of the network stack inherit theirarchitecture from early computer systems. Today, a single network stack with multiple receivethreads, but shared data structures remains a bottleneck.

1

1.1 BMX: Scalability and Low-latency Through Direct Hardware Access

BMX is a novel operating system mechanism which enables applications to bypass the operatingsystem and gain direct access to hardware. BMX implements a new abstraction—a bare metalprocess—that enables a new mode of execution in a traditional operating system. Bare metalprocesses execute directly on the hardware, without a layer of operating system underneath them.Bare metal processes have direct access to the private hardware assigned to them, and completecontrol over it—cores, caches, physical memory, network interfaces. Bare metal processes run withcomplete performance isolation from the operating system. The operating system will never preemptor interfere with a bare metal process. BMX will be implemented as part of the Linux kernel.

To enable implementation of novel, scalable, I/O intensive, and low-latency applications indatacenters and cloud infrastructure, BMX builds on the following principles:

No operating system on the critical path BMX will remove the traditional operating systemfrom the critical path of bare metal applications. Instead, bare metal applications will rely onuser-level device drivers and a lightweight execution environment to communicate with I/O devices,implement lightweight cooperative scheduling, scalable implementations of network stacks, andlightweight, low-latency protocols.

Performance isolation from the main kernel Traditional operating systems were not designedto support performance isolation of applications. Operating system introduces noise—periodicexecution of accounting, updating time, executing timers, preemption and scheduling, and executionof slow bottom halves of interrupt handlers and work queues [6]. Lack of performance isolationresults in scheduling overheads, cache and TLB pollution, and unpredictability of response times.Loss of performance of a single node can become amplified when thousands of machine cooperateto complete a task [60]. Despite advances such as scheduling and interrupt affinity, and LinuxControl Groups [1], the Linux kernel is notorious for not being able to ensure performance isolationof performance critical applications [6, 14]. To ensure that bare metal processes are isolated fromthe main kernel, BMX implements a separation hypervisor, which deprivileges the operating systemkernel at runtime and runs it in a minimal virtual machine environment.

Direct assignment of CPUs and devices Bare metal applications will exclusively own pro-cessors and I/O devices. BMX avoids software multiplexing of processes, and instead requiresapplications to implement lightweight, application specific schedulers. BMX relies on techniques ofhardware CPU and device virtualization to implement direct assignment of devices to bare metalapplications. This provides applications and user-level device drivers with the full access to thedevice interface.

Lean data-path BMX is designed to provide the fastest possible delivery of data from the I/Odevice to the application and back to the device. BMX will use parts of the Intel Data PlaneDevelopment Kit (DPDK) [38, 76] for building I/O intensive applications. DPDK relies on ringbuffers to implement lock-free, zero-copy, zero-allocation processing of network packets. BMX willextend DPDK with support for low-latency communications, e.g., ability to develop lightweight,purpose-built protocols designed for providing the lowest possible latency [47] for applications whichrequire random access to an in-memory database [56].

Cache friendly data structures As part of the Direct I/O specification modern Intel platformsreceive and send packet data straight to and from the L3 cache [35]. When the CPU polls for

2

new packets from the network interface, packet data is already available in the cache—a crucialadvantage for systems whose performance largely depends on the number of cache misses. BMXwill rely on cache-friendly data structures for reducing the number of cache misses required to servepacket requests, be it routing, serving key-value store, or providing more complex data management.To reduce the number of cache misses on packet dispatch and processing pass, BMX will explorepossibilities of co-locating network stack and application data. To reduce the effects of misses TLBand I/O TLB misses, BMX will use huge pages.

Backward compatibility While enabling applications with direct access to hardware, BMXprovides bare metal applications with a limited—but highly useful—form of backward compatibility.Bare metal applications will have access to the full Linux system call interface. Providing completebackward compatible is challenging [61]. BMX will abandon system calls such as fork() and exex()

which clone the state of the application, and exchange of the sockets through the UNIX domainsocket interface. BMX will implement libc-BMX, a libc-like library which will provide emulationof most system calls to bare metal applications. Instead of invoking the operating system on thesame CPU core, bare metal applications communicate with the kernel through a horizontal systemcall invocation mechanism. Furthermore, BMX will implement a layer enabling the use of traditionaldebugging tools with bare metal processes.

1.2 BMX Architecture

BMX implements bare metal processes by relying on three architectural components: 1) a minimalBMX virtual machine monitor, which implements BMX separation layer and direct assignmentof devices; 2) the libc-BMX library, which provides emulation of traditional system calls insidethe minimal execution environment of the bare metal process; and 3) the system call emulationlayer inside the host kernel (Figure 1). A bare metal networking application runs a user-level devicedriver, which can be implemented as a part of the Intel Data Plane Development Kit, a scalable,low-latency network stack, and the layer of application logic.

BMX relies on hardware virtualization support to implement performance isolation of bare metalprocesses and private assignment of peripheral devices. A minimal BMX virtual machine monitorstarts as a module in the kernel of the host system. When requested it enters isolation mode, usingtechniques of late launch to deprivilege the host kernel [16, 40]. Effectively, the BMX VMM movesthe host kernel into a VT-x non-root context. It “unplugs” the CPUs, unbinds devices, and balloonsmemory which are to be assigned to bare metal processes. Bare metal processes start as normalprocesses in the host kernel, but at the moment they are ready to enter the “bare metal” mode ofexecution, they are moved to a VT-x non-root context on a physical CPU which is assigned to them.

BMX uses I/O MMU protection mechanisms to implement secure assignment of a peripheraldevice to the bare metal process. Hardware support for virtualization is used to implement interruptrouting straight from the peripheral device into the address space of the bare metal process.

It is important to emphasize that although the BMX VMM launches each bare metal process, itnever executes underneath them. All I/O and CPU interrupts are configured to reach interrupthandlers inside the bare metal process. An external interrupt from the peripheral device does notcause a transition from the bare metal process into the host operating system, or into the BMXseparation layer. Instead it is executed immediately by the interrupt handler of the bare metalprocess.

3

VT-x non-root

libc-BMX

DMA

Bare Metal

Process

VT-x non-root

System

calls

Interrupts

VT-x non-root

BMX

Separation

Layer

Figure 1: BMX architecture. BMX sepration layer provides performance isolation across baremetal processes, and the Linux kernel. libc-BMX library implements a limited form of backwardcompatibility with horizontal, exception-less system calls. Bare metal have direct access to hardwarethrough VT-x and VT-d mechanisms.

1.3 BMX Examples

Example 1: Low-latency Network Access Hardware-accelerated network interfaces likeInfiniBand achieve a one way, back to back communication latency of one microsecond [63]. A cleanslate research prototype, VELO, achieves a sub-microsecond, one-way communication latency of970 ns with a relatively slow FPGA board [47]. At the same time, the Linux Ethernet stack is onlyable to achieve an end-to-end latency of 14µs for 1 Gbps and 10.2µs for the 10 Gbps devices [45].The main design choices for low-latency in Infiniband and FPGA devices are: (1) implementationof a kernel bypass interface—user-level processes are able to directly connect to one of the sendand receive queues provided by the device; (2) simpler network protocol and off-loading traditionalnetwork protocol in hardware; (3) use of polling device drivers instead of interrupts. The BMXarchitecture provides a way to leverage principles of low-latency design for Ethernet devices, andbare-metal applications.

Example 2: 10 Million Packets Per Second Challenge Modern DNS root servers, intrusiondetection systems, firewalls, load balancers, web cache servers need to scale to process millionsof packets per second, and maintain millions of open connections [32]. On a fully saturated 10Gigabit link, packets of a typical datacenter size—1024 bytes—arrive at a rate of 1.2 million packetsper second (a packet every 835 nanoseconds, or 1670 cycles on a 2GHz machine) [37]. If carefullydesigned, a server system with 8 network adapters and 8 cores can reach a speed of processing10 million packets per second. With the latency of memory access of 70 nanoseconds [37], eachcache miss takes 140 cycles. A budget of 1670 cycles per second leaves space for only several cachemisses. Modern operating systems keep connection information in a chain of linked data structures.For example, for an incoming TCP packet the network stack has to look up its destination, find

4

a pointer to the transmission control block, then a pointer to the socket, and then one for theapplication [32]. For ten million connections, the necessary data structures will not all fit in a cache(modern CPUs come with 20 megabyte level 3 caches). The network stack needs to be redesignedto collocate connection information together, so it results in a single cache miss. Furthermore, iftraditional paging is enabled, each cache miss results in an additional page table entry miss. Forexample, to cover 32GBs of RAM with 4KB pages, an operating system requires 64MB of space forpage tables. Of course, page tables do not fit in the cache, and result in additional cache misses.Traditional paging must be disabled [32,36,76]. BMX provides a clean abstraction for implementingscalable network processing applications.

2 The State of the Art is Inadequate

Traditional operating systems Three main problems of traditional operating systems motivatedevelopment of BMX: long data processing path, pure multi-core scalability, and operating systemnoise. Trying to shorten the packet processing path in a traditional UNIX kernel, PF RING [55],and Netmap [67] implement a general mechanism to relay raw packets from the kernel to user-levelprocesses. Both PF RING and Netmap avoid memory allocations, system call overheads, and a datacopy between user and kernel space. Intel Data Plane Development Kit is a user-level frameworkdesigned for efficient packet processing [38]. DPDK runs as a Linux process, exports an entiredevice interface to user space, and relies on a user-level device driver. Infiniband devices andSolarflare Openonload export virtual device interface to the user-level applications [71]. ELI explorestechniques of optimizing performance of device I/O exported with the I/O MMU mechanisms [31].

Over the last decade Linux has made a tremendous progress in improving scalability of kernelcomponents [13,20, 58]. Unfortunately, some parts of the Linux kernel are inherently non-scalabledue to limitations embedded in the POSIX interface, which they implement [21]. The biggestscalability limitation for the network processing path in a traditional kernel is that there is noguaranty that the entire path from a device driver to application is executed on the same core [58].Acknowledging the above problems, the industry has moved towards developing scalable, low-latencynetwork stacks from scratch [4, 71,76].

High-performance computing community explored the effects of operating system noise onperformance of parallel applications [60]. It makes sense to trade compute resources for isolation. Byassigning one out of four cores in a 2048-node cluster to run Linux, Petrini et. al observed reductionin parallel job completion times even though 25% of the system’s computing power was lost [60].Several projects explore the possibility of pushing the Linux kernel towards providing isolationthrough the concept of core specialization [6, 52], i.e., assign some set of cores to applications, andensure unpreempted execution. Linux already has the Control Group interface, and the boot-timeisolcpu option that allow eliminating noise from the user level applications, and disable load balancing.More work is required for kernel threads, software IRQs, timers, and work queues. Two projectsprovide a commercial Linux distribution specifically modified for the needs of high-performancecomputing tasks: Cray CNL (Compute Node Linux) [39], and ZeptoOS [11]. Both Cray CNL [39],and ZeptoOS [11] try to reduce memory management overheads by allocating a long continuousregion of physical memory at boot-time, and use huge page tables [41].

BMX is designed to allow clean slate development of operating system and application componentsin a convenient, backward compatible environment.

5

Performance isolation Traditional virtual machines concentrate on virtualization, e.g. multi-plexing of physical resources across multiple virtual machines versus ensuring performance isolation.Kocoloski and Lange use Palacious VMM framework to implement a virtualization layer on top ofthe Linux kernel [41]. Palacious is embedded into the Linux kernel as a kernel module. Palaciousallocates half of the physical memory for itself to make sure that it is accessible from lightweightkernel VMs without traditional Linux overheads. Palacious tries to minimize the noise from theLinux kernel by the use of the tickless configuration option. It is not clear whether sources ofother noise are addressed. Probably the closest to BMX, recently announced Jailhouse separationhypervisor aims to implement performance isolation and direct assignment of devices for virtualcontainers, called “cells”, which are similar to bare metal processes in BMX [40].

High-Performance Computing Systems High-performance computing (HPC) systems sharemany problems with today’s datacenter environments. Acknowledging the problems of noise,scalability, and communication overheads in traditional operating systems, HPC clusters run acustom, lightweight operating system on the compute nodes of the cluster. Commercial distributionof the IBM Blue Gene/P lightweight Compute Node Kernel (CNK) [30], Sandia’s Labs Kitten [44]run massively parallel tasks in a single thread per core manner. Lightweight kernels support asubset of the POSIX API to support development of a POSIX-like applications on a supercomputer.With a low OS noise, and focused on performance, scalability, and extensibility lightweight kernelsprovide a good choice for todays supercomputer applications. The main drawback of lightweightkernels is lack of POSIX compatibility, memory management and debugging tools, and standardscompatible networking [72].

Facing the lack of POSIX compatibility and complexity of porting applications to customruntime environments of the lightweight operating systems, the HPC community started exploringthe possibility of using hardware-level virtualization as a flexible way to choose between a customlightweight kernel, and a full Linux environment. If porting effort is high, it is possible to rununmodified Linux applications in virtual machines. Palacious extends the Kitten lightweight kernelwith the full-system virtualization [44]. The Kitten kernel boots and runs on the bare metal withoutany loss of performance. Palacious is a VMM framework, i.e., it is designed to embed into the sourcecode of the host kernel (Kitten in this case). A research project from IBM provides support forvirtualization on compute nodes of the IBM Blue Gene/P supercomputer with the L4 microkernelused as a lightweight compute kernel [72]. A virtual machine monitor runs on top of L4, as a regularL4 process.The fact that even VMM functionality is removed outside of the microkernel, ensuresthat after being ported to the L4 interface, performance critical applications can access the computenode with a minimal overhead.

Novel lightweight runtimes The idea of removing the operating system from the criticalpath is not new. Thibault et. al suggests to embed bare metal applications inside Xen virtualmachines [74]. A number of projects aim to run language runtimes on the bare metal inside virtualmachines [2, 28, 49]. Probably the most mature project, Mirage OS develops a complete libraryoperating system in a bare metal OCaml runtime inside Xen virtual machine [49]. Drawbridgeimplements a library operating system capable of running Windows applications [62]. Akaros is aresearch operating system aimed to deliver new operating system abstractions matching the needsof datacenter applications [66]. Akaros tries to expose as much information about the underlyingplatform to the application as possible, and provide applications with the tools to manage them.BMX has a similar goal, but tries to avoid repeating the steps required to develop the functionality

6

of a traditional OS, and instead uses Linux as a library which implements the POSIX interface. NIXis a collaborative effort between Bell Labs and Sandia aimed to build a hybrid operating systemfor a heterogeneous hardware on which some cores are not able to run the full operating systemdue to power or performance constrains [51]. Similar to BMX, NIX calls out into a full operatingsystem for the events which are not supported by the NIX kernel. FusedOS is a hybrid operatingsystem stack from IBM, which encapsulates the CNK kernel as a library on the compute nodes,and calls-out into the complete Linux kernel to process system calls [57]. While NIX concentratesmore on the scheduling of cores, FusedOS aims to be a practical platform capable of running bothCNK applications, and unmodified Linux programs. Osprey is an operating system project aimedto deliver predictable performance in cloud environments [68]. Similar to BMX, Osprey runs aminimal library OS environment along with the performance critical application. Osprey disablesclock interrupts on the application nodes. In case, Osprey needs to preempt an application it usesan inter-processor interrupt. Osprey uses asynchronous, horizontal system calls, large memory pages.Similar to Barrelfish [10], Osprey partitions hardware memory across the nodes. Osprey tries toachieve efficient network communication, and runs network stack inside the application, but themessage dispatch is still done in the kernel. Arrakis [59] aims to implement the separation of dataand control planes on in a research operating system built on top of Barrelfish. Sato et al. [69]suggest a hybrid operating system which runs the Linux kernel on a traditional multi-core CPU,and runs a lightweight OS kernel on a manycore CPU like Intel MIC [34].

3 Details of Proposed Research

3.1 Bare metal processes

Challenge #1: Performance isolation for bare metal processes While existing Linuxmechanisms provide partial solutions to operating system noise, they fall short in a number of areas.For example, Linux currently provides a mechanism for dedicating CPU cores to user processes, butthis mechanism does not prevent kernel threads from being scheduled on those CPUs or preventinterrupts from being handled on them. Moreover, even if these mechanisms were extended toprovide the necessary level of isolation, Linux ultimately remains in charge of all physical resources.Thus at any time, perhaps through misconfiguration, Linux could violate the isolation.

Task: Implementing bare metal processes via the BMX VMM. To properly provideisolation for bare metal processes, we will use an approach similar to that used in Jailhouse [40].In this approach there is a minimal hypervisor—loaded as a Linux kernel module—whose primaryresponsibility is to partition and isolate machine resources between virtual machine contexts. Whenthe first bare metal process is created, the BMX VMM kernel module will be invoked and it willde-privilege the running OS by moving it into a virtual machine (VT-x non-root) context. Byvirtualizing the running Linux kernel, resources such as CPUs, memory and IO space can be removedentirely from the reach of the kernel.

Bare metal processes are created by making a hypercall to the BMX VMM. The VMM can thenuse Linux interfaces to gracefully deallocate the needed resources from Linux before removing themfrom the VM context. For example, any cores that are to be dedicated to the bare metal processare first marked as “down” in Linux using an existing interface. Similarly, mappings for both thememory and IO space to be dedicated to the bare metal process are invalidated in Linux. A newVM context is then created with the appropriate resources. Destruction of a bare metal process

7

follows the reverse procedure, destroying the process VM and reintegrating the freed resources intoLinux.

Challenge #2: Direct hardware access for bare metal processes I/O devices are amongthe resources that must be isolated in BMX. Hypervisors have historically provided shared access tophysical I/O devices by implementing virtual devices in software that are multiplexed on the physicaldevice. However, this technique seriously impacts performance, both in terms of throughput andlatency. Modern CPUs now have hardware support for associating physical hardware devices withone or more virtual machine contexts. For example, Intel’s VT-d extension allows for dedicatingdevices to a virtual machine context by remapping device interrupts to processors within the contextand by providing an IOMMU to remap DMA requests to/from the device into memory assigned tothe context.

Task: Support for interrupt routing and DMA remapping. BMX will use the Intel VT-dprocessor extensions to remap hardware devices into bare metal processes when they are created bythe BMX VMM. To simplify access to hardware for bare metal applications, we will implement aminimal bare metal process runtime infrastructure called libc-BMX. libc-BMX will be responsiblefor implementing an interface for registering, and executing device interrupt handlers. Bare metalapplications can register their interrupt handling routines to ensure low-latency control transferstraight to the application code. libc-BMX will also provide a high-level programmable interface toconfigure direct memory access between a physical peripheral device and the bare metal application.

Task: Providing a device driver framework. Performance critical applications often do notneed or want general I/O device mechanisms. For example, a network application may not requirea full TCP/IP stack or may have its own specialized stack. However, more than likely, they will notwant to implement their own hardware device drivers. To enable this, BMX will provide a devicedriver framework that allows low-level access to the hardware via a convenient API. One potentialstarting point for this, is the Intel Data Plane Development Kit (DPDK) [38], which provideshigh-performance network packet processing using a variety of techniques including polling-mode,lock-free device drivers for current Intel NICs. BMX will also explore adding interrupt-mode support(for devices not on the critical path), support for Infiniband or combined Ethernet and Infinibanddevices, and support for high performance storage devices such as PCIe-based solid-state storagedevices (SSDs).

Challenge #3: Backward compatibility Despite the fact that most performance criticalapplications are largely independent from the operating system interface in their operational phase,they may still require OS services at other points, most notably during their startup phase. Ratherthan independently implement these services in each bare metal application, BMX will insteadenable controlled access to the host OS.

Task: Implementing horizontal system calls. BMX will export a traditional system callinterface to the bare metal process. This interface, rather than trapping “vertically” to an OSrunning beneath on the same CPU, instead makes “horizontal” calls to an OS running on a separateCPU. Thus, the system call invocation will not trigger an exit from the BMX VMM context.Furthermore, BMX will implement an exceptionless system call mechanism [70] very similar to thatused in the VirtuOS system [54]. This mechanism can provide a high performance interface as itallows asynchronous calls and batching of calls. The BMX VMM will implement a transparent

8

VT-x non-root

Figure 2: Internals of a bare metal process with two physical cores, and two network interfaces:libc-BMX, Intel DPDK, BMX network stack, application logic, system call emulation, and cross-core asynchronous communication. Each core runs a simple run-to-completion event loop. Corescommunicate with the asynchronous messaging mechanism to implement pipeline parallelism.

interface for copying the system call arguments in and out of the address space of the bare metalprocess in support of this interface.

3.2 Bare Metal Applications

Challenge #4: Building scalable, low-latency services

Task: Demonstrating BMX with real applications. BMX is intended to provide a convenientenvironment for developing scalable, low-latency network applications (Figure 2). Part of theproposed work is to port several widely-used applications to run as bare metal BMX processes: thememcached distributed in-memory caching system [3], an OpenFlow controller [50], and the Sparkbig data analytics engine [80].

Memcached is a distributed, in-memory, key-value store [3]. Memcached became a criticalbuilding block in many large scale web installations like LiveJournal, Wikipedia, Flickr, Twitter,and YouTube. To optimize the performance of memcached, which critically depends on the latencyof network communication, research and academic installations revert to using low-latency non-Ethernet devices like Infiniband [56]. BMX will try to achieve low latency numbers with bare metalprocesses and a short packet processing path.

An OpenFlow controller is part of a software-defined network responsible for initial establish-ment of every connection in the network [50]. Latency, throughput, and scalability of the controllerare critical for the performance of the entire network. In a datacenter with 100 edge switches, acentralized controller should be able to handle 10 million flow requests per second [12]. The mostefficient OpenFlow controllers today handle 1.6 million requests per second [75]. We plan to seehow this number can change if an OpenFlow controller is optimized as a bare metal application.

Spark is a distributed, in-memory big data analytics framework [80]. The main intuitionbehind Spark is an abstraction of resilient distributed datasets (RDDs), which is designed to avoidcross-node communication traditionally required for replication of in-memory data [79]. RDDs

9

allow Spark to outperform similar frameworks like Hive, Hadoop, and MapReduce by a factor of10-100x [77, 80]. Spark is implemented in Scala, a language which runs on top of a Java virtualmachine. The ability to run a complex language runtime inside a bare metal process will test BMX’sbackward compatibility.

Task: Using cache-friendly data structures. Two research directions are focused on datastructures that optimize utilization of CPU caches: cache-oblivious [23, 27] and cache-conscious [19].Cache-oblivious techniques aim to achieve theoretically asymptotically optimal performance fordata structures and algorithms on multi-level memory hierarchies, but without considering specificcache and memory parameters. Cache-conscious techniques consider parameters of a specific cachehierarchy. Both techniques rely on similar design principles [18,19]:

• Clustering: improves the spatial locality of a data structure by packing multiple elementsthat are likely accessed contemporaneously into the same and consecutive cache blocks formore efficient prefetching.

• Coloring: avoids conflicted cache misses by dispatching contemporaneously accessed elementsinto non-conflicting regions of the set-associative cache.

• Compression: reduces the size of a single element in a data structure for cache efficiency.Two common techniques for data structure compression are: 1) pointer elimination uses offsetsrather than address pointers to reference an object on the heap memory model, and 2) hotand cold structure splitting separates a large data structure by putting a small portion of thedata structure that is essential for the making decisions on the critical path into a separatepart of the data structure.

Many widely used data structures have been redesigned to improve cache friendliness: binarytrees [15], B+ trees and their variants [17,64], linked lists [26], tries [8], and hashtables [9]. BMX willapply techniques of cache friendly data structures for design of scalable and efficient applications.

Task: Providing cross-core communication and synchronization. Similar to multicoremicrokernels BMX will use a lock-free asynchronous message passing communication mechanismfor communication inside the bare metal applications [10]. We plan to rely on techniques ofcache-optimized, lock-, and contention-free producer-consumer buffers [29].

Task: Implementing low-overhead scheduling. Modern, highly-concurrent network serverslike Ngnix [73], memcached [3], and Node.JS [22] rely on the lightweight event-driven, cooperativescheduling. Compared to lightest user-level threads, event-driven architectures reduce overheadsof context switching and memory footprint to the minimum. Event programming works wellwith coarse-grained events, and simple event pipelines. With the complex event patterns event-driven architectures suffer from the problem of “stack-ripping” [5,43], and require relatively complexlanguage support to make sense of the model [43]. In BMX we will leverage event-driven programmingmodel for simple, packet processing pipelines. BMX will use run-to-completion model on each core,and pipeline parallelism across the cores with the help of asynchronous, nonblocking message passingmechanism.

10

Challenge #5: Achieving a one microsecond latency Our goal is to achieve a one mi-crosecond one way communication latency with BMX. We plan to explore both legacy TCP/IPprotocols and experimental network protocols designed specifically to reduce communication latencyin a datacenter environment. For legacy applications, which require a socket interface, BMX willimplement a provide a backward compatible socket abstraction, and applications designed for amessaging interface, like MPI.

Larsen et al. [45] present a detailed analysis of the send and receive latencies in an Ethernet-basedTCP/IP stack, which we can use as a basis for understanding the possible latency achievable by theBMX architecture. Larsen et al. measure an end-to-end latency of sending a 1-byte TCP packetover 1Gbps and 10Gbps Ethernet interfaces. In their experiment, two machines were connectedback-to-back with a 50 meter cable. The Ethernet stack achieves an end-to-end latency of 14µs forthe 1Gbps and 10.2µs for the 10Gbps devices. On the 1Gbps link, TCP/IP overhead is only 1.3µs,physical propagation delay is 720ns, and overheads of device processing are estimated to be 1.7µs.We can expect reduction of device processing overheads with the device clock speeds going up.1 Iftwo machines are connected back to back we can expect to achieve communication latency of lessthan 3µs, or 1.7µs if TCP/IP protocol is replaced with a lighter purpose-built network protocol.

Task: Designing a low-latency network path. BMX implements key principles of low-latencynetworking. With bare metal processes BMX implements kernel bypass. Bare metal processes candirectly communicate with the network interface through PCIe interface. BMX will use DirectI/O [35] instead of DMA for both send and receive paths. Send and receive operations will betriggered with a single PCIe transaction bringing data straight to the L3 cache. BMX will usetechniques of careful cache management to increase the chances of packet data remaining in thecache from the point it is received from the NIC to the moment it is processed by the process. BMXwill experiment with simple communication protocols optimized for low-latency in a datacenterenvironment. One possible direction is to experiment with software defined networking to reducethe amount of per-packet source and destination information. Finally, BMX will rely on pollingdevice drivers to reduce receive latency, and overhead.

Challenge #6: Scaling to 10 million concurrent connections

Task: Providing a scalable, user-level TCP stack. Many applications are, and will continueto be, built on standard network protocols, in particular TCP/IP. For this reason, BMX will includea user-level TCP stack that presents a familiar socket API. The proposed work will likely adapt anexisting network stack, of which there are a number of possibilities.

OpenOnload [61] is a high-performance TCP/UDP/IP stack specifically designed for user-modeoperation under Linux. It is designed to co-exist with the Linux network stack and to work in theface of destructive user-mode system calls such as fork and exec, features that are not needed inBMX.

Another possibility would be to wrap the Linux networking stack as has been done previously [25].This approach has the advantage of providing a network stack that is under constant activedevelopment and will likely always be at the forefront of new features and protocols. For example,it supports DCTCP [7] a modification to TCP for low-latency communications in datacenters. Thedisadvantages are that it is considerable work to isolate and “wrap” the code and the result is likelyto be larger and more general purpose than other alternatives.

1The measured Intel 82571 1 Gbps NIC runs at 135MHz.

11

Finally, BMX could use Click [42] is a modular software architecture for constructing routersand other network middle boxes. Click elements can be composed into graphs to provide the desiredfunctionality. Click includes standard elements for a wide variety of network protocols, includingTCP/UDP/IP. The explicit structure of the packet processing makes Click compositions amenableto concurrent processing on multiple cores.

4 Broader Impacts

We expect that the broadest impacts from BMX will come from the fact that it will enable newkinds of datacenter applications, which will in turn have impacts outside the field of computerscience. By bringing direct access to hardware through bare metal processes, BMX will form afoundation for the development of a new generation of applications that process data more efficientlyusing less computing hardware. The potential impacts of this are large: It will reduce the costsassociated with large datacenter applications, including not only computing costs themselves, but allof the ancillary costs, including space, administration, etc. In turn, this will have significant “green”effects, as less power is required, less cooling is required, etc. BMX will be particularly well suitedto applications that process “big data,” as it provides a foundation for building highly efficient,highly scalable, bulk data processing pipelines. As has been seen recently, applications of big dataprocessing can have significant societal effects on public health, disaster detection and response,product design, and social networking.

BMX will be developed in the Linux kernel, and our implementation will be open-source; thisprovides a direct path for our work to be adopted by the broader research community, to be used ineducation, and to be adopted by commercial entities.

5 Timeline and Management Plan

PI Burtsev will guide the overall direction of the project and will be the development lead. co-PI Regehr will provide broad expertise in the area of performance tuning, cache-friendly datastructures, and scalability. PI Burtsev is a member of the research staff in the Flux Research Group,an established research group with over 15 years of work on systems and networking. Burtsev willdo much of the day-to-day management and mentoring of the graduate students that we recruit forthis work.

Year 1 The primary focus of the first year is to produce a working prototype of BMX thatcan host multiple bare metal processes and demonstrate basic technologies. Specific milestonesinclude: (a) Initial design and implementation of the BMX VMM and bare metal process abstraction(b) Support for interrupt routing and DMA remapping (c) Implementation of an initial device driverframework, probably DPDK ported to run in BMX (d) Prototype of horizontal syscall mechanism(e) Initial demonstration of low-latency mechanisms including cache-friendly data structures, cross-core communication, and low-oeverhead scheduling (f) Integrate and evaluate a “proof of concept”TCP/IP stack, probably a port of OpenOnload.

Year 2 In the second year, we will produce a robust implementation and evaluate the benefits ofbare metal processes. We will also demonstrate the practicality of BMX by running real applications.Milestones include: (a) Fill out the device driver framework, including support for Infiniband and

12

storage devices (b) Complete the backward-compatibility layer, completing syscall redirection andlibc emulation (c) Provide cross-core communication/synchronization and low-overhead schedulingmechanisms (d) Produce a high-performance TCP/IP stack (e) Demonstrate low-latency networkingwith our target applications.

Year 3 In the final year we continue to refine the platform and strive to achieve the latencyand scaling goals. Milestones include: (a) Implement, and evaluate the impact of, cache-friendlydata structures, and other low overhead features throughout the BMX system (b) Explore newlow-latency network protocols and new applications of the system.

6 Results from Prior NSF Support

The most closely related prior NSF support for both PI Burtsev and co-PI Regehr is NSF awardCNS–1319076 “XCap: Practical Capabilities and Least Authority for Virtualized Environments,”October 2013—September 2016, $499,912.

XCap is a system designed to provide a practical environment for enforcing strong isolation,fine-grained access control, and the principle of least authority for unmodified, untrusted, off-the-shelfapplications and operating system services [53, 65]. XCap builds on two principles: strong isolationand secure collaboration. To isolate applications and components implementing system services,XCap relies on hardware-level, full-system virtualization. The interface exported by a hypervisor,the x86 instruction set, has far simpler semantics than the system call interface exported by anoperating system. Each application starts in an isolated virtual machine, and runs a fresh, privatecopy of an operating system kernel. All sharing between virtual machines is mediated by a capabilityaccess control model that explicitly describes the external resources available to the application andalso the ways in which the application can interact with the rest of the system. In XCap, capabilitiesserve as a general foundation for constructing least privilege services out of existing components ofthe traditional operating system stack. Furthermore, capabilities enable formal reasoning aboutauthority of individual applications, and the system overall.

13

References cited

[1] cgroups. http://en.wikipedia.org/wiki/Cgroups.

[2] Erlang on Xen. http://erlangonxen.org/.

[3] Memcached, 2013. http://memcached.org/.

[4] 6WIND. 6WINDGate Protocols and Management. http://www.6wind.com/products/

6windgate-protocols.

[5] A. Adya, J. Howell, M. Theimer, W. J. Bolosky, and J. R. Douceur. Cooperative taskmanagement without manual stack management. In Proceedings of the General Track ofthe Annual Conference on USENIX Annual Technical Conference, ATEC ’02, pages 289–302,Berkeley, CA, USA, 2002. USENIX Association.

[6] H. Akkan, M. Lang, and L. M. Liebrock. Stepping towards noiseless Linux environment.In Proceedings of the 2nd International Workshop on Runtime and Operating Systems forSupercomputers, ROSS ’12, pages 7:1–7:7, New York, NY, USA, 2012. ACM.

[7] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, andM. Sridharan. Data Center TCP (DCTCP). In Proceedings of the ACM SIGCOMM 2010Conference, SIGCOMM ’10, pages 63–74, New York, NY, USA, 2010. ACM.

[8] N. Askitis and R. Sinha. HAT-trie: A Cache-conscious Trie-based Data Structure for Strings.In Proceedings of the Thirtieth Australasian Conference on Computer Science, ACSC ’07, pages97–105. Australian Computer Society, Inc., 2007.

[9] N. Askitis and J. Zobel. Cache-Conscious Collision Resolution in String Hash Tables. InProceedings of the 12th International Conference on String Processing and Information Retrieval,SPIRE’05, pages 91–102. Springer-Verlag, 2005.

[10] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schupbach,and A. Singhania. The Multikernel: A New OS Architecture for Scalable Multicore Systems.In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP’09, pages 29–44, New York, NY, USA, 2009. ACM.

[11] Beckman et al. ZeptoOS project., 2013. http://www.mcs.anl.gov/research/projects/

zeptoos/.

[12] T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in thewild. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC’10, pages 267–280, New York, NY, USA, 2010. ACM.

[13] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, andN. Zeldovich. An Analysis of Linux Scalability to Many Cores. In Proceedings of the 9thUSENIX Conference on Operating Systems Design and Implementation, OSDI’10, pages 1–8,Berkeley, CA, USA, 2010. USENIX Association.

1

http://en.wikipedia.org/wiki/Cgroups

http://erlangonxen.org/

http://memcached.org/

http://www.6wind.com/products/6windgate-protocols

http://www.6wind.com/products/6windgate-protocols

http://www.mcs.anl.gov/research/projects/zeptoos/

http://www.mcs.anl.gov/research/projects/zeptoos/

[14] R. Brightwell, A. Maccabe, and R. Riesen. On the appropriateness of commodity operatingsystems for large-scale, balanced computing systems. In Parallel and Distributed ProcessingSymposium, 2003. Proceedings. International, pages 8 pp.–, 2003.

[15] G. S. Brodal, R. Fagerberg, and R. Jacob. Cache Oblivious Search Trees via Binary Trees ofSmall Height. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA ’02, pages 39–48. Society for Industrial and Applied Mathematics, 2002.

[16] Bromium, Inc. Bromium vSentry. http://www.bromium.com/sites/default/files/

Bromium-Whitepaper-vSentry_2.pdf.

[17] S. K. Cha, S. Hwang, K. Kim, and K. Kwon. Cache-Conscious Concurrency Control ofMain-Memory Indexes on Shared-Memory Multiprocessor Systems. In Proceedings of the27th International Conference on Very Large Data Bases, VLDB ’01, pages 181–190. MorganKaufmann Publishers Inc., 2001.

[18] T. Chilimbi, M. Hill, and J. Larus. Making Pointer-based Data Structures Cache Conscious.Computer, 33(12):67–74, 2000.

[19] T. M. Chilimbi. Cache-Conscious Data Structures: Design and Implementation. PhD thesis,1999.

[20] A. T. Clements, M. F. Kaashoek, and N. Zeldovich. Scalable Address Spaces Using RCUBalanced Trees. In Proceedings of the Seventeenth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems, ASPLOS XVII, pages 199–210,New York, NY, USA, 2012. ACM.

[21] A. T. Clements, M. F. Kaashoek, N. Zeldovich, R. T. Morris, and E. Kohler. The scalablecommutativity rule: Designing scalable software for multicore processors. In Proceedings of theTwenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 1–17, NewYork, NY, USA, 2013. ACM.

[22] R. Dahl. Node.js. http://nodejs.org.

[23] E. D. Demaine. Cache-Oblivious Algorithms and Data Structures. Lecture Notes from the EEFSummer School on Massive Data Sets, pages 1–29, 2002.

[24] eBay, Inc. eBay Digital Service Efficiency (DSE) Dashboard, 2013. http://tech.ebay.com/dashboard.

[25] B. Ford, G. Back, G. Benson, J. Lepreau, A. Lin, and O. Shivers. The Flux OSKit: A Substratefor Kernel and Language Research. In Proceedings of the Sixteenth ACM Symposium onOperating Systems Principles, SOSP ’97, pages 38–51, New York, NY, USA, 1997. ACM.

[26] L. Frias, J. Petit, and S. Roura. Lists Revisited: Cache Conscious STL Lists. In ExperimentalAlgorithms, volume 4007 of Lecture Notes in Computer Science, pages 121–133. Springer BerlinHeidelberg, 2006.

[27] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-Oblivious Algorithms. In40th Annual Symposium on Foundations of Computer Science, pages 285–297. IEEE, 1999.

2

http://www.bromium.com/sites/default/files/Bromium-Whitepaper-vSentry_2.pdf

http://www.bromium.com/sites/default/files/Bromium-Whitepaper-vSentry_2.pdf

http://nodejs.org

http://tech.ebay.com/dashboard

http://tech.ebay.com/dashboard

[28] Galois, Inc. Haskell Lightweight Virtual Machine (HaLVM). http://corp.galois.com/halvm.

[29] J. Giacomoni, T. Moseley, and M. Vachharajani. Fastforward for efficient pipeline parallelism:A cache-optimized concurrent lock-free queue. In Proceedings of the 13th ACM SIGPLANSymposium on Principles and Practice of Parallel Programming, PPoPP ’08, pages 43–52, NewYork, NY, USA, 2008. ACM.

[30] M. Giampapa, T. Gooding, T. Inglett, and R. W. Wisniewski. Experiences with a LightweightSupercomputer Kernel: Lessons Learned from Blue Gene’s CNK. In Proceedings of the 2010ACM/IEEE International Conference for High Performance Computing, Networking, Storageand Analysis, SC ’10, pages 1–10, Washington, DC, USA, 2010. IEEE Computer Society.

[31] A. Gordon, N. Amit, N. Har’El, M. Ben-Yehuda, A. Landau, D. Tsafrir, and A. Schuster. ELI:Bare-Metal Performance for I/O Virtualization. In ACM Architectural Support for ProgrammingLanguages & Operating Systems (ASPLOS), 2012.

[32] R. Graham. The C10M problem. http://c10m.robertgraham.com/p/manifesto.html.

[33] Infiniband Trade Association. About InfiniBand. http://www.infinibandta.org/content/pages.php?pg=about_us_infiniband.

[34] Intel. Intel Corporation. Intel Many Integrated Core Architecture (Intel MIC Architecture)– Advanced. http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html.

[35] Intel. Intel Data Direct I/O Technology (Intel DDIO): A Primer, 2012. http:

//www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-

direct-i-o-technology-brief.pdf.

[36] Intel. Impressive Packet Processing Performance Enables Greater Workload Consolida-tion, 2013. http://www.intel.com/content/dam/www/public/us/en/documents/solution-briefs/communications-packet-processing-brief.pdf.

[37] Intel. Intel data plane development kit (Intel DPDK). overview: Packet processing on in-tel architecture, 2013. http://www.intel.com/content/dam/www/public/us/en/documents/presentation/dpdk-packet-processing-ia-overview-presentation.pdf.

[38] Intel. Intel DPDK: Data plane development kit., 2013. http://dpdk.org.

[39] L. Kaplan. Cray CNL. In FastOS PI Meeting and Workshop, 2007.

[40] J. Kiszka et al. Jailhouse project., 2013. https://docs.google.com/file/d/0B6HTUUWSPdd-Zl93MVhlMnRJRjg/edit?pli=1.

[41] B. Kocoloski and J. Lange. Better than native: using virtualization to improve compute nodeperformance. In Proceedings of the 2nd International Workshop on Runtime and OperatingSystems for Supercomputers, ROSS ’12, pages 8:1–8:8, New York, NY, USA, 2012. ACM.

[42] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The Click Modular Router.ACM Transactions on Computing Systems, 18(3):263–297, Aug. 2000.

3

http://corp.galois.com/halvm

http://c10m.robertgraham.com/p/manifesto.html

http://www.infinibandta.org/content/pages.php?pg=about_us_infiniband

http://www.infinibandta.org/content/pages.php?pg=about_us_infiniband

http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html

http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html

http://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf



http://www.intel.com/content/dam/www/public/us/en/documents/solution-briefs/communications-packet-processing-brief.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/solution-briefs/communications-packet-processing-brief.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/presentation/dpdk-packet-processing-ia-overview-presentation.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/presentation/dpdk-packet-processing-ia-overview-presentation.pdf

http://dpdk.org

https://docs.google.com/file/d/0B6HTUUWSPdd-Zl93MVhlMnRJRjg/edit?pli=1

https://docs.google.com/file/d/0B6HTUUWSPdd-Zl93MVhlMnRJRjg/edit?pli=1

[43] M. Krohn, E. Kohler, and M. F. Kaashoek. Events can make sense. In Proceedings of theUSENIX Annual Technical Conference, ATC’07, pages 7:1–7:14. USENIX Association, 2007.

[44] J. Lange, K. Pedretti, T. Hudson, P. Dinda, Z. Cui, L. Xia, P. Bridges, A. Gocke, S. Jaconette,M. Levenhagen, and R. Brightwell. Palacios and Kitten: New high performance operatingsystems for scalable virtualized and native supercomputing. In Parallel Distributed Processing(IPDPS), 2010 IEEE International Symposium on, pages 1–12, 2010.

[45] S. Larsen, P. Sarangam, R. Huggahalli, and S. Kulkarni. Architectural breakdown of end-to-endlatency in a TCP/IP network. Int. J. Parallel Program., 37(6):556–571, Dec. 2009.

[46] C. Leber, B. Geib, and H. Litz. High frequency trading acceleration using FPGAs. In FieldProgrammable Logic and Applications (FPL), 2011 International Conference on, pages 317–322.IEEE, 2011.

[47] H. Litz, H. Froening, M. Nuessle, and U. Bruening. Velo: A novel communication engine forultra-low latency message transfers. In Parallel Processing, 2008. ICPP ’08. 37th InternationalConference on, pages 238–245, 2008.

[48] J. W. Lockwood, A. Gupte, N. Mehta, M. Blott, T. English, and K. Vissers. A low-latency libraryin FPGA hardware for High-Frequency Trading (HFT). In High-Performance Interconnects(HOTI), 2012 IEEE 20th Annual Symposium on, pages 9–16. IEEE, 2012.

[49] A. Madhavapeddy, R. Mortier, C. Rotsos, D. Scott, B. Singh, T. Gazagnaire, S. Smith, S. Hand,and J. Crowcroft. Unikernels: Library operating systems for the cloud. In Proceedings of theEighteenth International Conference on Architectural Support for Programming Languages andOperating Systems, ASPLOS ’13, pages 461–472, New York, NY, USA, 2013. ACM.

[50] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker,and J. Turner. Openflow: Enabling innovation in campus networks. SIGCOMM Comput.Commun. Rev., 38(2):69–74, Mar. 2008.

[51] J. McKie and J. Floren. Edging towards exascale with NIX. In Workshop on Exascale OperatingSystems and Runtime Software, 2012.

[52] R. G. Minnich, M. J. Sottile, S.-E. Choi, E. Hendriks, and J. McKie. Right-weight kernels: anoff-the-shelf alternative to custom light-weight kernels. SIGOPS Oper. Syst. Rev., 40(2):22–28,Apr. 2006.

[53] Y. Naik. Xen-Cap: a capability framework for Xen, 2013. MSc Thesis Report. University ofUtah. Also available as: http://www.cs.utah.edu/~aburtsev/doc/yathi-xcap-ms-report-2013.pdf.

[54] R. Nikolaev and G. Back. VirtuOS: An operating system with kernel virtualization. InProceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP’13, pages 116–132, New York, NY, USA, 2013. ACM.

[55] ntop. PF RING: High-speed packet capture, filtering and analysis. http://www.ntop.org/

products/pf_ring/.

4

http://www.cs.utah.edu/~aburtsev/doc/yathi-xcap-ms-report-2013.pdf

http://www.cs.utah.edu/~aburtsev/doc/yathi-xcap-ms-report-2013.pdf

http://www.ntop.org/products/pf_ring/

http://www.ntop.org/products/pf_ring/

[56] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazieres, S. Mitra,A. Narayanan, D. Ongaro, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, andR. Stutsman. The Case for RAMCloud. Communications of the ACM, 54(7):121–130, July2011.

[57] Y. Park, E. V. Hensbergen, M. Hillenbrand, T. Inglett, B. Rosenburg, K. D. Ryu, and R. W.Wisniewski. FusedOS: Fusing LWK performance with FWK functionality in a heterogeneousenvironment. Computer Architecture and High Performance Computing, Symposium on, 0:211–218, 2012.

[58] A. Pesterev, J. Strauss, N. Zeldovich, and R. T. Morris. Improving network connection localityon multicore systems. In Proceedings of the 7th ACM European Conference on ComputerSystems, EuroSys ’12, pages 337–350, New York, NY, USA, 2012. ACM.

[59] S. Peter and T. Anderson. Arrakis: The operating system as control plane. USENIX ;login,38(4):44–47, Aug. 2013.

[60] F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance:Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003ACM/IEEE conference on Supercomputing, SC ’03, pages 55–, New York, NY, USA, 2003.ACM.

[61] S. Pope and D. Riddoch. Introduction to OpenOnload—Building Application Trans-parency and Protocol Conformance into Application Acceleration Middleware. https:

//support.solarflare.com/index.php?option=com_cognidox&file=SF-105918-CD-1_

Introduction_to_OpenOnload_White_Paper.pdf&task=download&format=raw&Itemid=17.

[62] D. E. Porter, S. Boyd-Wickizer, J. Howell, R. Olinsky, and G. C. Hunt. Rethinking the libraryOS from the top down. In Proceedings of the Sixteenth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems, ASPLOS XVI, pages 291–304,New York, NY, USA, 2011. ACM.

[63] QLogic. QLogic TrueScale QLE7300 Series InfiniBand Adapters. http://www.qlogic.com/

Resources/Documents/DataSheets/Adapters/DataSheet_QLE7300Series.pdf.

[64] J. Rao and K. A. Ross. Making B+- Trees Cache Conscious in Main Memory. In Proceedingsof the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00,pages 475–486. ACM, 2000.

[65] J. Regehr (PI) and A. Burtsev (co-PI). National Science Foundation Secure and TrustworthyCyberspace Program. XCap: Practical Capabilities and Least Authority for Virtualized Envi-ronments. Award CNS-1319076, October 2013—September 2016. $499,912. Also available as:http://www.cs.utah.edu/~aburtsev/doc/2012-nsf-xencap.pdf.

[66] B. Rhoden, K. Klues, D. Zhu, and E. Brewer. Improving per-node efficiency in the datacenterwith new OS abstractions. In Proceedings of the 2nd ACM Symposium on Cloud Computing,SOCC ’11, pages 25:1–25:8, New York, NY, USA, 2011. ACM.

5

https://support.solarflare.com/index.php?option=com_cognidox&file=SF-105918-CD-1_Introduction_to_OpenOnload_White_Paper.pdf&task=download&format=raw&Itemid=17



http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/DataSheet_QLE7300Series.pdf

http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/DataSheet_QLE7300Series.pdf

http://www.cs.utah.edu/~aburtsev/doc/2012-nsf-xencap.pdf

[67] L. Rizzo and M. Landi. Netmap: Memory mapped access to network devices. In Proceedings ofthe ACM SIGCOMM 2011 Conference, SIGCOMM ’11, pages 422–423, New York, NY, USA,2011. ACM.

[68] J. Sacha, J. Napper, S. Mullender, and J. McKie. Osprey: Operating system for predictableclouds. In Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42ndInternational Conference on, pages 1–6, 2012.

[69] M. Sato, G. Fukazawa, K. Nagamine, R. Sakamoto, M. Namiki, K. Yoshinaga, Y. Tsujita,A. Hori, and Y. Ishikawa. A design of hybrid operating system for a parallel computer withmulti-core and many-core processors. In Proceedings of the 2nd International Workshop onRuntime and Operating Systems for Supercomputers, ROSS ’12, pages 9:1–9:8, New York, NY,USA, 2012. ACM.

[70] L. Soares and M. Stumm. Flexsc: Flexible system call scheduling with exception-less system calls.In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation,OSDI’10, pages 1–8, Berkeley, CA, USA, 2010. USENIX Association.

[71] Solarflare. OpenOnload, 2008. http://www.openonload.org/.

[72] J. Stoess, J. Appavoo, U. Steinberg, A. Waterland, V. Uhlig, and J. Kehne. A light-weightvirtual machine monitor for Blue Gene/P. In Proceedings of the 1st International Workshop onRuntime and Operating Systems for Supercomputers, ROSS ’11, pages 3–10, New York, NY,USA, 2011. ACM.

[73] I. Sysoev. Nginx http server. http://nginx.org.

[74] S. Thibault and T. Deegan. Improving performance by embedding HPC applications inlightweight Xen domains. In Proceedings of the 2nd workshop on System-level virtualization forhigh performance computing, HPCVirt ’08, pages 9–15, New York, NY, USA, 2008. ACM.

[75] A. Tootoonchian, S. Gorbunov, Y. Ganjali, M. Casado, and R. Sherwood. On controllerperformance in software-defined networks. In Proceedings of the 2Nd USENIX Conferenceon Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services,Hot-ICE’12, pages 10–10, Berkeley, CA, USA, 2012. USENIX Association.

[76] Wind River. Wind River Application Acceleration Engine. http://www.windriver.

com/products/product-overviews/wr-application-acceleration-engine-product-

overview.pdf.

[77] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL andrich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’13, pages 13–24, New York, NY, USA, 2013. ACM.

[78] YouTube Official Blog. Y,000,000,000utube, 2012. http://youtube-global.blogspot.com/2009/10/y000000000utube.html.

[79] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker,and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory clustercomputing. In Proceedings of the 9th USENIX Conference on Networked Systems Design andImplementation, NSDI’12, pages 2–2, Berkeley, CA, USA, 2012. USENIX Association.

6

http://www.openonload.org/

http://nginx.org

http://www.windriver.com/products/product-overviews/wr-application-acceleration-engine-product-overview.pdf



http://youtube-global.blogspot.com/2009/10/y000000000utube.html

http://youtube-global.blogspot.com/2009/10/y000000000utube.html

[80] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computingwith working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in CloudComputing, HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010. USENIX Association.

7