15
USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 297 EyeQ: Practical Network Performance Isolation at the Edge Vimalkumar Jeyakumar 1 , Mohammad Alizadeh 1,2 , David Mazi` eres 1 , Balaji Prabhakar 1 , Changhoon Kim 3 , and Albert Greenberg 3 1 Stanford University 2 Insieme Networks 3 Windows Azure Abstract The datacenter network is shared among untrusted ten- ants in a public cloud, and hundreds of services in a private cloud. Today we lack fine-grained control over network bandwidth partitioning across tenants. In this paper we present EyeQ, a simple and practical system that provides tenants with bandwidth guarantees as if their endpoints were connected to a dedicated switch. To realize this goal, EyeQ leverages the high bisection bandwidth in a datacenter fabric and enforces admission control on traffic, regardless of the tenant transport pro- tocol. We show that this pushes bandwidth contention to the network’s edge, enabling EyeQ to support end- to-end minimum bandwidth guarantees to tenant end- points in a simple and scalable manner at the servers. EyeQ requires no changes to applications and is deploy- able with support from the network available today. We evaluate EyeQ with an efficient software implementation at 10Gb/s speeds using unmodified applications and ad- versarial traffic patterns. Our evaluation demonstrates EyeQ’s promise of predictable network performance iso- lation. For instance, even with an adversarial tenant with bursty UDP traffic, EyeQ is able to maintain the 99.9th percentile latency for a collocated memcached applica- tion close to that of a dedicated deployment. 1 Introduction In the datacenter, we seek to virtualize the network for its tenants, as has been done for compute and storage. Ideally, a tenant running on shared physical infrastruc- ture should see the same range of control- and data-path capabilities on its virtual network, as it would see on a dedicated physical network. This vision has been in full swing for some years in the control plane [1, 2]. An early innovator in the control plane was Amazon Web Services, where a tenant can create a “Virtual Private Cloud” [1] with their IP addresses without interfering with other tenants. In the data plane, there has been little comparable progress. To make comparable progress, we posit that the provider should present a simple performance abstrac- tion of a dedicated switch connecting a tenant’s end- points [3], independent of the underlying physical topol- ogy. The endpoints may be anywhere in the datacenter, but a tenant should be able to attain full line rate for any traffic pattern between its endpoints, constrained only by endpoint capacities. Bandwidth assurances to this tenant should suffer no negative impact from the behavior and churn of other tenants in the datacenter. This abstraction has been a consistent ask of enterprise customers consid- ering moving to the cloud, as the enterprise mission de- mands a high degree of infrastructure predictability [4]. Is this abstraction realizable? EyeQ described in this paper attempts to deliver this abstraction for every ten- ant. This requires three key components of which EyeQ provides the final missing piece. First, with little to no knowledge of tenant com- munication patterns, promising bandwidth guarantees to endpoints requires smart endpoint placement in a network with adequate capacity (for the worst case). Hence, topologies with bottlenecks between server– server (“east–west”) traffic are undesirable. Fortu- nately, recent proposals [5, 6, 7] have demonstrated cost- effective means of building “high bisection bandwidth” network topologies. These topologies are realizable in practice (§2.3), and substantially lower the complexity of endpoint placement (§3.5) as server–server capac- ity is more uniform. Second, utilizing this high bisec- tional bandwidth requires effective traffic load balanc- ing schemes to mitigate network hotspots. While today’s routing protocols (e.g. Equal-Cost Multi-Path [8]) do a reasonable job of utilizing available capacity, there has

EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

Embed Size (px)

Citation preview

Page 1: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 297

EyeQ: Practical Network Performance Isolation at the Edge

Vimalkumar Jeyakumar1, Mohammad Alizadeh1,2, David Mazieres1, Balaji Prabhakar1,Changhoon Kim3, and Albert Greenberg3

1Stanford University 2Insieme Networks 3Windows Azure

AbstractThe datacenter network is shared among untrusted ten-ants in a public cloud, and hundreds of services in aprivate cloud. Today we lack fine-grained control overnetwork bandwidth partitioning across tenants. In thispaper we present EyeQ, a simple and practical systemthat provides tenants with bandwidth guarantees as iftheir endpoints were connected to a dedicated switch.To realize this goal, EyeQ leverages the high bisectionbandwidth in a datacenter fabric and enforces admissioncontrol on traffic, regardless of the tenant transport pro-tocol. We show that this pushes bandwidth contentionto the network’s edge, enabling EyeQ to support end-to-end minimum bandwidth guarantees to tenant end-points in a simple and scalable manner at the servers.EyeQ requires no changes to applications and is deploy-able with support from the network available today. Weevaluate EyeQ with an efficient software implementationat 10Gb/s speeds using unmodified applications and ad-versarial traffic patterns. Our evaluation demonstratesEyeQ’s promise of predictable network performance iso-lation. For instance, even with an adversarial tenant withbursty UDP traffic, EyeQ is able to maintain the 99.9thpercentile latency for a collocated memcached applica-tion close to that of a dedicated deployment.

1 IntroductionIn the datacenter, we seek to virtualize the network forits tenants, as has been done for compute and storage.Ideally, a tenant running on shared physical infrastruc-ture should see the same range of control- and data-pathcapabilities on its virtual network, as it would see on adedicated physical network. This vision has been in fullswing for some years in the control plane [1, 2]. Anearly innovator in the control plane was Amazon WebServices, where a tenant can create a “Virtual Private

Cloud” [1] with their IP addresses without interferingwith other tenants. In the data plane, there has been littlecomparable progress.

To make comparable progress, we posit that theprovider should present a simple performance abstrac-tion of a dedicated switch connecting a tenant’s end-points [3], independent of the underlying physical topol-ogy. The endpoints may be anywhere in the datacenter,but a tenant should be able to attain full line rate for anytraffic pattern between its endpoints, constrained only byendpoint capacities. Bandwidth assurances to this tenantshould suffer no negative impact from the behavior andchurn of other tenants in the datacenter. This abstractionhas been a consistent ask of enterprise customers consid-ering moving to the cloud, as the enterprise mission de-mands a high degree of infrastructure predictability [4].

Is this abstraction realizable? EyeQ described in thispaper attempts to deliver this abstraction for every ten-ant. This requires three key components of which EyeQprovides the final missing piece.

First, with little to no knowledge of tenant com-munication patterns, promising bandwidth guaranteesto endpoints requires smart endpoint placement in anetwork with adequate capacity (for the worst case).Hence, topologies with bottlenecks between server–server (“east–west”) traffic are undesirable. Fortu-nately, recent proposals [5, 6, 7] have demonstrated cost-effective means of building “high bisection bandwidth”network topologies. These topologies are realizable inpractice (§2.3), and substantially lower the complexityof endpoint placement (§3.5) as server–server capac-ity is more uniform. Second, utilizing this high bisec-tional bandwidth requires effective traffic load balanc-ing schemes to mitigate network hotspots. While today’srouting protocols (e.g. Equal-Cost Multi-Path [8]) do areasonable job of utilizing available capacity, there has

Page 2: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

298 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association

Server j'Server i'

Server jServer i

VM A1

VM B1

A2B2

B3

SEM REM

REMSEM

Feedback signalsTraffic

F1

F2

F3

DatacenterNetwork

Receiveflow ratesover time

Figure 1: EyeQ’s sender module (SEM) and receiver module(REM) work in a distributed fashion by exchanging feedbackmessages, to iteratively converge to bandwidth guarantees.

been continuous improvements on this front [9, 10, 11].The third (and missing) piece, is a bandwidth arbi-

tration mechanism that schedules tenant flows in accor-dance with the bandwidth guarantees, even with misbe-having or malicious tenants. Today, TCP’s congestioncontrol shares bandwidth equally across flows and is ag-nostic to tenant requirements, and thus falls short of pre-dictably sharing bandwidth across tenants.

EyeQ, the main contribution of this paper, is a pro-grammable bandwidth arbitration mechanism for thedatacenter network. Our design is based on the key in-sight that by relieving the network’s core of persistentcongestion, we can partition bandwidth in a simple anddistributed manner, completely at the edge. EyeQ usesserver-to-server congestion control mechanisms to parti-tion bandwidth locally at senders and receivers. The de-sign is highly scalable and responsive and ensures band-width guarantees are met even in the presence of highlyvolatile traffic patterns. The congestion control mecha-nism pushes overloads back to the sources, while drain-ing traffic at maximal rates. This ensures that networkbandwidth is not wasted.The EyeQ model. EyeQ allows administrators to con-figure a minimum and a maximum bandwidth to aVM’s Virtual Network Interface Card (vNIC). The lowerbound on bandwidth permits a work-conserving alloca-tion among vNICs collocated on a physical machine.The EyeQ arbitration mechanism. We explain the dis-tributed mechanism using the example shown in Fig-ure 1. VMs of tenants A and B are given a minimumbandwidth guarantee of 2Gb/s and 8Gb/s respectively.The first network flow F1 starts at VM A1 destined forA2. In the absence of contention, it is allocated the fullline rate of 10Gb/s. While F1 is in progress, a secondflow F2 starts at B1 destined for B2, creating congestionat server j. The Receiver EyeQ Module (REM) at j de-tects this contention for bandwidth, and uses end-to-end

feedback to rate limit F1 to 2Gb/s, and F2 to 8Gb/s. Now,suppose flow F3 starts at VM B1 destined for B3. TheSender EyeQ Module (SEM) at server i′ partitions its linkbandwidth between F2 and F3 equally. Since this lowersthe rate of F2 at server j to 5Gb/s, the REM at j will allo-cate the spare 3Gb/s bandwidth to F1 through subsequentfeedback. In this way EyeQ recursively and distributedlyschedules bandwidth across a network, to simultaneouslymaximize utilization, and meet bandwidth guarantees.

EyeQ is practical; the SEM and REM shim layersenforce traffic admission control without awareness ofapplication traffic demands, traffic patterns, or transportprotocol behavior (TCP/UDP) and without requiring anymore support from network switches than what is alreadyavailable today. We demonstrate this through extensiveevaluations on real applications.

In summary, our main contributions are:

• The design of EyeQ that simultaneously achievespredictable and work-conserving bandwidth arbitra-tion in a scalable fashion, completely from the net-work edge (host network stack, hypervisor, or NIC).

• An open implementation of EyeQ in software thatscales to high line rates.

• An evaluation of EyeQ’s feasibility at 10Gb/s onreal applications.

The rest of the paper is organized as follows. We de-scribe the nature of EyeQ’s guarantees and discuss in-sights about network contention from a production clus-ter (§2) that motivate our design. We then delve intothe design (§3), our software implementation (§4), andevaluation (§5) using micro- and macro-benchmarks. Wesummarize related work (§6) and conclude (§7).

We are committed to making our work easily avail-able for reproducibility. Our implementation and evalua-tion scripts are online at http://jvimal.github.com/eyeq.

2 Predictable Bandwidth PartitioningThe goal of EyeQ is to schedule network traffic acrossa datacenter network such that it meets tenant endpointbandwidth guarantees over short intervals of time (e.g.,a few milliseconds). In this section, we define this no-tion of bandwidth guarantees more precisely and explainwhy bandwidth guarantees need to be met over shorttimescales. Then, we describe the key insight that makesEyeQ’s simple design possible: The fact that the net-work’s core in today’s high bisection bandwidth data-center networks can be kept free of persistent congestion.We show measurements from a Windows Azure produc-tion storage cluster that validate this claim.

2

Page 3: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 299

Shared 10Gb/s

TCP VM6Gb/s

UDP VM3Gb/s

TCP VM UDP VM

(a) Topology setup. (b) UDP burstiness at small timescales. (c) Latency of TCP request response with andwithout EyeQ.

Figure 2: The UDP tenant bursts for 5ms and sleeps for 15ms, which even at 100ms timescales seems benign (2.5Gb/s or 25%utilization). These finer timescale interactions put a stringent performance requirement on the reaction times of any isolationmechanism, and mechanisms that react at large timescales may not see the big picture. EyeQ rate limits UDP over short timescalesand improves the TCP tenant’s median latency by over 10x. (There were no TCP timeouts in this experiment.)

2.1 EyeQ’s bandwidth guaranteesEyeQ provides bandwidth guarantees for every endpoint(e.g., VM NIC) in order to mimic the performance of adedicated switch for each tenant. The bandwidth guar-antee for each endpoint is configured at provision time.The endpoints should be able to attain their guaranteedbandwidth as long as their traffic does not oversubscribeany endpoint’s capacity.1 For instance, if N VMs, eachwith 1Gb/s capacity, attempt to send traffic at full rate toa single 1Gb/s receiver, EyeQ only guarantees that the re-ceiver (in aggregate) receives 1Gb/s of traffic. The excesstraffic is dropped at the senders. Hence, EyeQ enforcestraffic admissibility and only promises bandwidth guar-antees for the bottleneck port (the receiver) and allocatesbandwidth across senders in a max-min fashion.

There are different notions for bandwidth guarantees,ranging from exact rate and delay guarantees at thelevel of individual packets [13], to approximate “averagerate” [14] guarantees over an acceptable interval of time.As observed in prior work [14], exact rate guarantees re-quire per-endpoint queues and precise packet schedulingmechanisms [15, 16, 17] at every switch in the network.Such mechanisms are expensive to implement and arenot available in switches today at a scale to isolate thou-sands of tenants and millions of VMs [18]. Hence, withEyeQ, we strive to attain average rate guarantees over aninterval of time that is as short as possible.

2.2 Rate guarantees at short timescalesDatacenter traffic has been found to be highly volatileand bursty [5, 19, 20], leading to interactions at short

1This constraint is identical to what would occur with a dedicatedswitch, and is sometimes referred to as a hose constraint [12].

timescales of a few milliseconds that adversely impactflow throughput and tail latency [21, 22, 23]. This is ex-acerbated by high network speeds and the use of shallowbuffered commodity switches in datacenters. Today, asingle large TCP flow is capable of causing congestionon its path in a matter of milliseconds, exhausting switchbuffers [21, 24]. We refer the reader to [25] and our priorwork [26] that demonstrate how bursty packet losses canadversely affect TCP’s throughput.

The prior demonstrations highlight an artifact ofTCP’s behavior, but such interactions can also affect end-to-end latency, regardless of the transport protocol. Tosee this, consider a multi-tenant setting with a TCP andUDP tenant shown in Figure 2(a). Two VMs (one of eachtenant) collocated on a physical machine receive trafficfrom their tenant VMs on other machines. Assume anadministrator divides 9Gb/s of the access link bandwidth(at the receiver) between TCP and UDP tenants in the ra-tio 2:1 (the spare 1Gb/s or 10% bandwidth headroom isreserved to ensure good latency [22]). The UDP tenanttransmits at an average rate of 2.5Gb/s by bursting in anON-OFF fashion (ON at 10Gb/s for 5ms, OFF for 15ms).The TCP client issues back-to-back 1-byte requests overone connection and receives 1-byte responses from theserver collocated with the UDP tenant.

Figure 2(c) shows the latency distribution of the TCPtenant with and without EyeQ. Though the averagethroughput of the UDP tenant (2.5Gb/s) is less than itsallocated 3Gb/s, the TCP tenant’s median and 99th per-centile latency increases by over 10x. This is because ofthe congestion caused by the UDP tenant during the 5msbursts at line rate, as shown in Figure 2(b). When EyeQis enabled, it cuts off UDP’s bursts at short timescales.We see that the latency with EyeQ is about 55µs, which

3

Page 4: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

300 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association

Spine 1

ToR 1

Server 1...

Server N

ToR 2

Rack2

· · · Spine S

ToR T

RackT

· · ·

Figure 3: Emerging datacenter network architectures are over-subscribed only at the Top-of-Rack switches, which are con-nected to a spine layer that offers uniform bandwidth betweenracks. The over-subscription ratio is typically less than 3.

is close to bare-metal performance that we saw when run-ning the TCP tenant without the UDP tenant.

Thus, any mechanism that only looks at average uti-lization over large timescales (e.g., over 100 millisec-onds) fails to see the contention happening at finertimescales, which can substantially degrade both band-width and latency for contending applications. To ad-dress this challenge, EyeQ operates in the dataplane ina distributed fashion, and uses a responsive rate-basedcongestion control mechanism based on the Rate Con-trol Protocol [27]. This enables EyeQ to quickly re-act to congestion in 100s of microseconds. We believethis timescale is short enough to at least protect tenantsfrom persistent packet drops caused by other tenants asmost datacenter switches have a few milliseconds worthof buffering.

2.3 The Fat and Flat Datacenter NetworkIn this section, we show how the high bisection band-width network architecture of a datacenter can simplifythe task of bandwidth arbitration, which essentially boilsdown to managing network congestion wherever it oc-curs. In a flat datacenter network with little to no flow ag-gregation, congestion can occur everywhere, and there-fore, switches need to be aware of thousands of tenants.Configuring every switch as tenants and their VMs comeand go is unrealistic. We investigate where congestionactually occurs in a datacenter.

An emerging trend in datacenter network architectureis that the over-subscription ratio, typically less than 3:1,exists only at the Top-of-Rack (ToR) switches (Figure 3).Beyond the ToR switches, the network design eliminatesany structural bottleneck, and offers uniform high capac-ity between racks in a cluster. To study where conges-tion occurs in such a topology, we collected link utiliza-tion statistics from Windows Azure’s production storage

Figure 4: Utilization trends observed in a cluster running amulti-tenant storage service. The edge links exhibit higher peakand variance in link utilization compared to core links.

cluster, whose network has a small oversubscription (lessthan 3:1). Figure 4 plots the top 10 percentiles of link uti-lization (averaged over 5 minute intervals) on edge andcore links using data collected from 20 racks over thecourse of one week. The plot reveals two trends. First,we observe that edge links have higher peak link utiliza-tion. This suggests that persistent congestion manifestsitself more often, and earlier, on server ports than thenetwork core. Second, the variation of link utilization oncore links is smaller. This suggests that the network coreis evenly utilized and is free of persistent hot-spots.

The reason we observe this behavior is two fold. First,we observed that TCP is the dominant protocol in ourdatacenters [19, 21]. The nature of TCP’s congestioncontrol ensures that traffic is admissible, i.e., sources donot send more traffic (in aggregate) to a sink than the bot-tleneck capacity along the paths. In a high capacity fab-ric, the only bottlenecks are at over-subscription points—the server access links and the links between the ToRsand Spines—provided packets are optimally routed atother places. Second, datacenters today use Equal-CostMulti-Path (ECMP) to randomize routing at the level offlows. In practice, ECMP is “good enough” to miti-gate contention within the fabric, particularly when mostflows are short-lived. While the above link utilizationsreflect persistent congestion over 5 minute intervals, weconducted a detailed packet-level simulation study, andfound that randomized per-packet routing can push evenmillisecond timescale congestion to the edge (§5.3).

Thus, if (a) the network has high bisection bandwidth,(b) the network employs randomized traffic routing, and(c) traffic is admissible, persistent congestion only occursat the access links and not in the core. This observationguides our design in that it is sufficient if per-tenant stateis pushed to the edge, where it is already available.

3 EyeQ DesignWe now describe EyeQ’s design in light of the observa-tions in the §2. For ease of exposition, we first abstractthe datacenter network as a single switch and describe

4

Page 5: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 301

how EyeQ ensures rate guarantees to each endpoint con-nected to this switch. Then in §3.5, we explain how thisdesign fits in a network of switches. Finally, we describehow endpoints that do not need guarantees can coexist.

If a single switch connects all servers, bandwidth con-tention happens only at the first-hop link connecting thesender to the switch, and the last-hop link connecting theswitch to the receiver. Therefore, any contention is localto the servers, where the number of competing entities issmall (typically 8–32 services/VMs per server). To re-solve local contention between endpoints, two featuresare indispensable: (a) a mechanism that detects and ac-counts for contention, and (b) a mechanism that enforcesrate limits on flows that violate their share. We now de-scribe mechanisms to detect and resolve contention atsenders and receivers.

3.1 Detecting and Resolving ContentionSenders. Contention between transmitters at the senderis straightforward to detect and resolve as the first pointof contention is the server NIC. To resolve this andachive rate guarantees at the sender, EyeQ uses weightedfair queueing, where weights are set proportional to theendpoint’s minimum bandwidth guarantees.Receivers. However, contention at the receiver first hap-pens inside the switch, and not at the receiving server.To see this, consider the example shown in Figure 2where UDP generates highly bursty traffic that leads to25% average utilization of the receiver link. When TCPbegins to transmit packets, the link utilization soon ap-proaches 100%, and packets are queued up in the lim-ited buffer space inside the switch. If the switch doesnot differentially treat TCP and UDP packets, TCP’s re-quest/response experiences high latency.

Unfortunately, neither the sender nor receiver serverhas accurate, if any, visibility into this switch-internalcontention, especially at timescales it takes to fill theswitch packet buffers. These timescales can be verysmall as datacenter switches have limited packet buffers.Consider a scenario where two switch ports send data toa common port that has 1MB buffer. If each port startssending at line rate, it takes just 800µs (at 10Gb/s) to fillthe shared buffer inside the switch.

Fortunately, the scenario in Figure 2 offers an in-sight into the problem: contention happens when thelink utilization, at short timescales, approaches its capac-ity. EyeQ therefore measures rate every 200µs and usesthis information to rate limit flows before they cause fur-ther congestion. EyeQ stands to benefit if the networkcan further assist in quickly detecting any congestion, ei-ther using Explicit Congestion Notification (ECN) marks

VM1 VM2

3 4

WFQSched

SEM

3Gb/s 7Gb/s

VM3 VM4

REM

WFQSched

3Gb/s 6Gb/s

NetworkFabric

RateFeedback

End-to-EndFlow Control

Packets aresent to tenantwithout anyqueueing

RateLimiters

Rate Meters

Figure 5: The design consists of a Sender EyeQ Module(SEM) and Receiver EyeQ Module (REM) at every end-host.The SEM consists of hierarchical rate limiters to enforce ad-mission control, and a WRR scheduler to enforce minimumbandwidth guarantees. The REM consists of rate meters thatdetect and signal congestion. In tandem, the SEM and REMachieve end-to-end flow control.

from a shared queue when buffers exceed a configuredqueue occupancy, or using per-tenant dedicated queuesonly on the access link. But EyeQ does not need per-tenant network queues to function.Resolving receiver contention. Unlike sender viola-tion, performing both detection and rate limiting at thereceiver is not effective, as rate limiting at the receivercan only control well-behaved TCP-like flows (by hav-ing them back off via drops). Unfortunately, VMs mayuse transport protocols such as UDP, which do not reactto any downstream drops. EyeQ implements receiver-side detection, sender-side reaction. Specifically EyeQdetects bandwidth violation at the receiver using per-endpoint rate meters, and enforces rate limits at thesenders using per-destination rate limiters. These per-destination rate limiters are programmed by congestionfeedback generated by the receiver (§3.4).

In summary, EyeQ’s design has two main components:(a) a rate meter at receivers that sends feedback to (b) ratelimiters at senders. A combination of the above is neededto address both contention at the receiver indicated usingfeedback, as well as local contention at the sender. Therate limiters work in a distributed fashion using a controlalgorithm to iteratively converge to the ‘right’ rates.

3.2 Receiver EyeQ ModuleThe Receiver EyeQ Module (REM) consists of an RXscheduler and rate meters for every endpoint. The rate

5

Page 6: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

302 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association

meter is just as a byte counter that periodically (200µs)tracks the endpoint’s receive rate, and invokes the RXscheduler which computes a capacity for each endpointas Ci = Bi ×C/(∑ j∈A B j). Here, A is the set of activeVMs that are receiving traffic at non-zero rate, and Bi isthe minimum bandwidth guarantee to VM i. Rate me-ters can be hierarchical; for example, a tenant can createrate meters to further split its capacity Ci across tenantscommunicating with it.

REM is clocked by incoming packets and measureseach tenant’s aggregate utilization every 200µs. Apacket arrival triggers a feedback to source of the currentpacket. The feedback is a 16-bit value R computed by therate meter using a control algorithm. To avoid generatingexcessive feedback, we send feedback only to the sourceaddress of packet sampled every 10kB of received data.This tends to choose senders that are communicating athigh rates over a period of time.

This sampling also restricts the maximum bandwidthconsumed by feedback. Since feedback packets are min-imum sized packets (64 bytes), feedback traffic will notconsume more than 64Mb/s (on a 10Gb/s link). Thisdoes not depend on the number of rate meters or senderstransmitting to a single machine. We call this feedbackpacket a host-ACK, or HACK. The HACK is a specialIP packet that is never communicated to the tenant; wepicked the first unused IP protocol number (143) as a‘HACK.’ The HACK encodes a 16-bit rate in its IPIDfield. However, this feedback can also be piggybackedon traffic to the source.

3.3 Sender EyeQ ModuleTo enforce traffic admission control, SEM uses multiplerate limiters organized in a hierarchical fashion. To seethis hierarchy, consider the scenario in Figure 5. Theroot WRR scheduler schedules packet transmissions sothat VM1 and VM2 get (say) equal share of the transmitbandwidth. To further ensure that traffic does not con-gest a destination, there are a set of rate limiters at theleaf level, one per congested destination. Per-destinationrate limiters ensure that traffic to uncongested destina-tions are not head-of-line blocked by traffic to congesteddestinations. These per-destination rate limiters set theirrate dictated by HACKs from receivers (§3.4). EyeQ as-sociates traffic to a rate limiter only when the destinationsignals congestion through rate feedback. If a feedbackis not received within 100 milliseconds, the rate limiterhalves its rate until it hits 1Mb/s, to avoid congesting thenetwork if the receiver is unresponsive (e.g. due to fail-ures or network partitions).

3.4 Rate Control LoopThe heart of EyeQ’s architecture is a rate control algo-rithm that computes the rates at which senders shouldconverge to, to avoid overwhelming the receiver’s capac-ity limits. The goal of this algorithm is to compute onerate Ri to which all flows of a tenant destined to end-point i should be rate limited. If N senders all send long-lived flows, then Ri is simply Ci/N. In practice, N ishard to estimate as senders are in a constant state of flux,and not all of them may want to send traffic at rate Ri.Hence, we need a mechanism that can compute Ri with-out estimating the number of senders, or their demands.This makes the implementation practical, and more im-portantly, makes it possible to offload this functionalityin hardware such as programmable NICs [28].

The control algorithm (operating at each endpoint)uses the measured receive rate yi and the endpoint’s al-lowed receive capacity Ci (determined by the RX sched-uler) to compute a rate Ri that is advertised to senderscommunicating only with this endpoint. The basic ideais that the algorithm starts with an initial rate estimate Ri,and periodically corrects it based on observed yi; if theincoming rate yi is too small, it increases Ri, and if yi islarger than Ci, it decreases Ri. This iterative procedure tocompute Ri can be written as follows, taking care to keepRi positive:

Ri ← Ri

(

1−α · yi −Ci

Ci

)

The algorithm is a variant of the Rate Control Proto-col (RCP) proposed in [27, 29], but there is an importantdifference. RCP’s control algorithm operates on links inthe network to split the link capacity among every flowin a max-min fashion. This achieves per-flow max-minfairness, and therefore, RCP suffers from the same prob-lems as TCP. Instead, we operate the control algorithm ina hierarchical fashion. At the top level, the physical linkcapacity is divided by the RX scheduler into multiple vir-tual link capacities (Ci), one per VM, which isolates VMsfrom one another. Next, we operate the above algorithmindependently on each virtual link.

The sensitivity of the algorithm is controlled by pa-rameter α; higher values make the algorithm more ag-gressive in adjusting Ri. The above equation can be an-alyzed as follows. In the case where N flows traversea single congested link of unit capacity, the rate evo-lution of R can be described in the standard form: 2

z[n+1] = bz[n](1− z[n]), where z[n] =( b−1

b

) R[n]R∗ , R∗ =

1N , and b = 1+α . It can be shown that R∗ is the only sta-ble fixed point of the above recurrence if 1 < b < 3, i.e.

2This is a standard non-linear one-dimensional dynamical systemcalled the Logistic Map.

6

Page 7: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 303

0 < α < 2. By linearizing the recurrence about its fixedpoint, we can show that R[n]≈ R∗+(R[0]−R∗)(1−α)n.Therefore the system converges linearly. In practice, wefound that high values of α lead to oscillations aroundR∗ and therefore we recommend setting α conservativelyto 0.5, for which R[n] converges within 0.01% of R∗ inabout 30 iterations, irrespective of R[0].

Though EyeQ makes an assumption about conges-tion free network core, ECN marks from network enableEyeQ to gracefully degrade in the presence of in-networkcongestion that can arise, for example, when links fail.In the rare event of persistent in-network congestion, weestimate the fraction of marked incoming packets as βand reduce Ri proportionally: Ri ← Ri(1− β/2). Thisterm β aids in reducing the rate of transmitting endpointsonly in such transient cases. Though minimum band-width guarantees cannot be met in this case, it preventsstarvation where one endpoint is completely dominatedby another. In this case, bottleneck bandwidth is sharedequally among all receiving endpoints.

We experimented with other control algorithms basedon Data Center TCP (DCTCP) [21] and Quantized Con-gestion Notification (QCN) [30], and found that theyhave different convergence and stability properties. Forexample, DCTCP’s convergence was on the order of100–150ms, whereas QCN converged within 20–30ms.It is important that the control loop be fast and stable, toreact to bursts without over-reaching, and RCP convergeswithin a few milliseconds to the right rate. Since EyeQcomputes rates every 200µs, the worst case convergencetime (30 iterations) is 6ms. In practice, it is much faster.

3.5 EyeQ on a networkSo far, we described EyeQ with the assumption that allend-hosts are connected to a single switch. A few stepsmust be taken to ensure EyeQ’s design is directly ap-plicable to networks that have a little over-subscriptionat the ToR switches (Figure 3). Clearly, if the policy isto guarantee minimum bandwidth to each VM, the clus-ter manager must ensure that capacity is not overbooked.This becomes simpler in such networks, where the coreof the network is guaranteed to be congestion free, andhence admission control must only ensure that:

• The access links at end-hosts are not over-subscribed:i.e., the sum of bandwidth guarantees of VMs on aserver is less than 10Gb/s.

• The ToR’s uplink capacity is not over-subscribed: i.e.,the sum of bandwidth guarantees of VMs under a ToRswitch is less than the switch’s total capacity to theSpine layer.

The above conditions ensure that VMs are guaranteedtheir bandwidth in the worst case when every VM needsit. The remaining capacity can be used for VMs thathave no bandwidth requirements. Traffic from VMs thatdo not need bandwidth guarantees are mapped to a lowpriority, “best-effort” network class. This requires (i) aone-time network configuration of a low priority queue,which is easily possible in today’s commodity switches,and (ii) end-hosts to mark packets so they can be classi-fied to low priority network queues. This partitions theavailable bisection bandwidth across a class of VMs thatneed performance guarantees, and those that do not. Aswe saw in §3.4, EyeQ gracefully degrades in the pres-ence of network congestion that can happen due to over-subscription, by sharing bandwidth equally among all re-ceiving endpoints.

4 ImplementationEyeQ needs two components: rate limiters and rate me-tering. These components can be implemented in soft-ware, or hardware or a combination of two for optimumperformance. In this paper, we present a full-softwareimplementation of EyeQ’s mechanisms, addressing thefollowing challenges: (a) maintaining line rate perfor-mance at 10Gb/s while reacting quickly to deal with con-tentions at fine timescales, (b) co-existing with today’snetwork stacks that use various offload techniques tospeed up packet processing. EyeQ uses a combination ofsimple and well known techniques to reduce CPU over-head. We avoid writing to data structures shared acrossmultiple CPUs to minimize cache misses. If sharing isinevitable, we minimize updates of shared data as muchas possible through batching.

In untrusted environments, EyeQ is implemented inthe trusted codebase at the (hypervisor or Dom0) virtualswitch. In a non-virtualized, trusted environment, EyeQresides in the network stack as a shim layer above thedevice driver. As a prototype, we implemented RX andTX processing for a VMSwitch filter driver for WindowsServer 2008, and a kernel module for Linux which weuse for all our experiments in this paper. The kernel mod-ule implements a queueing discipline (qdisc) in about1900 lines of C code and about 700 lines of header files.We implemented a simple hash table based IP based clas-sifier to identify endpoints. EyeQ hooks into the RX dat-apath using netdev rx handler register.

4.1 Receiver EyeQ ModuleThe REM consists of rate meters, a scheduler and aHACK generator. A rate meter is created for each VM,and tracks the VM’s receive rate in an integer. Clocked

7

Page 8: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

304 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association

by incoming packets, the scheduler determines each end-point’s allowed rate. The scheduler distributes the re-ceive capacity among active endpoints, in accordancewith their minimum bandwidth requirements.

At 10Gb/s, today’s NICs use techniques such as Re-ceive Side Scaling [31] to scale software packet process-ing by load balancing interrupts across CPU cores. A sin-gle, atomically updated byte counter in the critical pathis a bottleneck and limits parallelism. To avoid such inef-ficiencies, we exploit the fact that today’s NICs use inter-rupt coalescing to deliver multiple packets in a single in-terrupt, and therefore batch counter updates over 200µstime intervals. A smaller interval results in inaccuraterate measurement due to tiny bursts, and a larger intervaldecreases the rate meter’s ability to detect short burststhat can cause interference. In a typical shallow bufferedToR switch that has a 1MB shared buffer, it takes 800µsto fill 1MB if two ports are sending at line rate to a com-mon receiver. Thus, the choice of 200µs interval is tobalance the ability to detect short bursts, and measurerate reasonably accurately.

4.2 Sender EyeQ ModuleSEM consists of multiple TX-contexts, one per endpoint,that are isolated from one another. The SEM classifiespackets to their corresponding TX-context. Each contexthas one root rate limiter, and a hash table of rate limiterskeyed by IP destination d. The hash table stores the ratecontrol state (R(i)

d ). Recall that rate enforcement is donehierarchically; leaf rate limiters enforce per-destinationrates determined by end-to-end feedback loop, and theroot rate limiter enforces a per-endpoint aggregate ratedetermined by the TX scheduler.

Rate limiters to IP destinations are created only on aneed-to-rate limit basis. At start, packets to a destinationare rate limited only at the root level. A per-destinationrate limiter to a destination d is created, and added to thehierarchy, only on receiving a congestion feedback fromthe receiver d. Inactive rate limiters are garbage collectedevery few seconds. The TX WRR scheduler executes ev-ery 200µs and reassigns the total TX capacity to activeendpoints, i.e., those that have a backlog of packets wait-ing to be transmitted in rate limiters.

Multi-queue rate limiter. The rate limiter is imple-mented as a token bucket, which has an associatedlinked-list (tail-drop) FIFO queue, a timer, a rate R andsome tokens. This simple design can be inefficient, asa single queue rate limiter increases lock contention,which degrades performance significantly, as the queueis touched for every packet. Hence, we split the idealrate limiter’s FIFO queue into a per-CPU queue, and the

total tokens into a local token count (tc) on each CPU c.The value tc is the number of bytes that Qc can transmitwithout violating the global rate limit. Only if Qc runsout of tokens to transmit the head of the queue, it grabsthe rate limiter’s lock to borrow all total tokens.

If the borrow fails due to lack of total tokens, the per-CPU queue is throttled and appended to a per-CPU listof backlogged queues. We found that having a timer forevery rate limiter was very expensive. Therefore, a sin-gle per-CPU timer fires every 50µs and clocks only thebacklogged rate limiters on that CPU. Decreasing the fir-ing interval increases the precision of the rate limiter, butincreases CPU overhead as it doubles the number of in-terrupts per second. In practice, we found that 50µs issufficient. At 10Gb/s, at most 64kB can be transmittedevery 50µs without violating rate constraints.

The rate limiter’s per-CPU FIFO maximum queue sizeis restricted to 128kB, beyond which it back-pressuresthe network stack by refusing to accept more packets.While a TCP flow responds to this immediate feedbackby stopping transmission, UDP applications may con-tinue to send packets that will be dropped. Stopped TCPflows will be resumed by incoming ACKs.

Rate limiter accuracy. Techniques such as large seg-mentation offload (LSO) make it challenging to enforcerates precisely. With default configuration, the TCP stackcan transmit data in 64kB chunks, which takes 51.2µs totransmit at 10Gb/s. If a flow is rate limited to 1Gb/s,the rate limiter would transmit one 64kB chunk every512µs. This burstiness affects the accuracy with whichthe rate meter measures rates. To limit burstiness, werestrict the maximum LSO packet size to 32kB, whichenables reasonably accurate rate metering at 256µs in-tervals. For rates less than 1Gb/s, the rate limiter se-lectively disables segmentation offload by splitting largepackets into smaller chunks of at most the MTU (1500bytes). This improves rate precision without incurringmuch CPU overhead. Limiting the size of an LSO packetalso improves latency for short flows by reducing head ofline blocking at the NIC.

5 EvaluationWe evaluate EyeQ to understand the following aspects:

• Responsiveness: We stress EyeQ’s convergence timesagainst a large burst of UDP streams and find thatEyeQ converges within 5ms to protect a collocatedTCP tenant.

• CPU overhead: At 10Gb/s, we evaluate the mainoverhead of EyeQ due to its rate limiters. We find itoutperforms the software rate limiters in Linux.

8

Page 9: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 305

Page 10: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

306 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association

Recall that EyeQ requires a number of rate limitersthat varies depending on the number of flows. In prac-tice, the number of active flows (flows that have out-standing data) is typically less than a few 100 on a ma-chine [5]. Nevertheless, we evaluated EyeQ’s rate lim-iters by creating 20000 long lived flows that are assignedto a number of rate limiters in a round robin fashion.As we increased the number of rate limiters connectedto the root (which is limited to 5Gb/s) from 1 to 1000to 10000, we found that the CPU usage stays the same.This is because the net work output (packets per second)is the same in all cases, except for the (small) overheadinvolved in book keeping rate limiters.

5.2 Macro BenchmarksThe micro-benchmarks show that EyeQ is efficient, andresponsive in mitigating congestion. In this section, weexplore the benefits of EyeQ on traffic characteristicsof two real world applications: a long data shuffle (e.g.Hadoop) and memcached.

5.2.1 All-to-all data shuffleTo mimic a Hadoop job’s network traffic component, wegenerate an all to all traffic pattern of a sort job using atraffic generator.3 Hadoop’s reduce phase is bandwidthintensive and job completion times depend on the avail-ability of bandwidth [32]. In the sort workload of S TBof data, using a cluster of N nodes involves roughly anequal amount of data shuffle between all pairs; in ef-fect, S

N(N−1) TB of data is shuffled between every pair ofnodes. We use a TCP traffic generator to create long livedflows according to the above traffic pattern, and recordthe flow completion times of all the N(N −1) flows. Wethen plot the CDF of flow completion times for every jobto visualize its progress over time; the job is completewhen the last flow completes. We make no optimizationsto mitigate stragglers.

Multiple all-to-all shuffles. In this test, we run threecollocated all-to-all shuffle jobs. Each job has 16 work-ers, one on each server in the cluster; and each serverhas three workers, one of each job. Each job has avarying degree of aggressiveness when consuming net-work bandwidth; job Pi (i = 1,2,4) creates i parallel TCPconnections between each pair of its nodes, and eachTCP connection an transfers equal amount of data. Thejobs are all temporally and spatially collocated with eachother and run a common 1 TB sort workload.

Figure 8(a) shows that jobs that open more TCP con-

3We used a traffic generator as our disks could not sustain enoughwrite throughput to saturate a 10Gb/s network.

(a) Without EyeQ, job P4 creates 4 parallel TCP con-nections between its workers and completes faster.

(b) With different minimum bandwidth guarantees,EyeQ can directly affect the job completion times.

Figure 8: EyeQ’s ability to differentially allocate bandwidthto all-to-all shuffle jobs can affect their completion times, andcan be done at runtime, without reconfiguring jobs.

nections complete faster. However, EyeQ provides flex-ibility to explicitly configure job priorities, irrespectiveof the traffic, or protocol behavior. Figure 8(b) showsthe job completion times if the lesser aggressive jobs aregiven higher priority; the priority can be inverted, and thejob completion times reflect the change of priorities. Thejob priority is inverted by assigning minimum bandwidthguarantees Bi to jobs Pi that is inversely proportional totheir aggressiveness; i.e., B1 : B2 : B4 = 4 : 2 : 1. The finalcompletion time in EyeQ increases from 180s to 210s,due to two reasons. First, EyeQ’s congestion detectorsmaintain a 10% bandwidth headroom inorder to work atthe end hosts without network ECN support. Second, theREM (§3.2) does not share bandwidth in a fine-grained,per-packet fashion, but over a 200µs time window. Thisleads to a small loss of utilization, when (say) P1 is allo-cated some bandwidth but does not use it.

5.2.2 MemcachedOur final macro-evaluation is a scenario where a mem-cached tenant is collocated alongside an adversarial UDPtenant. The memcached tenant is a cluster consists of 16processes: 4 memcached instances and 12 clients. Eachprocess is located on a different host. At the start of theexperiment, each cache instance allocates 8GB of mem-ory, each client starts one thread per cache instance, andeach thread opens 10 permanent TCP connections to itsdesignated cache instance.Throughput test. We generate an external load ofabout 288k requests/sec load balanced equally across allclients; at each client, the mean load is 6000 requests/secto each cache instance. The clients generate SET re-

10

Page 11: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 307

Latency percentiles (µs)Scenario 50th 99th 99.9thBare 98 370 666Bare+EyeQ 100 333 630Bare+UDP 4127 0.89×106 1.1×106

Bare+UDP+EyeQ 102 437 750

Table 1: Latency of memcached SET requests at low load(144k req/s). In all cases, the cluster throughput was the same,but EyeQ protects memcached from bursty traffic, bringing the99.9th percentile latency closer to bare-metal performance.

quests of 6kB values and 32B keys and record the latencyof each operation. We contrast the performance underfour cases. First, we dedicate the cluster to the mem-cached tenant and establish baseline performance. Thecluster was able to sustain the external load of 288k re-quests/sec. Second, we enable EyeQ on the same setupand found that EyeQ does not affect total throughput.

Third, we collocate memcached with a UDP tenant,by instantiating a UDP node on every end host. EachUDP node sends half-a-second burst of data to one othernode (chosen in a round robin fashion), sleeping for half-a-second between bursts. Thus, the average utilizationof UDP tenant is 5Gb/s. We chose this pattern as somecloud providers today allow a VM to burst at high ratesfor a few seconds before it is throttled. In this case,we find that the cluster was able to keep up only with269k requests/sec which caused many responses timedout even though UDP tenant is consuming only 5Gb/s.Finally, we set equal bandwidth guarantees (5Gb/s) toboth UDP and memcached tenant. We find that the clus-ter is can sustain the demand of 288k requests/sec. Thisshows that EyeQ is able to protect memcached tenantfrom the bursty UDP traffic.

Latency test. We over-provisioned the cluster by halv-ing the external load (144k requests/sec). When mem-cached is collocated with UDP without EyeQ’s protec-tion, we observed that the cluster was able to meet itsdemand, but UDP was still able to affect the latency ofmemcached requests, increasing the 99th percentile la-tency by over three orders of magnitude. When enabled,EyeQ was able to protect the memcached tenant fromfine-grained traffic bursts, and bring the 99.9th percentilelatency to 750µs. The latency is still more than baremetal as the total load on the network is higher.

Takeaways. This experiment highlights a subtle point.Though we pay a 10% bandwidth price for low latency,EyeQ improves the net cluster utilization and tail latencyperformance. In a real setup, an unsuspecting clientwould pay money to spin up additional memcached in-stances to cope with the additional load. While this

(a) Per-flow ECMP. (b) Per-packet ECMP.

Figure 9: Queue occupancy distribution at the edge and corelinks, for two different routing algorithms. The maximumqueue size is 225kB (150 packets). Large queue sizes indi-cate more congestion. In all cases, the network edge is morecongested even at small timescales (1ms).

does increase revenue for providers, we believe they canearn more by using fewer resources to achieve the samelevel of performance. In datacenters, servers accountfor over 60% of the cost, but network accounts for only10–15% [33]. With EyeQ, cloud operators can hope toachieve better CPU packing without worrying about net-work interference.

5.3 Congestion in the FabricIn §2.3 we used link utilization from a production clusteras evidence that network congestion happens more oftenat the edge than the network core. These coarse-grainedaverage link utilizations over five minute intervals showmacroscopic trends, but it does not capture congestionat packet timescales. Using packet-level simulations inns2 [34], we study the extent to which transient con-gestion can be mitigated at the network core using tworouting algorithms: (a) per-flow ECMP and (b) a per-packet variant of ECMP. In the per-packet variant, eachpacket’s route is chosen uniformly at random among allnext hops [35]. To cope with packet reordering, we in-crease TCP’s duplicate ACK threshold.

We created a full bisection bandwidth topology of144x10GbE hosts, 9 ToRs and 16 Spines as in our dat-acenters (Figure 3). Servers open TCP connections toevery other server and generate traffic at an average rateof 10Gb/s×λ , where λ is the offered load. Each serverpicks a random TCP connection to transmit data. Flowsizes are drawn from a distribution observed in a largedatacenter [21]; the median, mean and 90th percentileflow sizes are 19kB, 2.4MB, 133kB respectively. We setqueue sizes of all network queues to 150 packets and col-lect queue samples every 1ms. Figure 9 shows queue oc-cupancy percentiles at the edge and core when λ = 0.9.

Even at high load, we observe that the core links arefar less congested than the edge links. With per-packetECMP, the maximum queue occupancy is less than 50kB

11

Page 12: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

308 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association

in the core links. All packet are dropped at the serveraccess links. In all cases, we observed that per-packetECMP practically eliminates in-network congestion ir-respective of traffic pattern.

6 Related workEyeQ’s goals fall under the umbrella of Network Qual-ity of Service (QoS), which has a rich history. EarlyQoS models [13, 36] and their implementations [15, 16,36] focus on link-level and network-wide rate and de-lay guarantees for flows between a source and destina-tion. Protocols such as Resource Reservation Protocol(RSVP) [37] reserve/relinquish resources across multi-ple links using such QoS primitives. Managing networkstate for every flow becomes untenable when there area lot of flows. This led to approaches that relax strictguarantees for “average bandwidth,” [14, 38, 39] whileincurring lower state management overhead. A notablecandidate is Core-Stateless Fair Queueing (CSFQ) thatdistributes state between the network core and the net-work edge, relying on flow aggregation at edge routers.In a flat datacenter network, tenant VMs are distributedacross racks for availability, and hence there is little to noflow aggregation. OverQoS [40] provided an abstractionof a controlled loss virtual link with statistical bandwidthguarantees, between two nodes on an overlay network;this “pipe” model requires customers to specify band-width requirements between all communicating pairs, incontrast to a hose model [12].

Among recent approaches, Seawall [18] shares bottle-neck capacity across competing source VMs (relative totheir weights). This notion of sharing lacks predictabil-ity, as a tenant can grab more bandwidth by launchingmore source VMs. Oktopus [3] argues for predictabil-ity by enforcing a static hose model using rate lim-iters. It computes rates using a pseudo-centralized mech-anism, where VMs communicate their pairwise band-width consumption to a tenant-specific centralized coor-dinator. This control plane overhead limits reaction timesto about 2 seconds. However, as we have seen (§2.2), anyisolation mechanism has to react quickly to be effective.SecondNet [41] is limited to providing static bandwidthreservations between pairs of VMs. In contrast to Okto-pus and SecondNet, EyeQ supports both static and workconserving bandwidth allocations.

The closest related work to EyeQ is Gatekeeper [42],which also argues for predictable bandwidth allocation,and uses congestion control to provide rate guarantees toVMs. While the high level architecture is similar, Gate-keeper lacks details on the system design, especially therate control mechanism, which is critical to providing

bandwidth guarantees at short timescales. Gatekeeper’sevaluation is limited to static scenarios with long livedflows. Moreover, Gatekeeper uses Linux’s hierarchicaltoken bucket, which incurs high overhead at 10Gb/s.

FairCloud [43] explored fundamental trade-offs be-tween network utilization, min-guarantees and paymentproportionality, for a number of sharing policies. Fair-Cloud demonstrated the effect of such policies with per-flow queues in switches and CSFQ, which have limitedor no support in today’s commodity switches. However,the minimum-bandwidth guarantee that EyeQ supportsconforms to FairCloud’s ‘Proportional-sharing on proxi-mate links (PS-P)’ sharing policy, which, as the authorsdemonstrate, outperform many other sharing policies.NetShare [44] used in-network weighted fair queueingto enforce bandwidth sharing among VMs. This ap-proach, unfortunately, does not scale well due to the lim-ited queues (8–64) per port.

The literature on congestion control mechanisms isvast; however, the fundamental unit of allocation is stillper-flow, and therefore, the mechanisms are not ade-quate for network performance isolation. We refer theinterested reader to [45] for a more comprehensive sur-vey about recent efforts to address performance unpre-dictability in datacenter networks.

7 Concluding RemarksIn this paper, we presented EyeQ, a platform to enforcepredictable network bandwidth sharing within the data-center, using minimum bandwidth guarantees to end-points. Our design and evaluation shows that a synthesisof well known techniques can lead to a simple and scal-able design for network performance isolation. EyeQ ispractical, and is deployable on today’s, and next genera-tion high speed datacenter networks with no changes tonetwork hardware or applications. With EyeQ, providerscan flexibly and efficiently apportion network bandwidthacross tenants by giving each tenant endpoint a pre-dictable minimum bandwidth guarantee, eliminating theproblem of accidental, or malicious traffic interference.

AcknowledgmentsWe would like to thank the anonymous reviewers, AliGhodsi and our shepherd Lakshminarayanan Subrama-nian for their feedback and suggestions. The work atStanford was funded by NSF FIA award CNS–1040190,by a gift from Google, and by DARPA CRASH award#N66001-10-2-4088. Opinions, findings, and conclu-sions do not necessarily reflect the views of the NSF orother sponsors.

12

Page 13: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 309

References[1] Amazon Virtual Private Cloud. http://aws.

amazon.com/vpc/.

[2] Jayaram Mudigonda, Praveen Yalagandula, JeffMogul, Bryan Stiekes, and Yanick Pouffary. Net-lord: a scalable multi-tenant network architec-ture for virtualized datacenters. In ACM SIG-COMM Computer Communication Review, vol-ume 41, pages 62–73. ACM, 2011.

[3] Hitesh Ballani, Paolo Costa, Thomas Karagiannis,and Ant Rowstron. Towards predictable datacenternetworks. In ACM SIGCOMM Computer Commu-nication Review, volume 41, pages 242–253. ACM,2011.

[4] Michael Armbrust, Armando Fox, Rean Griffith,Anthony D Joseph, Randy Katz, Andy Konwinski,Gunho Lee, David Patterson, Ariel Rabkin, Ion Sto-ica, et al. A view of cloud computing. Communi-cations of the ACM, 53(4):50–58, 2010.

[5] Albert Greenberg, James R Hamilton, NavenduJain, Srikanth Kandula, Changhoon Kim, ParantapLahiri, David A Maltz, Parveen Patel, and SudiptaSengupta. Vl2: a scalable and flexible data centernetwork. In ACM SIGCOMM Computer Commu-nication Review, volume 39, pages 51–62. ACM,2009.

[6] Radhika Niranjan Mysore, Andreas Pamboris,Nathan Farrington, Nelson Huang, Pardis Miri,Sivasankar Radhakrishnan, Vikram Subramanya,and Amin Vahdat. Portland: a scalable fault-tolerant layer 2 data center network fabric. InACM SIGCOMM Computer Communication Re-view, volume 39, pages 39–50. ACM, 2009.

[7] Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi,Yongguang Zhang, and Songwu Lu. Dcell: a scal-able and fault-tolerant network structure for datacenters. In ACM SIGCOMM Computer Commu-nication Review, volume 38, pages 75–86. ACM,2008.

[8] C.E. Hopps. Analysis of an equal-cost multi-pathalgorithm. 2000.

[9] Mohammad Al-Fares, Sivasankar Radhakrishnan,Barath Raghavan, Nelson Huang, and Amin Vah-dat. Hedera: Dynamic flow scheduling for datacenter networks. In Proceedings of the 7th USENIXconference on Networked systems design and im-plementation, pages 19–19, 2010.

[10] Theophilus Benson, Aditya Akella, Anees Shaikh,and Sambit Sahu. Cloudnaas: a cloud networkingplatform for enterprise applications. In Proceedingsof the 2nd ACM Symposium on Cloud Computing,page 8. ACM, 2011.

[11] Costin Raiciu, Sebastien Barre, Christopher Plun-tke, Adam Greenhalgh, Damon Wischik, and MarkHandley. Improving datacenter performance androbustness with multipath tcp. In ACM SIG-COMM Computer Communication Review, vol-ume 41, pages 266–277. ACM, 2011.

[12] Nicholas G Duffield, Pawan Goyal, Albert Green-berg, Partho Mishra, Kadangode K Ramakrishnan,Jacobus E Van der Merwe, et al. A flexible modelfor resource management in virtual private net-works. ACM SIGCOMM Computer Communica-tion Review, 29(4):95–108, 1999.

[13] A.K. Parekh and R.G. Gallager. A generalizedprocessor sharing approach to flow control in in-tegrated services networks: the single-node case.IEEE/ACM Transactions on Networking (TON),1(3):344–357, 1993.

[14] Rong Pan, Lee Breslau, Balaji Prabhakar, and ScottShenker. Approximate fairness through differentialdropping. In ACM SIGCOMM Computer Commu-nication Review, volume 33, pages 23–39. ACM,2003.

[15] Alan Demers, Srinivasan Keshav, and ScottShenker. Analysis and simulation of a fair queue-ing algorithm. In ACM SIGCOMM Computer Com-munication Review, volume 19, pages 1–12. ACM,1989.

[16] Madhavapeddi Shreedhar and George Varghese.Efficient fair queueing using deficit round robin.In ACM SIGCOMM Computer Communication Re-view, volume 25, pages 231–242. ACM, 1995.

[17] Jon CR Bennett and Hui Zhang. Wf2q: worst-case fair weighted fair queueing. In INFOCOM’96.Fifteenth Annual Joint Conference of the IEEEComputer Societies. Networking the Next Genera-tion. Proceedings IEEE, volume 1, pages 120–128.IEEE, 1996.

[18] Alan Shieh, Srikanth Kandula, Albert Greenberg,Changhoon Kim, and Bikas Saha. Sharing thedata center network. In USENIX NSDI, volume 11,2011.

13

Page 14: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

310 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association

[19] Srikanth Kandula, Sudipta Sengupta, Albert Green-berg, Parveen Patel, and Ronnie Chaiken. The na-ture of data center traffic: measurements & analy-sis. In Proceedings of the 9th ACM SIGCOMM con-ference on Internet measurement conference, pages202–208. ACM, 2009.

[20] Theophilus Benson, Ashok Anand, Aditya Akella,and Ming Zhang. Understanding data center trafficcharacteristics. ACM SIGCOMM Computer Com-munication Review, 40(1):92–99, 2010.

[21] Mohammad Alizadeh, Albert Greenberg, David AMaltz, Jitendra Padhye, Parveen Patel, Balaji Prab-hakar, Sudipta Sengupta, and Murari Sridharan.Data center tcp (dctcp). ACM SIGCOMM Com-puter Communication Review, 40(4):63–74, 2010.

[22] Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,Balaji Prabhakar, Amin Vahdat, and Masato Ya-suda. Less is more: Trading a little bandwidth forultra-low latency in the data center. In Proceedingsof USENIX NSDI conference, 2012.

[23] David Zats, Tathagata Das, Prashanth Mohan,Dhruba Borthakur, and Randy Katz. Detail: Re-ducing the flow completion time tail in datacenternetworks. ACM SIGCOMM Computer Communi-cation Review, 42(4):139–150, 2012.

[24] Mohammad Alizadeh, Berk Atikoglu, Abdul Kab-bani, Ashvin Lakshmikantha, Rong Pan, BalajiPrabhakar, and Mick Seaman. Data center trans-port mechanisms: Congestion control theory andieee standardization. In Communication, Control,and Computing, 2008 46th Annual Allerton Con-ference on, pages 1270–1277. IEEE, 2008.

[25] Aleksandar Kuzmanovic and Edward W Knightly.Low-rate tcp-targeted denial of service attacks: theshrew vs. the mice and elephants. In Proceedings ofthe 2003 conference on Applications, technologies,architectures, and protocols for computer commu-nications, pages 75–86. ACM, 2003.

[26] Vimalkumar Jeyakumar, Mohammad Alizadeh,David Mazieres, Balaji Prabhakar, ChanghoonKim, and Windows Azure. Eyeq: practical networkperformance isolation for the multi-tenant cloud. InProceedings of the 4th USENIX conference on HotTopics in Cloud Ccomputing, pages 8–8. USENIXAssociation, 2012.

[27] Nandita Dukkipati and Nick McKeown. Why flow-completion time is the right metric for congestion

control. ACM SIGCOMM Computer Communica-tion Review, 36(1):59–62, 2006.

[28] Guohan Lu, Chuanxiong Guo, Yulong Li, ZhiqiangZhou, Tong Yuan, Haitao Wu, Yongqiang Xiong,Rui Gao, and Yongguang Zhang. Serverswitch: Aprogrammable and high performance platform fordata center networks. In Proc. NSDI, 2011.

[29] Frank Kelly, Gaurav Raina, and Thomas Voice. Sta-bility and fairness of explicit congestion controlwith small buffers. ACM SIGCOMM ComputerCommunication Review, 38(3):51–62, 2008.

[30] Mohammad Alizadeh, Abdul Kabbani, BerkAtikoglu, and Balaji Prabhakar. Stability analysisof qcn: the averaging principle. ACM SIGMET-RICS Performance Evaluation Review, 39(1):49–60, 2011.

[31] Guide to linux kernel network scaling.http://code.google.com/p/kernel/wiki/NetScalingGuide.

[32] Ganesh Ananthanarayanan, Srikanth Kandula, Al-bert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, andEdward Harris. Reining in the outliers in map-reduce clusters using mantri. In Proceedings ofthe 9th USENIX conference on Operating systemsdesign and implementation, pages 1–16. USENIXAssociation, 2010.

[33] Internet-scale datacenter economics: Costs andopportunities. http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_HPTS2011.pdf.

[34] Steven McCanne, Sally Floyd, Kevin Fall, KannanVaradhan, et al. Network simulator ns-2, 1997.

[35] Advait Dixit, Pawan Prakash, and Ramana RaoKompella. On the efficacy of fine-grained traf-fic splitting protocols in data center networks. InACM SIGCOMM Computer Communication Re-view, volume 41, pages 430–431. ACM, 2011.

[36] Ion Stoica, Hui Zhang, and TS Ng. A hierarchicalfair service curve algorithm for link-sharing, real-time and priority services, volume 27. ACM, 1997.

[37] Lixia Zhang, Steve Deering, Deborah Estrin, ScottShenker, and Daniel Zappala. Rsvp: A newresource reservation protocol. Network, IEEE,7(5):8–18, 1993.

14

Page 15: EyeQ: Practical Network Performance Isolation at the Edge · EyeQ: Practical Network Performance Isolation at ... This requires three key components ... and mechanisms that react

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 311

[38] Ion Stoica, Scott Shenker, and Hui Zhang. Core-stateless fair queueing: Achieving approximatelyfair bandwidth allocations in high speed networks.In ACM SIGCOMM Computer Communication Re-view, volume 28, pages 118–130. ACM, 1998.

[39] Abdul Kabbani, Mohammad Alizadeh, Masato Ya-suda, Rong Pan, and Balaji Prabhakar. Af-qcn: Ap-proximate fairness with quantized congestion noti-fication for multi-tenanted data centers. In HighPerformance Interconnects (HOTI), 2010 IEEE18th Annual Symposium on, pages 58–65. IEEE,2010.

[40] Lakshminarayanan Subramanian, Ion Stoica, HariBalakrishnan, and Randy Katz. Overqos: An over-lay based architecture for enhancing internet qos.In Proceedings of the 1st conference on Symposiumon Networked Systems Design and Implementation,volume 1, pages 6–21, 2004.

[41] Chuanxiong Guo, Guohan Lu, Helen J Wang,Shuang Yang, Chao Kong, Peng Sun, Wenfei Wu,and Yongguang Zhang. Secondnet: a data centernetwork virtualization architecture with bandwidthguarantees. In Proceedings of the 6th InternationalCOnference, page 15. ACM, 2010.

[42] Henrique Rodrigues, Jose Renato Santos, YoshioTurner, Paolo Soares, and Dorgival Guedes. Gate-keeper: Supporting bandwidth guarantees formulti-tenant datacenter networks. USENIX WIOV,2011.

[43] Lucian Popa, Gautam Kumar, Mosharaf Chowd-hury, Arvind Krishnamurthy, Sylvia Ratnasamy,and Ion Stoica. Faircloud: Sharing the networkin cloud computing. In Proceedings of the ACMSIGCOMM 2012 conference on Applications, tech-nologies, architectures, and protocols for computercommunication, pages 187–198. ACM, 2012.

[44] Sivasankar Radhakrishnan, Rong Pan, Amin Vah-dat, George Varghese, et al. Netshare and stochasticnetshare: predictable bandwidth allocation for datacenters. ACM SIGCOMM Computer Communica-tion Review, 42(3):5–11, 2012.

[45] Jeffrey C Mogul and Lucian Popa. What we talkabout when we talk about cloud network perfor-mance. ACM SIGCOMM Computer Communica-tion Review, 42(5):44–48, 2012.

15