Network Virtualization Case Study: VL2 - ETH Z · Case Study VL2: A Scalable and Flexible Data Center Network Albert Greenberg James R. Hamilton Navendu Jain Srikanth Kandula Changhoon

Network Virtualization Case Study: VL2

Key Needs

Agility

Location independent addressing

Performance uniformity

Security

Network semantics

Case Study

VL2: A Scalable and Flexible Data Center Network

Albert Greenberg James R. Hamilton Navendu JainSrikanth Kandula Changhoon Kim Parantap Lahiri

David A. Maltz Parveen Patel Sudipta Sengupta

Microsoft Research

Abstract

To be agile and cost effective, data centers should allow dynamic re-source allocation across large server pools. In particular, the datacenter network should enable any server to be assigned to any ser-vice. Tomeet these goals, we presentVL, a practical network archi-tecture that scales to support huge data centers with uniform highcapacity between servers, performance isolation between services,andEthernet layer- semantics. VLuses () flat addressing to allowservice instances to be placed anywhere in the network, () ValiantLoad Balancing to spread traffic uniformly across network paths,and () end-system based address resolution to scale to large serverpools, without introducing complexity to the network control plane.VL’s design is driven by detailed measurements of traffic and faultdata from a large operational cloud service provider. VL’s imple-mentation leverages proven network technologies, already availableat low cost in high-speed hardware implementations, to build a scal-able and reliable network architecture. As a result, VL networkscan be deployed today, and we have built a working prototype. Weevaluate the merits of the VL design using measurement, analysis,and experiments. Our VL prototype shuffles . TB of data among servers in seconds – sustaining a rate that is of the max-imum possible.

Categories and Subject Descriptors: C.. [Computer-Communi-cation Network]: Network Architecture and Design

General Terms: Design, Performance, Reliability

Keywords: Data center network, commoditization

1. INTRODUCTIONCloud services are driving the creation of data centers that hold

tens to hundreds of thousands of servers and that concurrently sup-port a large number of distinct services (e.g., search, email, map-reduce computations, and utility computing). The motivations forbuilding such shared data centers are both economic and technical:to leverage the economies of scale available to bulk deployments andto benefit from the ability to dynamically reallocate servers amongservices as workload changes or equipment fails [, ]. The cost isalso large – upwards of million per month for a , serverdata center — with the servers themselves comprising the largestcost component. To be profitable, these data centers must achievehigh utilization, and key to this is the property of agility — the ca-pacity to assign any server to any service.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGCOMM’09, August 17–21, 2009, Barcelona, Spain.Copyright 2009 ACM 978-1-60558-594-9/09/08 ...$10.00.

Agility promises improved risk management and cost savings.Without agility, each service must pre-allocate enough servers tomeet difficult to predict demand spikes, or risk failure at the brinkof success. With agility, the data center operator can meet the fluc-tuating demands of individual services from a large shared serverpool, resulting in higher server utilization and lower costs.

Unfortunately, the designs for today’s data center network pre-vent agility in several ways. First, existing architectures do notprovide enough capacity between the servers they interconnect.Conventional architectures rely on tree-like network configurationsbuilt from high-cost hardware. Due to the cost of the equipment,the capacity between different branches of the tree is typically over-subscribed by factors of : or more, with paths through the highestlevels of the tree oversubscribedby factors of : to :. This lim-its communication between servers to the point that it fragments theserver pool — congestion and computation hot-spots are prevalenteven when spare capacity is available elsewhere. Second, while datacenters host multiple services, the network does little to prevent atraffic flood in one service from affecting the other services aroundit—when one service experiences a trafficflood,it is common for allthose sharing the same network sub-tree to suffer collateral damage.Third, the routing design in conventional networks achieves scale byassigning servers topologically significant IP addresses and dividingservers among VLANs. Such fragmentation of the address spacelimits the utility of virtual machines, which cannot migrate out oftheir original VLAN while keeping the same IP address. Further,the fragmentation of address space creates an enormous configura-tion burden when servers must be reassigned among services, andthe human involvement typically required in these reconfigurationslimits the speed of deployment.

To overcome these limitations in today’s design and achieveagility, we arrange for the network to implement a familiar andconcrete model: give each service the illusion that all the serversassigned to it, and only those servers, are connected by a singlenon-interfering Ethernet switch—a Virtual Layer — andmaintainthis illusion even as the size of each service varies from server to,. Realizing this vision concretely translates into building anetwork that meets the following three objectives:

• Uniform high capacity: The maximum rate of a server-to-servertraffic flow should be limited only by the available capacity on thenetwork-interface cards of the sending and receiving servers, andassigning servers to a service should be independent of networktopology.

• Performance isolation: Traffic of one service should not be af-fected by the traffic of any other service, just as if each service wasconnected by a separate physical switch.

• Layer- semantics: Just as if the servers were on a LAN—whereany IP address can be connected to any port of an Ethernet switchdue to flat addressing—data-centermanagement software shouldbe able to easily assign any server to any service and configure

[ACM SIGCOMM 2009]

Influenced architecture ofMicrosoft Azure

VL2 Æ Azure Clos Fabrics with 40G NICs

16

T2-1-1 T2-1-2T2-1-8

T3-1 T3-2 T3-3 T3-4

Row Spine

T2-2-1 T2-2-2T2-2-4

Data Center Spine

T1-1

T1-8T1-7…T1-2

…

…Regional Spine

…

T1-1

T1-8T1-7…T1-2

T1-1

T1-8T1-7…T1-2

Rack…

T0-1 T0-2T0-20

Servers

…T0-1 T0-2

T0-20

Servers

…T0-1 T0-2

T0-20

Servers

Outcome of >10 years of history, with major

revisions every six months

Scale-out, active-active

L3

L2LB/FW

LB/FW LB/FW

LB/FW

Scale-up, active-passive

[From Albert Greenberg keynote at SIGCOMM 2015:http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/keynote.pdf]

http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/keynote.pdf%5D

Motivating Environmental Characteristics

Increasing internal traffic is a bottleneck

• Traffic volume between servers is 4x external traffic

Unpredictable, rapidly-changing traffic matrices (TMs)

0

0.01

0.02

0.03

0.04

1 10 100 1000 0

0.2

0.4

0.6

0.8

1

Fra

ctio

n o

f T

ime

Cu

mu

lativ

e

Number of Concurrent flows in/out of each Machine

PDFCDF

Figure : Number of concurrent connections has two modes: () flows per nodemore than of the time and () flows pernode for at least of the time.

Similar to Internet flow characteristics [],we find that there aremyriad small flows (mice). On the other hand, as compared withInternet flows, the distribution is simpler and more uniform. Thereason is that in data centers, internal flows arise in an engineeredenvironment driven by careful design decisions (e.g., the -MBchunk size is driven by the need to amortize disk-seek times overread times) and by strong incentives to use storage and analytic toolswith well understood resilience and performance.

Number of Concurrent Flows: Figure shows the probabilitydensity function (as a fraction of time) for the number of concur-rent flows going in and out of a machine, computed over all ,monitored machines for a representative day’s worth of flow data.There are two modes. More than of the time, an average ma-chine has about ten concurrent flows, but at least of the time ithas greater than concurrent flows. We almost never see morethan concurrent flows.

The distributions of flow size and number of concurrent flowsboth imply that VLB will perform well on this traffic. Since even bigflows are only MB ( s of transmit time at Gbps), randomiz-ing at flow granularity (rather than packet) will not cause perpetualcongestion if there is unlucky placement of a few flows. Moreover,adaptive routing schemes may be difficult to implement in the datacenter, since any reactive traffic engineering will need to run at leastonce a second if it wants to react to individual flows.

3.3 Traffic Matrix AnalysisPoor summarizability of traffic patterns: Next, we ask the

question: Is there regularity in the traffic that might be exploitedthrough careful measurement and traffic engineering? If traffic in theDC were to follow a few simple patterns, then the network could beeasily optimized to be capacity-efficient for most traffic. To answer,we examine how the Traffic Matrix(TM) of the , server clusterchanges over time. For computational tractability, we compute theToR-to-ToR TM— the entry TM(t)i,j is the number of bytes sentfrom servers in ToR i to servers in ToR j during the s beginningat time t. We compute one TM for every s interval, and serversoutside the cluster are treated as belonging to a single “ToR”.

Given the timeseries of TMs, we find clusters of similar TMsusing a technique due to Zhang et al. []. In short, the techniquerecursively collapses the trafficmatrices that aremost similar to eachother into a cluster, where the distance (i.e., similarity) reflects howmuch traffic needs to be shuffled to make one TM look like theother. We then choose a representative TM for each cluster, suchthat any routing that can deal with the representative TM performsno worse on every TM in the cluster. Using a single representativeTM per cluster yields a fitting error (quantified by the distances be-tween each representative TMs and the actual TMs they represent),which will decrease as the number of clusters increases. Finally, ifthere is a knee point (i.e., a small number of clusters that reducesthe fitting error considerably), the resulting set of clusters and theirrepresentative TMs at that knee corresponds to a succinct numberof distinct traffic matrices that summarize all TMs in the set.

0

5

10

15

20

25

30

35

40

0 200 400 600 800 1000

Inde

x of th

e C

on

tain

ing C

lust

er

Time in 100s intervals

Fre

qu

en

cy

0 5 10 20

05

01

00

20

0

Run Length

Fre

qu

en

cy

2.0 3.0 4.0

01

00

20

03

00

log(Time to Repeat)

(a) (b) (c)Figure : Lack of short-term predictability: The cluster to whicha traffic matrix belongs, i.e., the type of traffic mix in the TM,changes quickly and randomly.

Surprisingly, the number of representative traffic matrices inour data center is quite large. On a timeseries of TMs, indicat-ing a day’s worth of traffic in the datacenter, even when approximat-ing with 50 − 60 clusters, the fitting error remains high () andonly decreasesmoderately beyond that point.This indicates that thevariability in datacenter traffic is not amenable to concise summa-rization and hence engineering routes for just a few traffic matricesis unlikely to work well for the traffic encountered in practice.

Instability of traffic patterns: Next we ask how predictable isthe traffic in the next interval given the current traffic? Traffic pre-dictability enhances the ability of an operator to engineer routingas traffic demand changes. To analyze the predictability of traffic inthe network, we find the best TM clusters using the techniqueabove and classify the traffic matrix for each s interval to thebest fitting cluster. Figure (a) shows that the traffic pattern changesnearly constantly, with no periodicity that could help predict the fu-ture. Figure (b) shows the distribution of run lengths - how manyintervals does the network traffic pattern spend in one cluster be-fore shifting to the next. The run length is to the th percentile.Figure (c) shows the time between intervals where the traffic mapsto the same cluster. But for the mode at s caused by transitionswithin a run, there is no structure to when a traffic pattern will nextappear.

The lack of predictability stems from the use of randomness toimprove the performance of data-center applications. For exam-ple, the distributed file system spreads data chunks randomly acrossservers for load distribution and redundancy. The volatility impliesthat it is unlikely that other routing strategies will outperform VLB.

3.4 Failure CharacteristicsTo design VL to tolerate the failures and churn found in data

centers, we collected failure logs for over a year from eight produc-tion data centers that comprise hundreds of thousands of servers,host over a hundred cloud services and serve millions of users. Weanalyzed hardware and software failures of switches, routers, loadbalancers, firewalls, links and servers using SNMP polling/traps,syslogs, server alarms, and transaction monitoring frameworks. Inall, we looked at M error events from over K alarm tickets.

What is the pattern of networking equipment failures? Wedefine a failure as the event that occurs when a system or compo-nent is unable to perform its required function for more than s.As expected, most failures are small in size (e.g., of networkdevice failures involve < devices and of network device fail-ures involve < devices) while large correlated failures are rare(e.g., the largest correlated failure involved switches). However,downtimes can be significant: of failures are resolved in min, in < hr, . in < day, but . last > days.

What is the impact of networking equipment failure? As dis-cussed in Section , conventional data center networks apply : re-

[Greenberg et al.]





0

0.01

0.02

0.03

0.04

1 10 100 1000 0

0.2

0.4

0.6

0.8

1

Fra

ctio

n o

f T

ime

Cu

mu

lativ

e


PDFCDF








0

5

10

15

20

25

30

35

40

0 200 400 600 800 1000

Index

of th

e C

onta

inin

g C

lust

er


Fre

qu

en

cy

0 5 10 20

050

100

200

Run Length

Fre

qu

en

cy

2.0 3.0 4.0

0100

200

300

log(Time to Repeat)









[Greenberg et al.]





Design result: Nonblocking fabric

• High throughput for any TM that respects server NIC rates

0

0.01

0.02

0.03

0.04

1 10 100 1000 0

0.2

0.4

0.6

0.8

1

Fra

ctio

n o

f T

ime

Cu

mu

lativ

e


PDFCDF








0

5

10

15

20

25

30

35

40

0 200 400 600 800 1000

Index

of th

e C

onta

inin

g C

lust

er


Fre

qu

en

cy

0 5 10 20

050

100

200

Run Length

Fre

qu

en

cy

2.0 3.0 4.0

0100

200

300

log(Time to Repeat)









[Greenberg et al.]


Failure characteristics

• Analyzed 300K alarm tickets, 36M error events• 0.4% of failures were resolved in over one day• 0.3% of failures eliminated all redundancy in a device group

(e.g. both uplinks)

Design result: Clos topology

• “Scale out” instead of “scale up”

VL2 physical topology

that server with whatever IP address the service expects. Virtualmachines should be able to migrate to any server while keepingthe same IP address, and the network configuration of each servershould be identical to what it would be if connected via a LAN.Finally, features like link-local broadcast, on which many legacyapplications depend, should work.

In this paper we design, implement and evaluate VL, a net-work architecture for data centers that meets these three objectivesand thereby provides agility. In creating VL, a goal was to investi-gate whether we could create a network architecture that could bedeployed today, so we limit ourselves from making any changes tothe hardware of the switches or servers, and we require that legacyapplications work unmodified. However, the software and operat-ing systems on data-center servers are already extensively modified(e.g., to create hypervisors for virtualization or blob file-systems tostore data). Therefore, VL’s design explores a new split in the re-sponsibilities between host and network — using a layer . shimin servers’ network stack to work around limitations of the networkdevices. No new switch software or APIs are needed.

VL consists of a network built from low-cost switch ASICsarranged into a Clos topology [] that provides extensive path di-versity between servers. Our measurements show data centers havetremendous volatility in their workload, their traffic, and their fail-ure patterns. To cope with this volatility, we adopt Valiant LoadBalancing (VLB) [, ] to spread traffic across all available pathswithout any centralized coordination or traffic engineering. UsingVLB, each server independently picks a path at random through thenetwork for each of the flows it sends to other servers in the datacenter. Common concerns with VLB, such as the extra latency andthe consumption of extra network capacity caused by path stretch,are overcome by a combination of our environment (propagationdelay is very small inside a data center) and our topology (whichincludes an extra layer of switches that packets bounce off of). Ourexperiments verify that our choice of using VLB achieves both theuniform capacity and performance isolation objectives.

The switches that make up the network operate as layer-routers with routing tables calculated by OSPF, thereby enabling theuse of multiple paths (unlike Spanning Tree Protocol) while using awell-trusted protocol. However, the IP addresses used by servicesrunning in the data center cannot be tied to particular switchesin the network, or the agility to reassign servers between serviceswould be lost. Leveraging a trick used in many systems [], VLassigns servers IP addresses that act as names alone, with no topo-logical significance. When a server sends a packet, the shim-layeron the server invokes a directory system to learn the actual locationof the destination and then tunnels the original packet there. Theshim-layer also helps eliminate the scalability problems created byARP in layer- networks, and the tunneling improves our ability toimplement VLB.These aspects of the design enable VL to providelayer- semantics, while eliminating the fragmentation and waste ofserver pool capacity that the binding between addresses and loca-tions causes in the existing architecture.

Taken together, VL’s choices of topology, routing design, andsoftware architecture create a huge shared pool of network capacitythat each pair of servers can draw from when communicating. Weimplement VLB by causing the traffic between any pair of serversto bounce off a randomly chosen switch in the top level of the Clostopology and leverage the features of layer- routers, such as Equal-Cost MultiPath (ECMP), to spread the traffic along multiple sub-paths for these two path segments. Further,we use anycast addressesand an implementation of Paxos [] in a way that simplifies thedesign of the Directory System and, when failures occur, providesconsistency properties that are on par with existing protocols.

•••••

Figure : A conventional network architecture for data centers(adapted from figure by Cisco []).

The feasibility of our design rests on several questions that weexperimentally evaluate. First, the theory behind Valiant Load Bal-ancing, which proves that the networkwill be hot-spot free, requiresthat (a) randomization is performed at the granularity of small pack-ets, and (b) the traffic sent into the network conforms to the hosemodel []. For practical reasons, however, VL picks a differentpath for each flow rather than packet (falling short of (a)), and italso relies on TCP to police the offered traffic to the hose model(falling short of (b), as TCP needs multiple RTTs to conform traf-fic to the hose model). Nonetheless, our experiments show that fordata-center traffic, the VL design choices are sufficient to offer thedesired hot-spot free properties in real deployments. Second, thedirectory system that provides the routing information needed toreach servers in the data center must be able to handle heavy work-loads at very low latency. We show that designing and implementingsuch a directory system is achievable.

In the remainder of this paper we will make the following con-tributions, in roughly this order.

• Wemake a first of its kind study of the traffic patterns in a produc-tion data center, and find that there is tremendous volatility in thetraffic, cycling among - different patterns during a day andspending less than s in each pattern at the th percentile.

• We design, build, and deploy every component of VL in an -server cluster. Using the cluster, we experimentally validate thatVL has the properties set out as objectives, such as uniform ca-pacity and performance isolation. We also demonstrate the speedof the network, such as its ability to shuffle . TB of data among servers in s.

• We apply Valiant Load Balancing in a new context, the inter-switch fabric of a data center, and show that flow-level traffic split-ting achieves almost identical split ratios (within of optimalfairness index) on realistic data center traffic, and it smoothes uti-lization while eliminating persistent congestion.

• We justify the design trade-offs made in VL by comparing thecost of a VL network with that of an equivalent network basedon existing designs.

2. BACKGROUNDIn this section, we first explain the dominant design pattern for

data-center architecture today []. We then discuss why this archi-tecture is insufficient to serve large cloud-service data centers.

As shown in Figure , the network is a hierarchy reaching froma layer of servers in racks at the bottom to a layer of core routers atthe top.There are typically to servers per rack, each singly con-nected to a Top of Rack (ToR) switch with a Gbps link. ToRs con-nect to two aggregation switches for redundancy, and these switchesaggregate further connecting to access routers. At the top of the hi-erarchy, core routers carry traffic between access routers and man-

Traditional VL2

. . .

. . .

!"#

$%&

. . .

. . . .

'(()

DA/2 x 10G

DA/2 x 10G

DI x10G

2 x10G DADI/4 x ToR Switches

DI x Aggregate Switches

20(DADI/4) x Servers

InternetLink-state networkcarrying only LAs

(e.g., 10/8) DA/2 x Intermediate Switches

Fungible pool ofservers owning AAs

(e.g., 20/8)

Figure : An exampleClos network betweenAggregation and In-termediate switches provides a richly-connected backbone well-suited for VLB. The network is built with two separate addressfamilies— topologically significant LocatorAddresses (LAs) andflat Application Addresses (AAs).

dundancy to improve reliability at higher layers of the hierarchicaltree. Despite these techniques, we find that in . of failures allredundant components in a network device group became unavail-able (e.g., the pair of switches that comprise each node in the con-ventional network (Figure ) or both the uplinks from a switch). Inone incident, the failure of a core switch (due to a faulty supervi-sor card) affected ten million users for about four hours. We foundthe main causes of these downtimes are networkmisconfigurations,firmware bugs, and faulty components (e.g., ports). With no obvi-ous way to eliminate all failures from the top of the hierarchy, VL’sapproach is to broaden the topmost levels of the network so that theimpact of failures is muted and performance degrades gracefully,moving from : redundancy to n:m redundancy.

4. VIRTUAL LAYER TWO NETWORKINGBefore detailing our solution, we brieflydiscuss our design prin-

ciples and preview how they will be used in the VL design.Randomizing to Cope with Volatility: VL copes with

the high divergence and unpredictability of data-center trafficmatrices by using Valiant Load Balancing to do destination-independent (e.g., random) traffic spreading across multiple inter-mediate nodes. We introduce our network topology suited for VLBin §., and the corresponding flow spreading mechanism in §..

VLB, in theory, ensures a non-interfering packet switched net-work [], the counterpart of a non-blocking circuit switched net-work, as long as (a) traffic spreading ratios are uniform, and (b) theoffered traffic patterns do not violate edge constraints (i.e., line cardspeeds). To meet the latter condition, we rely on TCP’s end-to-endcongestion control mechanism. While our mechanisms to realizeVLB do not perfectly meet either of these conditions, we show in§. that our scheme’s performance is close to the optimum.

Building on proven networking technology: VL is based onIP routing and forwarding technologies that are already availablein commodity switches: link-state routing, equal-cost multi-path(ECMP) forwarding, IP anycasting, and IP multicasting. VL usesa link-state routing protocol to maintain the switch-level topology,but not to disseminate endhosts’ information.This strategyprotectsswitches from needing to learn voluminous, frequently-changinghost information. Furthermore, the routing design uses ECMP for-warding along with anycast addresses to enable VLB with minimalcontrol plane messaging or churn.

Separating names from locators: The data center networkmust support agility, whichmeans, in particular, support for hostingany service on any server, for rapid growing and shrinking of serverpools, and for rapid virtual machine migration. In turn, this callsfor separating names from locations. VL’s addressing scheme sep-arates server names, termed application-specific addresses (AAs),from their locations, termed location-specific addresses (LAs). VLuses a scalable, reliable directory system to maintain the mappingsbetween names and locators. A shim layer running in the networkstack on every server, called the VL agent, invokes the directorysystem’s resolution service. We evaluate the performance of the di-rectory system in §..

Embracing End Systems: The rich and homogeneous pro-grammability available at data-center hosts provides a mechanismto rapidly realize new functionality. For example, the VL agent en-ables fine-grained path control by adjusting the randomization usedin VLB. The agent also replaces Ethernet’s ARP functionality withqueries to the VL directory system. The directory system itself isalso realized on servers, rather than switches, and thus offers flexi-bility, such as fine-grained, context-aware server access control anddynamic service re-provisioning.

We next describe each aspect of the VL system and how theywork together to implement a virtual layer- network.These aspectsinclude the network topology, the addressing and routing design,and the directory that manages name-locator mappings.

4.1 Scale-out TopologiesAs described in §., conventional hierarchical data-center

topologies have poor bisection bandwidth and are also suscepti-ble to major disruptions due to device failures at the highest levels.Rather than scale up individual network devices with more capac-ity and features, we scale out the devices — build a broad networkoffering huge aggregate capacity using a large number of simple, in-expensive devices, as shown in Figure . This is an example of afolded Clos network [] where the links between the Intermedi-ate switches and the Aggregation switches form a complete bipar-tite graph. As in the conventional topology, ToRs connect to twoAggregation switches, but the large number of paths between ev-ery two Aggregation switches means that if there are n Intermedi-ate switches, the failure of any one of them reduces the bisectionbandwidth by only 1/n–a desirable graceful degradation of band-width that we evaluate in §.. Further, it is easy and less expen-sive to build a Clos network for which there is no over-subscription(further discussion on cost is in §). For example, in Figure , weuse DA-port Aggregation and DI -port Intermediate switches, andconnect these switches such that the capacity between each layer isDIDA/2 times the link capacity.

The Clos topology is exceptionally well suited for VLB in that byindirectly forwarding traffic through an Intermediate switch at thetop tier or “spine” of the network, the network can provide band-width guarantees for any traffic matrices subject to the hose model.Meanwhile, routing is extremely simple and resilient on this topol-ogy — take a random path up to a random intermediate switch anda random path down to a destination ToR switch.

VL leverages the fact that at every generation of technol-ogy, switch-to-switch links are typically faster than server-to-switchlinks, and trends suggest that this gap will remain. Our current de-sign uses G server links and G switch links, and the next designpoint will probably be G server links with G switch links. Byleveraging this gap, we reduce the number of cables required to im-plement the Clos (as compared with a fat-tree []), and we simplifythe task of spreading load over the links (§.).

[Figures from Greenberg et al.]

Routing in VL2

Unpredictable traffic

• Difficult to adapt

Design result: “Valiant Load Balancing”

• Route traffic independent of current traffic matrix• Spreads arbitrary traffic pattern so it’s uniform among top

layer switches

Routing Implementation

. . .

. . .

!"#

$%&

. . .

. . . .

'(()

DA/2 x 10G

DA/2 x 10G

DI x10G







(e.g., 20/8)















10.1.1.1 10.1.1.1 10.1.1.1


. . .

. . .

!"#

$%&

. . .

. . . .

'(()

DA/2 x 10G

DA/2 x 10G

DI x10G







(e.g., 20/8)















10.1.1.1 10.1.1.1 10.1.1.1


. . .

. . .

!"#

$%&

. . .

. . . .

'(()

DA/2 x 10G

DA/2 x 10G

DI x10G







(e.g., 20/8)















10.1.1.1 10.1.1.1 10.1.1.1

Similar effect to ECMP to each rack

Smaller forwarding tables at most switches

Virtualization

“All problems in computer science can be solved by another level of indirection.”

– David Wheeler

• Application Addresses (AAs): Location independent• Illusion of a single big Layer 2 switch connecting the app

Virtualization layer

App / Tenant layer

Physical network layer• Locator Addresses (LAs): Tied to topology, used to route• Layer 3 routing via OSPF

• Directory server: Maintain AA to LA mapping• Server agent: Query server, wrap AAs in outer LA header

End-to-end example

!"#$%&'()#*+),#--$#./0123

4+'56$)7)(#'()*895#*+),#4-$#.:0123

!"#$%&'()#*+),#--$#./0123

!"!"#$#$#$%%&

!')!'#$'$'$'&

#"!"#$#$#$%(&

!"#$%"&

!"#"#"#$$ !"#"#"#$%H&'() *"#"#"#%

H&'() *"#*#*#*

!"#$%"&

!"#"#"#$$ !"#"#"#%%

!'#$#$#$(&

;8<!"#$#$#$'&

!'#$#$#$)&

;8<!"#$#$#$'&

!')!'#$'$'$'&

!')!'#$'$'$'&

. . . . . .

!"#$%"&

!"#"#"#$$ !"#"#"#$%H&'() *"#"#"#%

H&'() *"#*#*#*

!"#$%"&

!"#"#"#$$ !"#"#"#$%

H&'() *"#"#"#%

Figure : VLB in an example VL network. Sender S sends pack-ets to destination D via a randomly-chosen intermediate switchusing IP-in-IP encapsulation. AAs are from 20/8, and LAs arefrom 10/8. H(ft) denotes a hash of the five tuple.

4.2 VL2 Addressing and RoutingThis section explains how packets flow through a VL network,

and how the topology, routing design, VL agent, and directory sys-tem combine to virtualize the underlying network fabric— creatingthe illusion that hosts are connected to a big, non-interfering data-center-wide layer- switch.

4.2.1 Address resolution and packet forwardingVL uses two different IP-address families, as illustrated in Fig-

ure . The network infrastructure operates using location-specificIP addresses (LAs); all switches and interfaces are assigned LAs, andswitches run an IP-based (layer-) link-state routing protocol thatdisseminates only these LAs.This allows switches to obtain the com-plete switch-level topology, as well as forward packets encapsulatedwith LAs along shortest paths. On the other hand, applications useapplication-specific IP addresses (AAs), which remain unaltered nomatter how servers’ locations change due to virtual-machinemigra-tion or re-provisioning. Each AA (server) is associated with an LA,the identifier of the ToR switch to which the server is connected.The VL directory system stores the mapping of AAs to LAs, andthis mapping is created when application servers are provisioned toa service and assigned AA addresses.

The crux of offering layer- semantics is having servers believethey share a single large IP subnet (i.e., the entire AA space) withother servers in the same service, while eliminating the ARP andDHCP scaling bottlenecks that plague large Ethernets.

Packet forwarding: To route traffic between servers, which useAA addresses, on an underlying network that knows routes for LAaddresses, the VL agent at each server traps packets from the hostand encapsulates the packet with the LA address of the ToR of thedestination as shown in Figure . Once the packet arrives at theLA (the destination ToR), the switch decapsulates the packet anddelivers it to the destination AA carried in the inner header.

Address resolution: Servers in each service are configured tobelieve that they all belong to the same IP subnet. Hence, when anapplication sends a packet to anAA for the first time,the networkingstack on the host generates a broadcast ARP request for the destina-tion AA.The VL agent running on the host intercepts this ARP re-quest and converts it to a unicast query to the VL directory system.The directory system answers the query with the LA of the ToR towhich packets should be tunneled.The VL agent caches this map-ping from AA to LA addresses, similar to a host’s ARP cache, suchthat subsequent communication need not entail a directory lookup.

Access control via the directory service: A server cannot sendpackets to anAA if the directory service refuses to provide it with anLA throughwhich it can route its packets.Thismeans that the direc-

tory service can enforce access-control policies. Further, since thedirectory system knows which server is making the request whenhandling a lookup, it can enforce fine-grained isolation policies. Forexample, it could enforce the policy that only servers belonging tothe same service can communicatewith each other. An advantage ofVL is that, when inter-service communication is allowed, packetsflow directly from a source to a destination, without being detouredto an IP gateway as is required to connect two VLANs in the con-ventional architecture.

These addressing and forwarding mechanisms were chosen fortwo reasons. First, they make it possible to use low-cost switches,which often have small routing tables (typically just 16K entries)that can hold only LA routes, without concern for the huge numberof AAs. Second, they reduce overhead in the network control planeby preventing it from seeing the churn in host state, tasking it to themore scalable directory system instead.

4.2.2 Random traffic spreading over multiple pathsTo offer hot-spot-free performance for arbitrary traffic matri-

ces, VL uses two related mechanisms: VLB and ECMP. The goalsof both are similar — VLB distributes traffic across a set of inter-mediate nodes and ECMP distributes across equal-cost paths — buteach is needed to overcome limitations in the other. VL uses flows,rather than packets, as the basic unit of traffic spreading and thusavoids out-of-order delivery.

Figure illustrates how theVL agent uses encapsulation to im-plement VLB by sending traffic through a randomly-chosen Inter-mediate switch.The packet is first delivered to one of the Intermedi-ate switches, decapsulated by the switch, delivered to the ToR’s LA,decapsulated again, and finally sent to the destination.

While encapsulating packets to a specific, but randomly chosen,Intermediate switch correctly realizes VLB, it would require updat-ing a potentially huge number of VL agents whenever an Inter-mediate switch’s availability changes due to switch/link failures. In-stead, we assign the same LA address to all Intermediate switches,and the directory system returns this anycast address to agents uponlookup. Since all Intermediate switches are exactly three hops awayfrom a source host, ECMP takes care of delivering packets encapsu-lated with the anycast address to any one of the active Intermediateswitches. Upon switch or link failures, ECMPwill react, eliminatingthe need to notify agents and ensuring scalability.

In practice, however, the use of ECMP leads to two problems.First, switches today only support up to -way ECMP, with -way ECMP being released by some vendors this year. If there aremore paths available than ECMP can use, then VL defines severalanycast addresses, each associated with only as many Intermediateswitches as ECMP can accommodate. When an Intermediate switchfails, VL reassigns the anycast addresses from that switch to otherIntermediate switches so that all anycast addresses remain live, andservers can remain unaware of the network churn. Second, someinexpensive switches cannot correctly retrieve the five-tuple values(e.g., the TCP ports) when a packet is encapsulatedwith multiple IPheaders. Thus, the agent at the source computes a hash of the five-tuple values and writes that value into the source IP address field,which all switches do use in making ECMP forwarding decisions.

The greatest concern with both ECMP and VLB is that if “ele-phant flows” are present, then the random placement of flows couldlead to persistent congestion on some links while others are under-utilized. Our evaluation did not find this to be a problem on data-centerworkloads (§.). Should it occur, initial results show theVLagent can detect and deal with such situations with simple mecha-nisms, such as re-hashing to change the path of large flows whenTCP detects a severe congestion event (e.g., a full window loss).

[Greenberg et al.]

Application sendsto AA 20.0.0.56

End-to-end example

!"#$%&'()#*+),#--$#./0123

4+'56$)7)(#'()*895#*+),#4-$#.:0123

!"#$%&'()#*+),#--$#./0123

!"!"#$#$#$%%&

!')!'#$'$'$'&

#"!"#$#$#$%(&

!"#$%"&

!"#"#"#$$ !"#"#"#$%H&'() *"#"#"#%

H&'() *"#*#*#*

!"#$%"&

!"#"#"#$$ !"#"#"#%%

!'#$#$#$(&

;8<!"#$#$#$'&

!'#$#$#$)&

;8<!"#$#$#$'&

!')!'#$'$'$'&

!')!'#$'$'$'&

. . . . . .

!"#$%"&

!"#"#"#$$ !"#"#"#$%H&'() *"#"#"#%

H&'() *"#*#*#*

!"#$%"&

!"#"#"#$$ !"#"#"#$%

H&'() *"#"#"#%


















[Greenberg et al.]


Host agentencapsulates

Directoryservers

Q: W

here is

AA 20.0.0.56?

A: LA 10.0.06

End-to-end example

!"#$%&'()#*+),#--$#./0123

4+'56$)7)(#'()*895#*+),#4-$#.:0123

!"#$%&'()#*+),#--$#./0123

!"!"#$#$#$%%&

!')!'#$'$'$'&

#"!"#$#$#$%(&

!"#$%"&

!"#"#"#$$ !"#"#"#$%H&'() *"#"#"#%

H&'() *"#*#*#*

!"#$%"&

!"#"#"#$$ !"#"#"#%%

!'#$#$#$(&

;8<!"#$#$#$'&

!'#$#$#$)&

;8<!"#$#$#$'&

!')!'#$'$'$'&

!')!'#$'$'$'&

. . . . . .

!"#$%"&

!"#"#"#$$ !"#"#"#$%H&'() *"#"#"#%

H&'() *"#*#*#*

!"#$%"&

!"#"#"#$$ !"#"#"#$%

H&'() *"#"#"#%


















[Greenberg et al.]



Directoryservers

Q: W

here is

AA 20.0.0.56?

A: LA 10.0.06

End-to-end example

!"#$%&'()#*+),#--$#./0123

4+'56$)7)(#'()*895#*+),#4-$#.:0123

!"#$%&'()#*+),#--$#./0123

!"!"#$#$#$%%&

!')!'#$'$'$'&

#"!"#$#$#$%(&

!"#$%"&

!"#"#"#$$ !"#"#"#$%H&'() *"#"#"#%

H&'() *"#*#*#*

!"#$%"&

!"#"#"#$$ !"#"#"#%%

!'#$#$#$(&

;8<!"#$#$#$'&

!'#$#$#$)&

;8<!"#$#$#$'&

!')!'#$'$'$'&

!')!'#$'$'$'&

. . . . . .

!"#$%"&

!"#"#"#$$ !"#"#"#$%H&'() *"#"#"#%

H&'() *"#*#*#*

!"#$%"&

!"#"#"#$$ !"#"#"#$%

H&'() *"#"#"#%


















[Greenberg et al.]



Directoryservers

Intermediate switchdecapsulates

Q: W

here is

AA 20.0.0.56?

A: LA 10.0.06

End-to-end example

!"#$%&'()#*+),#--$#./0123

4+'56$)7)(#'()*895#*+),#4-$#.:0123

!"#$%&'()#*+),#--$#./0123

!"!"#$#$#$%%&

!')!'#$'$'$'&

#"!"#$#$#$%(&

!"#$%"&

!"#"#"#$$ !"#"#"#$%H&'() *"#"#"#%

H&'() *"#*#*#*

!"#$%"&

!"#"#"#$$ !"#"#"#%%

!'#$#$#$(&

;8<!"#$#$#$'&

!'#$#$#$)&

;8<!"#$#$#$'&

!')!'#$'$'$'&

!')!'#$'$'$'&

. . . . . .

!"#$%"&

!"#"#"#$$ !"#"#"#$%H&'() *"#"#"#%

H&'() *"#*#*#*

!"#$%"&

!"#"#"#$$ !"#"#"#$%

H&'() *"#"#"#%


















[Greenberg et al.]



Directoryservers


Destination ToRdecapsulates again

Q: W

here is

AA 20.0.0.56?

A: LA 10.0.06

End-to-end example

!"#$%&'()#*+),#--$#./0123

4+'56$)7)(#'()*895#*+),#4-$#.:0123

!"#$%&'()#*+),#--$#./0123

!"!"#$#$#$%%&

!')!'#$'$'$'&

#"!"#$#$#$%(&

!"#$%"&

!"#"#"#$$ !"#"#"#$%H&'() *"#"#"#%

H&'() *"#*#*#*

!"#$%"&

!"#"#"#$$ !"#"#"#%%

!'#$#$#$(&

;8<!"#$#$#$'&

!'#$#$#$)&

;8<!"#$#$#$'&

!')!'#$'$'$'&

!')!'#$'$'$'&

. . . . . .

!"#$%"&

!"#"#"#$$ !"#"#"#$%H&'() *"#"#"#%

H&'() *"#*#*#*

!"#$%"&

!"#"#"#$$ !"#"#"#$%

H&'() *"#"#"#%


















[Greenberg et al.]



Directoryservers


Destination ToRdecapsulates again

Host agent delivers

Q: W

here is

AA 20.0.0.56?

A: LA 10.0.06

Did we achieve agility?

Location independent addressing

• AAs are location independent

L2 network semantics

• Agent intercepts and handles L2 broadcast, multicast

• Both of the above require “layer 2.5” shim agent running on host; but, concept transfers to hypervisor-based virtual switch

Did we achieve agility?

Performance uniformity

• Clos network is nonblocking (non-oversubscribed)• Uniform capacity everywhere• ECMP provides good (though not perfect) load balancing• But, performance isolation among tenants depends on TCP

backing off to rate destination can receive• Leaves open the possibility of fast load balancing

Security

• Directory system can allow/deny connections by choosing whether to resolve an AA to a LA

• But, segmentation not explicitly enforced at hosts

Where’s the SDN?

Directory servers: Logically centralized control

• Orchestrate application locations• Control communication policy

Host agents: dynamic “programming” of data path

Documents

Network Virtualization Case Study: VL2 - ETH Z · Case Study VL2: A Scalable and Flexible Data Center Network Albert Greenberg James R. Hamilton Navendu Jain Srikanth Kandula Changhoon