ALIAS: Scalable, Decentralized Label Assignment for Data ...ALIAS’s labels can be used in a variety of routing and for-warding contexts, such as tunneling, IP-encapsulation, or MAC

ALIAS: Scalable, Decentralized Label Assignmentfor Data Centers

Meg Walraed-SullivanUC San Diego

[email protected]

Radhika Niranjan MysoreUC San Diego

[email protected]

Malveeka TewariUC San Diego

[email protected] Zhang

Ericsson [email protected]

Keith MarzulloUC San Diego

[email protected]

Amin VahdatUC San Diego

[email protected]

ABSTRACTModern data centers can consist of hundreds of thousandsof servers and millions of virtualized end hosts. Managingaddress assignment while simultaneously enabling scalablecommunication is a challenge in such an environment. Wepresent ALIAS, an addressing and communication protocolthat automates topology discovery and address assignmentfor the hierarchical topologies that underlie many data cen-ter network fabrics. Addresses assigned by ALIAS interop-erate with a variety of scalable communication techniques.ALIAS is fully decentralized, scales to large network sizes,and dynamically recovers from arbitrary failures, withoutrequiring modifications to hosts or to commodity switchhardware. We demonstrate through simulation that ALIASquickly and correctly configures networks that support upto hundreds of thousands of hosts, even in the face of fail-ures and erroneous cabling, and we show that ALIAS is apractical solution for auto-configuration with our NetFPGAtestbed implementation.

Categories and Subject DescriptorsC.2.1 [Network Architecture and Design]: Network Com-munications; C.2.2 [Computer-Communication Networks]:Network Protocols

General TermsAlgorithms, Design, Management, Performance, Reliability

KeywordsData Center Address Assignment, Hierarchical Labeling

1. ALIASThe goal of ALIAS is to automatically assign globally unique,topologically meaningful host labels that the network can

SOCC’11, October 27–28, 2011, Cascais, Portugal.

internally employ for efficient forwarding. We aim to de-liver one of the key benefits of IP addressing—hierarchicaladdress assignments such that hosts with the same prefixshare the same path through the network and a single for-warding table entry suffices to reach all such hosts—withoutrequiring manual address assignment and subnet configura-tion. A requirement in achieving this goal is that ALIAS beentirely decentralized and broadcast-free. At a high level,ALIAS switches automatically locate clusters of good switchconnectivity within network topologies and assign a shared,non-conflicting prefix to all hosts below such pockets. Theresulting hierarchically aggregatable labels result in compactswitch forwarding entries.1 Labels are simply a reflection ofcurrent topology; ALIAS updates and reassigns labels to af-fected hosts based on topology dynamics.

1.1 EnvironmentALIAS overlays a logical hierarchy on its input topology.In this logical hierarchy, switches are partitioned into levelsand each switch belongs to exactly one level. Switches con-nect predominantly to switches in the levels directly aboveor below them, though pairs of switches at the same level(peers) may connect to each other via peer links.

One high-level dichotomy in multi-computer interconnectsis that of direct versus indirect topologies [19]. In a directtopology, a host can connect to any switch in the network.With indirect topologies, only a subset of the switches con-nect directly to hosts; communication between hosts con-nected to different switches is facilitated by one or moreintermediate switch levels. We focus on indirect topologiesbecause such topologies appear more amenable to automaticconfiguration and because they make up the vast majorityof topologies currently deployed in the data center [1, 11,14, 3, 6]. Figure 1(a) gives an example of an indirect 3-leveltopology, on which ALIAS has overlaid a logical hierarchy.In the Figure, Sx and Hx refer to unique IDs of switchesand hosts, respectively.

A host with multiple network interfaces may connect to mul-tiple switches, and will have separate ALIAS labels for eachinterface. ALIAS also assumes that hosts do not play aswitching role in the network and that switches are pro-grammable (or run software such as OpenFlow [2]).

1We use the terms “label” and “address” interchangeably.

S8 L3

S4 L2 7

S5 L2 1

S1 L1 5,5

S2 L1 3,2

S6 L2 1

S3 L1 7

S10 L2 3

S11 L2 3

S13 L1 1

S14 L1 7

S12 L2 3

S15 L1 2

S9 L3

S7 L3

H3 1.7

H2 7.3 1.2

H5 3.7

H4 3.1

H1 7.5 1.5

H6 3.2

(a) Sample Topology

S8 L3

S4 L2 7

S5 L2 1

S1 L1 5,5

S2 L1 3,2

S6 L2 1

S3 L1 7

S10 L2 3

S11 L2 3

S13 L1 1

S14 L1 7

S12 L2 3

S15 L1 2

S9 L3

S7 L3

H3 1.7

H2 7.3 1.2

H5 3.7

H4 3.1

H1 7.5 1.5

H6 3.2

(b) Sample Level and Label Assignment

Figure 1: Level and Label Assignment for Sample Topology

1.2 Protocol OverviewALIAS first assigns topologically meaningful labels to hosts,and then enables communication over these labels. As withIP subnetting, topologically nearby hosts share a commonprefix in their labels. In general, longer shared prefixes cor-respond to closer hosts. ALIAS groups hosts into relatedclusters by automatically locating pockets of strong connec-tivity in the hierarchy—groups of switches separated by onelevel in the hierarchy with full bipartite connectivity betweenthem. However, even assigning a common prefix to all hostsconnected to the same leaf switch can reduce the number ofrequired forwarding table entries by a large factor (e.g., thenumber of host facing switch ports multiplied by the typicalnumber of virtual machines on each host).

1.2.1 Hierarchical Label AssignmentALIAS labels are of the form (cn−1...c1.H.VM); the firstn− 1 fields encode a host’s location within an n-level topol-ogy, the H field identifies the port to which each host con-nects on its local switch, and the VM field provides supportfor multiple VMs multiplexed onto a single physical ma-chine. ALIAS assigns these hierarchically meaningful labelsby locating clusters of high connectivity and assigning toeach cluster (and its member switches) a coordinate. Coor-dinates then combine to form host labels; the concatenationof switches’ coordinates along a path from the core of thehierarchy to a host make up the ci fields of a host’s label.

Prior to selecting coordinates, switches first discover theirlevels within the hierarchy, as well as those of their neigh-bors. Switches i hops from the nearest host are in level Li,as indicated by the L1, L2, and L3 labels in Figure 1(b).

Once a switch establishes its level, it begins to participatein coordinate assignment. ALIAS first assigns unique H-

coordinates to all hosts connected to the same L1 switch,creating multiple one-level trees with an L1 switch at theroot and hosts as leaves. Next, ALIAS locates sets of L2

switches connected via full bipartite graphs to sets of L1

switches, and groups each such set of L2 switches into ahypernode (HN). The intuition behind HNs is that all L2

switches in an L2HN can reach the same set of L1 switches,

and therefore these L2 switches can all share the same pre-fix. This process continues up the hierarchy, grouping Li

switches into LiHNs based on bipartite connections to Li−1HNs.

Finally, ALIAS assigns unique coordinates to switches, wherea coordinate is a number shared by all switches in an HNand unique across all other HNs at the same level. By shar-ing coordinates between HN members, ALIAS leverages thehierarchy present in the topology and reduces the numberof coordinates used overall, thus collapsing forwarding tableentries. Switches at the core of the hierarchy do not requirecoordinates and are not grouped into HNs (see Section 2.4for more details). L1 switches select coordinates withoutbeing grouped into HNs. Further, we employ an optimiza-tion (Section 2.2) that assigns multiple coordinates to an Li

switch, one per neighboring Li+1HN.

When the physical topology changes due to switch, host,or link failure, configuration changes, or any other circum-stances, ALIAS adjusts all label assignments and forwardingentries as necessary (Sections 2.3 and 3.3).

Figure 1(b) shows a possible set of coordinate assignmentsand the resulting host label assignments for the topologyof Figure 1(a); only topology-related prefixes are shown forhost labels. For this 3-level topology, L2 switches are groupedin HNs (as shown with dotted lines), and L1 switches havemultiple coordinates corresponding to multiple neighboringL2HNs. Hosts have multiple labels corresponding to theL2HNs connected to their ingress L1 switches.

1.2.2 CommunicationALIAS’s labels can be used in a variety of routing and for-warding contexts, such as tunneling, IP-encapsulation, orMAC address rewriting [13]. We have implemented one suchcommunication technique (based on MAC address rewriting)and present an example of this communication below.

An ALIAS packet’s traversal through the topology is con-trolled by a combination of forwarding (Section 3.2) andaddressing logic (Section 2.2) Consider the topology shownin Figure 1(b). A packet sent from H4 to H2 must flow up-

ward to one of S7, S8, or S9, and then downward towards itsdestination. First, H4 sends an ARP to its first-hop switch,S13, for H2’s label (Section 3.3). S13 determines this label(with cooperation from nearby switches if necessary) and re-sponds to H4. H4 can then forward its packet to S13 withthe appropriate label for H2, for example (1.2.1.0) if H2 isconnected to port 1 of S2 and has VM coordinate 0. At thispoint, forwarding logic moves the packet to one of S7, S8,or S9, all of which have a downward path to H2; the rout-ing protocol (Section 3.1) creates the proper forwarding en-tries at switches between H4 and the core of the network, sothat the packet can move towards an appropriate L3 switch.Next, the packet is forwarded to one of S5 or S6, based onthe (1.x.x.x) prefix of H2’s label. Finally, based on thesecond field of H2’s label, the packet moves to S2 where itcan be delivered to its destination.

1.3 Multi-Path SupportMulti-rooted trees provide multiple paths between host pairs,and routing and forwarding protocols should discover andutilize these multiple paths for good performance and faulttolerance. ALIAS provides multi-path support for a givendestination label via its forwarding component (Section 3.2).For example, in Figure 1(b), a packet sent from H4 to H2

with destination label (1.2.1.0) may traverse one of fivedifferent paths.

An interesting aspect of ALIAS is that it enables a secondclass of multi-path support: hosts may have multiple la-bels, where each label corresponds to a set of paths to ahost. Thus, choosing a label corresponds to selecting a setof paths to a host. For example, in Figure 1(b), H2 has twolabels. Label (1.2.1.0) encodes 5 paths from H4 to H2,and label (7.3.1.0) encodes a single H4-to-H2 path. Thesetwo classes of multi-path support help limit the effects oftopology changes and failures. In practice, common datacenter fabric topologies will result in hosts with few labels,where each label encodes many paths. Policy for choosing alabel for a given destination is a separable issue; we presentsome potential methods in Section 3.3.

2. PROTOCOLALIAS is comprised of two components, Level Assignmentand Coordinate Assignment. These components operatecontinuously, acting whenever topology conditions change.For example, a change to a switch’s level may trigger changesto that switch’s and its neighbors’ coordinates. ALIAS alsoinvolves a Communication component (for routing, forward-ing, and label resolution and invalidation); in Section 3 wepresent one of the many possible communication compo-nents that might use the labels assigned by ALIAS.

ALIAS operates based on the periodic exchange of Topol-ogy View Messages (TVMs) between switches. In an n-leveltopology, individual computations rely on information fromno more than n− 1 hops away.

Listing 1 gives an overview of the general state stored ateach switch, as well as that related to level assignment. Aswitch knows its unique ID and the IDs and types (hosts orswitches) of each of its neighbors (lines 1-3.) Switches alsoknow their own levels as well as those of their neighbors,and the types of links (regular or peer) connecting them to

Listing 1: ALIAS local state

1 UID myId2 UIDSet nbrs3 Map(UID→NodeType) types

4 Level level5 Map(UID→Level) levels6 Map(UID→LinkType) link types

each neighbor (lines 4-6); these values are set by the levelassignment protocol (Section 2.1).

2.1 Level AssignmentALIAS level assignment enables each switch to determineits own level as well as those of its neighbors and to detectand mark peer links for special consideration by other com-ponents. ALIAS defines an Li switch to be a switch with aminimum of i hops to the nearest host. For convenience, inan n-level topology, Ln switches may be referred to as cores.Regular links connect L1 switches to hosts, and Li switchesto switches at Li±1, while peer links connect switches of thesame level.

Level assignment is bootstrapped by L1 switch identificationas follows: In addition to sending TVMs, each switch alsoperiodically sends IP pings to all neighbors that it does notknow to be switches. Hosts reply to pings but do not sendTVMs, enabling switches to detect neighboring hosts. Thisallows L1 identification to proceed without host modifica-tion. If hosts provided self-identification, then the protocolbecomes much simpler. Recent trends toward virtualizationin the data center with a trusted hypervisor may take onthis functionality.

When a switch receives a ping reply from a host, it immedi-ately knows that it is at L1 and that the sending neighbor isa host, and updates its state accordingly (lines 3-4, Listing1). If a ping reply causes the switch to change its currentlevel, it may need to mark some of its links to neighborsas peer links (line 6). For instance, if the switch previouslybelieved itself to be at L2, it must have done so because ofa neighboring L1 switch and its connection to that neighboris now a peer link.

Based on L1 identification, level assignment operates via awave of information from the lowest level of the hierarchyupwards; A switch that receives a TVM from an L1 switchlabels itself as L2 if it has not already labeled itself as L1,and this process continues up the hierarchy. More generally,each switch labels itself as Li, where i − 1 is the minimumlevel of all of its neighbors.

On receipt of a TVM, a switch determines whether thesource’s level is smaller than that recorded for any of itsothers neighbors, and if so, adjusts its own level assignment(line 4, Listing 1). It also updates its state for its neighbor’slevel and type if necessary (lines 3,5). If its level or thatof its neighbor has changed, it detects any changes to thelink types for its neighbors and updates its state accordingly(line 6). For instance, if an L3 switch moves to L2, links toL2 neighbors become peer links.

The presence of unexpected or numerous peer links may in-dicate a miswiring, or erroneous cabling, with respect to theintended topology. If ALIAS suspects a miswiring, it raisesan alert (e.g., by notifying the administrator) but continuesto operate. In this way, miswirings do not bring the systemto a halt, but are also not ignored.

ALIAS’s level assignment can assign levels to all switches aslong as at least one host is present. Once a switch learns itslevel, it participates in coordinate assignment.

2.2 Label AssignmentAn ALIAS switch’s label is the concatenation of n− 1 coor-dinates, cn−1cn−2...c2c1, each corresponding to one switchalong a path from a core switch to the labeled switch. Ahost’s label is then the concatenation of an ingress L1 switch’slabel and its own H and VM coordinates. As there may bemultiple paths from the core switches of the topology to aswitch (host), switches (hosts) may have multiple labels.

2.2.1 Coordinate AggregationSince highly connected data center networks tend to havenumerous paths to each host, per-path labeling can lead tooverwhelming numbers of host labels. ALIAS creates com-pact forwarding tables by dynamically identifying sets of Li

switches that are strongly connected to sets of L1−i switchesbelow. It then assigns to these Li hypernodes unique Li co-ordinates. By sharing one coordinate among the membersof an LiHN, ALIAS allows hosts below this HN to share acommon label prefix, thus reducing forwarding table entries.

An LiHN is defined as a maximal set of Li switches that allconnect to an identical set of Li−1HNs, via any constituentmembers of the Li−1HNs. Each Li switch is a member ofexactly one LiHN. L2HN grouping are based on L1 switchesrather than HNs. In Figure 2, L2 switches S5 and S6 con-nect to same set of L1 switches, namely {S1, S2, S3}, andare grouped together into an L2HN, whereas S4 connects to{S1, S2}, and therefore forms its own L2HN. Similarly, S7

and S8 connect to both L2HNs (though via different con-stituent members) and form one L3HN while S9 forms asecond L3HN, as it connects only to one L2HN below.

S8 L3

S4 L2

S5 L2

S1 L1

S2 L1

S6 L2

S3 L1

S9 L3

S7 L3

S10 L4

Figure 2: ALIAS Hypernodes

Since LiHNs are defined based on connectivity to identical

sets of Li−1HNs, the members of an LiHN are interchange-able with respect to downward forwarding. This is the keyintuition that allows HN members to share a coordinate,ultimately leading to smaller forwarding tables.

ALIAS employs an optimization with respect to HN group-ing for coordinate assignment. Consider switch S1 of Figure2, and suppose the L2HNs {S4} and {S5, S6} have coordi-nates x and y, respectively. Then S1 has labels of the form...xc1 and ...yc1, where c1 is S1’s coordinate. Since S1 is con-nected to both L2HNs, it needs to ensure that c1 is uniquefrom the coordinates of all other L1 switches neighboring{S4} and {S5, S6} (in this example, all other L1 switches).

It is helpful to limit the sets of switches competing for co-ordinates, to decrease the probability of collisions (two HNsselecting the same coordinate) and to allow for a smallercoordinate domain. We accomplish this as follows: S1 hastwo coordinates, one corresponding to each of its label pre-fixes, giving it labels of the form ...xc1 and ...yc2. In thisway S1 competes only with S2 for labels corresponding toHN {S4}. In fact, ALIAS assigns to each switch a coor-dinate per upper neighboring HN . This reduces coordinatecontention without increasing the coordinate domain size.

2.2.2 Decider/Chooser AbstractionThe goal of coordinate assignment in ALIAS is to select co-ordinates for each switch such that these coordinates canbe combined into forwarding prefixes. By assigning per-HN rather than per-switch coordinates, ALIAS leveragesa topology’s inherent hierarchy and allows nearby hosts toshare forwarding prefixes. In order for an Li switch to dif-ferentiate between two lower-level HNs, for forwarding pur-poses, these two HNs must have different coordinates. Thus,the problem of coordinate assignment in ALIAS is to enableLiHNs to cooperatively select coordinates that do not con-flict with those of other LiHNs that have overlapping Li+1

neighbors. Since the members of an HN are typically not di-rectly connected to one another, this task requires indirectcoordination.

To explain ALIAS’s coordinate assignment protocol, we be-gin with a simplified Decider/Chooser Abstraction (DCA),and refine the abstraction to solve the more complicatedproblem of coordinate assignment. The basic DCA includesa set of choosers that select random values from a givenspace, and a set of deciders that ensure uniqueness amongthe choosers’ selections. A requirement of DCA is that anytwo choosers that connect to the same decider select distinctvalues. Choosers make choices and send these requests toall connected deciders. Upon receipt of a request from achooser, a decider determines whether it has already storedthe value for another chooser. If not, it stores the valuefor the requester and sends an acknowledgment. If it hasalready stored the requested value for another chooser, thedecider compiles a list of hints of already selected values andsends this list with its rejection to the chooser. A chooserreselects its value if it receives a rejection from any decider,and considers its choice stable once it receives acknowledg-ments from all connected deciders.

We employ DCA within a single LiHN and its Li−1 neigh-bors to assign coordinates to the Li−1 switches, as in Figure

S5 L2

S1 L1 5

S2 L1 2

S6 L2

S3 L1 7

Choosers

Deciders

(a) Basic Decider/Chooser

C{S4}=5 C{S5,S6}=5

S4 L2

S5 L2

S1 L1

S2 L1

S6 L2

S3 L1

Choosers

Deciders Deciders

C{S4}=3 C{S5,S6}=2 C{S5,S6} =7

(b) Multiple Hypernodes

S8 L3

S4 L2 7

S5 L2 1

S1 L1

S2 L1

S6 L2 1

S3 L1

S9 L3

S7 L3

0

2

0

0

1

1

1

2

(c) Distributed Chooser

Figure 3: Level and Label Assignment for Sample Topology

3(a). The members of L2HN {S5, S6} act as deciders for L1

choosers, S1, S2, and S3, ensuring that the three choosersselect unique L1 coordinates.

Recall that as an optimization, ALIAS assigns to each switchmultiple coordinates, one per neighboring higher level HN.We extend the basic DCA to have switches keep track of theHN membership of upward neighbors, and to store coordi-nates (and an indication of whether a choice is stable) on aper-HN basis. This is shown in Figure 3(b), where each L1

switch stores information for all neighboring L2HNs. Thefigure includes two instances of DCA, that from Figure 3(a)and that in which S4 is a decider for choosers S1 and S2.

Finally, we refine DCA to support coordinate sharing withinan HN. Since each member of an HN may connect to a differ-ent set of higher level switches (deciders), it is necessary thatall HN members cooperate to form a distributed chooser.HN members cooperate with the help of a deterministicallyselected representative L1 switch (for example, the L1 switchwith the lowest MAC address of those connected to the HN).L1 switches determine whether they represent a particularHN as a part of HN grouping calculations.

The members of an LiHN, and the HN’s representative L1

switch collaborate to select a shared coordinate for all HNmembers as follows: The representative L1 switch performsall calculations and makes all decisions for the chooser, anduses the HN’s Li switches as virtual channels to the deciders.LiHN members gather and combine hints from L3 decidersabove, passing them down to the representative L1 switch forcalculations. The basic chooser protocol introduced aboveis extended to support reliable communication over the vir-tual channels between the representative L1 switch and theHN’s Li switches. Additionally, for the distributed versionof DCA, deciders maintain state about the HN membershipof their L2 neighbors in order to avoid falsely detecting con-flicts; a decider may be connected to a single chooser viamultiple virtual channels (L2 switches) and should not per-ceive identical requests across such channels as conflicts.

Figure 3(c) shows two distributed choosers in our exampletopology. Choosers {S1, S4} and {S1, S5, S6} are shaded inlight and dark grey, respectively. Note that S7 is a decider

for both choosers while S8 and S9 are deciders only for thesecond chooser. S2 and S3 play no part in L2 coordinateselection for this topology.

Our implementation does not separate each level’s coordi-nate assignment into its own instance of the extended DCAprotocol; rather, all information pertaining to both leveland coordinate assignment is contained in a single TVM.For instance, in a 5-level topology, a TVM from an L3

switch to an L2 switch might contain hints for L2 coordi-nates, L3HN grouping information, and L4 information onits way down to a representative L1 switch. Full details ofthe Decider/Chooser Abstraction, a protocol derivation forits refinements, and a proof of correctness are available in aseparate technical report [22].

Label assignment converges when all L2 through Ln−1 switcheshave grouped themselves into hypernodes, and all L1 throughLn−1 switches have selected coordinates.

2.2.3 Example AssignmentsFigure 4 depicts the TVMs sent to assign coordinates tothe L2 switches in Figure 3’s topology. For clarity, we showTVMs only for a subset of the switches. In TVM 1, all coreswitches disallow the selection of L2-coordinate 3, due to itsuse in another HN (not shown). L2 switches incorporate thisrestriction into their outgoing TVMs, including their sets ofconnected L1 switches (TVMs 2a and 2b). S1 is the repre-sentative L1 switch for both HNs, as it has the lowest ID. S1

selects coordinates for the HNs and informs neighboring L2

switches of their HNs and coordinates (TVMs 3a and 3b.)

2.3 RelabelingSince ALIAS labels encode paths to hosts, topology changesmay affect switch coordinates and hence host labels. For in-stance, when the set of L1 switches reachable by a particularL2 switch changes, the L2 switch may have to select a newL2-coordinate. This process is coined Relabeling.

Consider the example shown in Figure 5 where the high-lighted link between S5 and S3 fails. At this point, af-fected switches must adjust their coordinates. With TVM1, S5 informs its L1 neighbors of its new connection status.Since S1 knows the L1 neighbors of each of its neighbor-

S1 L1

S2 L1

S3 L1

S8 L3

S9 L3

S7 L3 1 1

1 1 1

2b 2b 2a

2b S4 L2

S5 L2

S6 L2

3b

2b 2b

2b

3a 3b

Time TVM

1 L2 hints{3}

2adwn nbrs{S1,S2}

L2 hints{3}

2bdwn nbrs{S1,S2,S3}

L2 hints{3}

3aHN {S5,S6}

L2 coordinate: 1

3aHN {S4}

L2 coordinate: 7

Figure 4: Label Assignment: L2-coordinates

ing L2 switches, it knows that it remains the representativeL1 switch for both HNs. S1 informs S4 and S5 of the HNmembership changes in TVM 2a, and informs S6 of S4’s de-parture in TVMs 2b. Since S5 simply left one HN and joinedanother, existing HN, host labels are not affected.

S7 L3

S4 L2 7

S5 L2 1 7

S1 L1

S2 L1

S6 L2 1

S3 L1

1

1

2a

2b

H3

1.7

H2

7.3 1.2

H1

7.5 1.5

S8 L3

S9 L3

2a

Time TVM

1dwn nbrs{S1,S2}

L2 hints{3}

2aHN {S4,S5}

L2 coordinate: 7

2bHN {S6}

L2 coordinate: 1

Figure 5: Relabeling Example

The effects of relabeling (whether caused by link additionor deletion) are determined solely by changes to the HNmembership of the upper level switch incident on the af-fected link. Table 1 shows the effects of relabeling after achange to a link between an L2 switch `2 and an L1 switch`1, in a 3-level topology. Case 1 corresponds with the ex-ample of Figure 5; `2 moves from one HN to another. Inthis case, no labels are created nor destroyed. In case 2,one HN splits into two and all L1 switches neighboring `2add a new label to their sets of labels. In case 3, two HNsmerge into a single HN, and with the exception of `1, all L1

switches neighboring `2 lose one of their labels. Finally, case4 represents a situation in which an HN simply changes itscoordinate, causing all neighboring L1 switches to replacethe corresponding label.

Changes due to relabeling are completely encapsulated inthe forwarding information propagated, as described in Sec-tion 3.2. Additionally, in Section 3.3 we present an optimiza-tion that limits the effects of relabeling on ongoing sessionsbetween pairs of hosts.

CaseOld New Single RemainingHN HN L1 switch L1 switches

1 Intact Existing None None2 Intact New + 1 label + 1 label3 Removed Existing None - 1 label4 Removed New Swap label Swap label

Table 1: Relabeling Cases

2.4 M-graphsThere are some rare situations in which ALIAS providesconnectivity between switches from the point of view of thecommunication component, but not from that of coordinateassignment. The presence of an M-graph in a topology canlead to this problem, as can the use of peer links. We con-sider M-graphs below and discuss peer links in Section 3.4.

S8 L3

S4 L2 3

S5 L2 2

S1 L1 1

S2 L1 2

S6 L2 2

S3 L1 1

S9 L3

H3

3.1 H2

2.2 H1

3.1

S7 L2 3

Figure 6: Example M-graph

ALIAS relies on shared core switch parents to enforce therestriction that pairs of Ln−1HNs do not select identicalcoordinates. There are topologies, though, in which twoLn−1HNs do not share a core and could therefore selectidentical coordinates. Such an M-graph is shown in Fig-ure 6. In the example, there are 3 L2HNs, {S4}, {S5, S6},and {S7}. It is possible that S4 and S7 select the same L2-coordinate, e.g., 3, as they do not share a neighboring core.Since {S5, S6} shares a parent with each of the other HNs,its coordinate is unique from those of S4 and S7. L1 switchesS1 and S3 are free to choose the same L1-coordinates, 1 inthis example. As a result, two hosts H1 and H3 are legallyassigned identical ALIAS labels, (3.1.4.0), if both H1 andH3 are connected to their L1 switches on the same numberedport (in this case, 4), and have VM coordinate 0.

H2 can now see two non unique ALIAS labels, which in-troduces a routing ambiguity. If H2 attempts to forward apacket to H1, it will use the label (3.1.4.0). When S2 re-ceives the packet, S2 can send this packet either to S5 or S6,since it thinks it is connected to an L2HN with coordinate3 via both. The packet could be transmitted to the unin-tended destination H3 via S6, S9, S7, S3. When the packetreaches S3, S3 is in a position to verify whether the packet’sIP address matches H3’s ALIAS label, by referencing a flowtable entry that holds IP address-to-ALIAS label mappings.(Note that such flow table entries are already present forthe communication component, as in Section 3.3.) A packet

destined to H1’s IP address would not match such a flowentry and would be punted to switch software.2

Because we expect M-graphs to occur infrequently in well-connected data center environments, our implementation fa-vors a simple “detect and resolve” technique. In our exam-ple, S3 receives the mis-routed packet and knows that it ispart of an M-graph. At this point S3 sends a directive toS7 to choose a new L2-coordinate. This will result in dif-ferent ALIAS labels for H1 and H3. Once the relabelingdecision propagates via routing updates, S2 correctly routesH1’s packets via S5. The convergence time of this relabelingequals the convergence period for our routing protocol, or 3TVM periods.3

In our simulations we encounter M-graphs only for inputtopologies with extremely poor connectivity, or when we ar-tificially reduce the size of the coordinate domain to causecollisions. If M-graphs are not tolerable for a particular net-work, they can be prevented in two ways, each with an ad-ditional application of the DCA abstraction. With the firstmethod, the set of deciders for a pair of HNs is augmented toinclude not only shared parents but also lower-level switchesthat can reach both HNs. For example, in Figure 6, S2 wouldbe a decider for (and would ensure L2-coordinate uniquenessamong) all three L2HNs. The second method for preventingM-graphs relies on coordinate assignments for core switches.In this case, core switches group themselves into hypern-odes and select shared coordinates, using representative L1

switches to factiliate cooperation. Lower level switches actas deciders for these core-HNs. Both of these solutions in-crease convergence time, as there may be up to n hops be-tween a hypernode and its deciders in an n-level hierarchy.Because of this, our implementation favors a simple detect-and-resolve solution over the cost of preventing M-graphs.

3. COMMUNICATION3.1 RoutingALIAS labels specify the ‘downward’ path from a core tothe identified host. Each core switch is able to reach allhosts with a label that begins with the coordinate of anyLn−1HN directly connected to it. Similarly, each switch inan LiHN can reach any host with a label that contains oneof the HN’s coordinates in the ith position. Thus, routingpackets downward is simply based on an Li switch matchingthe destination label’s (i−1)th coordinate to the that of oneor more of its Li−1 neighbors.

To leverage this simple downward routing, ingress switchesmust be able to move data packets to cores capable of reach-ing a destination. This reduces to a matter of sending a datapacket towards a core that reaches the Ln−1HN correspond-ing to the first coordinate in the destination label. Ln−1

switches learn which cores reach other Ln−1HNs directlyfrom neighboring cores and pass this information downwardvia TVMs.. Switches at level Li in turn learn about the setof Ln−1HNs reachable via each neighboring Li+1 switch.

2If the L1 switches’ coordinates did not overlap, detectionwould occur at S7.3It is possible that two HNs involved in an M-graph simulta-neously detect and recover from a collision, causing an extrarelabeling. However, we optimize for the common case, asthis potential cost is small and unlikely to occur.

3.2 ForwardingSwitch forwarding entries map a packet’s input port andcoordinates to the appropriate output port. The coordinatefields in a forwarding entry can hold a number, requiring anexact match, or a ‘don’t care’ (DC) which matches all valuesfor that coordinate. An Li switch forwards a packet witha destination label matching any of its own label prefixesdownward to the appropriate Li−1HN. If none of its prefixesmatch, it uses the label’s Ln−1 coordinate to send the packettowards a core that reaches the packet’s destination.

Figure 7 presents a subset of the forwarding tables entriesof switches S7, S4, and S1 of Figure 3(c), assuming the L1-coordinate assignments of Figure 3(b) and that S1 has asingle host on port 3. Entries for exception cases are omit-ted.

Core (S7)

Level L2 (S4)

Level L1

(S1)

InPort L2 L1 H OutPort

DC 1 DC DC 1

DC 7 DC DC 0


DC 7 5 DC 0

DC 7 3 DC 1 0/1 1 DC DC 2


0 7 5 3 3

1/2 1 5 3 3

3 7 DC DC 0

3 1 DC DC 1/2

Figure 7: Example of forwarding table entries

All forwarding entries are directional, in that a packet canbe headed ‘downwards’ to a lower level switch, or ‘upwards’to a higher level switch. Directionality is determined by thepacket’s input port. ALIAS restricts the direction of packetforwarding to ensure loop-free forwarding. The key restric-tion is that a packet coming into a switch from a higherlevel switch can only be forwarded downwards, and that apacket moving laterally cannot be forwarded upwards. Werefer to this property as up*/across*/down* forwarding, anextension of the up*/down* forwarding introduced in [18].

3.3 End-to-End CommunicationALIAS labels can serve as a basis for a variety of communi-cation techniques. Here we present an implementation basedon MAC address rewriting.

When two hosts wish to communicate, the first step is gener-ally ARP resolution to map a destination host’s IP addressto a MAC address. In ALIAS, we instead resolve IP ad-dresses to ALIAS labels. This ALIAS label is then writteninto the destination Ethernet address. All switch forward-ing proceeds based on this destination label. Unlike stan-dard Layer 2 forwarding, the destination MAC address isnot rewritten hop-by-hop through the network.

Figure 8 depicts the flow of information used to establishend-to-end communication between two hosts. When an L1

H

11

L1

Ln

Ln-‐1

H

Discovery: IP address

1

3

Mapping: IP address-‐> ALIAS labels

2

Grat. ARP: IP address-‐> ALIAS label


10

12


Address Mappings/InvalidaEons Proxy ARP


8

L1

Ln

Ln-‐1

ARP query: IP address

4

6

Proxy ARP Query:

IP address

5

ARP reply: IP address-‐> ALIAS label

Proxy ARP response: ALIAS labels

7

9

Proxy ARP response: ALIAS labels

Proxy ARP Query:

IP address

Figure 8: End-to-End Communication

switch discovers a connected host, it assigns to it a set ofALIAS labels. L1 switches maintain a mapping betweenthe IP address, MAC address and ALIAS labels of eachconnected host. Additionally, they send a mapping of IPaddress-to-ALIAS labels of connected hosts upwards to allreachable cores. This eliminates the need for a broadcast-based ARP mechanism. Arrows 1-3 in Figure 8 show thismapping as it moves from L1 to the cores.

To support unmodified hosts, ALIAS L1 switches interceptARP queries (arrow 4), and reply if possible (arrow 9). Ifnot, they send a proxy ARP query to all cores above them inthe hierarchy via intermediate switches (arrows 5,6). Coreswith the requested mappings reply (arrows 7,8). The query-ing L1 switch then replies to the host with an ALIAS label(arrow 9) and incorporates the new information into its lo-cal map, taking care to ensure proper handling of responsesfrom multiple cores. As a result, the host will use this la-bel as the address in the packet’s Ethernet header. Priorto delivering a data packet, the egress switch rewrites theALIAS destination MAC address with the actual MAC ad-dress of the destination host, using locally available stateinformation encoded in the hardware forwarding table.

Hosts can have multiple ALIAS labels corresponding to mul-tiple sets of paths from cores. However, during ARP resolu-tion, a host expects only one MAC address to be associatedwith a particular IP address. To address this, the queryinghost’s neighboring L1 switch chooses one of the ALIAS la-bels of the destination host. This choice could be made ina number of ways; switches could select randomly or couldbase their decisions on local views of dynamically changingcongestion. In our implementation, we include a measure ofeach label’s value when passing labels from L1 switches tocores. We base a host label’s value on connectivity betweenthe host h and the core of the network as well as on thenumber of other hosts that can reach h using this label. Tocalculate the former, we keep track of the number of pathsfrom the core level to h that are represented by each label.For the latter, we count the number of hosts that each corereaches and weight paths according to these counts. An L1

switch uses these combined values to select a label out ofthe set returned by a core.

Link additions and failures can result in relabeling. Whilethe routing protocol adapts to changes, existing flows topreviously valid ALIAS labels will be affected due to ARPcaching in unmodified end hosts. Here, we describe our ap-proach to minimize disruption to existing flows in the face ofshifts in topology. We note however that any network envi-ronment is subject to some period of convergence followinga failure. Our goal is to ensure that ALIAS convergencetime at least matches the behavior of currently deployednetworks.

Upon a link addition or failure, ALIAS performs appropri-ate relabeling of switches and hosts (Section 2.3) and prop-agates the new topology view to all switches as part of stan-dard TVM exchanges. Recall that cores store a mapping ofIP addresses-to-ALIAS labels for hosts. Cores compare re-ceived mappings to existing state to determine newly invalidmappings. Cores also maintain a cache of recently queriedARP mappings. Using this cache, core switches inform re-cent L1 requesters that an ARP mapping has changed (ar-rows 10-11), and L1 switches in turn send gratuitous ARPreplies to hosts (arrow 12).

Additionally, ingress L1 switches can preemptively rewritestale ALIAS labels to maintain connectivity between pairsof hosts during the window of vulnerability when a gratu-itous ARP has been sent but not yet received. In the worstcase, failure of certain cores may necessitate an ARP cachetimeout at hosts before communication can resume.

Recall that ALIAS enables two classes of multi-path sup-port. The first class is tied to the selection of a particularlabel (and thus a corresponding set of paths) from a host’slabel set, whereas the second represents a choice within thisset of paths. For this second class of multi-path, ALIASsupports standard multi-path forwarding techniques such asECMP [7]. Essentially, forwarding entries on the upwardpath can contain multiple next hops toward the potentiallymultiple core switches capable of reaching the appropriatetop-level coordinate in the destination host label.

3.4 Peer LinksALIAS considers peer links, links between switches at thesame level of the hierarchy, as special cases for forwarding.There are two considerations to keep in mind when introduc-ing peer links into ALIAS: maintaining loop-free forwardingguarantees and retaining ALIAS’s scalability properties. Weconsider each in turn below.

To motivate our method for accommodating peer links, wefirst consider the reasons for which a peer link might ex-ist in a given network. A peer link might be added (1)to create direct connectivity between two otherwise discon-nected HNs or cores, (2) to create a “shortcut” between twoHNs (for instance, HNs with frequent interaction) or (3) un-intentionally. ALIAS supports intentional peer links withup*/across*/down* forwarding. In other words, a packetmay travel upwards and then may “jump” from one HN toanother directly, or may traverse a set of cores, before mov-ing downwards towards its destination.

Switches advertise hosts reachable via peer links as part ofoutgoing TVMs. While the up* and down* components

Si

Hj Hk Hi

Sj

Figure 9: Peer Link Tradeoff

of the forwarding path are limited in length by the over-all depth of the hierarchy, the across* component can bearbitrarily long. To avoid the introduction of forwardingloops, all peer link advertisements include a hop count.

The number of peer link traversals allowed during the across*component of forwarding represents a tradeoff between rout-ing flexibility and ALIAS convergence. This is due to thefact that links used for communication must also be con-sidered for coordinate assignment, as discussed in Section2.4. Consider the example of Figure 9. In the figure, dottedlines indicate long chains of links, perhaps involving switchesnot shown. Since host Hk can reach both other hosts, Hi

and Hj , switches Si and Sj need to have unique coordi-nates. However, they do not share a common parent, andtherefore, must cooperate across the long chain of peer linksbetween them to ensure coordinate uniqueness. In fact, ifa packet is allowed to cross p peer links during the across*segment of its path, switches as far as 2p peer links apartmust not share coordinates. This increases convergence timefor large values of p. Because of this ALIAS allows a net-work designer to tune the number of peer links allowed peracross* segment to limit convergence time while still provid-ing the necessary routing flexibility. Since core switches donot have coordinates, this restriction on the length of theacross* component is not necessary at the core level; coresuse a standard hop count to avoid forwarding loops.

It is important that peer links are used judiciously, given theparticular style of forwarding chosen. For instance, support-ing shortest path forwarding may require disabling “short-cut” style peer links when they represent a small percentageof the connections between two HNs. This is to avoid a sit-uation in which all traffic is directed across a peer link (asit provides the shortest path) and the link is overwhelmed.

3.5 Switch ModificationsWe engineer ALIAS labels to be encoded into 48 bits tobe compatible with existing destination MAC addresses inprotocol headers. Our task of assigning globally unique hi-erarchical labels would be simplified if there were no pos-sibility of collisions in coordinates, for instance if we al-lowed each coordinate to be 48-bits in length. If we adoptedlonger ALIAS labels, we would require modified switch hard-ware that would support an encapsulation header contain-ing the forwarding address. Forwarding tables would needto support matching on pre-selected and variable numbersof bits in encapsulation headers. Many commercial switchesalready support such functionality in support of emergingLayer 2 protocols such as TRILL [20] and SEATTLE [10].

Our goal of operating with unmodified hosts does requiresome support from network switching elements. ALIAS L1

switches intercept all ARP packets from hosts. This does notrequire any hardware modifications, since packets that donot match a flow table entry can always be sent to the switchsoftware and ARP packets need not necessarily be processedat line rate. We further introduce IP address-to-ALIAS labelmappings at cores, and IP address, actual MAC address, andALIAS label mappings at L1. We also maintain a cache ofrecent ARP queries at cores. All such functionality can berealized in switch software without hardware modifications.

4. IMPLEMENTATIONALIAS switches maintain the state necessary for level andcoordinate assignment as well as local forwarding tables.Switches react to two types of events: timer firings andmessage receipt. When a switch receives a TVM it updatesthe necessary local state and forwarding table entries. Thenext time its TVMsend timer fires, it compiles a TVM foreach switch neighbor as well as a ping for all host neigh-bors. Neighbors of unknown types receive both. OutgoingTVMs include all information related to level and coordi-nate assignment, and forwarding state, and may include la-bel mappings as they are passed upwards towards cores. Theparticular TVM created for any neighbor varies both withlevels of the sender and the receiver as well as with the iden-tity of the receiver. For instance, in a 3-layer topology, anL2 switch sends the set of its neighboring L1 switches down-ward for HN grouping by the representative L1 switch. Onthe other hand, it need not send this information to cores.

Receive TVM

Process w.r.t. Layer

Process w.r.t. Coords

Store Address Mappings

Update Forwarding

Tables

TVMsend ?mer

Send TVM Compile TVM (UID, layer)

Select next neighbor

Figure 10: ALIAS Architecture

Figure 10 shows the basic architecture of ALIAS. We haveproduced two different implementations of ALIAS, which wedescribe below.

4.1 Mace ImplementationWe first implemented ALIAS in Mace [5, 9]. Mace is a lan-guage for distributed system development that we chose fortwo reasons; the Mace toolkit includes a model checker [8]that can be used to verify correctness, and Mace code com-piles into standard C++ code for deployment of the exactcode that was model checked.

We verified the correctness of ALIAS by model checking ourMace implementation. This included all protocols discussedin this paper: level assignment, coordinate and label assign-ment, routing and forwarding, and proxy ARP support withinvalidations on relabeling. For a range of topologies withintermittent switch, host, and network failures, we verified(via liveness properties) the convergence of level and coor-dinate assignment and routing state as well as the correct

operation of label resolution and invalidation. Further, weverified that all pairs of hosts that are connected by thephysical topology are eventually able to communicate.

4.2 NetFPGA Testbed ImplementationUsing our Mace code as a specification, we implementedALIAS into an OpenFlow [2] testbed, consisting 20 4-portNetFPGA PCI-card switches [12] hosted in 1U dual-core 3.2GHz Intel Xeon machines with 3GB of RAM. 16 end hostsconnect to the 20 4-port switches wired as a 3-level fat tree.All machines run Linux 2.6.18-92.1.18.el5 and switches runOpenFlow v0.8.9r2.

Although OpenFlow is based around a centralized controllermodel, we wished to remain completely decentralized. To ac-complish this, we implemented ALIAS directly in the Open-Flow switch, relying only on OpenFlow’s ability to insertnew forwarding rules into a switch’s tables. We also modi-fied the OpenFlow configuration to use a separate controllerper switch. These modifications to the OpenFlow softwareconsist of approximately 1,200 lines of C code.

5. EVALUATIONWe set out to answer the following questions with our ex-perimental evaluation of ALIAS:

• How scalable is ALIAS in terms of storage require-ments and control overhead?

• How effective are hypernodes in compacting forward-ing tables?

• How quickly does ALIAS converge on startup and af-ter faults? How many switches relabel after a topol-ogy change and how quickly does the new informationpropagate?

Our experiments run on our NetFPGA testbed, which weaugment with miswirings and peer links as necessary. Formeasurements on topologies larger than our testbed, we relyon simulations.

5.1 Storage RequirementsWe first consider the storage requirements of ALIAS. Thisincludes all state used to compute switches’ levels, coordi-nates, and forwarding tables.For a given number of hosts,H, we determined the number of L1, L2, and L3 switchespresent in a 3-level, 128-port fat tree-based topology. Wethen calculated analytically the storage overhead requiredat each type of switch as a function of the input topologysize, as shown in Figure 11. L1 switches store the most state,as they may be representative switches for higher level HNs,and therefore must store state to calculate higher level HNcoordinates.

We also empirically measured the storage requirements ofALIAS. L1, L2, and L3 switches required 122, 52, and 22bytes of storage, respectively for our 16-node testbed; theseresults would grow linearly with the number of hosts. Over-all, the total required state is well within the range of whatis available in commodity switches today. Note that thisstate need not be accessed on the data path; it can reside inDRAM accessed by the local embedded processor.

200

300

400

500

600

700

800

900

1000

1100

1200

0 2000 4000 6000 8000 10000 12000 14000

Sto

rage o

verh

ead in

byte

s

Number of hosts (H)

200

300

400

500

600

700

800

900

1000

1100

1200

0 2000 4000 6000 8000 10000 12000 14000

Sto

rage o

verh

ead in

byte

s

Number of hosts (H)

L3L2L1

Figure 11: Storage Overhead for 3-level 128-port tree

5.2 Control OverheadWe next consider the control message overhead of ALIAS.Table 2 shows the contents of TVMs, both for immediateneighbors and for communication with representative L1

switches. The table gives the expected size of each field,(where S and C are the sizes of a switchID and coordinate),as well as the measured sizes for our testbed implementation(where S = 48, C = 8 bits). Since our testbed has 3 levels,TVMs from L2 switches to their L1 neighbors are combinedwith those to representative L1 switches (and likewise forupward TVMs); our results reflect these combinations. Mes-sages sent downwards to L1 switches come from all membersof an LiHN and contain per-parent information for each HNmember; therefore, these messages are the largest.

Sender andField

Expected MeasuredReceiver Size Size

All-to-All level log(n) 2 bitsTo Downward hints kC/2

L3 to L2: 6BNeighbor dwnwrd HNs kS/2To rep. per-parent hints k2C/4

L2 to L1: 28BL1 switch per-parent dwnwrd HNs k2S/4

To coord CUpward HN kS

L2 to L3: 5BNeighbor rep. L1 SFrom Rep. per-parent coords kC/2

L1 to L2: 7BL1 switch HN assignment O(kS)

Table 2: Level and Coordinate Assignment TVMFields

The TVM period must be at least as large as the time ittakes a switch to process k incoming TVMs, one per port.On our NetFPGA testbed, the worst case processing timefor a set of TVMs was 57µs plus an additional 291µs forupdating forwarding table entries in OpenFlow in a smallconfiguration. Given this, 100ms is a reasonable setting forTVM cycle at scale. L1 switches send k

2TVMs per cycle

while all other switches send k TVMs. The largest TVM is

dominated by k2S4

, giving a control overhead of k3S400

bms

. Fora network with 64-port switches, this is 31.5Mbps or 0.3%of a 10Gbps link, an acceptable cost for a routing/locationprotocol that scales to hundreds of thousands of ports with4 levels and k = 64. This brings out a tradeoff betweenconvergence time and control overhead; a smaller TVM cycletime is certainly possible, but would correspond to a largeramount of control data sent per second. It is also important

to note that this control overhead is a function only of k andTVM cycle time; it does not increase with link speed.

5.3 Compact Forwarding TablesNext, we asses the effectiveness of hypernodes in compact-ing forwarding tables. We use our simulator to generate fullyprovisioned fat tree topologies made up of k-port switches.We then remove a percentage of the links at each level ofthe topology to model less than fully-connected networks.We use the smallest possible coordinate domain that canaccommodate the worst-case number of pods for each topol-ogy, and allow data packets to cross as many peer links asneeded, within the constraints of up*/across*/down* for-warding.

Once the input topology has been generated, we use thesimulator to calculate all switches’ levels and HNs, and weselect random coordinates for switches based on commonupper-level neighbors. Finally, we populate forwarding ta-bles based on the labels corresponding to the selected coor-dinates and analyze the forwarding table sizes of switches.

Table 3 gives the parameters used to create each input topol-ogy along with the total number of servers supported and theaverage number of number of forwarding table entries perswitch. The table provides values for optimized forwardingtables (in which redundant entries are removed and entriesfor peer links appear only when providing otherwise unavail-able connectivity) and unoptimized tables (which includeredundant entries for use with techniques such as ECMP).As the tables shows, even in graphs supporting millions ofservers, the number of forwarding entries is dramatically re-duced from the entry-per-host requirement of Layer 2 tech-niques.

As the provisioning of the tree reduces, the number of for-warding entries initially increases. This corresponds to casesin which the tree has become somewhat fragmented fromits initial fat tree specification, leading to more HNs andthus more coordinates across the graph. However, as evenmore links are deleted, forwarding table sizes begin to de-crease; for extremely fragmented trees, mutual connectivitybetween pairs of switches drops, and a switch need not storeforwarding entries for unreachable destinations.

5.4 Convergence TimeWe measured ALIAS’s convergence time on our testbed forboth an initial startup period as well as across transientfailures. We consider a switch to have converged when ithas stabilized all applicable coordinates and HN membershipinformation.

As shown in Figure 12(a), ALIAS takes a maximum of 10TVM cycles to converge when all switches and hosts are ini-tially booted, even though they are not booted simultane-ously. L3 switches converge most quickly since they simplyfacilitate L2-coordinate uniqueness. L1 switches convergemore slowly; the last L1 switch to converge might see thefollowing chain of events: (1) L2 switch `2a sends its co-ordinate to L3 switch `3, (2) `3 passes a hint about thiscoordinate to L2 switch `2b, which (3) forwards the hint toits representative L1 switch, which replies (4) with an as-signment for `2b’s coordinate.

Topology Info Forwarding Entries

levels ports% fully total

optimizedwithout

provisioned servers opts

3

16

100

1024

22 11280 62 9650 48 5820 28 31

32

100

8,192

45 42980 262 38650 173 21720 86 95

64

100

65,536

90 167780 1028 153050 653 84220 291 320

4

16

100

8,192

23 11980 197 24650 273 30720 280 304

32

100

131,072

46 45780 1278 149950 2079 224820 2415 2552

5 16

100

65,536

23 12380 492 55050 886 93120 1108 1147

Table 3: Forwarding Entries Per Switch

These 4 TVM cycles combine with 5 cycles to propagate levelinformation up and down the 3-level hierarchy, for a totalof 9 cycles. The small variation in our results is due to ourasynchronous deployment setting. In our implementation, aTVM cycle is 400µs, leading to an initial convergence timeof 4ms for our small topology. Our cycle time accounts for57µs for TVM processing and 291µs for flow table updatesin OpenFlow. In general, the TVM period may be set toanything larger than the time required for a switch to processone incoming TVM per port. In practice we would expectsignificantly longer cycle times in order to minimize controloverhead.

We also considered the behavior of ALIAS in response tofailures. As discussed in Section 2.3, relabeling is triggeredby additions or deletions of links, and its effects depend onthe HN membership of the upper level switch on the affectedlink. Figure 12(b) shows an example of each of the casesfrom Table 1 along with measured convergence time on ourtestbed. The examples in the figure are for link addition;we verified the parallel cases for link deletion by reversingthe experiments. We measured the time for all HN member-ship and coordinate information to stabilize at each affectedswitch. Our results confirm the locality of relabeling effects;only immediate neighbors of the affected L2 switch react,and few require more than the 2 TVM cycles used to recom-pute HN membership.

6. RELATED WORKALIAS provides automatic, decentralized, scalable assign-ment of hierarchical host labels. To the best of our knowl-edge, this is the first system to address all three of our goalssimultaneously.

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14

% of Switche

s converged

TVM Cycles

L1 switches L2 switches L3 switches All switches

(a) CDF of Initial Convergence Times

0

1

2

3

4

5

Case 4 Case 3 Case 2 Case 1 S5 S1 S2 S3 S5 S1 S2 S3

S4 S5

S1 S2

S6

S3

S7 Before: {S4},{S5, S6} A8er: {S4, S5},{S56}

S4 S5

S1 S2

Before: {S4},{S5} A8er: {S4, S5}

S4 S5

S1 S2

Before: {S4, S5} A8er: {S4},{S5}

S4

S1 S2 S3

S5 S1 S2 S7 S4 S1 S2

TVM Cycles

Case 4 Case 3 Case 2 Case 1

(b) Relabeling Convergence Times, dashed lines are new links

Figure 12: Convergence Analysis on startup and after failures

Our work can trace its lineage back to the original workon spanning trees [15] designed to bridge multiple physicalLayer 2 networks. While clearly ground-breaking, spanningtrees suffer from scalability challenges and do not support hi-erarchical labeling. SmartBridge [17] provides shortest pathrouting among Layer 2 hosts but is still broadcast based anddoes not support hierarchical host labels. More recently,Rbridges [16] and TRILL [20] suggest running a full-blownrouting protocol among Layer 2 switches along with an ad-ditional Layer 2 header to protect against forwarding loops.

SEATTLE [10] improves upon aspects of Rbridge’s scala-bility by distributing the knowledge of host-to-egress switchmapping among a distributed directory service implementedas a one-hop DHT. In general, however, all of these earlierprotocols target arbitrary topologies with broadcast-basedrouting and flat host labels. ALIAS benefits from the un-derlying assumption that we target hierarchical topologies.

VL2 [6] proposed scaling Layer 2 to mega data centers us-ing end-host modification and addressed load balancing toimprove agility in data centers. However VL2 uses an under-lying IP network fabric, which requires subnet and DHCPserver configuration, and does not address the requirementfor automation.

Most related to ALIAS are PortLand [13] and DAC [4].PortLand employs a Location Discovery Protocol for hostnumbering but differs from ALIAS in that it relies on a cen-tral fabric manager, assumes a 3-level fat tree topology, anddoes not support arbitrary miswirings and failures. In ad-dition, LDP makes decisions (e.g. edge switch labeling andpod groupings) based on particular interconnection patternsin fat trees. This limits the approach under heterogeneousconditions (e.g. a network fabric that is not yet fully de-ployed) and during transitory periods (e.g., when the systemfirst boots). Contrastingly, ALIAS makes decisions solelybased on current network conditions. DAC supports arbi-trary topologies but is fully centralized. Additionally, DACrequires that an administrator manually input configurationinformation both initially and prior to any planned changes.

Landmark [21] also automatically configures hierarchy ontoa physical topology and relabels as a result of topology

changes for ad hoc wireless networks. However, Landmark’shierarchy levels are defined such that even small topologychanges (e.g. a router losing a single neighbor) trigger rela-beling. Also, routers maintain forwarding state for distantnodes while ALIAS aggregates such state with hypernodes.

7. CONCLUSIONCurrent naming and communication protocols for data cen-ter networks rely on manual configuration or centralizationto provide scalable communication between end hosts. Suchmanual configuration is costly, time-consuming, and errorprone. Centralized approaches introduce the need for anout-of-band control network.

We take advantage of particular characteristics of data cen-ter topologies to design and implement ALIAS. We showhow to automatically overlay appropriate hierarchy on top ofa data center network interconnect such that end hosts canautomatically be assigned hierarchical, topologically mean-ingful labels using only pair-wise communication and withno central components. Our evaluation indicates that ALIASholds promise for simplifying data center management whilesimultaneously improving overall scalability.

8. ACKNOWLEDGMENTSThis section is optional; it is a location for you to acknowl-edge grants, funding, editing assistance and what have you.In the present case, for example, the authors would like tothank Gerald Murray of ACM for his help in codifying thisAuthor’s Guide and the .cls and .tex files that it describes.

9. REFERENCES[1] Cisco data center infrastructure 2.5 design guide.

http://tinyurl.com/23486bs.

[2] Openflow. www.openflowswitch.org.

[3] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable,commodity data center network architecture. InProceedings of the ACM SIGCOMM 2008 conferenceon Data communication, SIGCOMM ’08, pages 63–74,New York, NY, USA, 2008. ACM.

[4] K. Chen, C. Guo, H. Wu, J. Yuan, Z. Feng, Y. Chen,S. Lu, and W. Wu. Generic and automatic addressconfiguration for data center networks. In Proceedings

of the ACM SIGCOMM 2010 conference onSIGCOMM, SIGCOMM ’10, pages 39–50, New York,NY, USA, 2010. ACM.

[5] D. Dao, J. Albrecht, C. Killian, and A. Vahdat. Livedebugging of distributed systems. In Proceedings ofthe 18th International Conference on CompilerConstruction: Held as Part of the Joint EuropeanConferences on Theory and Practice of Software,ETAPS 2009, CC ’09, pages 94–108, Berlin,Heidelberg, 2009. Springer-Verlag.

[6] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula,C. Kim, P. Lahiri, D. A. Maltz, P. Patel, andS. Sengupta. VL2: a scalable and flexible data centernetwork. In Proceedings of the ACM SIGCOMM 2009conference on Data communication, SIGCOMM ’09,pages 51–62, New York, NY, USA, 2009. ACM.

[7] C. Hopps. Analysis of an equal-cost multi-pathalgorithm, 2000.

[8] C. Killian, J. W. Anderson, R. Jhala, and A. Vahdat.Life, death, and the critical transition: finding livenessbugs in systems code. In Proceedings of the 4thUSENIX conference on Networked systems design &implementation, NSDI’07, pages 18–18, Berkeley, CA,USA, 2007. USENIX Association.

[9] C. E. Killian, J. W. Anderson, R. Braud, R. Jhala,and A. M. Vahdat. Mace: language support forbuilding distributed systems. In Proceedings of the2007 ACM SIGPLAN conference on Programminglanguage design and implementation, PLDI ’07, pages179–188, New York, NY, USA, 2007. ACM.

[10] C. Kim, M. Caesar, and J. Rexford. Floodless inseattle: a scalable ethernet architecture for largeenterprises. In Proceedings of the ACM SIGCOMM2008 conference on Data communication, SIGCOMM’08, pages 3–14, New York, NY, USA, 2008. ACM.

[11] C. E. Leiserson. Fat-trees: universal networks forhardware-efficient supercomputing. IEEE Transactionson Computers, 34:892–901, October 1985.

[12] J. W. Lockwood, N. McKeown, G. Watson, G. Gibb,P. Hartke, J. Naous, R. Raghuraman, and J. Luo.Netfpga–an open platform for gigabit-rate networkswitching and routing. In Proceedings of the 2007IEEE International Conference on MicroelectronicSystems Education, MSE ’07, pages 160–161,Washington, DC, USA, 2007. IEEE Computer Society.

[13] R. Niranjan Mysore, A. Pamboris, N. Farrington,N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya,and A. Vahdat. PortLand: a scalable fault-tolerantlayer 2 data center network fabric. In Proceedings ofthe ACM SIGCOMM 2009 conference on Datacommunication, SIGCOMM ’09, pages 39–50, NewYork, NY, USA, 2009. ACM.

[14] J.-H. Park, H. Yoon, and H.-K. Lee. The deflectionself-routing banyan network: a large-scale ATM switchusing the fully adaptive self-routing and itsperformance analyses. IEEE/ACM Transactions onNetworks (TON), 7:588–604, August 1999.

[15] R. Perlman. An algorithm for distributed computationof a spanningtree in an extended LAN. In Proceedingsof the ninth symposium on Data communications,SIGCOMM ’85, pages 44–53, New York, NY, USA,1985. ACM.

[16] R. Perlman. Rbridges: transparent routing. InINFOCOM 2004. Twenty-third AnnualJointConference of the IEEE Computer andCommunications Societies, volume 2, pages 1211 –1218 vol.2, March 2004.

[17] T. L. Rodeheffer, C. A. Thekkath, and D. C.Anderson. Smartbridge: a scalable bridge architecture.In Proceedings of the conference on Applications,Technologies, Architectures, and Protocols forComputer Communication, SIGCOMM ’00, pages205–216, New York, NY, USA, 2000. ACM.

[18] M. Schroeder, A. Birrell, M. Burrows, H. Murray,R. Needham, T. Rodeheffer, E. Satterthwaite, andC. Thacker. Autonet: a high-speed, self-configuringlocal area network using point-to-point links. IEEEJournal on Selected Areas in Communications,9(8):1318 –1335, October 1991.

[19] H. J. Siegel and C. B. Stunkel. Inside parallelcomputers: Trends in interconnection networks. IEEEComputer Science & Engineering, 3:69–71, September1996.

[20] J. Touch and R. Perlman. Transparent interconnectionof lots of links (TRILL): Problem and applicabilitystatement, RFC 5556, May 2009.

[21] P. F. Tsuchiya. The landmark hierarchy: a newhierarchy for routing in very large networks. InSymposium proceedings on Communicationsarchitectures and protocols, SIGCOMM ’88, pages35–42, New York, NY, USA, 1988. ACM.

[22] M. WalraedSullivan, R. Niranjan Mysore,K. Marzullo, and A. Vahdat. Brief Announcement: ARandomized Algorithm for Label Assignment inDynamic Networks. In Proceedings of the 25thInternational Symposium on DIStributed Computing,DISC ’11, 2011.

Documents

ALIAS: Scalable, Decentralized Label Assignment for Data ...ALIAS’s labels can be used in a variety of routing and for-warding contexts, such as tunneling, IP-encapsulation, or MAC