16
Scalable Flow-Based Networking with DIFANE Minlan Yu * Jennifer Rexford * Michael J. Freedman * Jia Wang * Princeton University, Princeton, NJ, USA AT&T Labs - Research, Florham Park, NJ, USA ABSTRACT Ideally, enterprise administrators could specify fine-grain poli- cies that drive how the underlying switches forward, drop, and measure traffic. However, existing techniques for flow- based networking rely too heavily on centralized controller software that installs rules reactively, based on the first packet of each flow. In this paper, we propose DIFANE, a scalable and efficient solution that keeps all traffic in the data plane by selectively directing packets through intermediate switches that store the necessary rules. DIFANE relegates the con- troller to the simpler task of partitioning these rules over the switches. DIFANE can be readily implemented with com- modity switch hardware, since all data-plane functions can be expressed in terms of wildcard rules that perform simple actions on matching packets. Experiments with our proto- type on Click-based OpenFlow switches show that DIFANE scales to larger networks with richer policies. 1. INTRODUCTION The emergence of flow-based switches [1, 2] has en- abled enterprise networks that support flexible policies. These switches perform simple actions, such as drop- ping or forwarding packets, based on rules that match on bits in the packet header. Installing all of the rules in advance is not attractive, because the rules change over time (due to policy changes and host mobility) and the switches have relatively limited high-speed mem- ory (such as TCAMs). Instead, current solutions rely on directing the first packet of each “microflow” to a centralized controller that reactively installs the appro- priate rules in the switches [3, 4]. In this paper, we argue that the switches themselves should collectively perform this function, both to avoid a bottleneck at the controller and to keep all traffic in the data plane for better performance and scalability. 1.1 DIFANE: Doing It Fast ANd Easy Our key challenge, then, is to determine the appro- priate “division of labor” between the controller and the underlying switches, to support high-level policies in a scalable way. Previous work has demonstrated that a logically-centralized controller can track changes in user Ingress Switch Authority Switch Egress Switch Distribute partition information Redirect Forward Cache rules First Packet Subsequent Packets Hit cached rules and forward Controller Authority Switch ... Install authority rules Figure 1: DIFANE flow management architecture. (Dashed lines are control messages. Straight lines are data traffic.) locations/addresses and compute rules the switches can apply to enforce a high-level policy [4, 5, 6]. For exam- ple, an access-control policy may deny the engineering group access to the human-resources database, leading to low-level rules based on the MAC or IP addresses of the current members of the engineering team, the IP ad- dresses of the HR servers, and the TCP port number of the database service. Similar policies could direct pack- ets on customized paths, or collect detailed traffic statis- tics. The controller can generate the appropriate switch rules simply by substituting high-level names with net- work addresses. The policies are represented with 30K - 8M rules in the four different networks we studied. This separation of concerns between rules (in the switches) and policies (in the controller) is the basis of several promising new approaches to network management [3, 7, 8, 9, 10]. While we agree the controller should generate the rules, we do not think the controller should (or needs to) be involved in the real-time handling of data pack- ets. Our DIFANE (DIstributed Flow Architecture for Networked Enterprises) architecture, illustrated in Fig- ure 1, has the following two main ideas: The controller distributes the rules across (a subset of) the switches, called “authority switches,” to scale to large topologies with many rules. The controller runs a partitioning algorithm that di-

Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

Scalable Flow-Based Networking with DIFANE

Minlan Yu∗ Jennifer Rexford∗ Michael J. Freedman∗ Jia Wang†

∗ Princeton University, Princeton, NJ, USA † AT&T Labs - Research, Florham Park, NJ, USA

ABSTRACTIdeally, enterprise administrators could specify fine-grain poli-cies that drive how the underlying switches forward, drop,and measure traffic. However, existing techniques for flow-based networking rely too heavily on centralized controllersoftware that installs rules reactively, based on the first packetof each flow. In this paper, we propose DIFANE, a scalableand efficient solution that keeps all traffic in the data planebyselectively directing packets through intermediate switchesthat store the necessary rules. DIFANE relegates the con-troller to the simpler task of partitioning these rules overtheswitches. DIFANE can be readily implemented with com-modity switch hardware, since all data-plane functions canbe expressed in terms of wildcard rules that perform simpleactions on matching packets. Experiments with our proto-type on Click-based OpenFlow switches show that DIFANEscales to larger networks with richer policies.

1. INTRODUCTIONThe emergence of flow-based switches [1, 2] has en-

abled enterprise networks that support flexible policies.These switches perform simple actions, such as drop-ping or forwarding packets, based on rules that matchon bits in the packet header. Installing all of the rulesin advance is not attractive, because the rules changeover time (due to policy changes and host mobility) andthe switches have relatively limited high-speed mem-ory (such as TCAMs). Instead, current solutions relyon directing the first packet of each “microflow” to acentralized controller that reactively installs the appro-priate rules in the switches [3, 4]. In this paper, weargue that the switches themselves should collectivelyperform this function, both to avoid a bottleneck at thecontroller and to keep all traffic in the data plane forbetter performance and scalability.

1.1 DIFANE: Doing It Fast ANd EasyOur key challenge, then, is to determine the appro-

priate “division of labor” between the controller and theunderlying switches, to support high-level policies in ascalable way. Previous work has demonstrated that alogically-centralized controller can track changes in user

Ingress

Switch

Authority

Switch

Egress

Switch

Distribute

partition

information

Redirect

ForwardCache ru

les

First

Packet

Subsequent

Packets

Hit cached rules and forward

Controller

Authority

Switch...

Install

authority

rules

Figure 1: DIFANE flow management architecture.

(Dashed lines are control messages. Straight lines are

data traffic.)

locations/addresses and compute rules the switches canapply to enforce a high-level policy [4, 5, 6]. For exam-ple, an access-control policy may deny the engineeringgroup access to the human-resources database, leadingto low-level rules based on the MAC or IP addresses ofthe current members of the engineering team, the IP ad-dresses of the HR servers, and the TCP port number ofthe database service. Similar policies could direct pack-ets on customized paths, or collect detailed traffic statis-tics. The controller can generate the appropriate switchrules simply by substituting high-level names with net-work addresses. The policies are represented with 30K -8M rules in the four different networks we studied. Thisseparation of concerns between rules (in the switches)and policies (in the controller) is the basis of severalpromising new approaches to network management [3,7, 8, 9, 10].

While we agree the controller should generate therules, we do not think the controller should (or needsto) be involved in the real-time handling of data pack-ets. Our DIFANE (DIstributed Flow Architecture forNetworked Enterprises) architecture, illustrated in Fig-ure 1, has the following two main ideas:

• The controller distributes the rules across (asubset of) the switches, called “authority switches,”to scale to large topologies with many rules. Thecontroller runs a partitioning algorithm that di-

Page 2: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

vides the rules evenly and minimizes fragmenta-tion of the rules across multiple authority switches.

• The switches handle all packets in the dataplane (i.e., TCAM), diverting packets throughauthority switches as needed to access the appro-priate rules. The “rules” for diverting packets arethemselves naturally expressed as TCAM entries.

All data-plane functionality in DIFANE is expressiblein terms of wildcard rules with simple actions, exactlythe capabilities of commodity flow switches. As such,a DIFANE implementation requires only modificationsto the control-plane software of the authority switches,and no data-plane changes in any of the switches. Ex-periments with our prototype, built on top of the Click-based OpenFlow switch [11], illustrate that distributedrule management in the data plane provides lower delay,higher throughput, and better scalability than directingpackets through a separate controller.

Section 2 presents our main design decisions, followedby our DIFANE architecture in Section 3. Next, Sec-tion 4 describes how we handle network dynamics, andSection 5 presents our algorithms for caching and par-titioning wildcard rules. Section 6 presents our switchimplementation, followed by the performance evalua-tion in Section 7. Section 8 describes different deploy-ment scenarios of DIFANE. Section 9 discusses the sup-port for flow management tasks. The paper concludesin Section 10.

1.2 Comparison to Related WorkRecent work shows how to support policy-based man-

agement using flow switches [1, 2] and centralized con-trollers [4, 5, 6, 3]. The most closely related workis the Ethane controller that reactively installs flow-level rules based on the first packet of each TCP/UDPflow [3]. The Ethane controller can be duplicated [3]or distributed [12] to improve its performance. In con-trast, DIFANE distributes wildcard rules amongst theswitches, and handles all data packets in the data plane.Other recent work capitalizes on OpenFlow to rethinknetwork management in enterprises and data centers [7,8, 9, 10]; these systems could easily run as applicationson top of DIFANE.

These research efforts, and ours, depart from tra-ditional enterprise designs that use IP routers to in-terconnect smaller layer-two subnets, and rely heav-ily on inflexible mechanisms like VLANs. Today, net-work operators must configure Virtual LANs (VLANs)to scope broadcast traffic and direct traffic on longerpaths through routers that perform access control onIP and TCP/UDP header fields. In addition, an indi-vidual MAC address or wall jack is typically associatedwith just one VLAN, making it difficult to support morefine-grained policies that treat different traffic from thesame user or office differently.

Other research designs more scalable networks by se-lectively directing traffic through intermediate nodes toreduce routing-table size [13, 14, 15]. However, hash-based redirection techniques [13, 14], while useful forflat keys like IP or MAC addresses, are not appropri-ate for look-ups on rules with wildcards in arbitrary bitpositions. ViAggre [15] subdivides the IP prefix space,and forces some traffic to always traverse an interme-diate node, and does not consider on-demand cache ormulti-dimensional, overlapping rules.

2. DIFANE DESIGN DECISIONSOn the surface, the simplest approach to flow-based

management is to install all of the low-level rules in theswitches in advance. However, preinstalling the rulesdoes not scale well in networks with mobile hosts, sincethe same rules would need to be installed in multiple lo-cations (e.g., any place a user might plug in his laptop).In addition, the controller would need to update manyswitches whenever rules change. Even in the absence ofmobile devices, a network with many rules might nothave enough table space in the switches to store all therules, particularly as the network grows or its policiesbecome more complex. Instead, the system should in-stall rules on demand [3].

To build a flow-processing system that has high per-formance and scales to large networks, DIFANE makesfour high-level design decisions that reduce the over-head of handling cache misses and allow the system toscale to a large number of hosts, rules, and switches.

2.1 Reducing Overhead of Cache MissesReactively caching rules in the switches could easily

cause problems such as packet delay, larger buffers, andswitch complexity when cache misses happen. Moreimportantly, misbehaving hosts could easily trigger ex-cessive cache misses simply by scanning a wide range ofaddresses or port numbers — overloading TCAM andintroducing extra packet-processing overhead. DIFANEhandles “miss” packets efficiently by keeping them inthe data plane and reduces the number of “miss” pack-ets by caching wildcard rules.

Process all packets in the data plane: Someflow management architectures direct the first packet(or first packet header) of each microflow to the con-troller and have the switch buffer the packet awaitingfurther instructions [3].1 In a network with many shortflows , a controller that handles “miss” packets can eas-ily become a bottleneck. In addition, UDP flows intro-duce extra overhead, since multiple (potentially large)packets in the same flow may be in flight (and need

1Another solution to handle cache miss is for the switch toencapsulate and forward the entire packet to the controller.This is also problematic because it significantly increasescontroller load.

2

Page 3: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

to visit the controller) at the same time. The switchesneed a more complex and expensive buffering mecha-nism, because they must temporarily store the “miss”packets while continuing to serve other traffic, and thenretrieve them upon receiving the rule. Instead, DIFANEmakes it cheap and easy for switches to forward alldata packets in the data plane (i.e., hardware), by di-recting “miss” packets through an intermediate switch.Transferring packets in the data plane through a slightlylonger path is much faster than handling packets in thecontrol plane.

Efficient rule caching with wildcards: Cachinga separate low-level rule for each TCP or UDP mi-croflow [3], while conceptually simple, has several dis-advantages compared to wildcard rules. For example, awildcard rule that matches on all destinations in the123.132.8.0/22 subnet would require up to 1024 mi-croflow rules. In addition to consuming more data-plane memory on the switches, fine-grained rules re-quire special handling for more packets (i.e., the firstpacket of each microflow), leading to longer delays andhigher overhead, and more vulnerability to misbehav-ing hosts. Instead, DIFANE supports wildcard rules, tohave fewer rules (and fewer cache “misses”) and capital-ize on TCAMs in the switches. Caching wildcard rulesintroduces several interesting technical challenges thatwe address in our design and implementation.

2.2 Scaling to Large Networks and Many RulesTo scale to large networks with richer policies, DI-

FANE divides the rules across the switches and handlesthem in a distributed fashion. We also keep consis-tent topology information among switches by leveraginglink-state protocols.

Partition and distribute the flow rules: Repli-cating the controller seems like a natural way to scalethe system and avoid a single point of failure. How-ever, this requires each controller to maintain all therules, and coordinate with the other replicas to main-tain consistency when rules change. (Rules may changerelatively often, not only because the policy changes,but also because host mobility triggers changes in themapping of policies to rules.) Instead, we partition thespace of rules to reduce the number of rules each com-ponent must handle and enable simpler techniques formaintaining consistency. As such, DIFANE has one pri-mary controller (perhaps with backups) that managespolicies, computes the corresponding rules, and dividesthese rules across the switches; each switch handles aportion of the rule space and receives updates only whenthose rules change. That is, while the switches reac-tively cache rules in response to the data traffic, the DI-FANE controller proactively partitions the rules acrossdifferent switches.

Consistent topology information distribution withthe link-state protocol: Flow-based managementrelies on the switches having a way to communicate withthe controller and adapt to topology changes. Rely-ing on rules for this communication introduces circular-ity, where the controller cannot communicate with theswitches until the appropriate rules have been installed.Rather than bootstrapping communication by havingthe switches construct a spanning tree [3], we advo-cate running a link-state protocol amongst the switches.Link-state routing enables the switches to compute pathsand learn about topology changes and host locationchanges without involving the controller, reducing over-head and also removing the controller from the criticalpath of failure recovery. In addition, link-state rout-ing scales to large networks, enables switches to directpackets through intermediate nodes, and reacts quicklyto switch failure [13]. As such, DIFANE runs a link-state routing protocol amongst the switches, while alsosupporting flow rules that allow customized forwardingof traffic between end hosts. The controller also par-ticipates in link-state routing to reach the switches andlearn of network topology changes.2

3. DIFANE ARCHITECTUREThe DIFANE architecture consists of a controller that

generates the rules and allocates them to the authorityswitches, as shown in Figure 1. Authority switches canbe a subset of existing switches in the network (includ-ing ingress/egress switches), or dedicated switches thathave larger memory and processing capability.

Upon receiving traffic that does not match the cachedrules, the ingress switch encapsulates and redirects thepacket to the appropriate authority switch based on thepartition information. The authority switch handlesthe packet in the data plane and sends feedback to theingress switch to cache the relevant rule(s) locally. Sub-sequent packets matching the cached rules can be en-capsulated and forwarded directly to the egress switch.

In this section, we first discuss how the controllerpartitions the rules and distributes the authority andpartition rules to the switches. Next, we describe howa switch directs packets through the authority switchand caches the necessary rules, using link-state routingto compute the path to the authority switch. Finally, weshow that the data-plane functionality of DIFANE canbe easily implemented on today’s flow-based switchesusing wildcard rules.

3.1 Rule Partition and AllocationAs shown in Figure 2, we use the controller to pre-

compute the low-level rules, generate partition rules that

2The links between the controller and switches are set withhigh link-weights so that traffic between switches do not gothrough the controller.

3

Page 4: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

F1

F2

A

B

C DR2

R1

R4

R5

R6

R7R3

(a) Low-level rules and the partition

Type Priority F1 F2 Action Timeout Note

Cache 210 00** 111* Encap, forward to B 10 sec R3

rules 209 1110 11** Drop 10 sec R7

... ... ... ... ... ... ...

110 00** 001*Encap, forward to D,

∞ R1Authority trigger ctrl. plane func.rules

109 0001 0***Drop,

∞ R2trigger ctrl. plane func.15 0*** 000* Encap, redirect to B ∞ Primary

Partition 14 0*** 1*** Encap, redirect to C ∞rules 13 11** **** Encap, redirect to D ∞

5 0*** 000* Encap, redirect to B′ ∞ Backup

... ... ... ... ... ... ...

(b) Wildcard rules in the TCAM of Switch A

Figure 3: Wildcard rules in DIFANE (A-D are authority switches).

Authority rules

for Switch A

Policies

Map

Rules

Partition

Authority rules

for Switch D

Partition

rules

Controller

... ...

Figure 2: Rule operations in the controller.

describe which low-level rules are stored in which au-thority switches, and then distribute the partition rulesto all the switches. The partition rules are representedby coarse-grained wildcard rules on the switches.

Precompute low-level rules: The controller pre-computes the low-level rules based on the high-levelpolicies by simply substituting high-level names withnetwork addresses. Since low-level rules are pre-computedand installed in the TCAM of switches, we can alwaysprocess packets in the fast path. Most kinds of poli-cies can be translated to the low-level rules in advancebecause the controller knows the addresses of the hostswhen the hosts first connect to the ingress switch, andthus can substitute the high-level names with addresses.However, precomputation is not an effective solution forpolicies (like traffic engineering) that depend on dynam-ically changing network state.

Use partitioning to subdivide the space of allrules: Hashing is an appealing way to subdivide therules and direct packets to the appropriate authorityswitch. While useful for flat keys like an IP or MACaddress [13, 14], hashing is not effective when the keyscan have wildcards in arbitrary bit positions. In par-ticular, packets matching the same wildcard rule wouldhave different hash values, leading them to different au-

thority switches; as a result, multiple authority switcheswould need to store the same wildcard rule. Instead ofrelying on hashing, DIFANE partitions the rule space,and assigns each portion of rule space to one or moreauthority switches. Each authority switch stores therules falling in its part of the partition.

Run the partitioning algorithm on the controller:Running the partitioning algorithm on the switches them-selves would introduce a large overhead, because theywould need to learn the rules from the controller, runthe partitioning algorithm, and distribute the results.In contrast, the controller is a more natural place torun the partitioning algorithm. The controller is al-ready responsible for translating policies into rules andcan easily run the partitioning algorithm periodically, asthe distribution of low-level rules changes. We expectthe controller would recompute the partition relativelyinfrequently, as most rule changes would not require re-balancing the division of rule space. Section 4 discusseshow DIFANE handles changes to rules with minimalinterruption to the data traffic.

Represent the partition as a small collection ofpartition rules: Low-level rules are defined as ac-tions on a flow space. The flow space usually has sevendimensions (source/destination IP addresses, MAC ad-dresses, ports, the protocol) or more. Figure 3(a) showsa two-dimensional flow space (F1, F2) and the rules onit. The bit range of each field is from 0 to 15 (i.e.,F1 = F2 = [0..15]). For example, F1, F2 can be viewedas the source/destination fields of a packet respectively.Rule R2 denotes that all packets which are from source1 (F1 = 1) and forwarded to a destination in [0..7](F2 =[0..7]) should be dropped.

The controller partitions the flow space into M ranges,and assigns each range to an authority switch. The re-sulting partition can be expressed concisely as a smallnumber of coarse-grain partition rules, where M is pro-portional to the number of authority switches ratherthan the number of low-level rules. For example, in Fig-ure 3(a), the flow space is partitioned into four parts by

4

Page 5: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

the straight lines, which are represented by the parti-tion rules in Figure 3(b). Section 5 discusses how thecontroller computes a partition of overlapping wildcardrules that reduces TCAM usage.

Duplicate authority rules to reduce stretch andfailure-recovery time: The first packet covered byan authority rule traverses a longer path through an au-thority switch. To reduce the extra distance the trafficmust travel (i.e., “stretch”), the controller can assigneach of the M ranges to multiple authority switches.For example, if each range is handled by two authorityswitches, the controller can generate two partition rulesfor each range, and assign each switch the rule thatwould minimize stretch. That way, on a cache miss3,a switch directs packets to the closest authority switchresponsible for that range of rules. The placement ofmultiple authority switches is discussed in Section 5.

Assigning multiple authority switches to the samerange can also reduce failure-recovery time. By push-ing backup partition rules to every switch, a switch canquickly fail over to the backup authority switch whenthe primary one fails (see Section 4). This requires eachswitch to store more partition rules (e.g., 2M insteadof M), in exchange for faster failure recovery. For ex-ample, in Figure 3(b), switch A has a primary partitionrule that directs packets to B and a backup one thatdirects packets to B′.

3.2 Packet Redirection and Rule CachingThe authority switch stores the authority rules. The

ingress switch encapsulates the first packet covered byan authority switch and redirects it to the authorityswitch.4 The authority switch processes the packet andalso caches rules in the ingress switch so that the fol-lowing packets can be processed at the ingress switch.

Packet redirection: In the ingress switch, the firstpacket of a wildcard flow matches a partition rule. Thepartition rule indicates which authority switch main-tains the authority rules that are related to the packet.For example, in Figure 3 a packet with (F1 =9, F2 =7)hits the primary partition rule for authority switch Band should be redirected to B. The ingress switch en-capsulates the packet and forwards it to the authorityswitch. The authority switch decapsulates the packet,processes it, re-encapsulates it, and forwards it to theegress switch.

Rule Caching: To avoid redirecting all the data traf-fic to the authority switch, the authority switch caches

3In DIFANE, every packet matches some rule in the switch.“Cache miss” in DIFANE means a packet does not matchany cache rules, but matches a partition rule instead.4With encapsulation, the authority switch knows the ad-dress of the ingress switch from the packet header and sendsthe cache rules to the ingress switch.

the rules in the ingress switch.5 Packets that matchthe cache rules are encapsulated and forwarded directlyto the egress switch (e.g., packets matching R3 in Fig-ure 3(b) are encapsulated and forwarded to D). In DI-FANE “miss” packets do not wait for rule caching, be-cause they are sent through the authority switch ratherthan buffered at the ingress switch. Therefore, we canrun a simple caching function in the control plane ofthe authority switch to generate and install cache rulesin the ingress switch. The caching function is triggeredwhenever a packet matches the authority rules in theauthority switch. The cache rule has an idle timeout sothat it can be removed by the switch automatically dueto inactivity.

3.3 Implement DIFANE with Wildcard RulesAll the data plane functions required in DIFANE can

be expressed with three sets of wildcard rules of vari-ous granularity with simple actions, as shown in Fig-ure 3(b).

Cache rules: The ingress switches cache rules so thatmost of the data traffic hits in the cache and is processedby the ingress switch. The cache rules are installed bythe authority switches in the network.

Authority rules: Authority rules are only stored inauthority switches. The controller installs and updatesthe authority rules for all the authority switches. Whena packet matches an authority rule, it triggers a control-plane function to install rules in the ingress switch.

Partition rules: The controller installs partitionrules in each switch. The partition rules are a set ofcoarse-grained rules. With these partition rules, we en-sure a packet will always match at least one rule in theswitch and thus always stay in the data plane.

The three sets of rules can be easily expressed as asingle list of wildcard rules with different priorities. Pri-orities are naturally supported by TCAM. If a packetmatches multiple rules, the packet is processed based onthe rule that has the highest priority. The cached ruleshave highest priority because packets matching cacherules do not need to be directed to authority switches.In authority switches, authority rules have higher prior-ity than partition rules, because packets matching au-thority rules should be processed based on these rules.The primary partition rules have higher priority thanbackup partition rules.

Since all functionalities in DIFANE are expressed withwildcard rules, DIFANE does not require any data-plane modifications to the switches and only needs mi-nor software extensions in the control plane of the au-thority switches.

5Here we assume that we cache flow rules only at the ingressswitch. Section 9 discusses the design choices of where tocache flow rules.

5

Page 6: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

4. HANDLING NETWORK DYNAMICSIn this section, we describe how DIFANE handles dy-

namics in different parts of the network: To handle rulechanges at the controller, we need to update the au-thority rules in the authority switches and occasionallyrepartition the rules. To handle topology changes at theswitches, we leverage link-state routing and focus on re-ducing the interruptions of authority switch failure andrecovery. To handle host mobility, we dynamically up-date the rules for the host, and use redirection to handlechanges in the routing rules.

4.1 Changes to the RulesThe rules change when administrators modify the

policies, or network events (e.g., topology changes) af-fect the mapping between policies and rules. The re-lated authority rules, cache rules, and partition rules inthe switches should be modified correspondingly.

Authority rules are modified by the controllerdirectly: The controller changes the authority rulesin the related authority switches. The controller caneasily identify the related authority switches based onthe partition it generates.

Cache rules expire automatically: Cached copiesof the old rules may still exist in some ingress switches.These cache rules will expire after the timeout time.For critical changes (e.g., preventing DoS attacks), theauthority switches can get the list of all the ingressswitches from the link-state routing and send them amessage to evict the related TCAM entries.

Partition rules are recomputed occasionally: Whenthe rules change, the number of authority rules in theauthority switches may become unbalanced. If the dif-ference in the number of rules among the authorityswitches exceeds a threshold, the controller recomputesthe partition of the flow space. Once the new partitionrules are generated, the controller notifies the switchesof the new partition rules, and updates the authorityrules in the authority switches.

The controller cannot update all the switches at ex-actly the same time, so the switches may not have a con-sistent view of the partition during the update, whichmay cause transient loops and packet loss in the net-work. To avoid packet loss, the controller simply up-dates the switches in a specific order. Assume the con-troller decides to move some authority rules from au-thority switch A to B. The controller first sends theauthority rules to authority switch B, before sendingthe new partition rules for A and B to all the switchesin the network. Meanwhile, switches can redirect thepackets to either A or B for the authority rules. Finally,the controller deletes the authority rules in switch A. Inthis way, we can prevent packet loss during the change.The same staged update mechanism also applies to the

partition change among multiple authority switches.

4.2 Topology DynamicsLink-state routing enables the switches to learn about

topology changes and adapt routing quickly. When au-thority switches fail or recover, DIFANE adapts therules to reduce traffic interruption.

Authority switch failure: When an authority switchfails, packets directed through it are dropped. To min-imize packet loss, we must react quickly to author-ity switch failures. We design a distributed authorityswitch takeover mechanism. As discussed in Section 3.1,the controller assigns the same group of authority rulesto multiple authority switches to reduce stretch andfailure-recovery time. Each ingress switch has primarypartition rules directing traffic to their closest author-ity switch and backup partition rules with lower prioritythat directing traffic to another authority switch whenthe primary one fails.

The link-state routing protocol propagates a messageabout the switch failure throughout the network. Uponreceiving this message, the switches invalidate their par-tition rules that direct traffic to the failed authorityswitch. As a result, the backup partition rule takeseffect and automatically directs packets through thebackup authority switch. For the switches that have notyet received the failure information, the packets may getsent towards the failed authority switch, but will finallyget dropped by a switch who has updated its switchforwarding table.6

Authority switch addition/recovery: We use thecontroller to handle switches joining in the network, be-cause it does not require fast reaction compared to au-thority switch failures. To minimize the change of thepartition rules and authority rules, the controller ran-domly picks an authority switch, divides its flow rangeevenly into two parts. The controller then moves onepart of the flow range to the new switch, and installsthe authority rules in the new switch. Finally the con-troller updates the partition rules correspondingly in allthe switches.

4.3 Host MobilityIn DIFANE, when a host moves, its MAC and per-

haps IP address stays the same. The rules in the con-troller are defined based on these addresses, and thusdo not change as hosts move. As a result, the parti-tion rules and the authority rules also stay the same.7

Therefore we only need to consider the changes of cache

6If the switch can decapsulate the packet and encapsulateit with the backup authority switch as the destination, wecan avoid such packet loss.7In some enterprises, a host changes its identifier when itmoves. The rules also change correspondingly. We can usethe techniques in Section 4.1 to handle the rule changes.

6

Page 7: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

rules.

Installing rules at the new ingress switch on de-mand: When a host connects to a new ingress switch,the switch may not have the cache rules for the pack-ets sent by the hosts. So the packets are redirected tothe responsible authority switch. The authority switchthen caches rules at the new ingress switch.

Removing rules from old ingress switch by time-out: Today’s flow-based switches usually have a time-out for removing the inactive rules. Since the host’sold ingress switch no longer receives packets from themoved host, the cache rules at the switch are automat-ically removed once the timeout time expires.

Redirecting traffic from old ingress switch to thenew one: The rules for routing packets to the hostchange when the host moves. When a host connectsto a new switch, the controller gets notified throughthe link-state routing and constructs new routing rulesthat map the address of the host to the correspondingegress switch. The new routing rules are then installedin the authority switches.

The cached routing rules in some switches may beoutdated. Suppose a host H moves from ingress switchSold to Snew. The controller first gets notified aboutthe host movement. It then installs new routing rulesin the authority switch, and also installs a rule in Sold

redirecting packets to Snew. If an ingress switch A re-ceives a packet whose destination is H , A may still sendthe packet to the old egress point Sold if the cache rulehas not expired. Sold then redirects packets to Snew.After the cache rule expires in switch A, A directs thepackets to the authority switch for the correct egresspoint Snew and caches the new routing rule.

5. HANDLING WILDCARD RULESMost flow-management systems simply use microflow

rules [3] or transform overlapping wildcard rules into aset of non-overlapping wildcard rules. However thesemethods significantly increase the number of rules inswitches, as shown in our evaluation in Section 7. Tothe best of our knowledge, there is no systematic and ef-ficient solution for handling overlapping wildcard rulesin network-wide flow-management systems. In this sec-tion, we first propose a simple and efficient solutionfor multiple authority switches to independently insertcache rules in ingress switches. We then discuss the keyideas of partitioning overlapping wildcard rules, defer-ring the description of our algorithm to the Appendix.

5.1 Caching Wildcard RulesWildcard rules complicate dynamic caching at ingress

switches. In the context of access control, for examplein Figure 4, the packet (F1 =7, F2 =0) matches an “ac-cept” rule R3 that overlaps with “deny” rule R2 which

Rule F1 F2 Action

R1 4 0-15 AcceptR2 0-7 5-6 DropR3 6-7 0-15 AcceptR4 14-15 0-15 Accept

(a) Wildcard rules listed in the decreasing

order of priority. (R1 > R2 > R3 > R4)

Cut A

F1

F2

R3 R4Cut B

A1 A2

B1

B2

R1

R2

(b) Graphical view and two partition solutions.

Figure 4: An illustration of wildcard rules.

has higher priority. Simply caching R3 is not safe. Ifwe just cache R3 in the ingress switch, another packet(F1 = 7, F2 = 5) could incorrectly pass the cached ruleR3, because the ingress switch is not aware of the ruleR2. Thus, because rules can overlap with each other,the authority switch cannot solely cache the rule that apacket matches. This problem exists in all the flow man-agement systems that cache wildcard rules and there-fore it is not trivial to extend Ethane controllers [3] tosupport wildcards.

To address this problem, DIFANE constructs one ormore new wildcard rules that cover the largest flow range(i.e., a hypercube in a flow space) in which all packetstake the same action. We use Figure 4 to illustrate thesolution. Although the rules overlap, which means apacket may match multiple rules, the packet only takesthe action of the rule with the highest priority. That is,each point in the flow space has a unique action (whichis denoted by the shading in each spot in Figure 4(b)).As long as we cache a rule that covers packets withthe same action (i.e., spots with the same shading),we ensure that the caching preserves semantic correct-ness. For example, we cannot cache rule R3 because thespots it covers have different shading. In contrast, wecan safely cache R1 because all the spots it covers hasthe same shading. For the packet (F1 = 7, F2 = 0), weconstruct and cache a new rule: F1 =[6..7], F2=[0..3].

The problem of constructing new wildcard rules forcaching at a single switch was studied in [16]. Ourcontribution lies in extending this approach to multi-ple authority switches, each of which can independentlyinstall cache rules at ingress switches. Given such a set-ting, we must prevent authority switches from installing

7

Page 8: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

conflicting cache rules. To guarantee this, DIFANE en-sures that the caching rules installed by different au-thority switches do not overlap. This is achieved byallocating non-overlapping flow ranges to the author-ity switches, and only allowing the authority switch toinstall caching rules in its own flow range. We laterevaluate our caching scheme in Section 7.

5.2 Partitioning Wildcard RulesOverlapping wildcard rules also introduce challenges

in partitioning. We first formulate the partition prob-lem: The controller needs to partition rules into Mparts to minimize the total number of TCAM entriesacross all M authority switches with the constraint thatthe rules should not take more TCAM entries than areavailable in the switches. There are three key ideas inthe partition algorithm:

Allocating non-overlapping flow ranges to au-thority switches: As discussed in the caching solu-tion, we must ensure that the flow ranges of authorityswitches do not overlap with each other. To achievethis goal, DIFANE first partitions the entire flow spaceinto M flow ranges and then stores rules in each flowrange in an authority switch. For example, the “CutA” shown in Figure 4(b) partitions the flow space onfield F1 into two equal flow ranges A1 and A2. We thenassign R1, R2 and R3 in A1 to one authority switch,and R4 to another.

DIFANE splits the rules so that each rule onlybelongs to one authority switch. With the abovepartitioning approach, one rule may span multiple par-titions. For example, “Cut B” partitions the flow spaceon field F2, which results in the rules R1, R3, and R4

spanning the two flow ranges B1 and B2. We split eachrule into two independent rules by intersecting it withthe two flow ranges. For example, the two new rulesgenerated from rule R4 are F1 = [14..15], F2 = [0..7]→ Accept and F1 = [14..15], F2 = [8..15] → Accept.These two independent rules can then be stored in dif-ferent authority switches. Splitting rules thus avoidsthe overlapping of rules among authority switches, butat the cost of increased TCAM usage.

To reduce TCAM usage, we prefer the cuts toalign with rule boundaries. For example, “Cut A”is better than “Cut B” because “Cut A” does not breakany rules. We also observe that cut on field F1 is betterthan F2 since we have more rule boundaries to choose.Based on these observations, we design a decision-treebased partition algorithm which is described in [17].

In summary, DIFANE partitions the entire rule spaceinto M independent portions, and thus each author-ity switch is assigned a non-overlapping portion. How-ever, within the portion managed by a single author-ity switch, DIFANE allows overlapping or nested rules.

This substantially reduces the TCAM usage of author-ity switches (see Section 7).

Duplicating authority rules to reduce stretch:We can duplicate the rules in each partition on mul-tiple authority switches to reduce stretch and to reactquickly to authority switch failures. Due to host mobil-ity, we cannot pre-locate authority switches to minimizestretch. Instead, we assume traffic that is related to onerule may come from any ingress switches and place repli-cated authority switches to reduce the average stretch.One simple method is to randomly place the replicatedswitches to reduce stretch. Alternatively, by leveragingan approximation algorithm for the “k-median prob-lem” [18], we can place the replicated authority switchesso that they have minimal average stretch to any pairof switches. Both schemes are evaluated in Section 7.

6. DESIGN AND IMPLEMENTATIONIn this section, we present our design and implemen-

tation of DIFANE. First, we describe how our prototypehandles multiple sets of rules from different kinds ofhigh-level policies for different management functions.Second, we describe the prototype architecture, whichjust add a few control-plane functions for authority switchesto today’s flow-based switches.

6.1 Managing Multiple Sets of RulesDifferent management functions such as access con-

trol, measurement and routing may have totally differ-ent kinds of policies. To make our prototype efficientand easy to implement, we generate different sets ofrules for different policies, partition them using a singlepartition algorithm, and process them sequentially inthe switch.

Generate multiple sets of low-level rules: Trans-lating and combining different kinds of high-level poli-cies into one set of rules is complicated and significantlyincreases TCAM usage. For example, if the policies areto monitor web traffic, and perform destination basedrouting, we have to provide rules for (dst, port 80) and(dst, other ports) for each destination dst. If the admin-istrator changes the policy of monitoring port 21 ratherthan port 80, we must change the rules for every des-tination. In contrast, if we have different sets of rules,we only need one routing rule for each destination anda single measurement rule for port 80 traffic which iseasy to change.

To distribute multiple sets of rules, the controller firstpartitions the flow space to minimize the total TCAMusage. It then assigns all the rules (of different man-agement modules) in one flow range to one authorityswitch. We choose to use the same partition for differ-ent sets of low-level rules so that packets only need tobe redirected to one authority switch to match all sets

8

Page 9: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

Access

control

rules

Measure-

ment rules

Routing

rules

In

pkts

TCAM in a switch

Switch

connect

rules

Out

pkts

Figure 5: Rules for various management modules.

of authority rules.8

Processing packets through multiple sets of rulesin switches: In switches we have one set of rules foreach management module.9 We process the flow rulessequentially through the rules for different modules asshown in Figure 5. We put access control rules first toblock malicious traffic. Routing rules are placed later toidentify the egress switch for the packets. Finally, thelink-state routing constructs a set of switch connectionrules to direct the packets to their egress switches.

To implement sequential processing in the switch whereall the rules share the same TCAM, the controller setsa “module identifier” in the rules to indicate the mod-ule they belong to. The switch first initializes a moduleidentifier in the packet. It then matches the packetwith the rules that have the same module identifier.Next, the switch increments the module identifier in thepacket and matches to the next set of rules. By process-ing the packet several times through the memory, theswitch matches the packet to the rules for different mod-ules sequentially. The administrator specifies the orderof the modules by giving different module identifiers forthe high-level policies in the controller.

A packet may be redirected to the authority switchwhen it is in the middle of the sequential processing(e.g., while being processed in the measurement mod-ule). After redirection, the packet will be processedthrough the following modules in the authority switchbased the module identifier in the packet.

6.2 DIFANE Switch PrototypeFigure 6 shows both the control and data plane of

our DIFANE switch prototype.

Control plane: We use XORP [19] to run the link-state routing protocol to maintain the switch-level con-nectivity, and keep track of topology changes. XORPalso sends updates of the switch connection rules in thedata plane.

The authority switch also runs a cache manager whichinstalls cache rules in the ingress switch. If a packetmisses the cache, and matches the authority rule in theauthority switch, the cache manager is triggered to send

8One can also choose to provide a different partition of theflow space for different sets of low-level rules, but packetsmay be redirected to multiple authority switches to matchthe authority rules of different management modules.9To make the rule processing simple, we duplicate the sameset of partition rules in the management modules.

Controller

Control

Plane

Data

Plane

(Click)

In

pkts.

Switch

connect

rules Out

pkts.

Pkt.

Encap.

Link-state

Protocol

(XORP)

Forwarding

updates

Cache

manager

Install auth.

and partition

rules

Recv cache

updates

Send cache

updates

Data packets

Control msgs

Routing

updates

with other

switches

Trigger

cache

manager

Partition

rules

Auth.

rules

Cache

rules

Figure 6: DIFANE prototype implementation. (Cache

manager and authority rules (shaded boxes) only exist

in authority switches.)

a cache update to the ingress switch of the packet. Thecache manager is implemented in software in the controlplane because packets are not buffered and waiting forthe cache rule in the ingress switch. The ingress switchcontinues to forward packets to the authority switchif the cache rules are not installed. The cache man-ager sends an update for every packet that matches theauthority rule. This is because we can infer that theingress switch does not have any related rules cached,otherwise it would forward the packets directly ratherthan sending them to the authority switch.10

Data plane: We run Click-based OpenFlow switch [11]in the kernel as the data plane of DIFANE. Click man-ages the rules for different management modules andencapsulates and forwards packets based on the switchconnection rules. We implement the packet encapsu-lation function to enable tunneling in the Click Open-Flow element. We also modify the Click OpenFlow ele-ment to support the flow rule action “trigger the cachemanager”. If a packet matches the authority rules,Click generates a message to the cache manager throughthe kernel-level socket “netlink”. Today’s flow-basedswitches already support actions of sending messagesto a local controller in order to communicate with thecentralized controller [4]. We just add a new messagetype of “matching authority rules”. In addition, today’sflow-based switches already have interfaces for the cen-tralized controller to install new rules. The cache man-

10For UDP flows, they may be a few packets sent to theauthority switch. The authority switch sends one feedbackfor each UDP flow, because it takes very low overhead tosend a cache update message (just one UDP packet). Inaddition, we do not need to store the existing cached flowrules in the authority switch or fetch them from the ingressswitch.

9

Page 10: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

Switch 1

NOX Controller

... ... Local Controller

Data Plane

Switch p

Local Controller

Data Plane

Packets Packets

Packets Packets Install

Rules

(a) NOX

Switch 1

... ...

Switch p

Authority Switch

Cache Manager

Data Plane Data Plane

Data Plane Install

Rules Packets

Packets Packets

(b) DIFANE.

Figure 7: Experiment setup. (Dashed lines are control

messages; straight lines are data traffic.)

ager then just leverages the same interfaces to installcache rules in the ingress switches.

7. EVALUATIONIdeally we would like to evaluate DIFANE based on

policies, topology data, and user-mobility traces fromreal networks. Unfortunately, most networks today arestill configured with rules that are tightly bound totheir network configurations (e.g., IP address assign-ment, routing, and VLANs). Therefore, we evaluatedDIFANE’s approach against the topology and access-control rules of a variety of different networks toex-plore DIFANE’s benefit across various settings. Wealso perform latency, throughput, and scalability micro-benchmarks of our DIFANE prototype and a trace-drivenevaluation of our partition and caching algorithms.

To verify the design decisions in Section 2, we eval-uate two central questions in this section: (1) How ef-ficient and scalable is DIFANE? (2) How well do ourpartition and caching algorithms work in handling largesets of wildcard rules?

7.1 Performance of the DIFANE PrototypeWe implemented the DIFANE prototype using a kernel-

level Click-based OpenFlow switch and compared thedelay and throughput of DIFANE with NOX [4], whichis a centralized solution for flow management. For a faircomparison, we first evaluated DIFANE using only oneauthority switch. Then we evaluated the throughput of

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0.1 1 10 100

Cum

ulat

ive

frac

tion

RTT (ms)

DIFANENOX

Figure 8: Delay comparison of DIFANE and NOX.

DIFANE with multiple authority switches. Finally weinvestigated how fast DIFANE reacts to the authorityswitch failures.

In the experiments, each sender sends the packets toa receiver through a single ingress switch, which is con-nected directly to either NOX or a DIFANE authorityswitch as shown in Figure 7. (In this way, the net-work delay from the ingress switch to NOX and theDIFANE authority switch is minimized. We evaluatethe extra delay caused by redirecting through author-ity switches in Section 7.2.) With NOX, when a packetdoes not match a cached rule, the packet is buffered inthe ingress switch before NOX controller installs a ruleat the ingress switch. In contrast, in DIFANE the au-thority switch redirects the packet to the receiver in thedata plane and installs a rule in the ingress switch atthe same time. We generate flows with different portnumbers, and use a separate rule for each flow. Sincethe difference between NOX and DIFANE lies in theprocessing of the first packet, we generate each flowas a single 64 Byte UDP packet. Based on the mea-surements, we also calculate the performance differencebetween NOX and DIFANE for flows of normal sizes.Switches, NOX, and traffic generators (“clients”) run onseparate 3.0 GHz 64-bit Intel Xeon machines to avoidinterference between them.

(1) DIFANE achieves small delay for the firstpacket of a flow by always keeping packets inthe fast path. In Figure 8, we send traffic at 100single-packet flows/s and measure the round-trip time(RTT) of the each packet being sent through a switchto the receiver and an ACK packet being sent back.Although we put NOX near the switch, the packets stillexperience a RTT of 10 ms on average, which is notacceptable for those networks that have tight latencyrequirement such as data centers. In DIFANE, sincethe packets stay in the fast path (forwarded throughan authority switch), the packets only experience 0.4ms RTT on average. Since all the following packetstake 0.3 ms RTT for both DIFANE and NOX, we caneasily calculate that to transfer a flow of a normal size35 packets, which is based on the measurement in the

10

Page 11: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

paper [20]), the average packet transfer time is 0.3 ms(=(0.3*34+0.4)/35) transferring time for DIFANE but0.58 ms (=(0.3*34+10)/35) for NOX. People also testedNOX with commercial OpenFlow switches and observesimilar delay [21].

10K

20K

50K

100K

200K

500K

1M

10K 20K 50K 100K200K 500K 1M

1 2 3 4 56 810

Th

rou

gh

pu

t (f

low

s/s

ec)

Sending rate (flows/sec)

No. of switches

DIFANENOX

Figure 9: Throughput comparison of DIFANE and

NOX.

(2) DIFANE achieves significantly higher through-put than NOX. We then increase the number ofswitches (p) to measure the throughput of NOX andDIFANE. In Figure 9, we show the maximum through-put of flow setup for one client using one switch. InDIFANE, the switch was able to achieve the client’smaximum flow setup rate, 75K flows/s, while the NOXarchitecture was only able to achieve 20K flows/s. Thisis because, while all packets remain in the fast path (thesoftware kernel) in DIFANE, the OpenFlow switch’slocal controller (implemented in user-space) becomesa bottleneck before NOX does. Today’s commercialOpenFlow switches can only send 60-330 flows/s to thecontroller due to the limited CPU resources in the con-troller [22]. Section 8 discusses how to run DIFANE ontoday’s commercial switches.

As we increase the number of ingress switches—eachadditional data-point represents an additional switchand client as shown in the upper x-axis—we see thatthe NOX controller soon becomes a bottleneck: Withfour switches, a single NOX controller achieves a peakthroughput of 50K single-packet flows/s. Suppose a net-work has 1K switches, a single NOX controller can onlysupport 50 new flows per second for each switch simul-taneously. In comparison, the peak throughput of DI-FANE with one authority switch is 800K single-packetflows/s.

Admittedly, this large difference exists because DI-FANE handles packets in the kernel while NOX oper-ates in user space. However, while it is possible to moveNOX to the kernel, it would not be feasible to imple-ment the entire NOX in today’s switch hardware be-cause the online rule generation too complex for hard-ware implementation. In contract, DIFANE is meantfor precisely that — designed to install a set of rules in

0

20

40

60

80

100

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Pa

cke

t lo

ss r

ate

(%

)

Time (sec)

Switch

fails

Detect

failure

Figure 10: Authority switch failure.

Network # switches/routers # Rules

Campus ∼1700 30KVPN ∼1500 59KIPTV ∼3000 5M

IP ∼2000 8M

Table 1: Network characteristics.

the data plane.11

We then evaluate the performance of the cache man-ager in authority switches. When there are 10 author-ity rules, the cache manager can handle 30K packets/sand generate one rule for each packet. When there are9K authority rules, the cache manager can handle 12Kpackets/s. The CPU is the bottleneck of the cache man-ager.

(3) DIFANE scales with the number of author-ity switches. Our experiments show that DIFANE’sthroughput increases linearly with the number of au-thority switches. With four authority switches, thethroughput of DIFANE reaches over 3M flows/s. Ad-ministrators can determine the number of authority switchesaccording to the size and the throughput requirementsof their networks.

(4) DIFANE recovers quickly from authority switchfailure. Figure 10 shows the effect of an authorityswitch failure. We construct a diamond topology withtwo authority switches, both connecting to the ingressand the egress switches. We set the OSPF hello intervalto 1 s, and the dead interval to 3 s. After the authorityswitch fails, OSPF notifies the ingress switch. It takesless than 10 ms for the ingress switch to change to an-other authority switch after the dead interval, at whichtime the ingress switch sets the backup partition ruleas the primary one, and thus connectivity is restored.

7.2 Evaluation of Partitioning and CachingWe now evaluate DIFANE’s partitioning algorithm

using the topologies and access-control rules from sev-eral sizable networks (as of Sept. 10, 2009) including alarge-scale campus network [24] and three large back-

11Today’s TCAM with pipelined processing only takes 3–4 ns per lookup [23], which is more than three orders ofmagnitude faster than in software.

11

Page 12: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

1 10 100 1K 10K

Cum

ulat

ive

frac

tion

No. of non-overlapping / No. of overlapping rules

IPTVVPN

CampusIP

Figure 11: Comparison of overlapping and non-

overlapping rules.

bone networks that are operated by a tier-1 ISP for itsenterprise VPN, IPTV, and traditional IP services. Thebasic characteristics of these networks are shown in Ta-ble 1. In these networks, each access control rule hassix fields: ingress interface, source IP/port, destinationIP/port, and protocol. Access control rules are config-ured on the ingress switches/routers. It is highly likelythat different sets of rules are configured at differentswitches and routers, hence one packet may be permit-ted in one ingress switch but denied at another. Inaddition, as Table 1 shows, there are a large number ofaccess control rules configured in these networks. Thisis due to the large number of ingress routers that needto have access control rules configured and the poten-tially large number rules that need to be configured oneven a single router. For example, ISPs often configurea set of rules on each ingress switch to protect their in-frastructures and important servers from unauthorizedcustomers. This would easily result in a large numberof rules on an ingress switch if there are a large numberof customers connected to the switch and/or these cus-tomers make use of a large amount of non-aggregatableaddress space.

In the rest of this section, we evaluate the effect ofoverlapping rules, the number of authority switches neededfor different networks, the number of extra rules neededafter the partitioning, the miss rate of caching wildcardrules, and the stretch experienced by packets that travelthrough an authority switch.

(1) Installing overlapping rules in an author-ity switch significantly reduces the memory re-quirement of switches: In our set of access controlrules, the use of wildcard rules can result in overlappingrules. One straight forward way to handle these over-lapping rules is to translate them into non-overlappingrules which represent the same semantics. However, thisis not a good solution because Figure 11 shows that theresulting non-overlapping rules require one or two or-ders of magnitude more TCAM space than the originaloverlapping rules. This suggests that the use of overlap-ping wildcard rules can significantly reduce the number

1

10

100

1000

1K 10K 100K 1000K 5000K

No.

of a

uth.

sw

itche

s

No. of TCAM entries per auth. switch

IPIPTVVPN

Campus

Figure 12: Evaluation on number of authority switches.

0

2

4

6

8

10

12

0 2 4 6 8 10

% o

f ext

ra T

CA

M e

ntrie

s

No. of splits

Figure 13: The overhead of rule partition.

of TCAM entries.

(2) A small number of authority switches areneeded for the large networks we evaluated. Fig-ure 12 shows the number of authority switches neededunder varying TCAM capacities. The number of au-thority switches needed decreases almost linearly withthe increase of the switch memory. For networks withrelatively few rules, such as the campus and VPN net-works, we would require 5–6 authority switches with10K TCAM in each (assuming we need 16B to store thesix fields and action for a TCAM entry, we need about160KB of TCAM in each authority switch). To handlenetworks with many rules, such as the IP and IPTVnetworks, we would need approximately 100 author-ity switches with 100K TCAM entries (1.6MB TCAM)each.12 The number of the authority switches is stillrelatively small compared to the network size (2K - 3Kswitches).

(3) Our partition algorithm is efficient in re-ducing the TCAM usage in switches. As shownin Figure 4, depending on the rules, partitioning wild-card rules can increase the total number of rules andthe TCAM usage for representing the rules. With the6-tuple access-control rules in the IP network, the to-tal number of TCAM entries increases only by 0.01%if we distribute the rules over 100 authority switches.This is because most of the cuts are on the ingress di-mension and, in the data set, most rules differ between

12Today’s commercial switches are commonly equipped with2 MB TCAM.

12

Page 13: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

0.0001

0.001

0.01

0.1

1

10

100

10 100 1000

Cac

he m

iss

rate

(%

)

Cache size per switch (TCAM entries)

microflowwildcard

Figure 14: Cache miss rate.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

1 1.5 2 3 4 5

Cum

ulat

ive

frac

tion

Stretch

k-med,10k-med,3k-med,1rand,10

rand,3rand,1

Figure 15: The stretch of campus network. (We place

1,3,10 authority switches for each set of the author-

ity rules using k-median and random algorithms respec-

tively.)

ingresses. To evaluate how the partition algorithm han-dles highly overlapping rules, we use our algorithm topartition the 1.6K rules in one ingress router in the IPnetwork. Figure 13 shows that we only increase thenumber of TCAM entries by 10% with 10 splits (100–200 TCAM entries per split).

(4) Our wildcard caching solution is efficient inreducing cache misses and cache memory size.We evaluate our cache algorithm with packet-level tracesof 10M packets collected in December 2008 and the cor-responding access-control lists (9K rules) in a routerin the IP network. Figure 14 shows that if we onlycache micro-flow rules, the miss rate is 10% with 1Kcache entries. In contrast, with wildcard rule cachingin DIFANE, the cache miss rate is only 0.1% with 100cache entries: 99.9% of packets are forwarded directlyto the destination, while only 0.1% of the packets take aslightly longer path through an authority switch. Thoseingress switches which are not authority switches onlyneed to have 1K TCAM entries for cache rules (Fig-ure 14) and 10 - 1000 TCAM entries for partition rules(Figure 12).

(5) The stretch caused by packet redirection issmall. We evaluated the stretch of two authorityswitch placement schemes (random and k-median) dis-cussed in Section 5. Figure 15 shows the distribution ofstretch (delay normalized by that of the shortest path)

among all source-destination pairs in the campus net-work. With only one authority switch for each set ofauthority rules, the average stretch is twice the delay ofthe shortest path length in the random scheme. Thoughsome packets experience 10 times the delay of the short-est path, this usually happens to those pairs of nodesthat are one or two hops away from each other; so theabsolute delay of these paths is not large. If we storeone set of rules at three random places, we can reducestretch (the stretch is 1.8 on average). With 10 copiesof rules, the stretch is reduced to 1.5. By placing theauthority switches with the k-median scheme, we canfurther reduce the stretch (e.g., with 10 copies of rules,the stretch is reduced to 1.07).

8. DIFANE DEPLOYMENT SCENARIOSDIFANE proposes to use authority switches to always

keep packets in the data plane. However, today’s com-mercial OpenFlow switches have resource constraints inthe control plane. In this section, we describe how DI-FANE can be deployed in today’s switches with resourceconstraints and in future switches as a clean-slate solu-tion. We provide three types of design choices: imple-menting tunneling with packet encapsulation or VLANtags; performing rule caching in the authority switchesor in the controller; choosing normal switches or dedi-cated switches as authority switches.

Deployment with today’s switches: Today’s com-mercial OpenFlow switches have resource constraintsin the control plane. For example, they do not haveenough CPU resources to generate caching rules or havehardware to encapsulate packets quickly. Therefore weuse VLAN tags to implement tunneling and move thewildcard rule caching to the DIFANE controller. Theingress switch tags the “miss” packet and sends it tothe authority switch. The authority switch tags thepacket with a different VLAN tag and sends the packetto the corresponding egress switch. The ingress switchalso sends the packet header to the controller, and thecontroller installs the cache rules in the ingress switch.

Performing rule caching in the controller resolves thelimitations of today’s commercial OpenFlow switchesfor two reasons: (i) Authority switches do not needto run caching functions;(ii) The authority switches donot need to know the addresses of the ingress switch.We can use VLAN tagging instead of packet encapsula-tion, which can be implemented in hardware in today’sswitches.

This deployment scenario is similar to Ethane [3] inthat it also has the overhead of the ingress switch send-ing packets to the controller and the controller installingcaching rules. However, the difference is that DIFANEalways keeps the packet in the fast path. Since we needone VLAN tag for each authority switch (10 - 100 au-thority switches) and each egress switch (at most a few

13

Page 14: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

thousand switches), we have enough VLAN tags to sup-port tunneling in DIFANE in today’s networks.

Clean slate deployment with future switches: Fu-ture switches can have more CPU resources and hardware-based packet encapsulation techniques. In this case, wecan have a clean slate design. The ingress switch en-capsulates the “miss” packet with its address as thesource address. The authority switch decapsulates thepacket, gets the address of the ingress switch and re-encapsulates the packet with the egress switch addressas the packet’s destination address. The authority switchalso installs cache rules to the ingress switch based onthe address it gets from the packet header. In this sce-nario, we can avoid the overhead and single point offailure of the controller. Future switches should alsohave high bandwidth channel between the data planeand the control plane to improve the caching perfor-mance.

Deployment of authority switches: There are twodeployment scenarios for authority switches: (i) The au-thority switches are just normal switches in the networkthat can be taken over by other switches when theyfail; (ii) The authority switches are dedicated switchesthat have larger TCAM to store more authority rulesand serve essentially as a distributed data plane of thecentralized controller. All the other switches just needa small TCAM to store a few partition rules and thecache rules. In the second scenario, even when DIFANEhas only one authority switch that serves as the dataplane of the controller, DIFANE and Ethane [3] are stillfundamentally different in that DIFANE pre-installs au-thority rules in TCAM and thus always keeps the packetin the fast path.

9. FLEXIBLE MANAGEMENT SUPPORTRecently proposed flow-based management solutions [3,

7, 8, 9, 10] can easily run on top of DIFANE, by definingtheir own policies and translating them into rules, whilecapitalizing on our support for distributed rule process-ing for better scalability and performance. In addition,DIFANE is also flexible to support some other pro-posed techniques and management functions for man-aging how rules are handled.

Support scalable routing with packet redirec-tion: SEATTLE [13] performs packet redirection basedon the hash of a packet’s destination MAC address, andreactively caching information about the host’s currentlocation. SEATTLE can be easily implemented withinour architecture even when the flow-based switches donot support hashing. In particular, the DIFANE con-troller could generate routing rules that map a desti-nation MAC address to the switch where the host islocated. The partitioning algorithm could divide thespace of rules by cutting along the destination address

dimension to generate partition rules, with wildcards insome bit positions of the destination address, and withwildcards in all other header fields. When an ingressswitch does not have a cached rule that matches anincoming packet, the partition rule directs the packetthrough the appropriate authority switch. This trig-gers the authority switch to install the rules for thedestination address at the ingress switch.

Similarly, DIFANE could support ViAggre [15] bycreating partition rules based on the destination IP ad-dress space (rather than MAC addresses), so ingressswitches forward packets to authority switches with more-specific forwarding-table entries.

Handle measurement rules in both authority andingress switches. In DIFANE packets may be pro-cessed in the ingress switch or in the authority switch.If we have a measurement rule that accumulates thestatistics of a flow, we need to install it in both theingress and the authority switch. The controller installsthese measurement flow rules in the authority switches,which then installs the rules in the ingress switches. Apacket matches one of the cached rule is counted at theingress switch. Otherwise, it is redirected to an author-ity switch and counted there.

The controller can use either pull or push method tocollect flow information from the authority switches andingress switches. The cache rules in the ingress switchmay be swapped out if the cache rule is not used forthe timeout time or there is not enough memory. Inthis case, the ingress switch notifies the controller andsends the data of the cache rule to the controller, whichis already supported by flow-based switches today.

Most network management functions can be achievedby caching rules only at the ingress switch. Wechoose to cache rules only at the ingress switch for loweroverhead. One alternative is to cache rules at everyswitch on the path a packet transfers. The authorityswitch would have to determine all the switches on thepacket’s path and install the rule in them.

In fact, caching rules only at the ingress can meetmost of the requirements of management modules. Forexample, access control rules are checked at the ingressswitch to block the malicious traffic before they enterthe network. Measurement rules are usually applied atthe ingress switch to count the number of packets or theamount of traffic. Routing rules are used at the ingressswitch to select the egress point for the packets. Pack-ets are then encapsulated and forwarded directly to theegress switch without matching routing rules at otherswitches. Customized routing can also be supported bytagging the packets and selecting pre-computed pathsat the ingress switch.

10. CONCLUSION

14

Page 15: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

We design and implement DIFANE, a distributed flowmanagement architecture that distributes rules to au-thority switches and handles all data traffic in the fastpath. DIFANE can handle wildcard rules efficientlyand react quickly to network dynamics such as pol-icy changes, topology changes and host mobility. DI-FANE can be easily implemented with today’s flow-based switches. Our evaluation of the DIFANE proto-type, various networks, and large sets of wildcard rulesshow that DIFANE is scalable with networks with alarge number of hosts, flows and rules.

11. REFERENCES[1] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar,

L. Peterson, J. Rexford, S. Shenker, and J. Turner, “OpenFlow:Enabling innovation in campus networks,” ACM ComputerCommunication Review, Apr. 2008.

[2] “Anagran: Go with the flow.” http://anagran.com.[3] M. Casado, M. J. Freedman, J. Pettit, J. Luo, N. Gude,

N. McKeown, and S. Shenker, “Rethinking enterprise networkcontrol,” IEEE/ACM Transactions on Networking, 2009.

[4] N. Gude, T. Koponen, J. Pettit, B. Pfaff, M. Casado,N. McKeown, and S. Shenker, “NOX: Toward an operatingsystem for networks,” ACM Computer CommunicationReview, July 2008.

[5] A. Greenberg, G. Hjalmtysson, D. A. Maltz, A. Myers,J. Rexford, G. Xie, H. Yan, J. Zhan, and H. Zhang, “A cleanslate 4D approach to network control and management,” ACMComputer Communication Review, 2005.

[6] H. Yan, D. A. Maltz, T. S. E. Ng, H. Gogineni, H. Zhang, andZ. Cai, “Tesseract: A 4D Network Control Plane,” in Proc.Networked Systems Design and Implementation, Apr. 2007.

[7] A. Nayak, A. Reimers, N. Feamster, and R. Clark, “Resonance:Dynamic access control for enterprise networks,” in Proc.Workshop on Research in Enterprise Networks, 2009.

[8] N. Handigol, S. Seetharaman, M. Flajslik, N. McKeown, andR. Johari, “Plug-n-Serve: Load-balancing Web traffic usingOpenFlow,” Aug. 2009. ACM SIGCOMM Demo.

[9] D. Erickson et al., “A demonstration of virtual machinemobility in an OpenFlow network,” Aug. 2008. ACMSIGCOMM Demo.

[10] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis,P. Sharma, S. Banerjee, and N. McKeown, “ElasticTree: Savingenergy in data center networks,” in Proc. Networked SystemsDesign and Implementation, Apr. 2010.

[11] Y. Mundada, R. Sherwood, and N. Feamster, “An OpenFlowswitch element for Click,” in Symposium on Click ModularRouter, 2009.

[12] A. Tootoocian and Y. Ganjali, “HyperFlow: A distributedcontrol plane for OpenFlow,” in INM/WREN workshop, 2010.

[13] C. Kim, M. Caesar, and J. Rexford, “Floodless in SEATTLE:A scalable Ethernet architecture for large enterprises,” in Proc.ACM SIGCOMM, 2008.

[14] S. Ray, R. Guerin, and R. Sofia, “A distributed hash tablebased address resolution scheme for large-scale Ethernetnetworks,” in Proc. International Conference onCommunications, June 2007.

[15] H. Ballani, P. Francis, T. Cao, and J. Wang, “Making routerslast longer with ViAggre,” in Proc. NSDI, 2009.

[16] Q. Dong, S. Banerjee, J. Wang, and D. Agrawal, “Wire speedpacket classification without TCAMs: A few more registers(and a bit of logic) are enough,” in ACM SIGMETRICS, 2007.

[17] M. Yu, J. Rexford, M. J. Freedman, and J. Wang, “Scalableflow-based networking with DIFANE,” Princeton UniversityTechnical Report, 2010.

[18] J.-H. Lin and J. S. Vitter, “e-approximations with minimumpacking constraint violation,” in ACM Symposium on Theoryof Computing, 1992.

[19] M. Handley, O. Hudson, and E. Kohler, “XORP: An openplatform for network research,” in Proc. SIGCOMM Workshopon Hot Topics in Networking, Oct. 2002.

[20] N. Brownlee, “Some observations of Internet stream lifetimes,”in Passive and Active Measurement, 2005.

[21] “Stanford OpenFlow network real time monitor.”http://yuba.stanford.edu/~masayosi/ofgates/.

[22] “Personal communication with Stanford OpenFlow deploymentgroup.”

[23] “Netlogic microsystems.” www.netlogicmicro.com.[24] Y.-W. E. Sung, S. Rao, G. Xie, and D. Maltz, “Towards

systematic design of enterprise networks,” in Proc. ACMCoNEXT, 2008.

[25] P. Gupta, P. Gupta, and N. Mckeown, “Packet classificationusing hierarchical intelligent cuttings,” in Hot InterconnectsVII, 1999.

[26] S. Singh, F. Baboescu, G. Varghese, and J. Wang, “Packetclassification using multidimensional cutting,” in Proc. ACMSIGCOMM, 2003.

APPENDIX

A. PARTITION ALGORITHMWe formulate the rule partition problem as follows:

Assume that there are M candidate authority switches,each of which can store up to S TCAM entries. Fora given set of N low-level rules of K dimensions, wewould like to partition the flow space into n hypercubes(n ≤ M), because hypercubes are easy to represent aswildcard partition rules. Each of the hypercubes is rep-resented by a K-tuple of ranges, [l1..r1], . . . , [lK ..rK ],and is stored in an authority switch. The optimizationobjective is to minimize the total number of TCAMentries in all n authority switches. The flow partitionproblem is NP-hard for K ≥ 2.13 Therefore, we insteaddesign a heuristic algorithm for partitioning rules.

Rule \ Field F1 F2 F3 F4 F5 Action

R1 0-1 14-15 2 0-3 0 acceptR2 0-1 14-15 1 2 0 acceptR3 0-1 8-11 0-3 2 1 denyR4 0-1 8-11 2 3 1 denyR5 0-15 0-7 0-3 1 0 acceptR6 0-15 14-15 2 1 0 acceptR7 0-15 14-15 2 2 0 acceptR8 0-15 0-15 0-3 0-3 0-1 deny

(a) A group of wildcard rules

[0,7] [8,11] [12,15]

Cut on

Field 2

Root

R5

R8^{F2=[0,7]}R8^{F2=[8,11]}

[0,1] [2,3]

R2

R8^{F2=[12,15]}

^{F3=[0,1]}

R1, R6, R7

R8^{F2=[12,15]}

^{F3=[2,3]}

Cut on Field 3

(b) The decision tree for the rules

Figure 16: Construct decision tree for a group of rules.

The complete algorithm is shown in Algorithm 1,which consists of two key ideas:

Use a decision tree to represent the partition:

13The proof of the NP-hardness of the partition problems isomitted due to lack of space.

15

Page 16: Scalable Flow-Based Networking with DIFANEminlanyu.seas.harvard.edu/writeup/difane10-TR.pdf · Scalable Flow-Based Networking with DIFANE Minlan Yu ∗ Jennifer Rexford ∗ Michael

Algorithm 1 Heuristic partition algorithm using decision tree

Initialization:

Step 1 k = 0. T0 is the tree node which represents the entireflow range.

Split the flow range:

Step 2 Incrementk. Pick up a tree node Tk to split.Tk is a leaf node in the decision tree that contain morethan S TCAM entries to represent its hypercube Ck.If we cannot find such a node, stop.

Step 3 Select a flow dimension i that have maximum number ofunique components ui.

Step 4 Pick w boundaries of the unique components that mini-mizes (

P

1≤t≤wf(Ct

k) − f(Ck))/w.

Step 5 Put all the child node of Ti in the tree.Step 6 Goto Step 2.

The root node of the decision tree denotes the hyper-cube of the entire flow space. Similarly, each node inthe decision tree denotes a hypercube of flow range andmaintains all the rules intersecting with the hypercube.We start with the root node (Step 1). In each roundof the splitting process, we pick a node in the decisiontree which has more than S authority rules, and splitit into a group of child nodes, each of which a subsetflow range of its parent node (Step 2). The splittingprocess terminates when each leaf node has fewer thanS authority rules. In the end, the authority rules ineach leaf node will be assigned to an authority switch.

Cut based on rule boundaries: We first choose thedimension i that has the maximum number of uniquecomponents (i.e., non-overlapping ranges) as the di-mension to split (Step 3). This gives us a better chanceto be able to split the flow range into balanced pieces.For example, in Figure 16, field F2 has four unique com-ponents [0..7], [8..11], [12..13], [14..15].

The next challenge is to split the hypercube in theselected dimension. As we discussed earlier, partition-ing the flow range equally may not be the best choice.It leads to many rules that span across multiple parti-tions, and hence yields more authority rules. Instead,we partition the flow space based on the boundaries ofthe unique components (Step 4). Let function f(C) bethe number of required TCAM entries for hypercube C.Assume that Ck is the hypercube we want to split in thek-th round of our algorithm. We select w boundariesof the unique components in dimension i (b1 . . . bw) andthe resulting sub-hypercubes C1

k. . . Cw

ksuch that the

average increase of authority rules per cut is minimized:U = (

∑1≤t≤w

f(Ct

k)−f(Ck))/w. We enumerate all the

boundary selections and choose the one with minimalU .

Figure 16 illustrates an example of rule partitioning.Assume that we have S = 4 in each authority switch.Using our partition algorithm, we can construct a deci-sion tree as shown in Figure 16(b) for the rules shown inFigure 16(a). In the first round, we choose the dimen-sion on field F2 to partition the root node, because it

has the maximum number of unique ranges. We then di-vided the root node into three children nodes on field F2:[0..7], [8..11], [12..15]. This yields U = 0 because rulesR3, R4, R8 that fall in the second child of F2 =[8..11] alltake the deny actions. In the second round, since thethird child node requires 5 authority rules, we furthersplit it into two children nodes on field F3.

Note that, although our partition algorithm is moti-vated by both HiCuts [25] and HyperCuts [26], whichexplored the efficient software processing of packet clas-sification rules in one switch with the help of a decisiontree, we differ in our optimization goals. Both HiCutsand HyperCuts sought to speed up software process-ing of the rules, while DIFANE aims at minimizing theTCAM usage in switches, because the partition and au-thority rules are all processed in hardware. This leadsto two key design differences in our algorithm: (i) Wechoose to cut the selected dimension based on the ruleboundaries, while HiCuts and HyperCuts cut the se-lected dimension equally for c cuts. (ii) We choose thecuts so that the number of TCAM entries per cut isminimized. In HiCuts and HyperCuts, they optimizeon the number of cuts, in order to minimize the size ofthe tree (and especially its depth). DIFANE allows aslightly deeper decision tree if it reduces TCAM usage.

16