12
Proceedings of the 2014 Winter Simulation Conference A. Tolk, S. D. Diallo, I. O. Ryzhov, L. Yilmaz, S. Buckley, and J. A. Miller, eds. A SIMULATION AND EMULATION STUDY OF SDN-BASED MULTIPATH ROUTING FOR FAT-TREE DATA CENTER NETWORKS Eric Jo Deng Pan Jason Liu School of Computing and Information Sciences Florida International University Miami, Florida, USA Linda Butler Department of Computer Science and Engineering University of South Florida Tampa, Florida, USA ABSTRACT The fat tree topology with multipath capability has been used in many recent data center networks (DCNs) for increased bandwidth and fault tolerance. Traditional routing protocols have only limited support for multipath routing, and cannot fully utilize the available bandwidth in such networks. In this paper, we study multipath routing for fat tree networks. We formulate the problem as a linear program and prove its NP-completeness. We propose a practical solution, which takes advantage of the emerging software- defined networking paradigm. Our algorithm relies on a central controller to collect necessary network state information in order to make optimized routing decisions. We implemented the algorithm as an OpenFlow controller module and validated it with Mininet emulation. We also developed a fluid-based DCN simulator and conducted experiments, which show that our algorithm outperforms the traditional multipath algorithm based on random assignments, both in terms of increased throughput and in reduced end-to-end delay. 1 INTRODUCTION Data center networks (DCNs) must provide efficient inter-connection between a large number of servers in order for the data centers to achieve the necessary economy of scale (Mudigonda and Yalagandula 2010, Wang et al. 2010). In order to provide abundant bisection bandwidth, modern DCNs usually adopt a hierarchical multi-rooted network topology, such as the fat tree (Al-Fares, Loukissas, and Vahdat 2008), VL2 (Greenberg et al. 2009), DCell (Guo et al. 2008), and BCube (Guo et al. 2009). The hierarchical feature would make such topologies easy to expand, and the multi-rooted topology would allow more routing options due to multipath. Among the existing alternatives, the fat tree topology is especially popular for its simplicity, and has been considered in a number of recent data center designs (Heller et al. 2010, Mysore et al. 2009). Figure 1 shows a 4-pod fat tree with four hierarchical layers, including, from the bottom, hosts, edge switches, aggregation switches, and core switches. The four core switches at the top are the roots of the tree. The figure shows two possible paths between host A and host B in different colors (there should be four of them). In this network, all switches have the same number of ports, and all links are of the same bandwidth. We refer the readers to Al-Fares, Loukissas, and Vahdat (2008) and Mysore et al. (2009) for a detailed description of the fat tree topology. Traditional link-state and distance-vector routing protocols, which have been commonly used on the Internet (Kurose and Ross 2007), cannot be readily used to take advantage of the multipath capability of the fat tree topology. For example, most IP-based routing protocols calculate routes and forward packets based solely on the packet’s destination address, and as a result, all packets with the same destination would follow the same path. There are indeed protocols that support multipath. For example, some protocols come with

A SIMULATION AND EMULATION STUDY OF SDN …users.cis.fiu.edu/~pand/publications/14wsc-eric.pdf · A SIMULATION AND EMULATION STUDY OF SDN-BASED MULTIPATH ROUTING ... We propose a

Embed Size (px)

Citation preview

Proceedings of the 2014 Winter Simulation ConferenceA. Tolk, S. D. Diallo, I. O. Ryzhov, L. Yilmaz, S. Buckley, and J. A. Miller, eds.

A SIMULATION AND EMULATION STUDY OF SDN-BASED MULTIPATH ROUTINGFOR FAT-TREE DATA CENTER NETWORKS

Eric JoDeng PanJason Liu

School of Computing and Information SciencesFlorida International University

Miami, Florida, USA

Linda Butler

Department of Computer Science and EngineeringUniversity of South Florida

Tampa, Florida, USA

ABSTRACT

The fat tree topology with multipath capability has been used in many recent data center networks (DCNs)for increased bandwidth and fault tolerance. Traditional routing protocols have only limited support formultipath routing, and cannot fully utilize the available bandwidth in such networks. In this paper, westudy multipath routing for fat tree networks. We formulate the problem as a linear program and proveits NP-completeness. We propose a practical solution, which takes advantage of the emerging software-defined networking paradigm. Our algorithm relies on a central controller to collect necessary network stateinformation in order to make optimized routing decisions. We implemented the algorithm as an OpenFlowcontroller module and validated it with Mininet emulation. We also developed a fluid-based DCN simulatorand conducted experiments, which show that our algorithm outperforms the traditional multipath algorithmbased on random assignments, both in terms of increased throughput and in reduced end-to-end delay.

1 INTRODUCTION

Data center networks (DCNs) must provide efficient inter-connection between a large number of servers inorder for the data centers to achieve the necessary economy of scale (Mudigonda and Yalagandula 2010,Wang et al. 2010). In order to provide abundant bisection bandwidth, modern DCNs usually adopt ahierarchical multi-rooted network topology, such as the fat tree (Al-Fares, Loukissas, and Vahdat 2008),VL2 (Greenberg et al. 2009), DCell (Guo et al. 2008), and BCube (Guo et al. 2009). The hierarchicalfeature would make such topologies easy to expand, and the multi-rooted topology would allow morerouting options due to multipath.

Among the existing alternatives, the fat tree topology is especially popular for its simplicity, and hasbeen considered in a number of recent data center designs (Heller et al. 2010, Mysore et al. 2009). Figure 1shows a 4-pod fat tree with four hierarchical layers, including, from the bottom, hosts, edge switches,aggregation switches, and core switches. The four core switches at the top are the roots of the tree. Thefigure shows two possible paths between host A and host B in different colors (there should be four ofthem). In this network, all switches have the same number of ports, and all links are of the same bandwidth.We refer the readers to Al-Fares, Loukissas, and Vahdat (2008) and Mysore et al. (2009) for a detaileddescription of the fat tree topology.

Traditional link-state and distance-vector routing protocols, which have been commonly used on theInternet (Kurose and Ross 2007), cannot be readily used to take advantage of the multipath capability of thefat tree topology. For example, most IP-based routing protocols calculate routes and forward packets basedsolely on the packet’s destination address, and as a result, all packets with the same destination would followthe same path. There are indeed protocols that support multipath. For example, some protocols come with

Jo, Butler, Pan, and Liu

Core

Switches

Aggregation

SwitchesEdge

Switches

HostsA B

Figure 1: The fat tree topology with four pods.

the “equal cost multipath” (ECMP) feature. ECMP performs static traffic splitting based on informationfrom the packet headers. In particular, the algorithm does not consider the bandwidth requirement of theflows. It is rather limited in terms of path optimization, since it does not consider the traffic conditionof the entire network. Furthermore, the algorithm is only able to choose among the available paths withthe same minimum cost. Consequently, the algorithm is limited to only a small number of alternativepaths (Heller et al. 2010). One must note that DCN routing is significantly different for Internet routing.Normally, the Internet protocols would give preference to the shortest paths in order to reduce latency.By contrast, DCNs are less concerned about latencies—after all, DCNs are local area networks with closegeographical distances between nodes. Routing on the DCNs should give more priority to the bandwidthutilization rather than latency.

The goal of our study is to fully explore the potential of the fat tree topology for data center networking. Inparticular, we design practical and efficient multipath routing algorithms with the following three objectives.First, the algorithms shall be able to improve bandwidth utilization of the entire network. In other words,the algorithms should accommodate more traffic for the same network fabric. Second, the algorithms shallbe able to achieve better traffic load balance. This is to prevent the network from creating hot spots andcausing congestions. Last, the algorithms shall have a lower time complexity. They should make fastrouting decisions in order to handle large traffic volumes.

In this paper, we focus on load-balanced multipath routing for fat tree DCNs. First, we formulate themultipath problem as a linear program, and prove that it is NP-complete. The NP-completeness proofestablishes that the problem has no known polynomial-time solutions, which directs us to seek efficientalgorithms using approximation and heuristics. As a practical solution, we propose a load-balancedmultipath routing algorithm that can take advantage of the emerging Software Defined Networking (SDN)paradigm (Open Networking Foundation 2012). The main idea is to use a OpenFlow controller to collectthe necessary information of the entire network, based on which the algorithm can make centralized routingdecisions upon each flow arrival. More specifically, our algorithm first determines all available paths thatcan carry the new flow, and then finds the bottleneck link of each path. A bottleneck link is defined to bethe link with the minimum available bandwidth among all links along the path. The algorithm comparesthe available bandwidth of bottleneck links of different paths, and selects the path whose bottleneck linkhas the most available bandwidth. In this way, the algorithm can guarantee that the selected path minimizesthe maximum link load of the entire network at the decision time. Consequently, the algorithm is able toachieve better load balance.

To validate our designs, we have implemented the proposed load-balanced multipath algorithm in twoenvironments. First, we implemented the algorithm as a module for an open-source OpenFlow controller,which can be readily deployed in an existing data center. We used Mininet (Lantz, Heller, and McKeown2010) to validate the implementation. Mininet is a container-based emulation environment for flexiblytesting SDN applications by running the applications in networks consisted of virtual hosts and virtualswitches, all running on a single physical machine. Second, we developed a fluid-based DCN simulator to

Jo, Butler, Pan, and Liu

evaluate the performance of our load-balanced multipath algorithm. The simulator was implemented usingthe MiniSSF parallel discrete-event simulation framework (Rong, Hao, and Liu 2014). In the experimentswe compared the performance of our algorithm with that of a multipath algorithm based on random pathallocations. Our experiment results show that our load-balanced multipath routing algorithm consistentlyoutperforms the traditional approach.

The main contributions in this paper are three-folds. First, we formulate the load-balanced multipathrouting problem for the fat tree network as a linear program, and prove that it is NP-complete. Second,we propose a practical algorithm that can be applied in a data center using the SDN technology. Finally,through extensive emulation and simulation experiments, we demonstrate the performance superiority ofthe proposed algorithm over the existing approach.

The rest of the paper is organized as follows. In section 2, we review related work. In section 3, weformulate the load-balanced multipath problem as a linear program and proves its NP-completeness. Insection 4, we describe the proposed SDN-based algorithm. In section 5, we describe the two implementations,one for emulation and the other for simulation, which we use to evaluate our algorithm. In section 6, wepresent the experiment results. Finally, we conclude the paper and outline future work in section 7.

2 RELATED WORK

Typical DCNs offer multiple routing paths for increased bandwidth and fault tolerance. Multipath routingcan reduce congestion by taking advantage of the path diversity in DCNs. Multipath routing can beconducted at either layer two or layer three. Layer-two typically uses a spanning tree, where only one pathis made available between a pair of nodes. A recent work by Mudigonda and Yalagandula (2010) allowsmultipath by exploiting the path redundancy in the network. The algorithm computes a set of available pathsand merges them into a set of trees, each mapped to a separate VLAN. At layer three, equal cost multipath(ECMP) (Cisco Systems 2007) provides the multipath capability by performing static load splitting amongthe flows. ECMP-enabled switches are configured with several forwarding paths in a given subnet. Whena packet arrives at a switch that has multiple candidate paths, the switch selects a path based on a hashfunction using the selected fields of the packet header, and then forwards the packet along that path. In doingso, the mechanism is able to split the traffic load among multiple paths in a subnet. It is important to note,however, ECMP does not account for flow bandwidth in making path allocation decisions, which may leadto oversubscription even for simple communication patterns. Furthermore, current ECMP implementationslimit the number of choices of candidate paths to 8 or 16, which can be far fewer than what would beneeded to deliver the bisection bandwidth for larger data centers (Heller et al. 2010).

There exist multipath solutions specifically designed for DCNs. Al-Fares et al. (2010) proposed twomultipath algorithms: one uses global first-fit and the other uses simulated annealing. The first algorithmsimply selects among all possible paths the first one that can accommodate a flow. The algorithm is simple.The drawback of the algorithm, however, is that it needs to maintain accurate information of all pathsbetween all pairs of nodes. The second algorithm performs a probabilistic search for the optimal path. Thealgorithm is fast. However, it is shown that the algorithm may also converge slowly. Heller et al. (2010)proposed the ElasticTree DCN power management scheme, which includes two multipath algorithms: agreedy bin-packing algorithm and a topology-aware heuristic algorithm. The former evaluates the availablepaths and deterministically chooses the leftmost one with sufficient capacity. The latter uses a fast heuristicbased on the topological features of the fat trees; it relies on flow splitting. Benson et al. (2010a) alsoproposed the MicroTE framework that supports multipath routing. Our algorithm described in this paperalso considers multipath. It uses depth-first search to quickly find the possible paths among the layers ofswitches, and then uses worst-fit to select the candidate path. Compared with the previous approaches, ouralgorithm provides a fast and effective solution for achieving system-wide traffic load balance using SDN.

Jo, Butler, Pan, and Liu

3 PROBLEM FORMULATION

In this section, we formulate the load-balanced multipath routing problem as a linear program, and thenprove its NP-completeness for the fat tree topology by reduction from the integer partition problem.

3.1 Load-Balanced Multipath Routing Problem

A fat tree DCN can be modeled as a directed graph G = (H ∪S,E), where H is the set of hosts, S is theset of switches, and E is the set of links each connecting a switch with another switch or a host. Eachedge (vi,v j) ∈ E has a nonnegative capacity c(vi,v j) ≥ 0, which denotes the available bandwidth of thecorresponding link. Let F = {F1, . . . ,Fn} be the set of all flows in the network. A flow Fk (where 1≤ k≤ n)is defined as a triple (ak,bk,dk), where ak ∈ H is the source host, bk ∈ H is the destination host, and dkis the demanded bandwidth. We use fk(vi,v j) ∈ {0,1} to indicate whether the flow Fk traverses the link(vi,v j).

The load-balanced multipath routing problem can be expressed as a linear program, whose the objectiveis to minimize the maximum load among all the links in the network. More formally, the load-balancedmultipath routing problem can be specified as follows:

Minimize maxload

Subject to the following constraints:

n

∑k=1

fk(vi,v j)dk ≤maxload≤ c(vi,v j), ∀(vi,v j) ∈ E (1)

∑vi∈H∪S

fk(vi,v) = ∑v j∈H∪S

fk(v,v j), ∀k ∈ {1,2, · · · ,n},∀v ∈ H ∪S\{ak,bk} (2)

∑v j∈H∪S

fk(ak,v j) = ∑vi∈H∪S

fk(vi,bk) = 1, ∀k ∈ {1,2, · · · ,n} (3)

Equation (1) defines maxload, and also states the link capacity constraint, i.e., the total demandedbandwidth on a link shall not exceed its available bandwidth. Equation (2) states the flow conservationconstraint, i.e., the amount of any flow shall not change at an intermediate node. Equation (3) states thedemand satisfaction constraint, i.e., for any flow, the outgoing traffic at the source and the incoming trafficat the destination shall be the same as the demand of the flow.

3.2 NP-Completeness

In this section we show the NP-completeness of the load-balanced multipath routing problem.Theorem 1 The load-balanced multipath routing problem is NP-complete for the fat tree topology.

Proof. The proof is by reduction from the NP-complete partition problem (Garey and Johnson 1990). Apartition problem decides whether a set of integers, X = {x1, . . . ,xn}, can be partitioned into two subsets,Xs and X \Xs, such that the sum of elements in Xs is equal to that in X \Xs, i.e.,

∃Xs ⊆ X , ∑xi∈Xs

xi =12 ∑

xi∈Xxi (4)

For reduction, we consider an instance of the partition problem with set X , and construct an instanceof the load-balanced multipath routing problem for a fat-tree. The basic idea is to create a group of flowsbetween two hosts, and the demand of each flow is one of the integers in set X . Leave only two disjointpaths between the two hosts for the flows. If there is a partition of the integer set, then we can accordinglyallocate the flows to the two paths to achieve optimal load balance, and vice versa. Figure 2 shows an

Jo, Butler, Pan, and Liu

A

C

D

B

1 2 3 4

Path ACD

Path ABD

Aggregation Switches

Edge Switches

Hosts

Figure 2: One pod from a four-pod fat tree topology.

example of one pod of a 4-pod fat tree, where hosts 1 and 3 have only two disjoint paths ABD and ACDbetween them.

The detailed reduction is as follows. First, set up a K-pod fat tree, and the capacity of each link is setas infinity to avoid violating the link capacity constraint. Second, select two hosts, denoted as h1 and h2,in the same pod but under different edge switches. We will designate all the flows between the two hosts.Third, create n flows each corresponding to an integer in X , where each flow Fi has ai = h1, bi = h2, anddi = xi where xi ∈ X . Four, create additional K/2−2 flows, where each flow Fk has ak = h1, bk = h2, anddk = ∑xi∈X xi/2.

Suppose that the partition problem for set X has a solution Xs. Note that for the two hosts h1 andh2 in a K-pod fat tree, there are K/2 disjoint paths between them, each through a different aggregationswitch. In the fourth step of the above reduction process, we create (K/2−2) flows, each with the demandof ∑xi∈X xi/2. Due to load balancing, each of them will be allocated to a different path, resulting in tworemaining vacant paths. Finally, each of the n flows created in the third step, if xi ∈ Xs, we designate thecorresponding flow Fi to the first of the remaining two paths. If not, we designate the corresponding flowto the second path. In this way, we find a solution for the load-balanced multipath routing problem thatminimizes the maximum load to be ∑xi∈X xi/2 and achieves the optimal load balance.

Suppose that the load-balanced multipath routing problem has an optimal solution with a maximumload being ∑xi∈X xi/2. Each of (K/2− 2) flows created in the fourth step must occupy a different path.The n flows created in the third step must share the remaining two paths, based on which we can createthe two subsets Xs and X \Xs accordingly, and thus we find a solution for the partition problem.

4 AN SDN-BASED ALGORITHM

In this section, we propose an SDN-based load-balanced multipath routing algorithm. Software-DefinedNetworking (Open Networking Foundation 2012) is an emerging networking paradigm that has recentlygained tremendous popularity. SDN can be realized using OpenFlow: the network has a central OpenFlowcontroller that communicates with the OpenFlow switches in the network using the OpenFlow protocol.The controller can be used to collect the network state and direct all traffic flows in the network. Ourmain idea here is to use the central controller to collect the traffic load information for all the links in thenetwork so that our algorithm can make globally optimized routing decisions. When a new flow enters thesystem, the switch that first encounters the flow will inform the controller. The controller will enumerateall possible paths that can carry the flow, compare the load of the bottleneck links of those paths, andthen select the one with the minimum load. After the path has been determined, the controller will informall switches along the path and update their flow tables so that subsequent packets of the flow will beforwarded along the same path.

4.1 Design Philosophy

To optimize bandwidth utilization in DCNs, one needs a global view of the available resources in thenetwork (Benson et al. 2010a). To do that, we implement the load-balanced multipath routing scheme at

Jo, Butler, Pan, and Liu

the central controller, which communicates with the switches in the network using the OpenFlow protocol.Each OpenFlow-enabled switch contains a flow table, which the switch uses to control the flows. A flowcan be flexibly defined by any combination of the ten fields in the packet header identified by the switch,including the source and destination IP addresses, the port numbers, and others. A controller can controlthe flows in the system by querying, inserting, or modifying the corresponding entries in the flow table ofan OpenFlow switch. In particular, the central controller is able to collect bandwidth and flow informationof the entire network from the switches for it to make globally optimized routing decisions, and then directthe flows by updating the flow tables at the switches along the flow paths.

OpenFlow devices are already widely available on the market today. Many recent data centers useOpenFlow as part of their network switching fabric. As such, our proposed solution can be readily appliedin those data centers to improve their network performance.

4.2 The Algorithm

In the following we describe the SDN-based load-balanced multipath algorithm in more detail. The algorithmhas three steps:

In the first step, the algorithm identifies the highest layer of switches a flow needs to reason based onthe location of its source and destination hosts. If the source and destination hosts belong to different pods,the path will need to include the core switches. If the two hosts are attached to the same edge switch, theedge switch will simply connect them. If the two hosts are in the same pod but not under the same edgeswitch, one of the aggregation switches will be needed to connect them.

In the second step, the central controller determines the bottleneck link of each potential path betweenthe source and destination hosts. Note that in the fat tree, the switch at the highest layer can uniquelyidentify the path between the given source and destination. Since the central controller has the informationof every link in the network, it can quickly identify the bottleneck link of a path by comparing the availablebandwidth on all the links along the path. The link with the smallest available bandwidth will be thebottleneck link. Arbitrary tie-breaking rules can be applied here if a path has more than one links withsmallest available bandwidth.

In the third step, the controller compares the available bandwidth of bottleneck links of all potentialpaths and selects the one with the maximum value. If the maximum available bandwidth is greater than thedemand of the flow, the corresponding path is selected for the flow. The controller will then set up the flowtables of all the switches along the chosen path accordingly. However, if the maximum available bandwidthis less than the flow demand, it means that there does not exist a viable path for this flow. The system maychoose to either reject this flow, or admit this flow at the risk of jeopardizing the quality-of-service of thisflow and all other existing flows in the system that may intersect with this flow.

4.3 Time Complexity

In a k-pod fat tree, which consists of k3/4 hosts and 5k2/4 switches, there are at most k distinct pathsbetween any two hosts. For each path, the algorithm needs to check at most 8 links (which is the maximumpath length) in order to find the bottleneck. Therefore, the complexity of the routing process is O(k). Inother words, the proposed multipath routing algorithm has a time complexity of O(n1/3) for a networkwith O(n) nodes.

5 IMPLEMENTATIONS

To validate our designs, we have implemented the proposed algorithm in two environments: as a moduleof an open-source OpenFlow controller, and in a home grown DCN simulator.

Jo, Butler, Pan, and Liu

5.1 OpenFlow Controller Module

We implemented the algorithm as a module of the Beacon controller (version 1.04) (Erickson 2013). Themodule has two functions: monitoring the link state and calculating paths for new flows.

The Beacon controller module periodically schedules a monitoring event, upon which the controllerwill send an OFStatisticsRequest command to every switch in the network, once every T seconds.Each switch will in turn reply with an OFStatisticsReply message. The controller processes the replymessage by extracting the switch ID, the port number, the total number of bytes transmitted, as well as thetime the reply message was received. An estimation of the flow rate for each port can be calculated basedon this information. The size of the messages is quite small: 8 bytes for the OFStatisticsRequestmessage and 104 bytes for the OFStatisticsReply message. The overhead for the port statisticsrequest and reply was small. In a 4-pod or 8-pod fat tree setup in Mininet (Lantz, Heller, and McKeown2010), we noticed that by making T smaller (from 6 to 2 seconds) the CPU usage was increased only byapproximately 2-5%.

The flow path scheduling function requires up-to-date port statistics for load balancing. When anOpenFlow switch receives a packet, the switch will attempt to match the packet header with the entriesin its flow table. If the switch cannot find a matching flow entry, it will send the packet to the controllerwrapped in an OFPacketIn message. Upon receiving this message, the controller will decide on anoptimized path for the flow, as described in Section 4. Once the path is determined, the controller willcreate and send an OFFlowMod message, which contains the packet header information for flow matching,along with the input and output ports for each switch on the flow path.

Mininet 2.1 was selected for emulation. We studied the algorithm using a 4-pod or a 8-pod fat tree. Apython script was developed to automatize the testing process, including switch initialization and connectionto the Beacon Controller, host initialization, and traffic generation. Traffic are UDP flows generated byiperf: each host has an iperf server; and based on the traffic load, the python script creates a UDP flowfrom an iperf server to an iperf client chosen randomly among the hosts at certain time intervals.

5.2 DCN Simulation

We also developed a fluid-based DCN simulator to help evaluate the performance of the proposed multipathalgorithm in diverse settings. The DCN simulator was implemented using the MiniSSF parallel discrete-event simulation framework (Rong, Hao, and Liu 2014), although it only used the sequential simulationfunctions.

The DCN simulator creates the fat tree network model with simulated hosts, switches, and links betweenthe switches and hosts. Currently we use an exponential distribution to model the inter-arrival time offlows, which are randomly placed between the hosts according to a uniform distribution. Here we assumethat the flow arrival is a stationary Poisson process. One can capture the non-stationary behavior (such asthe diurnal effect) by using piece-wise constant flow arrival rates at different time intervals. Such trafficpattern can also be extended easily to include more realistic scenarios, e.g., Benson et al. (2010), Bensonet al. (2010b). Upon each new flow arrival, the simulator invokes a routing oracle to determine its path.The aforementioned load-balanced multipath routing algorithm is implemented as the routing oracle.

To achieve high performance for running large-scale network models and scenarios with high trafficintensity, we developed a fluid model to represent the network traffic. Fluid models, e.g., Nicol (2001), Liuet al. (2003), can be applied to reduce the computational complexity, where one uses fluids to represent aseries of packets traversing the network. It is also possible to capture the interaction between that networkflows represented as fluids and those represented by individual packets, e.g., Nicol and Yan (2004), Gu,Liu, and Towsley (2004), Liu (2006), Li, Van Vorst, and Liu (2013). These models are shown to be ableto achieve substantial performance gains—in some cases, more than three orders of magnitude—over thetraditional packet-oriented approach, while still maintaining good accuracy.

Jo, Butler, Pan, and Liu

8

8.5

9

9.5

10

10.5

11

0 5 10 15 20

Thro

ughput (M

b/s

)

Offered Load (%)

simulation throughputmininet mean throughputmininet results

Figure 3: Mininet throughput.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 5 10 15 20

Dela

y (

ms)

Offered Load (%)

simulation delaymininet mean delaymininet results

Figure 4: Mininet delay.

In our case, we developed a simple fluid model that can capture the steady-state queuing behavior. Uponthe arrival of a new flow, we update the flow arrival rate of the network queues at each hop visited by theflow. We calculate the throughput and end-to-end delay of the flow traversing the queue using a steady-stateapproximation of the M/D/1/N queue. For simplicity, we do not capture the transient time-varying aspectsof the queuing behavior. Changes in the throughput and delay at one queue are propagated immediately tothe subsequent queues. The throughput and delay of other flows traversing the same queue may also needto be updated. To avoid the “ripple effect” (Nicol 2001), where a change in the arrival rate of one flowat one queue may cause changes to the rate of all other flows at the same queue as well as other queuessubsequent traversed by these flows, we apply a dampening factor to prevent small changes from beingpropagated to subsequent queues. Our fluid model is simple and can accurately capture the fluctuations inflow rate when flows entering and leaving the system.

6 EXPERIMENTS

We conducted two sets of experiments: one with the Beacon SDN controller and the Mininet emulationplatform, and the other with the MiniSSF simulator. The former validates the correctness of the proposeddesign in a real implemented system. The latter uses simulation to evaluate the algorithm performance fornetworks with various configurations and of larger scales.

6.1 Emulation

While our OpenFlow controller module implementation can be applied directly in DCN, we need to firstvalidate its correctness. For this, we chose Mininet (version 2.1.0), which is a container-based emulationenvironment for testing SDN applications (Lantz, Heller, and McKeown 2010). Mininet can instantiatea set of virtual hosts and virtual switches, which can be connected as an arbitrary network, all on thesame machine. Mininet runs on a single Linux kernel for all virtual host instances using a lightweightcontainer-based virtualization solution. Using lightweight container-based virtualization provides a scalablemechanism: one can use Mininet to create hundreds and even thousands of virtual hosts. One limitation isthat all instantiated virtual machines must run the same OS kernel (since each container is simply runningin a separate kernel name space). The main drawback of the container-based VM, however, as we acutelyobserved in our experiments, is that containers do not impose resource isolation. There is a stringentlimitation in the amount of traffic that can be emulated by Mininet on a single machine in real time.Moreover, the applications running on the virtual hosts are also competing for the CPU resources. For ourexperiment scenarios potentially with heavy traffic, Mininet cannot produce reasonable results.

To show its limitation, we used the Mininet to emulate a fat tree with k = 4 pods, in which case thenetwork contains only 16 hosts and 20 switches. We set the bandwidth of all links to be 100 Mbps. We

Jo, Butler, Pan, and Liu

used iperf to generate UDP flows; the source and destination of the flows were chosen uniformly atrandom. Each flow generates packets at a constant rate which is uniformly distributed between 8 to 12Mbps; the mean is 10 Mbps. The duration of each flow is 30 seconds. We varied the offered traffic loadon the network by changing the mean inter-arrival time between the flows, which is set to be exponentiallydistributed. For the experiment, we varied the offered load from 1% to 20% of the total bandwidth forconnecting the hosts. For k = 4, that’s 100 Mbps for each of the 16 hosts, or 1.6 Gbps in total. Eachexperiment ran for 300 seconds; we conducted 10 runs for each offered load.

Figure 3 shows the average throughput of all flows created in the experiment; Figure 4 shows theaverage delay. In the figures, we plot the results from each of the 10 runs, together with the averages.For comparison, we also show the results from simulation with a similar setting (described in more detailin the next section). From this experiment, we can make two observations. First, there exists significantvariation in the Mininet results between runs, especially considering the results are averages among allflows in each experiment. Second, the throughput drops significantly when offered load increases; suchdecrease starts early even when the offered load is small. The same is true for the end-to-end delay, whichincreases with the increasing offered load. In this aspect, the emulation results diverge significantly fromthe simulation results, which show a rather stable behavior when the offered load is blow 20%.

There could be many possible contributing factors to this disparity. We measured the CPU utilizationduring the emulation experiment. The machine we ran the experiment has a four-core 3.3 GHz Intel Xeonprocessor with hyper-threading enable, and 12 GB of memory. At 1% traffic load, the CPU utilization wasaveraged at 77%. At 2% load, it was increased to 93%. For an offered load of 3% and more, the CPUutilization was consistently above 97%. As this rate, Mininet was not able to deal with the emulation events.We believe the fast increase in delay and rapid decrease in throughput can be attributed to the resourcecontention. Note that, with such light traffic load for a small-scale DCN, we cannot conduct meaningfulexperiments to compare the performance of our load-balanced multipath algorithm. In the next section,we resort to simulation for the performance evaluations.

6.2 Simulation

In this section, we present the simulation results to demonstrate the effectiveness of our algorithm. Weconsider fat trees of k = 4, 8, and 16. A fat tree contains k3/4 hosts and 5k2/4 switches. For k = 16, wehave 1024 hosts and 320 switches. We used the same parameters as in the Mininet experiment described inthe previous section, except that we increased the range of the offered load up to 90% of the total bandwidthfor connecting the hosts. For each offered load we conducted 15 trials each running 1000 seconds.

We compare the results from using our load-balanced multipath algorithm (labeled as LBMP) withthose from a simple multipath algorithm, which, likes ECMP, randomly selects one of the available pathsbetween the source and destination hosts. Figure 5 shows the average throughput achieved under differentoffered load, and Figure 6 shows the average end-to-end delay. We plot the 95% confidence interval for the15 runs at each data point (which is negligible in most cases). As expected, our load-balanced multipathalgorithm consistently achieved better throughput and delay compared to the random multipath algorithm.With the same offered load, the end-to-end delay of our algorithm is measured slightly higher for largerfat trees. This is due to the longer average path length. The throughput remains similar between differentfat tree sizes, because the algorithm is able to balance the traffic load in the network. This is not truefor the random multipath algorithm: at higher traffic intensity, congestions start to form which causes thethroughput to drop more precipitously, especially for larger network size.

The size of the network queues should play an important factor in determining the packet delays. Inthe previous experiment, we fixed the buffer size to be 1000 packets (that’s 500 KB assuming the averagepacket size is 500 bytes). In the next experiment, we varied the buffer size from 1000 to 16000 to examineits effect on the throughput and delay. Figures 7 and 8 shows the results for a k = 4 fat tree. We observethat, while the throughput does not change under different buffer sizes, a larger buffer size would increasethe end-to-end delay. Both algorithms are able to spread the traffic load and the network has enough

Jo, Butler, Pan, and Liu

7.5

8

8.5

9

9.5

10

10.5

0 10 20 30 40 50 60 70 80 90

Thro

ughput (M

b/s

)

Offered Load (%)

k=4, LBMPk=8, LBMPk=16, LBMPk=4, ECMPk=8, ECMPk=16, ECMP

Figure 5: Simulation throughput.

0

2

4

6

8

10

12

14

16

0 10 20 30 40 50 60 70 80 90

Dela

y (

ms)

Offered Load (%)

k=16, ECMPk=8, ECMPk=4, ECMPk=16, LBMPk=8, LBMPk=4, LBMP

Figure 6: Simulation delay.

8

8.2

8.4

8.6

8.8

9

9.2

9.4

9.6

9.8

10

10.2

0 10 20 30 40 50 60 70 80 90

Thro

ughput (M

b/s

)

Offered Load (%)

N=1000..16000, LBMPN=1000..16000, ECMP

Figure 7: Throughput with varying buffer size.

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80 90

Dela

y (

ms)

Offered Load (%)

N=16000, ECMPN=16000, LBMPN=8000, ECMPN=8000, LBMPN=4000, ECMPN=4000, LBMPN=2000, ECMPN=2000, LBMPN=1000, ECMPN=1000, LBMP

Figure 8: Delay with varying buffer size.

capacity to handle the flows originated from the hosts. The queue length remains to be relatively small andtherefore the blocking probability is kept low not to significantly affect the throughput. However, with alarger buffer size, the average queue length does increase, which directly affects the queuing delay for theflows traversing the queue, as it is obviously shown in Figure 8. In all cases, our load-balanced multipathalgorithm consistently achieves better performance than the random multipath algorithm, both in terms ofthe increased throughput and in the reduced delay.

7 CONCLUSIONS

The fat tree with multipath capability has become a popular topology for data center networks for itsincreased bandwidth and fault tolerance. In this paper, we propose a load-balanced multipath routingalgorithm, which can better distribute the traffic load and utilize the available bandwidth in the fat treenetwork. We formulate the problem as a linear program and show its NP-completeness by reduction from theinteger partition problem. As a practical solution, we propose a load-balanced multipath routing algorithmfor SDN-based data center networks. The algorithm uses the central controller to collect informationof the network state, and makes optimized load balancing decisions when admitting new flows to thenetwork. We implemented the proposed algorithm both in Mininet, to establish the basic validation of theimplementation, and in a simulator, to conduct extensive performance evaluations. Experiment results showthat our algorithm consistently outperforms the traditional multipath algorithm using random assignments.

For future work, we plan to conduct more validation experiments. Especially we would like to comparethe simulation results against those from a real physical SDN testbed, and against other multipath algorithms.

Jo, Butler, Pan, and Liu

We will also conduct experiments with more realistic traffic load, including the use of TCP, for a betterrepresentation of the data center applications and their network behavior.

ACKNOWLEDGMENTS

This research is supported in part by NSF grant CNS-1117016, and a subcontract from the GENI ProjectOffice at Raytheon BBN Technologies (NSF grant CNS-1346688). We thank the anonymous reviewers fortheir constructive comments and suggestions.

REFERENCES

Al-Fares, M., A. Loukissas, and A. Vahdat. 2008. “A scalable, commodity data center network architecture”.ACM SIGCOMM Computer Communication Review 38 (4): 63–74.

Al-Fares, M., S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. 2010. “Hedera: dynamic flowscheduling for data center networks”. In Proceedings of the 7th USENIX Symposium on NetworkedSystems Design and Implementation, 19:1–19:15.

Benson, T., A. Akella, and D. A. Maltz. 2010. “Network traffic characteristics of data centers in the wild”.In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, 267–280.

Benson, T., A. Anand, A. Akella, and M. Zhang. 2010a. “The case for fine-grained traffic engineeringin data centers”. In Proceedings of the 2010 Internet Network Management Workshop/Workshop onResearch Enterprise Networking, 2:1–2:6.

Benson, T., A. Anand, A. Akella, and M. Zhang. 2010b. “Understanding data center traffic characteristics”.ACM SIGCOMM Computer Communication Review 40 (1): 92–99.

Cisco Systems 2007. “IP multicast load splitting - Equal Cost Multipath (ECMP) using S, G and next hop”.http://www.cisco.com/en/US/docs/ios/12 2sr/12 2srb/feature/guide/srbmpath.html.

Erickson, D. 2013. “The Beacon OpenFlow controller”. In Proceedings of the 2013 ACM SIGCOMMWorkshop on Hot Topics in Software Defined Networking, 13–18.

Garey, M. R., and D. S. Johnson. 1990. Computers and Intractability; A Guide to the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co.

Greenberg, A., N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. 2009. “VL2:a scalable and flexible data center network”. In ACM SIGCOMM Computer Communication Review,Volume 39, 51–62.

Gu, Y., Y. Liu, and D. Towsley. 2004. “On integrating fluid models with packet simulation”. In Proceedingsof the 23rd Annual Joint Conference of the IEEE Computer and Communications Societies, Volume 4,2856–2866.

Guo, C., G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. 2009. “BCube: a highperformance, server-centric network architecture for modular data centers”. ACM SIGCOMM ComputerCommunication Review 39 (4): 63–74.

Guo, C., H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. 2008. “DCell: a scalable and fault-tolerant networkstructure for data centers”. ACM SIGCOMM Computer Communication Review 38 (4): 75–86.

Heller, B., S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma, S. Banerjee, and N. Mckeown. 2010.“ElasticTree: saving energy in data center networks”. In Proceedings of the 7th USENIX Symposiumon Networked Systems Design and Implementation, 17:1–17:16.

Kurose, J. F., and K. W. Ross. 2007. Computer Networking: A Top-Down Approach. 4th ed. AddisonWesley.

Lantz, B., B. Heller, and N. McKeown. 2010. “A network in a laptop: rapid prototyping for software-definednetworks”. In Proceedings of the 9th ACM Workshop on Hot Topics in Networks, 19:1–19:6.

Li, T., N. Van Vorst, and J. Liu. 2013. “A rate-based TCP traffic model to accelerate network simulation”.SIMULATION 89 (4): 466–480.

Jo, Butler, Pan, and Liu

Liu, J. 2006. “Packet-level integration of fluid TCP models in real-time network simulation”. In Proceedingsof the 2006 Winter Simulation Conference, edited by L. F. Perrone, F. P. Wieland, J. Liu, B. G. Lawson,D. M. Nicol, and R. M. Fujimoto, 2162–2169. Piscataway, New Jersey: Institute of Electrical andElectronics Engineers, Inc.

Liu, Y., F. Lo Presti, V. Misra, D. Towsley, and Y. Gu. 2003. “Fluid models and solutions for large-scaleIP networks”. In ACM SIGMETRICS Performance Evaluation Review, Volume 31, 91–101.

Mudigonda, J., and P. Yalagandula. 2010. “SPAIN: COTS data-center ethernet for multipathing overarbitrary topologies”. In Proceedings of the 7th USENIX Symposium on Networked Systems Designand Implementation, 18:1–18:16.

Mysore, R. N., A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, andA. Vahdat. 2009. “Portland: a scalable fault-tolerant layer2 data center network fabric”. In ACMSIGCOMM Computer Communication Review, Volume 39, 39–50.

Nicol, D. M. 2001. “Discrete event fluid modeling of TCP”. In Proceedings of the 2001 Winter SimulationConference, edited by B. A. Peters, J. S. Smith, D. J. Medeiros, and M. W. Rohrer, 1291–1299.Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc.

Nicol, D. M., and G. Yan. 2004. “Discrete event fluid modeling of background TCP traffic”. ACM Transactionson Modeling and Computer Simulation 14 (3): 211–250.

Open Networking Foundation 2012. “Software-defined networking: The new norm for net-works”. https://www.opennetworking.org/images/stories/downloads/sdn-resources/white-papers/wp-sdn-newnorm.pdf.

Rong, R., J. Hao, and J. Liu. 2014. “Performance study of a minimalistic simulator on XSEDE massivelyparallel systems”. In Proceedings to the 3rd Annual Conference of the Extreme Science and EngineeringDiscovery Environment, Article 15.

Wang, G., D. G. Andersen, M. Kaminsky, M. Kozuch, T. S. E. Ng, K. Papagiannaki, and M. Ryan. 2010.“c-Through: part-time optics in data centers”. In ACM SIGCOMM Computer Communication Review,Volume 40, 327–338.

AUTHOR BIOGRAPHIES

ERIC JO is an undergraduate student in the School of Computing and Information Sciences, FloridaInternational University. His email address is [email protected].

LINDA BUTLER is an undergraduate student in the Department of Computer Science and Engineering,University of South Florida. Her email address is [email protected].

DENG PAN is an Associate Professor at the School of Computing and Information Sciences, FloridaInternational University. His research interests include high performance switch architecture and highspeed networking. He received the B.S. and M.S. degrees in Computer Science from Xian JiaotongUniversity, China, in 1999 and 2002, respectively, and the Ph.D. degree in Computer Science from theState University of New York at Stony Brook in 2007. His email address is [email protected]; his homewebpage is http://www.cis.fiu.edu/∼pand/.

JASON LIU is an Associate Professor at the School of Computing and Information Sciences, FloridaInternational University. His research focuses on parallel simulation and high-performance modeling ofcomputer systems and communication networks. He received a B.A. degree from Beijing University ofTechnology in China in 1993, an M.S. degree from College of William and Mary in 2000, and a Ph.D.degree in from Dartmouth College in 2003. He served as General Chair for MASCOTS 2010, SIMUTools2011 and PADS 2012, and also as Program Chair for PADS 2008 and SIMUTools 2010. He is anAssociate Editor for the ACM Transactions on Modeling and Computer Simulation (TOMACS), and forSIMULATION, Simulation Transactions of the Society for Modeling and Simulation International. He isalso a Steering Committee Member for SIGSIM-PADS. His email address is [email protected]; his homewebpage is http://www.cis.fiu.edu/∼liux/.