15
The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party Ori Rottenstreich, Pu Li, Inbal Horev, Isaac Keslassy, Senior Member, IEEE, and Shivkumar Kalyanaraman, Fellow, IEEE AbstractPacket reordering has now become one of the most signicant bottlenecks in next-generation switch designs. A switch practically experiences a reordering delay contagion, such that a few late packets may affect a disproportionate number of other packets. This contagion can have two possible forms. First, since switch designers tend to keep the switch ow order, i.e., the order of packets arriving at the same switch input and departing from the same switch output, a packet may be delayed due to packets of other ows with little or no reason. Further, within a ow, if a single packet is delayed for a long time, then all the other packets of the same ow will have to wait for it and suffer as well. In this paper, we suggest solutions against this reordering contagion. We rst suggest several hash-based counter schemes that prevent inter-ow blocking and reduce reordering delay. We further suggest schemes based on network coding to protect against rare events with high queueing delay within a ow. Last, we demonstrate using both analysis and simulations that the use of these solutions can indeed reduce the resequencing delay. For instance, resequencing delays are reduced by up to an order of magnitude using real-life traces and a real hashing function. Index TermsSwitching theory, packet-switching networks 1 INTRODUCTION 1.1 Switch Reordering Packet reordering is emerging as a signicant design issue in next-generation high-end switch designs. While it was easier to prevent in older switch designs, which were more central- ized, the increasingly distributed nature of switch designs and the growth in port numbers are making it harder to tackle [1]. Fig. 1 illustrates a simplied multi-stage switch architecture, with input ports, middle elements, and output ports. It could for instance be implemented using either a Clos network or a load-balanced switch [2]. Variable- size packets arriving to the switch are segmented into xed- size cells. Each such cell is load-balanced to one of the middle elements, either round-robin or uniformly-at-random. It is later sent to its appropriate output. There, in the output resequencing buffer, it waits for late packets from other middle elements, so as to depart the switch in order. For instance, assume that cells through arrive in order to the switch, and belong to the same switch ow, i.e. share the same input and output. When arrives to the output, it can depart. However, if cells and still wait to be transmitted out of their middle elements, then cells through cannot depart the switch, because they wait for and to be sent in order. More generally, packet reordering occurs when there is a difference in delay between middle elements. In practice, this can happen because of many reasons: e.g., a temporary congestion at a middle element following the sudden arrival of a few high-fanout multicast packets from different inputs, a periodic maintenance algorithm, or a ow-control backpres- sure message that temporarily stops transmissions out of a middle element but has not yet stopped arrivals to it [3]. Because of packet reordering, a large delay at a single middle element can affect the whole switch. For instance, assume that a given middle element temporarily stops trans- mitting cells because of a ow-control message. Consider an arbitrary switch ow, i.e. a set of packets sharing the same switch input and output. If the switch-ow cells are load- balanced in round-robin order across middle elements, and the switch ow has at least cells, then at least one cell will be stuck at the middle element. In turn, this will block the whole switch ow, since all other subsequent cells are waiting for it at the output resequencing buffer. Therefore, the switch will behave as if all the middle elements were blocked. Instead of affecting only of the trafc, it potentially affects the whole trafc. Reordering causes many issues in switch designs. The cells that are waiting at the output resequencing buffer are causing both increased delays and increased buffering needs. In fact, in current switch designs, the resequencing buffer sizes are typically pushing the limits of available on-chip memory. In addition, reordering buffers cause implementation complexity issues. This is because the output resequencing structures, shown in Fig. 1, are conceptually implemented as linked lists. O. Rottenstreich and I. Keslassy are with the Department of Electrical Engineering, Technion, Haifa 3200003, Israel. E-mail: {or@tx, isaac@ee}.technion.ac.il. P. Li is with ASML Netherlands B.V., De Run 6501, 5504 DR Veldhoven, The Netherlands. E-mail: [email protected]. I. Horev is with the Department of Mathematics and Computer Sciences, Weizmann Institute of Science, Rehovot 761001, Israel. E-mail: [email protected]. S. Kalyanaraman is with IBM Research, Bangalore 560045, India. E-mail: [email protected]. Manuscript received 19 Apr. 2012; revised 07 Oct. 2012; accepted 12 Nov. 2012; published online 9 Dec. 2012; date of current version 29 Apr. 2014. Recommended for acceptance by V. Eramo. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identier below. Digital Object Identier no. 10.1109/TC.2012.288 1262 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 0018-9340 © 2012 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

Embed Size (px)

Citation preview

Page 1: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

The Switch Reordering Contagion: Preventing aFew Late Packets from Ruining the Whole Party

Ori Rottenstreich, Pu Li, Inbal Horev, Isaac Keslassy, Senior Member, IEEE, andShivkumar Kalyanaraman, Fellow, IEEE

Abstract—Packet reordering has now become one of the most significant bottlenecks in next-generation switch designs. A switchpractically experiences a reordering delay contagion, such that a few late packets may affect a disproportionate number of other packets.This contagion can have two possible forms. First, since switch designers tend to keep the switch flow order, i.e., the order ofpackets arriving at the same switch input and departing from the same switch output, a packet may be delayed due to packets of otherflows with little or no reason. Further, within a flow, if a single packet is delayed for a long time, then all the other packets of the sameflow will have to wait for it and suffer as well. In this paper, we suggest solutions against this reordering contagion. We first suggestseveral hash-based counter schemes that prevent inter-flowblocking and reduce reordering delay.We further suggest schemesbasedonnetwork coding to protect against rare events with high queueing delay within a flow. Last, we demonstrate using both analysis andsimulations that the use of these solutions can indeed reduce the resequencing delay. For instance, resequencing delays are reduced byup to an order of magnitude using real-life traces and a real hashing function.

Index Terms—Switching theory, packet-switching networks

1 INTRODUCTION

1.1 Switch ReorderingPacket reordering is emerging as a significant design issue innext-generation high-end switch designs. While it was easierto prevent in older switch designs, which were more central-ized, the increasingly distributednature of switchdesigns andthe growth in port numbers are making it harder to tackle [1].

Fig. 1 illustrates a simplified multi-stage switcharchitecture, with input ports, middle elements, andoutput ports. It could for instance be implemented usingeither a Clos network or a load-balanced switch [2]. Variable-size packets arriving to the switch are segmented into fixed-size cells. Each such cell is load-balanced to one of themiddle elements, either round-robin or uniformly-at-random.It is later sent to its appropriate output. There, in the outputresequencing buffer, it waits for late packets from other middleelements, so as to depart the switch in order.

For instance, assume that cells through arrive in orderto the switch, andbelong to the same switchflow, i.e. share thesame input and output. When arrives to the output, it can

depart. However, if cells and still wait to be transmittedout of their middle elements, then cells through cannotdepart the switch, because they wait for and to be sent inorder.

More generally, packet reordering occurs when there is adifference in delay between middle elements. In practice, thiscan happen because of many reasons: e.g., a temporarycongestion at a middle element following the sudden arrivalof a fewhigh-fanoutmulticast packets fromdifferent inputs, aperiodic maintenance algorithm, or a flow-control backpres-sure message that temporarily stops transmissions out of amiddle element but has not yet stopped arrivals to it [3].

Because of packet reordering, a large delay at a singlemiddle element can affect the whole switch. For instance,assume that a given middle element temporarily stops trans-mitting cells because of a flow-control message. Consider anarbitrary switch flow, i.e. a set of packets sharing the sameswitch input and output. If the switch-flow cells are load-balanced in round-robin order across middle elements, andthe switchflowhas at least cells, then at least one cellwill bestuck at the middle element. In turn, this will block the wholeswitch flow, since all other subsequent cells are waiting for itat the output resequencing buffer. Therefore, the switch willbehave as if all the middle elements were blocked. Instead ofaffecting only of the traffic, it potentially affects the wholetraffic.

Reordering causesmany issues in switch designs. The cellsthat are waiting at the output resequencing buffer are causingboth increased delays and increased buffering needs. In fact, incurrent switch designs, the resequencing buffer sizes aretypically pushing the limits of available on-chip memory. Inaddition, reordering buffers cause implementation complexityissues. This is because the output resequencing structures,shown in Fig. 1, are conceptually implemented as linked lists.

• O. Rottenstreich and I. Keslassy are with the Department of ElectricalEngineering, Technion, Haifa 3200003, Israel.E-mail: {or@tx, isaac@ee}.technion.ac.il.

• P. Li is with ASML Netherlands B.V., De Run 6501, 5504 DR Veldhoven,The Netherlands. E-mail: [email protected].

• I. Horev is with the Department of Mathematics and Computer Sciences,Weizmann Institute of Science, Rehovot 761001, Israel.E-mail: [email protected].

• S. Kalyanaraman is with IBM Research, Bangalore 560045, India.E-mail: [email protected].

Manuscript received 19Apr. 2012; revised 07Oct. 2012; accepted 12Nov. 2012;published online 9 Dec. 2012; date of current version 29 Apr. 2014.Recommended for acceptance by V. Eramo.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TC.2012.288

1262 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014

0018-9340 © 2012 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

Therefore, longer resequencing buffers imply that designersneed to insert cells into increasingly longer linked lists at asmall constant average time.

Thereare twowaystoaddress thesereordering issues.First,it is possible to change routing. One commonly-proposedsolution to fight reordering is to hash each flow to a randommiddle switch [4]–[6]. However, while this indeed preventsreordering, it also achieves a significantly lower throughput,especially when dealing with bigger flows. In addition, othersolutions rely on a careful synchronization of packets to avoidreordering [2], [4].Yet these solutionsareoften toocomplex fortypical high-performance routers with high internal round-trip-times and largepacket rates. In this paper,weassume thatrouting is given, and do not try to improve it in any way.

A second approach, which is often adopted by switchvendors, is to use significant speedups on the switch fabrics(typically 2X, but published numbers go up to 5X [6], [7]).However, this approach also wastes a large part of thebandwidth. Our goal is to suggest ways of scaling routers in thefuture without further increasing this speedup.

1.2 Reordering Contagion and Flow BlockingIn this paper, we divide reordering effects into two differenttypes, and suggest algorithms to deal with each type. Moreprecisely, we consider two types of reordering contagion, i.e.cases where the large delay of one cell can significantly affectthe delay of many other cells:

A contagion within flows, e.g. when a large packet issegmented into many cells, and one late cell makes allother cells wait.A contagion across flows, e.g. when a late cell of one flowmakes cells of other flows wait.

The first type is easy to understand. The paper suggestsdealing with it by using network coding, so as to not have anyirreplaceable cell.

The second type is due to flow blocking, as explained below,and is tackled using hash-based counter schemes.

Today, switches guarantee that packets belonging to thesame switch flow, i.e. arriving at the same switch input anddeparting from the same switch output, will depart in the sameorder as they arrived. For switch vendors as well as for theircustomers, breaking this switch-flow order guarantee is not anoption.

To the best of our knowledge, all current switch designersprovide this guarantee.1 That, even though no standardrequires it, and in fact the IPv4 router standard does not evenforbid packet reordering (Section 2.2.2 in [9]). Today, it is asignificant customer requirement that needs to be addressed (itis well known that a major core-router vendor lost marketshare because of its routers experimenting some limitedreordering).

However, this guarantee has been increasingly hard toaddress in high-end multi-stage switches because of thecomplex and often-overlooked flow blocking interactions.Within a switch flow, let a flow be the set of all packets that alsoshare the same (source, destination) IP address pair. Then flowblocking happens when in order to satisfy the switch-floworder guarantee, packets from one flow wait for late packetsfrom another flow within the same switch flow.

To illustrate this flow blocking phenomenon, Table 1details the packets from Fig. 1. Assume that even thoughpackets through belong to the same switch flow, they donot all belong to the same flow. Packets , belong to flow(shownwith a circle in Fig. 1) since theyhave the same (source,destination) IP pair, packet belongs to flow (shownwith apentagon), packet belongs toflow (shownwith a triangle).Likewise, packet belongs to flow and packet belongs toflow . An additional packet belongs to a different switchflow and can independently depart the switch. As previouslymentioned, packet ,which arrives to the switch before andcan depart in order.However, packets , and are out of

switch-flow order, and therefore need towait for packets , atthe output. They can only depart from the switchwhen and

arrive, and are blocked meanwhile. In particular, note thatthese packets are blocked by a late packet froma differentflow:this is flow blocking.

TABLE 1An Example of Packets with Their Flow IDs and Switch Flow IDs

Packets through belong to the same switch flow, i.e., share the sameswitch input and switch output. Packet belongs to a second switch flow.In addition, packets through includepackets from5differentflows, i.e.,have 5 possible (source, destination) IP address pairs.

Fig. 1. Switch flow blocking in the switch architecture.

1. However, since switch internal details are often kept confidential,we could not find a publicly available reference. [8] (in Chapter 4), [7], [6](in Appendix B) refer to the use of internal sequence numbers tomaintainpacket order, but none of these detail how the sequence numbers areformed.

ROTTENSTREICH ET AL.: SWITCH REORDERING CONTAGION 1263

Page 3: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

1.3 Our ContributionsIn this paper,we propose solutions for the two different types of thereordering contagion. We first try to suggest hash-basedschemes to separate between flows more accurately andreduce the inter-flow blocking, and then suggest a codingscheme to reduce intra-flow reordering delay contagion.

First, instead of providing a switch-flow order guarantee,we only provide a flow order guarantee, so that packets ofthe same flow are still maintained in order, but are notconstrained with respect to packets of different flows. Asmentioned previously, this is compatible with all knownstandards. We then suggest schemes that use this new orderconstraint to reduce resequencing delays in the output rese-quencing buffers. Unlike most previous papers on switchreordering, these schemes do not change the internal load-balancing algorithm, and therefore also the internal packetreordering. In other words, our resequencing schemes areend-to-end in the switch, in the sense that they only affect theinput and output buffers, and not any element in-between, soas to be transparent to the switch designer internal routingand flow-control algorithms. Our schemes use various meth-ods to increasingly distinguish between flows and decreaseflow blocking. Our first scheme, Hashed-Counter, uses a hash-based counter algorithm. This scheme extends to switch fabricsthe idea suggested in [10] for network processors. It replaceseach single flow counter with hashed flow counters, andtherefore effectively replaces flow blocking within largeswitch flows by reducing it to flow blocking within smallerflow aggregates.

Then, our second scheme, the Multiple-Counter scheme,uses the same counter structure, but replaces the single hashfunction by hash functions to distinguish even better be-tween flows and therefore reduce flow blocking.

Fig. 2 illustrates the Hashed-Counter scheme and theMultiple-Counter scheme in comparisonwith a simple counter-based scheme called the Baseline scheme. The single counter isreplaced with an array of counters in the Hashed-Counter andthe Multiple-Counter schemes. While a packet from a givenflow is hashed to a single counter in the Hashed-Counterscheme, it is hashed to counters in the Multiple-Counterscheme.

Finally, our last scheme, Multiple-Counter attemptsto reduce flow blocking even further by using variable-increment counters based on sequences. All these schemeseffectively attempt to reduce hashing collisions between dif-ferentflowswithin the same switchflow.Note that in theworstcase, even if all hashes collide, all these scheme guarantee thatthey will not perform worse than the currently common scheme,whichuses the same counter for allflowsand therefore has theworst flow blocking within any switch flow.

However, if a packet is delayed in a long queue, thesecounter-based schemes cannot prevent it from affectingmanypackets within its flow. Therefore, we suggest to use networkcoding to reduce reordering delay. We introduce several possiblenetwork coding schemes and discuss their effects on reorder-ing delays. In particular, we show the existence of a time-constrained network coding that is not necessarily related tochannel capacity optimality.

Finally, using simulations, we show how the schemes cansignificantly reduce the total resequencing delay, and analyzethe impact of architectural variables on the switch perfor-mance. For instance, resequencingdelays are reducedbyup toan order ofmagnitude using real-life traces and a real hashingfunction.

Note that since they do not affect routing, our suggestedschemes are general and can apply to a variety of previously-studied multi-stage switch architectures, including Clos, PPS(Parallel Packet Switch), and load-balanced switch architec-tures [2], [11]–[14]. They can also apply to fat-tree data centertopologies with seven stages [15]–[17], and more generally toany network topology with inputs, outputs, and load-balanc-ing with reordering in-between. While we do not expandthese for the sake of readability, we believe that our schemescan decrease resequencing delay in all these potential archi-tectures. (For instance, to avoid reordering, data center linksare currently oversubscribed by factors of 1:5 or more[15]–[17]. If reordering did not incur such a high resequencingdelay, it may be easier to efficiently load-balance packets andfully utilize link capacities.)

1.4 Related WorkResequencing schemes for switches are usually divided intotwo main categories. First, counter-based schemes, which relyon sequence numbers. For instance, Turner [18] describes animplementation of a counter-based scheme that correspondsto the Baseline scheme, as described later.

Second, timestamp-based schemes, which rely on timestampsderiving from a common clock. Henrion [19] introduces sucha scheme with a fixed time threshold. Turner [20] presents anadaptive timestamp-based scheme with congestion-baseddyamic thresholds. However, while timestamp-basedschemes canbe simpler to implement, their delays canbecomeprohibitive when internal delays are highly variable, as ex-plained in [20], because most packets experience a worst-casedelay. Therefore, we restrict this paper to counter-basedschemes.

Several schemes for load-balanced switches [12], [13] attemptto prevent any reordering within the switch. [2], [4], [21]provide an overview of such schemes. In particular, in theAFBR (Application Flow-Based Routing) as well as in [5] and

Fig. 2. Illustration of the Baseline scheme, Hashed-Counter scheme and the Multiple-Counter scheme. The single flow counter from the Baselinescheme is replaced with hashed flow counters in the Hashed-Counter scheme and the Multiple-Counter scheme. A packet from a given flow ishashed to a single counter in the Hashed-Counter scheme and to counters in the Multiple-Counter scheme.

1264 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014

Page 4: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

[6] packets belonging to the same hashed flow, are forwardedthrough the same route, thus preventing resequencing butalso changing theflowpaths andobtaining low throughput inthe worst case. For instance, if there are a few large flows, theachieved balancing is only partial.

Resequencing schemes have also been considered in net-work processors.Wu et al. [22] describe a hardwaremechanismin each flow gets its own counter. Themechanism remembersall previous flows and sequentially adds new flows to the listof chains. It then matches a packet from an existing flowwiththe right SRAM (Static Random-Access Memory) entry usinga TCAM (Ternary Content-Addressable Memory) lookup.Meitinger et al. [10] are the first to suggest the use of hashingfor mapping flows into counters in network processors inorder to reduce reordering. They further discuss the tradeoffbetween the large number of counters and the possible colli-sions. However, all these works on network processors onlyconsider a single input and a single output, and therefore donot consider the complexity introduced by the switchflows. In addition, the load-balancing might be contained innetwork processors with stateful algorithms, while it is not inswitches.

Packet reordering is of course also studied in many addi-tional networking contexts, such as the parallel download ofmultiple parts of a given file [23], or in ARQ (AutomaticRepeat reQuest) protocols [24]. Recently, Leith et al. [25]considered the problem of encoding packets with indepen-dent probabilistic delays such that the probability that apacket can be decoded within a given decoding deadline ismaximized.

2 THE HASHED-COUNTER SCHEME

2.1 BackgroundA commonly used scheme for preserving switch flow order isto keep a sequence-number counter for each switch flow. In thisscheme, denoted as the Baseline scheme, each packet arrivingat a switch input and destined to a switch output is assigned

the corresponding sequence number. As illustrated in Fig. 3(a),in each switch input, we keep counters, one for each switchoutput. Then, the switch output simply sends packets fromthis switch input according to their sequence number. Refer-ring to the example of Table 1, all the first six packetsthrough share the same counter as they belong to the sameswitch flow. Assume that packet gets sequence number 1,gets number , and gets number 6. Then packet can

depart without waiting because its sequence number is thesmallest. Packet , and need towait for packets and .In the end, the departure order is , , , , , . Likewise,the packet uses a different counter since it belongs to asecond switch flow.

It would seem natural to similarly preserve flow order bykeeping a counter for each flow. However, the potentialnumber of (source, destination) IPv4 flows goingthrough a high-performance switch is too large to keep acounter for each.

We could also devise a solution in which counters wouldonly be kept for the most recent flows. But such a solutionmight be complex to maintain and not worth this cost andcomplexity. For instance, a 10Gbps linewith an average 400 Bpacket size would have to keep up to 3 million flows for thelast second. If each flow takes bits to store the (source,destination) IPv4 addresses and 10 bits for the counter, wewould need more than 200 Mb of memory, thus requiringexpensive off-chip DRAM (Dynamic Random-AccessMemory).

Instead, the following algorithms rely on hashing to reducethe number of needed counters by several orders of magni-tude, in exchange for a small collision probability.

2.2 Scheme DescriptionFig. 4(a) illustrates how the Hashed-Counter scheme is imple-mented in the input port. arrays of packet counters exist ineach input port. For a given output port, the Hashed-Counterschemeuses an array of counters, instead of a single counterfor the Baseline scheme.

Fig. 3. Baseline scheme.

Fig. 4. Hashed-Counter scheme.

ROTTENSTREICH ET AL.: SWITCH REORDERING CONTAGION 1265

Page 5: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

As shown in Fig. 4(b), each incoming packet is hashed to aspecific counter based on its flow source and destination IPaddresses. The counter value is then incremented, and thepacket is assigned this value as its sequence number. Forinstance, a packet belonging to flow is hashed to counter

, and it is assigned sequence number . Thissequence number 9 is inserted into , which is forwarded tothe middle switch element.

At the output resequencing buffer, the same hash functionis used aswell. Therefore, it will yield the same counter index.All packets hashing to the same counter will then leave inorder, as indicated by their sequence numbers.

Since the hash function is kept constant, all packets of thesame flow always use the same counter. Therefore, the switchis flow order preserving.

Note that theBaseline scheme is a private case of theHashed-Counter scheme for . Intuitively, the Hashed-Counterscheme splits all the flows that are part of a switch flow into

sets offlows, andkeeps the orderwithin each set. Therefore,a late packet will only delay packets within its set, and notaffect the packets of the other sets. Consider the exam-ple in Table 1 again, and assume . Packet , , offlows , and packet , of flows are hashed to twodifferent counters, the forth and the second, respectively.Likewise, packet of flow is hashed to the third counter.At the output, packet still needs to be buffered to wait forpacket and packet for packet , as in the Baseline scheme.However, packet which is thefirst packet that uses the thirdcounter, does not necessarily need to be buffered till any otherpackets’s arrival. Thus, it can depart from the switch rightafter packet . Therefore, the packet departure order isdifferent from their arrival order and only packets areblocked by the delayed packets and .

2.3 Output Resequencing BufferWe now want to illustrate the operation of the output rese-quencing buffer in the two schemes, i.e., Baseline and Hashed-Counter scheme, by considering the example above.

1. Baseline Scheme: As shown in Fig. 3(c), the Baselinescheme is easy to implement in our example. All packetsare kept in a single linked list.When a packet arrives andis thefirst one in the linked list, it is ready to depart. Sincepacket has departed, the next expected packet has acounter value of 2. Placeholders are used in the linked listfor the missing packets and . (Of course, real im-plementationsmight be optimized bymixing linked listsand arrays and skipping placeholders, but this is beyondthe scope of this paper.)

2. Hashed-Counter Scheme: Fig. 4(c) illustrates how theHashed-Counter scheme is implemented using

separate linked lists. Each packet is hashed into itscorresponding linked list depending on its flow. Whena packet arrives and is the first of its linked list, it is readyto depart.

In this example,flows and hash to the second linked list,with a placeholder for delayed packet and with packet .Flow hashes to the third linked list. As the counter value ofis the expected, leaves immediately. Finally, flows andhash to the last linked list, with a placeholder for delayedpacket and with packet .

3 THE MULTIPLE-COUNTER SCHEME

3.1 Scheme DescriptionWhile the Hashed-Counter scheme splits the flows into setsusing counters, we would like to use these counters evenmore efficiently and split the flows intomore sets.We suggesttouse several hash functions,whilemaking sure that the orderis still preserved.

The implementation of the Multiple-Counter scheme isillustrated in Fig. 5(a), it is schematically the same as that oftheHashed-Counter scheme.As shown in Fig. 5(b), theMultiple-Counter scheme also keeps an array of counters for eachpair of input and output ports. However, each incomingpacket is now hashed into different counters using differ-ent hash functions of the flow ID (the case correspondsof course to the previous Hashed-Counter scheme). All thecounter values are incremented and sent with the packet. Inthis example, in the input port, the packet belonging to flowis hashed to counters using hash functions

and is assigned the sequence numbers 3,9,4, respectively. Thesequence numbers are inserted to , which then is forwardedto the middle switch element.

Whenpacket arrives at the output resequencing buffer, itnow needs to check the same counters. In the case that itscounter value is the next expected one in at least one of itscounters, we can deduce that is the earliest packet of itsflow, and therefore that it can be released. This is because anyearlier packet of the same flow would have hashed to thesame counters, and therefore would have had smallercounter values in all counters. If the counter value of is thenext expected one, it necessarily implies that has alreadydeparted.

TheMultiple-Counter scheme ensures that the switch is floworder preserving. We now describe in greater detail the rese-quencing buffer implementation.

3.2 Output Resequencing BufferAs shown in Fig. 5(c), the implementation of the Multiple-Counter scheme is similar to that of the Hashed-Counter

Fig. 5. Multiple-Counter scheme.

1266 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014

Page 6: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

scheme.However, there is a small tweak: each packet actuallybelongs to several linked lists. Therefore, we only representpacket pointers, while the real packets sit in the packet buffer.Each arriving packet is hashed into its corresponding linkedlists and places pointers in each. When a packet arrives and isthe first of at least one of its linked lists, it is ready to depart.

Weuse and refer again to the example in Table 1withadditional assumptions. Specifically, we assume that flow(with and ) uses counters , flow (with ) uses

, flow (with ) uses , flow (with ) usesalso , andfinallyflow (with ) uses . Since

is the first packet in the linked list of , it can leaveimmediately after its arrival. Later,when arrives, its countervalue for is the next expected and therefore it can also leaveimmediately.Meanwhile, in the list of , packets and aretemporally keptwith a specialmark.When themissingpacketin the list departs, the linkwill be updated and the expected

counter value then would be set to 5. When arrives, itscounter value for , are 4 and 2while they expect a packetof counter value 2 and 1, respectively. Therefore, it is not thefirst of any of the two linked lists. It is placed in each, and aplaceholder is added for the missing packets of each counter.In addition, the real packet is inserted in the packet buffer.

Note that this implementation could be readily optimizedby using doubly-linked lists, and by pointing directly to thepackets instead of using copies. Thus, each packet wouldcontain pointers, i.e. two pointers per linked list. It couldalso be optimized by skipping departed packets in the linkedlist. All these considerations, however, are beyond the scopeof this paper.

4 THE MULTIPLE-COUNTER SCHEME

4.1 SequencesWhile all previous schemes (Baseline, Hashed-Counter andMultiple-Counter) increment their counters by one upon pack-et arrival, we now want to introduce flow-based variable incre-ments to distinguish between flows even further.

Note how schemes increasingly distinguish between flowsand prevent inter-flow blocking: starting from the Baselinescheme, we replace a single flow counter by hashed flowcounters to obtain the Hashed-Counter Scheme. Then wereplace a single hash function by hash functions to obtainthe Multiple-Counter Scheme. Last, we now distinguish be-tweenflows by changing the counter increments, as describedbelow.

To do so, we introduce the Multiple-Counter scheme. Inthis scheme, we keep an array of pairs of counters in eachinput for each output, as illustrated in Fig. 6. The scheme isbased on sequences [26]. Intuitively, a sequence is a setof integers with the property that for any , all the sumsof elements from the set are distinct. Therefore, given a sumof elements,we candeterminewhether an element of thesequence is a part of the sum. Recently, it was suggested to use

sequences in another network application [27].

Definition 1 ( Sequence). Let N be asequence of positive integers. Then is a sequence iff all thesums with aredistinct, i.e. all the sums of exactly elements from havedifferent values.

It is easy to see that a sequence has the followingproperty:

Observation 1. If is a sequence thenall the sums withfor are distinct as well.

Example 1. Let N .We can see that all the 10 sums of 2 elements of are

distinct: , , , , ,, , , , .

Therefore, is a sequence. However, since, is not a sequence.

4.2 Scheme DescriptionAs illustrated in Fig. 6, the scheme uses two sets of hashfunctions based on the flow ID: first, , with range

, where the output of each hash function points to apair of counters. And , with range ,where the output of each hash function corresponds to avariable increment. At each switch input, each incomingpacket is hashed again into different array entries. usingthe hash functions of the flow ID. At eacharray entry , there is a pair of counters. The first counterwith fixed increments, denoted by , is incremented byone. The second counter, , is incremented by the element

of the sequence . The values of the pairs ofcounters are then sent with the packet.

The counters are initialized as follows. At the switch input,in each pair of counters, both counters are first set to zero. Atthe output, thefirst counter in eachpair is set to one (accordingto the value of the corresponding counter of the first packet).The second counter is initialized to zero. This is the expectedvalue of the second counter in the first packet minus thecorresponding variable increment.

When this packet arrives at the output resequencingbuffer, it now needs to check the same pairs of counters. Foreach pair of counters in , let be the differencebetween the first counter of this packet and its expected valueat the output. We also denote by the difference between(a) the second counter value of the packet minus its variableincrement, and (b) the expected value as indicated by thesecond counter at the output. To calculate weconsider the counter values at the output before beingupdated by the current packet .

As in theMultiple-Counter scheme, if the first counter of thepacket is the next expected one, i.e. , we can deducethat is the earliest packet of itsflow, and therefore that it canbe released. If ,we also consider the value of .This difference equals the sumof the variable incrementsof the earlier delayed packets from the current pair of input

Fig. 6. Multiple-Counter scheme in the input, packet from flow .

ROTTENSTREICH ET AL.: SWITCH REORDERING CONTAGION 1267

Page 7: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

andoutput ports. Since these variable increments are based onthe sequence , by definition of the sequence, we candeterminewhether is composed of . If this is not thecase, we can release even though there are > earlierdelayed packets that used this array entry. This is because anyearlier packet of the same flow would have hashed to thispair of counters, and would have incremented this secondcounter by the same value .

We now consider again the previously suggested examplefrom Table 1 with additional assumptions on the variableincrements. Now, each of is a pair of counters.Specifically, we are concentrating at and assume that flow(with , ) increments its variable increment counter of by2.

Flow (with ) increments it by 7 and flow (with )increments it by 5. As in the previous scheme,when arrives,its fixed increment counter values for , are 4 and 2whilethey expect a packet of counter value 2 and 1, respectively.Therefore, is not the first of any of the two linked lists. Wenow consider the values of the pair of counters . Afterhaving been updated by the packets , , and , the valueof the variable increment counter of is

. Due to the missing pack-ets , at the output the value of the correspondingcounter before being updated by the current packet is only 2(after having been updated only by packet ). Thus, wehave that and

.Since , we can determine that the sum

must be composed of exactly two elements ofwhich are . Since the relevant variable increment of theflow is 5, we can deduce that there are not any missingpackets of the flow and the packet is the earliest packet ofits flow and can be released. Since otherwise, we must havethat is composed of 5.

4.3 Output Resequencing BufferThe implementation of the output resequencing buffer in thisscheme is similar to the implementation of the Multiple-Counter scheme, besides one change. The decision whethera packet can be released is based on the values ofand . To implement this scheme, we suggest the use of apredetermined two-dimensional binary table based on thesequence . Thevalue of the table in entry equals one iff asum can be composed of exactly elements of . We can seethat the sum canbe composedof andotherelements of iff the sum can be composed ofexactly elements of . Therefore, in order to deter-mine if can be released we access the table at entry

. Since we musthave that .Thus, the size of this table can be at most

. Forand the sequence , wehave a total number of memory bits.As illustrated in Fig. 7, in order to determine whether we canrelease packet in the example above, we access the table atentry . Thebit value of zero means that the value of cannot becomposed of exactly one element of and therefore can bereleased.

5 PERFORMANCE TRADE-OFF OVERVIEW

Table 2 provides an overview of the properties of the fourschemes. It is to give some intuition on the trade-offs involved.

Itfirst shows the collision probability, i.e. the probability thattwo arbitrary packets of a given switch flow use exactly thesame counters (with the same variable increments in theMultiple-Counter scheme). For simplicity, it assumes that eachswitch flow consists of an infinity of flows of negligible sizeand uses uniformly-distributed hash functions. While twopackets of a switch flow necessarily collide in the Baselinescheme, since they all share the same counter, their collisionprobability drops down to in theHashed-Counter scheme,because they choose uniformly at random among counters.In theMultiple-Counter scheme, there are different choicesof the subset of counters, and therefore a collision probabili-ty of . Thus, the collision probability is significantly smaller.

For the Multiple-Counter scheme, collision happens whenall the choices from the sequencing are the same.Consequently, the collision probability is further reduced to

.Note, as discussed inmoredetail in Section 7.2, that these

collisionprobabilities are not necessarily precisely equal to theprobabilities that a given packet is delayed, by a late packetfrom another flow.

On the other hand, the second column shows that theHashed-Counter, Multiple-Counter and Multiple-Counterschemes incur an increase in the number of counters by factorsof either or . Also, the third column shows that theMultiple-Counter and Multiple-Counter schemes increasepacket overhead. For instance, for , a cell size of 128B anda counter size of 3B. The Multiple-Counter scheme adds

overhead, and the Multiple-Counterscheme adds overhead. While this over-head is small in front of typical switch speedups (e.g. 2X, andup to to 5X), a switch designer would still need to take it intoaccount.

Fig. 7. Output resequencing structure for Multiple-Counter scheme.

TABLE 2Scheme Comparison

1268 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014

Page 8: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

The last column illustrates schematically the processingcomplexity needed to insert a packet in the resequencingbuffer. The Baseline scheme only needs to search through onelist to find the packet position. Likewise, the Hashed-Counterscheme finds the correct list using a single hash function, thengoes through the list. However, the Multiple-Counter schemeneeds to do so for lists accessed in parallel. In addition, the

Multiple-Counter scheme needs another hash function andparallel table lookups to understand the variable-increment

information.Incidentally, note that these complexitymeasuresmight be

misleading. For instance, the size of the Baseline scheme list ison averagemore than times larger than the lists of theHashed-Counter, Multiple-Counter and Multiple-Counter schemes,since their total resequencing buffer size is smaller and theyhave lists. Even though finding the largest of elements onparallel lists takesmore timeonaverage thanfindinga single

one, in practice, the factor actually makes it significantlyeasier to reach high access speeds with the Hashed-Counterthan with the Baseline scheme.

6 NETWORK CODING

6.1 Coding against Rare EventsWhile the counter schemes above can reduce reorderingdelay, the (worst-case) total packet delay is necessarily low-er-boundedby the (worst-case) queueingdelay. This is becausethe total delay of a packet is composed of its queueing delayand its reordering delay. If the first packet in a flow of 100packets is delayed in an extremely long queue, then all theother packets will have to wait for it and suffer as well.Therefore, reordering delay is vulnerable to rare events.

To solve this problem,we suggest to consider an intriguingidea: using network coding to reduce reordering delay. Whilenetwork coding has been often used in the past, it has mainlybeen destined to address packet losses, and reordering hasoften only been a minor side effect [28], [29]. We suggest hereto use it exclusively to reduce reordering delay by addressingthe vulnerability to rare events. Interestingly, we will showthat in some sense, the total delay of a packet can be smaller than itsqueueing delay—because network coding enabled the switchoutput to reconstruct and send it before it actually arrived tothe output.

Consider the switch architecture, as shown in Fig. 1, andassume for simplicity that all packets have the same size. Toimplement network coding, in each input port, we add onepacket buffer per switch flow, i.e. a total of packet buffersin all input ports. Then, for each switch flow, we keepcomputing the running XOR (exclusive or) function of theprevious packets. After a given number of slots (or packets),we send a redundant protection packet that contains thisXOR of the previous packets, and re-initialize the XORcomputation.

Fig. 8 illustrates an example of use of the protection XORpacket. In the input port, the XORpacket is generated usingthe previous three packets , and , i.e. .Packet is delayed by a relatively long time so that it arrivesto the output port last. Without XOR packet, packets andshould wait for to arrive in order to depart the switch.However, when using network coding, packet is simply

recovered by taking the XOR function of , and , becauseof the simple identity

As we can see, packets , and can depart the switchbefore the original packet even arrives at the output. Thus, inthis example, the XOR packet mainly helps reducing ’squeueing delay—and we do not really care about its impacton channel capacity, as in typical network-coding examples.

To further reduce the total delay,more XORpackets can begenerated. However, there is no free lunch. The XOR packetswill increase the traffic load in the switch, which will result inhigher packet queueing delay in the central stage. Therefore,there should exist an optimal point beyond which XORs nolonger help.

To implement the network coding, in each input port, wekeep one packet buffer per each output port. In this packetbuffer we keep the cumulative XOR of the packets of thecurrent switch flow which might include packets of severalflows. The total number of all packet buffers in all input portsis clearly .

Since for each recovered packet, we would like also toobtain the values of the corresponding counters, the packetbuffer should also keep the cumulative XOR of the corre-sponding counters of the packets, in addition to their data.Thus the required memory size for each pair of input andoutput port is the sum of the cell size and the total size of thecounters assigned to a single packet. Thismemory overhead issmaller when the fixed size of the cells is smaller.

After a delayed packet is recovered, when it finally arrivesat the output it is simply dropped.We can observe that such asituation occurred based on the counters at the switch output.

6.2 Network Coding SchemesThere are several possible network coding schemes to decidewhen the protection XOR packets should be sent. First, asshown in Fig. 9(a), a simple coding scheme is to generate theXORpackets every slots by taking theXORof the packets inthe last slots, where in the figure. The XOR packet isthen inserted following the last of the slots, andcovers theslots. These resulting slots make up a frame. In itsheader, the protection XOR packet contains the sequencenumbers of the first and last packets in the frame, so that theoutput will be able to know what packet it is supposed toprotect.

However, it can practically be further improved in twodirections. First, if the traffic load of the flow is light, thenumber of packets in a frame could be low, and the frame

Fig. 8. Network coding example.

ROTTENSTREICH ET AL.: SWITCH REORDERING CONTAGION 1269

Page 9: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

could even be empty. Therefore, the contribution of XORpackets does not justify the high relative overhead and addi-tional delay caused by these XORs. It wouldmakemore senseto insert protection XOR packets every packets instead ofevery slots. Fig. 9(b) illustrates such a scheme with ,where the blank boxes stands for the regular packets and darkboxes for protection XOR packets. The scheme overhead isclearly .

Interestingly, another possible improvement to the firstscheme couldbe to useXORs that protect a variable number ofslots, e.g. with cumulative coding. For instance, frames couldbe grouped into largemega-frames.Within eachmega-frame,every slots, the protection XOR packet cumulatively coversnot only the last slots, but the whole time since the start ofthemega-frame, excludingotherXORpackets. Therefore, thereis a global coding instead of a mere local coding. Fig. 9(c)illustrates this cumulative coding scheme with a mega-frameconsisting of three frames, each frame having regularslots and one protection slot.

This last scheme is interesting in that it illustrates aninteresting coding tradeoff. On the one hand, we would likethe coding to be efficient, so in some sense we could like towait as much as possible until the end of the mega-frame andthen use several protection packets, e.g. using a simple Reed-Solomon or Hamming code with fixed packet dependencies.However, if we only release the protection packets at the endof the mega-frame, the coding becomes useless, because theprotected packets will probably have already arrived at theoutput. Therefore, the protection scheme needs some form oftime-constrained coding, in the sense that the protection packetsneed to be close to the packets they protect.That’swhy the schemesdisplayed above are more effective for reordering delay thanmore complex schemes that might be closer to the Shannoncapacity bounds.

7 PERFORMANCE ANALYSIS

7.1 Delay ModelsWenowwant tomodel the performance of the schemes underdifferent delay models. We assume a simple Bernoulli i.i.d. ina slotted time, so that the probability of having apacket arrival

for a given switch flow at a given slot is equal to .We analyzethe performance of the schemes based on several delaymodels for the queueing delay , which is experienced byeach packet in the middle switch elements.

First, we will consider a Rare-Event delay model in whichis either some large delay with probability , or 0 with

probability . This delay distributionmodels the impact oflow-probability events in which some packets experiencevery large delays, and these delays impact other packets. Forinstance, this could be the case of a middle element with asignificant temporary congestion. It could alsomodel a failureprobability of formiddle elements, with a timeout value atthe output resequencing buffer and a negligible queueingtime. After time , an absent packet is declared lost, and thefollowing packets can be released.

We then consider a General delay model with an arbitrarycumulative distribution function of the queueing delay ,so that . Then, we apply the analysis to ageometric delay model, such that the delay is geometricallydistributed, i.e. . The Geomet-ric delaymodel will help us get some intuition on the reorder-ing delay in a switch with load .

In these models, we only take into account the queueingdelay in the middle elements and the resequencing delay

in the resequencing buffer. Therefore, the total delay is. We neglect the propagation delays, as well as

the additional queueing delays in the inputs and outputs. Wealso assume uniformly-distributed hash functions. For thesake of readability, the performance of the Multiple-Counterscheme is brought forward here only for part of the delaymodels.

7.2 Delay Distributions

7.2.1. Rare-Event Delay ModelWenowwant to analyze theBaseline,Hashed-Counter,Multiple-Counter and Multiple-Counter schemes. Note that the per-formances of the Hashed-Counter, Multiple-Counter andMultiple-Counter schemes depend on theflow size distribution.For instance, if a switch flow consists of a single large flow,then there is no point in adding counters to distinguishbetween flows. We have also developed a full model basedon the flow sizes. However, for the sake of readability, wewillonly present below the much simpler case in which all flowshave negligible size.

In Delay Model 1, has one of two possible values. It isequal to 0 w.p. , and to w.p. . The queueing delay ischosen at random independently for each packet. If ,we say that the packet is delayed. Using this delay model, thetotal delay has an upper bound of . Thereason is that the queueing delay is either 0 or . Therefore,time slots after its arrival at the switch input, it is guaranteed

that a packet arrives to the output. In addition, due to theupper bound of on the queueing delay, all its earlierpackets are also guaranteed to reach the output no laterthan that time. More generally, the higher the smaller

, and in particular in this delay model if thennecessarily .

We present the CDF (Cumulative Distribution Function)function of for the different counting schemes. To calculatethis function, we have to know the probability of the arrival of

Fig. 9. Network coding schemes.

1270 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014

Page 10: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

another packet of the same flow in the current input-outputpair, in each of the previous time slots. For simplicity,we assume that this probability, denotedby , is independentin the time difference of the two packets. According to theassumption that all flows have negligible size and packets arearbitrarily assigned a flow from the possible set of flows, wehave that .

Since the total arrival for this pair is Bernoulli distributedwithprobability ,wededuce that the probability of an arrivalof a packet of a different flow is .

The first theorem is about the Rare-Event delay model. Asmentioned earlier, in this delaymodel, the worst-case delay is, and therefore we always have .

Theorem 1 (Rare-Event Delay Model).i. Using the Baseline scheme, for ,

ii. Using the Hashed-Counter scheme, for ,

iii. Using the Multiple-Counter scheme, for ,

with .iv. Using the Multiple-Counter scheme, for

,

with .

Proof.i. The probability for a given packet not to experience a

queueing delay of is . In that case, its reorder-ing delaywill be atmost if it doesn’t need towait forearlier delayed packets beyond slots. Assume thatour packet arrived at time . An earlier packet thatarrives at time and is delayed by

slots will arrive at time, causing our packet to wait for slots and

therefore miss the time constraint of . Therefore, toreach this time constraint, any of the slotsbetween time and time (includ-ed) should either not receive a packet, or not encoun-ter delay. The probability of this event occurring is

.ii. This is the same result as above, given a uniformly-

distributed hash function and therefore a probabilityof having another packet share the same

counter.iii. In theMultiple-Counter scheme, a packet can leave the

switch if at least one of its counters is in order.Therefore, the reordering delay exceeds if thereordering delay in all counters exceeds , hence theexponent of . Further, a given counter is sharedwitha given other packet with a probability of , since

this other packet uses counters out of . Theremainder of the formula is then the same as above.

iv. In the Multiple-Counter scheme, a packet can leavethe switch if in at least one of its counters there are atmost earlier missing packets and each of them doesnot use the same value from the sequence as thecurrent packet. As explained in the previous scheme,in a time slot we have a delayed packet that uses aspecific counter w.p. . Since , a specificelement of is not used w.p. as above. ◽

7.2.2 General Delay ModelIn Delay Model 2, is distributed according to a generaldistribution function , which holds . Forexample, this delaymight be distributed as the distribution ofthe delay of Geom/Geom/1 queuewith Bernoulli distributedarrival with probability and Bernoulli distributed servicewith probability , in each time slot. Using this delay model,the resequencingdelay , and the total delayare not bounded from above. Let denote by , the queueingdelay of the packet that entered the system at time . Wealso denote . Here, a packet that enters itsinput port in time , waits (at most) for if thispacket and all the other previous packets with the samehashing values that appeared in their input node, arrive totheir output node no later than time .

The next two theorems provide exact models of theperformance of schemes given a General delay model anda Geometric delay model.

Theorem 2 (General Delay Model).i. Using the Baseline scheme,

ii. Using the Hashed-Counter scheme,

iii. Using the Multiple-Counter scheme,

with .

Proof.i. In theGeneral delaymodel, a packet arriving at time

experiences a total delay of atmost iff it satisfies twoindependent conditions. First, its queueingdelay is atmost , w.p. (with probability) . Second, none ofthe previous packets have been delayed beyond time

, and therefore no earlier packets prevent it fromleaving. Since an earlier packet arrives at slot

ROTTENSTREICH ET AL.: SWITCH REORDERING CONTAGION 1271

Page 11: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

w.p. , and in that case is only delayed beyondw.p. , the result follows by multiply-ing all the probabilities that there is no late packetfrom slot over all such possible slots.

ii. The result follows again directly from above whenconsidering a single counter out of .

iii. The proof is again exactly the same as in the previoustheorem, and follows from the previous result. ◽

7.2.3 Geometric Delay ModelNowwe focus on the performance evaluation under a privatecase of the General delay model with exponential delay. Inthis case, we have a queuing delay of w.p.for . Therefore, . Theresults for this case are summarized in the followingtheorem.

Theorem 3 (Geometric Delay Model).i. Using the Baseline scheme,

ii. Using the Hashed-Counter scheme,

iii. Using the Multiple-Counter scheme,

with .

Proof. The results follow from the previous theorem usingthe expression of the geometrically-distributed delaymodel. ◽

8 SIMULATIONS

We fully developed from scratch a switch simulator that canrun several use-cases, including the geometric delay model,theClos switch and thedata center switch. To bettermimic thepractical network behavior, we further added a possibility offeeding the flow source with real-life traces [30]. Simulationsfor each use-case are conducted as described below. Through-out this section, delays are brought inunits of time-slots. In thesimulations,we rely on real-life 64-bitmix hash functions [31].In addition, for the evaluation of the Multiple-Counterscheme, we use the sequence

which is a sequence for .

8.1 Geometric Delay Model SimulationsWe run simulations for the geometric delay model discussedin Section 7.2. In the simulation, the number of flows equals

, so that the assumption that the probability that twoarbitrary packets share the same flow is negligible still holds.We generate the switch flow using Bernoulli i.i.d traffic with

arrival rate , the parameter of geometric delay, and the number of total counters . In the

Multiple-Counter scheme and the Multiple-Counter scheme,we use hash functions.

Fig. 10(a) depicts the simulation results, comparedwith thetheoretical results from (8), (9) and (10). The simulation resultsmatch the theoretical results quite well. Furthermore, we cansee that changing the ordering guarantee from switch-floworder preservation to flow order preservation significantlydecreases the average and standard-deviation of both delays.Using the Hashed-Counter scheme, the average total delay isdrastically reduced from 19.09 to 9.35 time slots, an improve-ment of 51%. In theMultiple-Counter scheme the average totaldelay is reduced again by additional 8.3% to 8.58 without anyadditional memory.

Fig. 10(b) shows the average resequencing delay as afunction of the number of counters . The Hashed-Counterand the Multiple-Counter scheme significantly reduce theresequencing delay of the Baseline scheme which is not affect-ed of course by the value of . For instance, for , theaverage resequencing delay for the Baseline, Hashed-Counterand Multiple-Counter schemes is 11.7315, 2.0312 and 1.2442,respectively. Furthermore, for , the delay of theHashed-Counter scheme is 0.2731 while the Multiple-Counterscheme reduces it to only 0.0302.

Fig. 10. Geometric delay model.

1272 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014

Page 12: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

8.2 Switch SimulationsWenow run simulationswith the switch structure fromFig. 1.Therefore, the delay experienced in the middle switch ele-ments does not follow a specific delay model as above, but isinstead incurred by other packets.We always keep and

, yielding switch flows going throughdifferent paths. We set the number of total coun-

ters to for the Hashed-Counter and Multiple-Counterschemes. For the Multiple-Counter schemewe set toaccount for the larger memory requirements. We also assumea uniform traffic matrix. The presented results are based onsimulations.

We start by comparing the performance of hash-basedcounter schemes by using 4096 flows per (input, output) pair,i.e. a total of flows. The total load is set to

. Eachflow is generated using Bernoulli i.i.d. traffic ofparameter . In theMultiple-Counter and Multiple-Counter schemes, we have .

Fig. 11(a) compares the results of the switch simulations tothe theoretical results from (8), (9) and (10), similarly to Fig. 10(a). For the Baseline scheme the optimal fit was reached using

, and for the Hashed-Counter scheme and Multiple-Counter schemeusing . The discrepancy between thetheory and the simulation can be explainedwhenwe considerthat in the switch simulations the assumption that the flowsare mice of negligible size does not hold.

Fig. 11(b) plots the average resequencing delay in theswitch as a function of the traffic load . As the load increasesand the delays become larger, the impact of the Hashed-Counter scheme, Multiple-Counter scheme and Multiple-Counter scheme becomes increasingly significant. TheMultiple-Counter scheme results in the lowest resequencingdelay. For instance, at high traffic load, , theMultiple-Counter scheme achieves a 24.2% reduction of the resequen-cing delay obtained by the Hashed-Counter scheme whilethe Multiple-Counter scheme further reduces it by addi-tional 16%.

Fig. 11(c) shows the PDF (Probability Density Function) ofthe number of packets, which are blocked from exiting theresequencing buffers for theHashed-Counter,Multiple-Counterand Multiple-Counter schemes. The hash-based schemesvastly improve the average number of waiting packets in theBaseline scheme i.e, theBaseline schemehas an average of 22.47packets per time slot, while the Hashed-Counter scheme hasapproximately 5. The Multiple-Counter scheme reduces thisnumber to 4.6 packets per time slot, and finally theMultiple-Counter scheme has only 4.07 packets per time slot.

8.3 Switch Simulations Using Real-Life TracesWe now conduct experiments using real-life traces recordedon a single direction of anOC192 backbone link [30].We use areal hash function [31] to match each (source, destination)

Fig. 11. Switch simulation results.

ROTTENSTREICH ET AL.: SWITCH REORDERING CONTAGION 1273

Page 13: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

IP-address flow with a given counter bin. In the Multiple-Counter and Multiple-Counter schemes, we have .

As expected, Fig. 11(d) shows how the Hashed-Counterscheme drastically reduces the resequencing delay on thisreal-life traffic, from 1 to 0.1. Again, the reduction in totaldelay is more modest, from 2.4 to 1.5 time slots.

8.4 Network Coding SimulationsFig. 12presents simulation results of theCDFof the total delayfor the Rare-Event delay model with the parameters ,

. We assume a traffic load of and that theflows of the packets are uniformly distributed among a set of4096 possible flows. The network coding parameter was

, i.e. a protection XOR packet was inserted every10 packets. In order to have a fair comparison, we assumedthatwhen thenetwork coding is usedand the load is increasedby factor , is increased by the same factor. Thus, themaximal queueing delay is 110. Fig. 12(a) presents the CDFof the total delay without using the network coding andFig. 12(b) shows the CDF with coding.

CDF of 0.9 is achieved in theBaselinewithout coding only fora total delay of 90. For Baseline with coding and for all otherschemes, this CDF is achieved even for delay of 0. Withoutnetwork coding, the mean of the total delay for the Baseline,Hashed-Counter, Multiple-Counter and Multiple-Counterschemes was 36.12, 3.33, 1.53 and 1.14, respectively. Using thenetwork coding, the mean delay was significantly reducedto 3.30, 0.41, 0.23 and 0.17. An improvement by approxi-mately an order of magnitude.

In the delay models from Section 7 and in the simulationsso far, we have assumed that a packet can leave the switchimmediatelywhen the packet is available at the switch outputand in order. In practice, inmany switches at most one packetmight leave the same switch output in a single time slot. Thusa packet that is ready to leave the switch at the same time as asecond packet, might experience additional delay.We denotethis delay by exit delay and assume that the switch givesprecedence to packets with earlier arrival time. Fig. 12(c)presents the CDF of the sum of the total delay and the exitdelay in the Rare-Event delay model (without network cod-ing). The mean of this sum of delays for the Baseline, Hashed-Counter, Multiple-Counter and Multiple-Counter schemeswas 94.00, 6.64, 2.50 and 1.52, respectively. With networkcoding the means were reduced to 42.24, 1.30, 0.54 and 0.37,respectively. (Wedonot present the full CDF for this casewithnetwork coding due to space limits.)

8.5 Data Center Simulations Using Real-Life TracesWealsoperformsimulations ona 5-stagedata center structuredescribed in [17]. The input ports are fed with a real-life traceas in Section 7.3. The total number of counters in the Hashed-Counter scheme andMultiple-Counter scheme is whilefor the Multiple-Counter scheme we have . In theMultiple-Counter scheme and Multiple-Counter scheme, thenumber of multiple counters is .

Fig. 13 illustrates the simulation results of the resequencingdelay. As expected, Hashed-Counter scheme noticeably re-duces the resequencing delay, from 16.3 to 6.73 time slots.In this data-center simulation, the improvement from theHashed-Counter scheme to the Multiple-Counter scheme ismore significant. The average resequencingdelay isdecreasedfrom6.73 to 3.30 time slots (a 51.0% improvement). The reasonis that as the packet goes through more stages of switchelements, the variance of the queueing delay is larger andthe reordering caused by other flows becomes more severe.This delay can be improved furthermore by the Multiple-Counter scheme. As in the previous simulations, the bestresults are obtained by using the Multiple-Counter scheme,for which the resequencing delay is only 1.84 time slots.

9 CONCLUSION

In this paper, we provided schemes to deal with packetreordering, an emerging key problem in next-generationswitch designs. We argued that current packet order

Fig. 12. CDF of the total delay and the total delay + exit delay for the rare-event delay model with and without network coding. The results are based onsimulations.

Fig. 13. CDF of the resequencing delay for a data center, using a real-lifetrace (based on simulation).

1274 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014

Page 14: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

requirements for switches are too stringent, and suggestedonly requiring flow order preservation instead of switch-floworder preservation.

We then suggested several schemes to reduce reordering.We showed that hash-based counter schemes can help pre-vent inter-flow blocking. Then, we also suggested schemesbasedonnetwork coding,which are useful against rare eventswith high queueing delay, and identified a time-constrainedcoding problem. We also pointed out an inherent reorderingdelay unfairness between elephants and mice, and suggestedseveral mechanisms to correct this unfairness. We finallydemonstrated in simulations reordering delay gains by fac-tors of up to an order of magnitude.

In future work, we intend to further investigate the opti-mality of our schemes. We intend to find whether there arefundamental lower bounds to the average delay caused byreordering in a switch, given any possible scheme.

ACKNOWLEDGMENTS

The authors thank Alex Shpiner for his helpful suggestions.This work was supported in part by the European ResearchCouncil Starting Grant 210389, in part by the Jacobs-Qual-comm fellowship, in part by an Intel graduate fellowship, inpart by a Gutwirth Memorial fellowship, in part by an Intelresearch grant on Heterogeneous Computing, in part by theIntel ICRI-CI Center, in part by the Technion Funds forSecurity Research, and in part by the Hasso Plattner Centerfor Scalable Computing.

REFERENCES

[1] F. Abel et al., “Design issues in next-generation merchant switchfabrics,” IEEE/ACMTrans. Netw., vol. 15, no. 6, pp. 1603–1615, 2007.

[2] I. Keslassy et al., “Scaling internet routers using optics,” ACMSIGCOMM, vol. 33, no. 4, pp. 189–200, 2003.

[3] N. Chrysos et al., “End-to-end congestion management for non-blocking multi-stage switching fabrics,” in Proc. IEEE/ACM Archi-tectures Netw. Commun. Syst. (ANCS), 2010, p. 6.

[4] I. Keslassy, The Load-Balanced Router. Saarbrücken, Germany: VDMVerlag, 2008.

[5] Cisco Catalyst 6500 Documentation [Online]. Available: http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SXF/native/configuration/guide/cef.html.

[6] Juniper—T Series Core Routers Architecture Overview [Online]. Avail-able: http://www.juniper.net/us/en/local/pdf/whitepapers/2000302-en.pdf.

[7] Cisco CRS-1 Overview [Online]. Available: http://cseweb.ucsd.edu/∼varghese/crs1.ppt.

[8] Cisco CRS Carrier Routing System 16-Slot Line Card Chassis SystemDescription [Online]. Available: http://www.cisco.com/en/US/docs/routers/crs/crs1/16-slot-lc/system-description/reference/guide/sysdsc.pdf.

[9] F. Baker. (1995, Jun.). RFC 1812: Requirements for IP Version 4 Routers[Online]. Available: http://www.faqs.org/rfcs/rfc1812.html.

[10] M. Meitinger et al., “A hardware packet resequencer unit fornetwork processors,” in Proc. Architecture Comput. Syst. (ARCS),2008, pp. 85–97.

[11] S. Iyer and N. McKeown, “Analysis of the parallel packet switcharchitecture,” IEEE/ACM Trans. Netw., vol. 11, no. 2, pp. 314–324,Apr. 2003.

[12] C.-S. Chang, D.-S. Lee, and Y.-S. Jou, “Load balanced Birkhoff-VonNeumann switches, part I: One-stage buffering,” Comput. Commun.,vol. 25, no. 6, pp. 611–622, 2002.

[13] C.-S. Chang, D.-S. Lee, and C.-M. Lien, “Load balanced Birkhoff-Von Neumann switches, part II: Multi-stage buffering,” Comput.Commun., vol. 25, no. 6, pp. 623–634, 2002.

[14] W. Shi, M. H. MacGregor, and P. Gburzynski, “Load balancing forparallel forwarding,” IEEE/ACMTrans.Netw., vol. 13, no. 4, pp. 790–801, Aug. 2005.

[15] A. G. Greenberg et al., “VL2: A scalable and flexible data centernetwork,” in Proc. ACM SIGCOMM, 2009, pp. 51–62.

[16] M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commoditydata center network architecture,” in Proc. ACM SIGCOMM, 2008,pp. 63–74.

[17] R. N. Mysore et al., “Portland: A scalable fault-tolerant layer 2 datacenter network fabric,” in Proc. ACM SIGCOMM, 2009, pp. 39–50.

[18] J. S. Turner, “Resequencing cells in an ATM switch,” Comput. Sci.Dept., Washington Univ., Washington, DC, Tech. Rep. WUCS-91-21, Feb. 1991.

[19] M. Henrion, “Resequencing system for a switching node,” U.S.Patent #5,127,000, Jun. 1992.

[20] J. S. Turner, “Resilient cell resequencing in terabit routers,” in Proc.Allerton Conf. Commun. Control Comput., 2003, pp. 1183–1192.

[21] S. Kandula et al., “Dynamic load balancing without packet reorder-ing,” SIGCOMMComput. Commun.Rev., vol. 37, no. 2, pp. 51–62, 2007.

[22] B. Wu et al., “A practical packet reordering mechanism with flowgranularity for parallelism exploiting in network processors,” inProc. IEEE Int. Parallel Distrib. Process. Symp. (IPDPS), 2005, p. 133.

[23] Y. Nebat and M. Sidi, “Parallel downloads for streamingapplications—A resequencing analysis,” Perform. Eval., vol. 63,no. 1, pp. 15–35, 2006.

[24] Y. Xia and D. N. C. Tse, Analysis on packet resequencing forreliable network protocols, Perform. Eval., vol. 61, no. 4, pp. 299–328,2005.

[25] D. J. Leith and D. Vasudevan, “Coding packets over reorderingchannels,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), 2010, pp. 2458–2462.

[26] S. Graham, “ sequences,” in Analytic Number Theory, Springer,1996, pp. 431–449.

[27] O. Rottenstreich, Y. Kanizo, and I. Keslassy, “The variable-incre-ment counting Bloom filter,” in Proc. IEEE Conf. Comput. Commun(INFOCOM), 2012, pp. 1880–1888.

[28] O. Tickoo et al., “LT-TCP: End-to-end framework to improve TCPperformance over networks with lossy channels,” in Proc. IEEE/ACM Int. Workshop Quality of Service (IWQoS), 2005, pp. 81–93.

[29] V. Sharma et al., “MPLOT: A transport protocol exploiting multi-path diversity using erasure codes,” in Proc. IEEE Conf. Comput.Commun (INFOCOM), 2008, pp. 121–125.

[30] C. Shannon et al. CAIDA Anonymized 2008Internet Trace [Online].Available: http://imdc.datcat.org/collection/.

[31] T. Wang. Integer Hash Function [Online]. Available: http://www.concentric.net/∼Ttwang/tech/inthash.htm.

Ori Rottenstreich received the BS degree incomputer engineering (summa cum laude) andthe PhD degree from the Electrical EngineeringDepartment, Technion, Haifa, Israel, in 2008 and2014, respectively. His research interest includesexploring novel coding techniques for networkingapplications. He is a recipient of the GoogleEurope Fellowship in Computer Networking, theAndrew Viterbi graduate fellowship, the Jacobs-Qualcomm fellowship, the Intel graduate fellow-ship, and the Gutwirth Memorial fellowship. He

also received the Best Paper Runner Up Award at the IEEE Infocom 2013conference.

Pu Li received the BS and MSc degrees fromZhejiangUniversity, China, and Technical Univer-sity Eindhoven, The Netherlands, in 2007 and2009, respectively. He was in the Departmentof Electrical Engineering, Technion, betweenNovember 2009 and July 2010.He is nowworkingasnew technology introductionengineer inASML,TheNetherlands. He conductedmultiple researchprojects in the Chinese University of Hong Kong,TNO Netherlands, and Philips Research Europe,in the field of communication network. His

research interests include wireless technologies, high-performanceswitching network, and network-on-chip.

ROTTENSTREICH ET AL.: SWITCH REORDERING CONTAGION 1275

Page 15: The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party

Inbal Horev received the BSc degree in bothphysics and electrical engineering from theTechnion—Israel Institute of Technology, Haifa,Israel. She is anMSc student at theDepartment ofMathematics andComputer Sciences,WeizmannInstitute of Science, Rehovot, Israel. Currentresearch fields include active learning, computervision, and geometry processing.

Isaac Keslassy (M’02–SM’11) received the MSand PhD degrees in electrical engineering fromStanford University, California, in 2000 and 2004,respectively. He is currently an associate profes-sor in the Electrical Engineering Department,Technion, Israel. His research interest includesthe design and analysis of high-performancerouters and multi-core architectures. He is therecipient of the EuropeanResearchCouncil Start-ingGrant, the Alon Fellowship, theMani TeachingAward, and the Yanai Teaching Award.

Shivkumar Kalyanaraman received the BTechdegree from Indian Institute of Technology,Madras, India, theMS andPhD degrees fromOhioState University, Columbus, and the executiveMBA degree from Rensselaer Polytechnic Insti-tute (RPI), Troy,NewYork.He is a seniormanagerat IBMResearch, Bangalore, India. Previously, hewas a professor at the Department of Electrical,Computer and Systems Engineering, at RPI. Hisresearch in IBM is at the intersection of emergingwireless technologies, smarter energy systems

and IBM middleware, and systems technologies. He is an ACMDistinguished Scientist.

▽ For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

1276 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014