7
Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them Shuihai Hu 1,2 Yibo Zhu 1 Peng Cheng 1 Chuanxiong Guo 1 Kun Tan 1 Jitendra Padhye 1 Kai Chen 2 1 Microsoft 2 Hong Kong University of Science and Technology ABSTRACT Driven by the need for ultra-low latency, high throughput and low CPU overhead, Remote Direct Memory Access (RDMA) is being deployed by many cloud providers. To deploy RDMA in Ethernet networks, Priority-based Flow Control (PFC) must be used. PFC, however, makes Ethernet networks prone to deadlocks. Prior work on deadlock avoidance has fo- cused on necessary condition for deadlock formation, which leads to rather onerous and expensive solutions for deadlock avoidance. In this paper, we investigate sufficient conditions for deadlock formation, conjecturing that avoiding sufficient conditions might be less onerous. 1. INTRODUCTION In this paper we discuss a problem that is quite (c)old, albeit one that has re-emerged in a new context, and admit that we have no idea how to solve it completely. Our hope is to draw the community’s attention to this problem, and re-ignite research in this area. The problem is deadlock formation in lossless networks. Driven by the need for ultra-low latency, high throughput and low CPU overhead, major cloud service providers are deploying Remote Direct Memory Access (RDMA) in their datacenter networks [17, 24]. Among the available RDMA technologies, RDMA over Converged Ethernet (RoCE) [1] is a promising one as it is compatible with current IP and Ethernet based datacenter networks. The deployment of RoCE requires Priority-based Flow Con- trol (PFC) [2] to provide a lossless L2 network. With PFC, packet loss can be avoided by letting a switch pause its im- mediate upstream device before buffer overflow occurs. How- ever, PFC can cause deadlock problem. Deadlock refers to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is per- mitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. HotNets-XV, November 09-10, 2016, Atlanta, GA, USA © 2016 ACM. ISBN 978-1-4503-4661-0/16/11. . . $15.00 DOI: http://dx.doi.org/10.1145/3005745.3005760 PFC threshold Switch A buffer Switch C buffer PAUSE PAUSE PAUSE Packets ABCA Switch B buffer Figure 1: PFC-induced deadlock: simple illustration a standstill situation: there is a Cyclic Buffer Dependency (CBD) among a set of switches. Each switch in the cycle holds all the buffer needed by its upstream switch, and mean- while is waiting for its downstream switch to release some buffer and resume its packet transmission. A simple scenario is illustrated in Figure 1. It is easy to see that when deadlock occurs, no switch in the cycle can proceed. Further, throughput of the whole net- work or part of the network will go to zero due to the back- pressure effect of PFC pause. It is often believed that such deadlocks cannot occur in clos-structured datacenter networks, since a loop cannot form in such networks with valley-free routing [24]. However, Guo et al. [10] has shown that deadlocks can indeed occur in such network. We now believe that deadlocks can also occur when transient loops form in clos structured networks. In our datacenters, these can happen as BGP 1 re-routes around link failures. In SDN-based datacenters, transient loops can occur during updates [12]. While loops may be transient, the deadlocks caused by them are not. Deadlocks do not auto- matically break even after the problems (misconfiguration, failures/updates, etc.) that cause them have been fixed. Hence some mechanism for handling the deadlock prob- lem must be used when deploying RDMA in datacenter net- works. These mechanisms fall in two broad categories: Reactive mechanisms/systems detect that a deadlock has formed, and then try to break it by resetting links/ports/hosts etc. These mechanisms are inelegant, disruptive, and should be used only as a last resort. We do not consider them further in this paper. Proactive deadlock prevention is a more principled ap- proach to this problem. Prior work on deadlock preven- tion can be classified into two categories including 1) Rout- ing restriction-based approach [7, 21]. The idea of this ap- 1 In our datacenters, we use BGP for routing, with each switch be- ing a private AS.

Deadlocks in Datacenter Networks: Why Do They Form, and ... · Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them Shuihai Hu1;2 Yibo Zhu1 Peng Cheng1 Chuanxiong

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deadlocks in Datacenter Networks: Why Do They Form, and ... · Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them Shuihai Hu1;2 Yibo Zhu1 Peng Cheng1 Chuanxiong

Deadlocks in Datacenter Networks: Why Do They Form,and How to Avoid Them

Shuihai Hu1,2 Yibo Zhu1 Peng Cheng1 Chuanxiong Guo1 Kun Tan1

Jitendra Padhye1 Kai Chen2

1Microsoft 2Hong Kong University of Science and Technology

ABSTRACTDriven by the need for ultra-low latency, high throughputand low CPU overhead, Remote Direct Memory Access (RDMA)is being deployed by many cloud providers. To deploy RDMAin Ethernet networks, Priority-based Flow Control (PFC) mustbe used. PFC, however, makes Ethernet networks proneto deadlocks. Prior work on deadlock avoidance has fo-cused on necessary condition for deadlock formation, whichleads to rather onerous and expensive solutions for deadlockavoidance. In this paper, we investigate sufficient conditionsfor deadlock formation, conjecturing that avoiding sufficientconditions might be less onerous.

1. INTRODUCTIONIn this paper we discuss a problem that is quite (c)old,

albeit one that has re-emerged in a new context, and admitthat we have no idea how to solve it completely. Our hopeis to draw the community’s attention to this problem, andre-ignite research in this area.

The problem is deadlock formation in lossless networks.Driven by the need for ultra-low latency, high throughput

and low CPU overhead, major cloud service providers aredeploying Remote Direct Memory Access (RDMA) in theirdatacenter networks [17, 24]. Among the available RDMAtechnologies, RDMA over Converged Ethernet (RoCE) [1]is a promising one as it is compatible with current IP andEthernet based datacenter networks.

The deployment of RoCE requires Priority-based Flow Con-trol (PFC) [2] to provide a lossless L2 network. With PFC,packet loss can be avoided by letting a switch pause its im-mediate upstream device before buffer overflow occurs. How-ever, PFC can cause deadlock problem. Deadlock refers to

Permission to make digital or hard copies of all or part of this work for personalor classroom use is granted without fee provided that copies are not made ordistributed for profit or commercial advantage and that copies bear this noticeand the full citation on the first page. Copyrights for components of this workowned by others than ACM must be honored. Abstracting with credit is per-mitted. To copy otherwise, or republish, to post on servers or to redistribute tolists, requires prior specific permission and/or a fee. Request permissions [email protected].

HotNets-XV, November 09-10, 2016, Atlanta, GA, USA© 2016 ACM. ISBN 978-1-4503-4661-0/16/11. . . $15.00

DOI: http://dx.doi.org/10.1145/3005745.3005760

PFC threshold Switch A buffer

Switch C buffer PAUSE

PAUSE PAUSE

Packets AàBàCàA

Switch B buffer Figure 1: PFC-induced deadlock: simple illustration

a standstill situation: there is a Cyclic Buffer Dependency(CBD) among a set of switches. Each switch in the cycleholds all the buffer needed by its upstream switch, and mean-while is waiting for its downstream switch to release somebuffer and resume its packet transmission. A simple scenariois illustrated in Figure 1.

It is easy to see that when deadlock occurs, no switch inthe cycle can proceed. Further, throughput of the whole net-work or part of the network will go to zero due to the back-pressure effect of PFC pause.

It is often believed that such deadlocks cannot occur inclos-structured datacenter networks, since a loop cannot formin such networks with valley-free routing [24]. However,Guo et al. [10] has shown that deadlocks can indeed occur insuch network. We now believe that deadlocks can also occurwhen transient loops form in clos structured networks. Inour datacenters, these can happen as BGP1 re-routes aroundlink failures. In SDN-based datacenters, transient loops canoccur during updates [12]. While loops may be transient, thedeadlocks caused by them are not. Deadlocks do not auto-matically break even after the problems (misconfiguration,failures/updates, etc.) that cause them have been fixed.

Hence some mechanism for handling the deadlock prob-lem must be used when deploying RDMA in datacenter net-works. These mechanisms fall in two broad categories:

Reactive mechanisms/systems detect that a deadlock hasformed, and then try to break it by resetting links/ports/hostsetc. These mechanisms are inelegant, disruptive, and shouldbe used only as a last resort. We do not consider them furtherin this paper.

Proactive deadlock prevention is a more principled ap-proach to this problem. Prior work on deadlock preven-tion can be classified into two categories including 1) Rout-ing restriction-based approach [7, 21]. The idea of this ap-1In our datacenters, we use BGP for routing, with each switch be-ing a private AS.

Page 2: Deadlocks in Datacenter Networks: Why Do They Form, and ... · Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them Shuihai Hu1;2 Yibo Zhu1 Peng Cheng1 Chuanxiong

proach is to ensure that no CBD exists in the network by lim-iting the routing paths used in each priority class; 2) buffermanagement (structured buffer pool) based approach [8, 14].This approach divides switch buffer into several buffer classes.A packet is allowed to access more buffer classes as it trav-els greater distance in the network. It can be proved that aslong as the number of buffer classes is no smaller than thehop count of the longest routing path, there will be no CBD.

Both categories of approaches to deadlock prevention havesome important drawbacks. Using routing restrictions toprevent deadlocks usually wastes link bandwidth and lim-its throughput performance. Preventing deadlocks via buffermanagement may require a lot of priority classes and switchbuffer for networks of large diameters, rendering them im-practical: today’s commodity switches with shallow buffercan support at most 2 lossless traffic classes [10].

As noted earlier, all these mechanisms share a common-ality: they seek to eliminate CBD. Thus, the drawbacks de-scribed above can be viewed as the cost of eliminating CBDin the network. While avoiding cycle buffer dependencyguarantees a deadlock-free network, it may not be alwaysfeasible to pay these costs.

Thus, we take step back and ask: Is CBD a necessary con-dition for deadlock formation, or is it a sufficient condition?If it is only a necessary condition, can we focus on sufficientconditions, and guarantee deadlock freedom, without elimi-nating CBD?

To answer the above questions, we studied several repre-sentative deadlock cases. First, we find that CBD is just anecessary condition for deadlock. In some sense, this is triv-ially true: if no flow is sending any data, there will obviouslybe no deadlock, regardless of CBD. However, there are alsoseveral non-trivial cases where CBD is met, all flows are ac-tive, but there is no deadlock. Second, we find that even if allthe links in a switch cycle are paused simultaneously, dead-lock may still not occur. These findings indicate that priorsolutions are too conservative.

So, in this paper, we shall try to understand the sufficientconditions for deadlock formation, which we conjecture tobe far easier to ameliorate than the necessary conditions. Asmentioned earlier, we do not yet have a precise characteri-zation of the sufficient conditions. Yet, we have made someheadway, which allows us to sketch a few possible solutionsto the problem.

2. DEADLOCK IN LOSSLESS NETWORKWe now briefly discuss how deadlocks form in lossless

networks, and why we must study the necessary and suffi-cient conditions of deadlock formation.Lossless Ethernet relies on PFC. RoCE needs PFC toprovide a lossless L2 network. With the PFC PAUSE mech-anism, a switch can pause an incoming link when its ingressbuffer occupancy reaches a preset threshold. Properly tuned,no packet will be dropped due to insufficient buffer space.Unfortunately, deadlock may occur in such lossless networks.

PFC may lead to deadlock, if paused links form a cycle.In a PFC-enabled network, if a subset of links simultane-ously paused by PFC happen to form a directed cycle, nopackets in the cycle can move even if there is no more newtraffic injected into this cycle.

To avoid such deadlock, deadlock-free routing [21] hasbeen proposed. It guarantees that (if the routing configu-ration is correct,) any traffic does not cause deadlock.Unfortunately, achieving deadlock-free routing is ineffi-cient, and may not even be viable. Deadlock-free rout-ing is achieved by eliminating Cyclic Buffer Dependency(CBD) [5]. However, ensuring that there is never any CBDis challenging.

First, deadlock-free routing largely limits the choice oftopology. For example, Stephens et al. [21] proposes toonly use tree-based topology and routing, and shows thatit is deadlock-free. However, there are a number of otherdatacenter topologies and routing schemes that are not tree-based [3, 9, 19], and do not have deadlock-free guarantee.

Second, due to bugs or misconfiguration, deadlock-freerouting configuration may turn into deadlock-vulnerable. Infact, recent work has observed a PFC deadlock case in real-world tree-based datacenter [10], caused by the (unexpected)flooding of lossless class traffic. This case is a concrete ex-ample to show that even for tree-based topology, CBD canstill occur if up-down routing is not strictly followed. Fur-thermore, there are multiple reports of routing loops due tomisconfiguration in today’s production datacenters [23, 25].If lossless traffic encounters any of these loops, CBD is un-avoidable.

In this paper, we argue that we should accept the fact thatCBD cannot be completely avoided,2 and instead try to un-derstand more precise deadlock conditions. Our findingsshow that even if there is CBD, deadlock may not occur (seeSection 3). This means that CBD is only a necessary but notsufficient condition for deadlock. We show the occurrenceof deadlock is affected by the packet TTL, the traffic matrix,as well as flow rate. Based on these findings, we proposeseveral ways to avoid deadlock even in face of CBD.

3. CASE STUDIES: CBD IS INSUFFICIENTFOR DEADLOCK

Although CBD is a necessary condition for deadlock, itis not a sufficient condition. In this section, we present ourcase studies in which CBD is present, but deadlock forma-tion still depends on other factors. We demonstrate that 1)a looping flow that generates CBD does not always lead todeadlock. The length of loop, flow rates and packet Time-To-Live (TTL) affects whether the deadlock forms. 2) Mul-tiple flows may cause CBD, but slightly different flow setslead to different deadlock results. 3) Rate-limiting can pre-vent deadlock from happening.

2In other words, deadlock-free routing may not always apply orwork correctly.

Page 3: Deadlocks in Datacenter Networks: Why Do They Form, and ... · Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them Shuihai Hu1;2 Yibo Zhu1 Peng Cheng1 Chuanxiong

A TX1 BRX2Flow 1

RX1

RX1

TX1

(a) Topology and flows

A BRX2

RX1

RX1

(b) Buffer dependency graph

r B

B-rd qn q2

q1

(c) Stable state model

Figure 2: Single looping flow creates CBD but may not create deadlock.

3.1 Case 1: Flow Rate and TTL DetermineDeadlock in a Routing Loop

A routing loop can create CBD if a flow is trapped inthe loop. The simplest example is a two-hop loop betweentwo switches, as shown in Figure 2(a) (RX represents inputqueue (or port), and TX represents output queue (or port)).We then plot buffer dependency graph (Figure 2(b)). Eachdirected line represents a buffer dependency from the sourceRX to the destination RX. For example, packets buffered inRX1 of switch A will be sent to RX1 of B, and vice versa.So in Figure 2(b), two directed lines are drawn between Aand B. Switch A’s dependency on switch B means whetherswitch A can move the packets in its receiving buffer RX1to egress depends on switch B’s buffer RX1.3 The switchescan send packets to the other side only when the other side’sbuffer utilization is under PFC PAUSE threshold. The dead-lock happens when both of the involved buffers reach thePFC threshold at the same time and PAUSE the links.

However, this CBD may not always turn into deadlockstate. The flow rate, the TTL (time-to-live) of packets andthe length of the loop together determine whether there willbe deadlock. In our testbed, we run a simple experiment ontwo switches that are connected by a 40Gbps link and con-figured with a routing loop. All packets have initial TTL of16 and are injected into one of the switches. We find that,only if the packet injection rate exceeds 5Gbps, there canform deadlock.

In order to analyze deadlock formation in the cases ofrouting loop, we develop a mechanism called boundary stateanalysis. It yields accurate prediction of whether deadlockforms, as shown below.Boundary state analysis. On any of the switches in theloop, packets are injected by the previous hop and drained bythe next hop. If the draining rate is smaller than the inject-ing rate, packets will continuously queue up in the switchbuffers. In a loop, once one switch buffers enough pack-ets and triggers PFC, the PFC will soon cascade through thewhole loop and forms deadlock. We define boundary state,in which the injecting rate and draining rate are balanced onevery switch, and any larger injecting rate leads to deadlockbecause draining rate cannot catch up.

We build the boundary state model as illustrated in Fig-ure 2(c). The variables are described in Table 1.

According to the boundary state definition, the injecting

3We focus on receiving buffer because PFC PAUSE triggers basedon the occupancy of receiving buffer.

Table 1: Stable state analysis variablesVariable Description

r Inject rate of new packets.B Link bandwidth.rd Packets drain rate caused by TTL expiration.

TTL Initial Time-To-Live value.n The length of the routing loop.

rate and draining rate must be equal on the first switch:

r +B − rd = B (1)

In addition, we consider the sum of TTL values of all packetsin the system. During the boundary state, it should remainstable. Therefore, the increase rate and decrease rate of thesum of TTL should be the same:

n ∗B = TTL ∗ r (2)

Combining Equation 1, Equation 2, and the fact that dead-lock requires larger injecting rate than that in boundary state,we derive the necessary and sufficient condition of deadlockin a routing loop scenario:

r > rd =nB

TTL(3)

This matches what we observe in testbed experiment: withB = 40Gbps, n = 2 and TTL = 16, the flow injectingrate must be at least 5Gbps to cause deadlock. With largerbandwidth, shorter loop length or smaller initial TTL values,the threshold of r can be higher. As long as the flow rate issmaller than the threshold, no deadlock will form. As shownin Section 4, we may utilize this property to avoid deadlock,even if routing loop occurs.

3.2 Case 2: Traffic Matrix Affects DeadlockMultiple flows may create a CBD even if there is no rout-

ing loop. Figure 3(a) shows a simple example with fourswitches A, B, C and D. Flow 1 starts at a host (not shown)attached to A, passes through B and C, and ends at a hostattached to D. Flow 2 starts at a host attached to C, passesthrough D and A, and ends at a host attached to B. Similarto the previous case, we can draw the dependency lines be-tween switches. As shown in Figure 3(b), there is a CBDamong the four switches, i.e., dependencies from RX1 of Ato RX1 of B, then to RX1 of C, then to RX1 of D, and finallyback to RX1 of A.

In this example, the boundary state analysis does not yieldmeaningful results. Because the flows do not have any ratelimiting, one can easily analyze that the stable throughputof each flow is B/2. However, it is not easy to tell whetherdeadlock will form. For example, one may suspect that switchA’s RX1 will generate PFC PAUSE, and these PAUSE frames

Page 4: Deadlocks in Datacenter Networks: Why Do They Form, and ... · Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them Shuihai Hu1;2 Yibo Zhu1 Peng Cheng1 Chuanxiong

A TX1 B

D C

RX1

RX2 RX1

TX1

TX2

TX1 RX2

RX1TX1

TX2 RX1

L1

L2

L3

L4

Flow 1

Flow 2

(a) Topology and flows

A B

D C

RX1

RX2

RX1

RX2

RX1

RX1

(b) Buffer dependencygraph

0 0.5 1 1.5 20

1

2

3

4

5

6

7

Time/ms

Pa

use

Eve

nt

Pause event at link L1

Pause event at link L2

Pause event at link L3

Pause event at link L4

(c) Pause events at four links

0 2 4 6 8 100

20

40

60

80

Time/ms

Insta

nt

bu

ffe

r o

ccu

pa

ncy/K

B

Buffer occupancy of flow 2 at RX1

PFC threshold

(d) Buffer occupancy at switch A

0 2 4 6 8 100

20

40

60

80

Time/ms

Insta

nt

bu

ffe

r o

ccu

pa

ncy/K

B

Buffer occupancy of flow 1 at RX1

PFC threshold

(e) Buffer occupancy at switch B

0 2 4 6 8 100

20

40

60

80

Time/ms

Insta

nt

bu

ffe

r o

ccu

pa

ncy/K

B

Buffer occupancy of flow 1 at RX1

PFC threshold

(f) Buffer occupancy at switch C

0 2 4 6 8 100

20

40

60

80

Time/ms

Insta

nt

bu

ffe

r o

ccu

pa

ncy/K

B

Buffer occupancy of flow 2 at RX1

PFC threshold

(g) Buffer occupancy at switch D

Figure 3: There is no deadlock even though two flows create CBD among four switches.

may cascade from A to D, then to C, B and finally back toA, thus creating deadlock. To understand such scenarios, wemust analyze and simulate them at packet level.

Simulation setup: To create a well-controlled experimen-tal environment, we simulate the scenario in Figure 3(a) us-ing packet-level NS-3 simulations. In our NS-3 simulator,we implement the PFC protocol (i.e., IEEE 802.1 Qbb proto-col). For each ingress queue, the switch maintains a counterto track the bytes of buffered packets received by this ingressqueue. Once the queue length exceeds the preset PFC thresh-old, the corresponding incoming link will be paused.

In our simulations, we configure static routing on all switchesso that flow paths are enforced. Both flows are UDP flowswith infinite traffic demand. Link capacity of all links is40Gbps. All the switches have 12MB buffer. PFC thresholdis statically set to 40KB for each ingress queue. These pa-rameters affect how fast deadlock forms (if any), but do notaffect whether deadlock forms.

In Figure 3(c), we plot the PFC pause events at four linksL1, L2, L3 and L4. If link Li, (i = 1, 2, 3, 4) is paused attime t, we plot a point at location (t, i). Pause events at dif-ferent links are plotted with different colors and of differentheights. As we can observe, links L2 and L4 are pausedcontinuously, while the other two links L1 and L3 never getpaused. In this case, deadlock will never form as no packetwill be paused permanently.

To understand the pause pattern, we sample the instanta-neous buffer occupancy of both flows at RX1 queues of A, B,C and D every 1us. In Figure 3(d), we draw the instant bufferoccupancy of flow 2 at RX1 of A. Similarly, in Figure 3(e),Figure 3(f) and Figure 3(g), we draw the instant buffer oc-cupancy of interested flows at RX1 queues of B, C and D,respectively. As flow 1 and flow 2 are symmetric, we onlypresent the analysis for Figure 3(d) and Figure 3(e) to showwhy Link L4 is paused continuously but link L1 never getspaused. As shown in the figures, buffer occupancy of flow 2

at RX1 of A fluctuates between 10KB and 55KB around thePFC threshold, so link L4 will get paused intermittently. Incontrast, buffer occupancy of flow 1 at RX1 of B is well be-low the PFC threshold (fluctuates between 0KB and 18KB),hence link L1 never gets paused.

The takeaway is that, we cannot simply predict deadlockbased on the existence of CBD and flow-level stable stateanalysis. This is because we cannot predict the instantaneousbuffer occupancy (and whether PFC is triggered) from flow-level analysis that only focuses on average flow throughput.Slightly different traffic matrix leads to deadlock: asshown in Figure 4(a), based on the previous scenario, Weadd another flow (flow 3) which enters the network at switchB and leaves at switch C. All the three flows are UDP flowswith infinite traffic demand. Buffer dependency graph isdrawn in Figure 4(b). Compared with previous scenario,one additional dependency from RX2 of B to RX1 of C isadded, but it is outside the CBD. The CBD itself remainsunchanged.

Pause events at four links L1, L2, L3 and L4 are plot-ted in Figure 4(c). As we can see, in this case four linksare all paused. To check whether deadlock will form in thiscase, we stop the three flows after a sufficient long period(1000ms). We find that pause events are continuously gen-erated at all the four links even after three flows stop sendingnew packets. This means that a deadlock has been createdamong the four switches.

The bizarre thing is, if we apply the stable state flow anal-ysis based on PFC fairness,4 it is easy to see that all flowsshould have 20Gbps throughput. Particular at switch D, thestable ingress and egress rate of flow 1 and flow 2 shouldremain the same as the previous case (Figure 3). However,now switch D starts to generate PFC at RX1 towards switch

4PFC ensures per-hop per-port fairness. If packets from two ingressports go to the same egress port, each ingress port gets half of theegress bandwidth.

Page 5: Deadlocks in Datacenter Networks: Why Do They Form, and ... · Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them Shuihai Hu1;2 Yibo Zhu1 Peng Cheng1 Chuanxiong

A TX1 B

D C

RX1

RX2 RX1

TX1

TX2

TX1 RX2

RX1TX1

TX2 RX1

L1

L2

L3

L4

Flow 1

Flow 2

Flow 3

RX2

TX2

(a) Topology and flows

A B

D C

RX1

RX2

RX1

RX2

RX1

RX1

RX2

(b) Buffer dependency graph

0 2 4 6 8 100

1

2

3

4

5

6

7

Time/ms

Pa

use

eve

nt

Pause event at link L1

Pause event at link L2

Pause event at link L3

Pause event at link L4

(c) Pause events at four links

Figure 4: Slightly different traffic matrix leads to deadlock, even though the flow-level analysis shows that the averagethroughput of flow 1 and flow 2 should not be affected.

A TX1 B

D C

RX1

RX2 RX1

TX1

TX2

TX1 RX2

RX1TX1

TX2 RX1

L1

L2

L3

L4

Flow 1

Flow 2

Flow 3

RX2

TX2

Rate limiting=2Gbps

(a) Topology and flows

0 2 4 6 8 100

1

2

3

4

5

6

7

Time/ms

Pa

use

eve

nt

Pause event at link 1

Pause event at link 2

Pause event at link 3

Pause event at link 4

(b) Pause events at four links

0 2 4 6 8 100

20

40

60

80

Time/ms

Insta

nt buffer

occupancy/K

B

Buffer occupancy of flow 1 at RX1

PFC threshold

(c) 2Gbps rate limiter, no dead-lock

0 2 4 6 8 100

20

40

60

80

Time/ms

Insta

nt buffer

occupancy/K

B

Buffer occupancy of flow 1 at RX1

PFC threshold

(d) 3Gbps rate limiter, deadlock

Figure 5: Different rate limiting determines whether the deadlock forms.

C, as opposed to no PFC generated in the previous scenario.Flow-level stable state analysis cannot capture such be-

havior. We can only get answers from the packet-level anal-ysis. Unfortunately, we have not yet found any analytic toolsthat can precisely describe the PFC behaviors in these twoexamples. Looking at the packet traces, we only roughlyknow that after adding flow 3, flow 1 has to share the band-width of link L2 with flow 3 and this may cause differentPFC patterns on link L1 without affecting the average through-put of flow 1 and flow 2. But this change in pattern makesPFC cascade towards L4 and finally L3.

Once all the four links are paused simultaneously, there isa chance that no link can get resumed. For example, it ispossible that when simultaneous pause happens, at switch Aand switch B, the first packet buffered in the head is a packetof flow 1, and meanwhile, at switch C and switch D, the firstpacket buffered in the head is a packet of flow 2. Once thiscondition is met, PFC deadlock occurs.Summary: In the above multi-flow scenarios, CBD canbe created without a routing loop. However, it is again nota sufficient condition for deadlock. The analysis of suffi-cient condition is complicated. Stable flow state analysisdoes not apply. A slightly different matrix that does notsignificantly affect stable flow state may lead to very dif-ferent packet-level behavior, thus different deadlock results.Though packet-level simulations help us understand thesescenarios, we so far do not find any analytic tools that are atpacket-level and work for above examples.

3.3 Case 3: Rate Limiting Mitigates DeadlockIn the last deadlock example (Figure 4), if we additionally

limit the rate of flow 3, deadlock may be avoided. As shownin Figure 5(a), we add a rate limiter to switch B’s ingress portRX2. While the buffer dependency graph remains the same

as Figure 4(b), slower flow 3 means that the congestion onswitch B is reduced, PFC is less frequent and deadlock maybe avoided. The question is, what is the maximum rate thatcan avoids deadlock?

Again, using flow-level stable state analysis, we cannotget the answer. We use packet-level simulator to test differ-ent rate limiting values. We find that when the rate of flow3 is no more than 2Gbps, there is no deadlock even thoughall links have frequent PAUSE (Figure 5(b)). Note that, af-ter zoom in Figure 5(b), we can see that four links are neverpaused simultaneously at packet level. Why is 2Gbps differ-ent from higher rate, like 3Gbps?

We plot the buffer occupancy of RX1 at switch B, andcompare when we limit the rate of RX2 to 2Gbps (Figure 5(c))and 3Gbps (Figure 5(d)). Interestingly, the buffer occupancyalways fluctuates between 0 and a little above PFC thresh-old5 with 2Gbps rate limiter. While with 3Gbps rate limiter,after some fluctuation the buffer goes into deadlock, eventhough the peak buffer usage is the same as 2Gbps case. Un-fortunately, we cannot find any existing analysis tools thatexplain what we have observed.

In short, while rate limiting mitigates deadlock, packet-level analysis is required for understanding the actual thresh-old. We are currently working on analysis tools, e.g., a fluidmodel that can describe PFC behavior, and will report it infuture work.Summary: From all the examples in this section, we sum-marize that CBD is a loose condition for deadlock. The traf-fic demand matrix, TTL and flow rates all affect the dead-lock formation. While we cannot obtain the tightest con-dition (i.e., necessary and sufficient condition), we know5It takes some time for PFC PAUSE to arrive the other side andbecome effective after PFC threshold is reached. The switch buffersadditional packets due to this delay.

Page 6: Deadlocks in Datacenter Networks: Why Do They Form, and ... · Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them Shuihai Hu1;2 Yibo Zhu1 Peng Cheng1 Chuanxiong

that a tighter condition should include those factors, and thatthese factors can be utilized for deadlock mitigation. In Sec-tion 4, we discuss potential deadlock mitigations in additionto avoiding CBD.

4. POTENTIAL DEADLOCK MITIGATIONSSince CBD is just a necessary condition for deadlock, there

are mitigation mechanisms that avoid deadlock even if CBDis present. The examples and analysis in Section 3 inspire uswith some of the following potential deadlock mitigations.We stress that we have not been able to formally characterizethe sufficient conditions for deadlock formation. Thus, thesolutions presented below are mitigations – and essentiallyheuristic in nature. In future, we hope to fully characterizethe sufficient conditions, and come up with more precise andefficient deadlock prevention solutions.TTL-based mitigation for deadlock caused by loops. Ina routing loop, deadlock formation becomes less likely withsmaller TTL (see Equation 3). Thus, the most straightfor-ward mitigation is to reduce packets’ initial TTL values. Forexample, in an N -hop routing loop, if the initial TTL is notlarger than N , no deadlock will form because the deadlockthreshold for r is B, which can never be exceeded.

In practice, we may not be able to guarantee that initialTTL values are always smaller than the size of the loop.However, by proper switch buffer management, we may makeclass-specific TTL much smaller than the actual TTL values.For example, if we assign packets that have different TTLvalues by at least X to different priority classes, the effectiveTTL becomes X within a priority class. Since PFC PAUSEoperates based on priority classes, the deadlock threshold ofinjecting rate r is effectively increased.

In worst-case scenarios, the effective TTL may still belarger than the size of loop, meaning that some r smallerthan B leads to deadlock. We may consider rate limiting tokeep r below the threshold NB/TTL, as discussed below.Rate limiting. Commodity switches support bandwidthshaping for each priority class or even particular flows. Thiscan mitigate deadlock caused by both routing loops and multi-flow buffer dependency, as shown in Section 3. If we are ableto predict the rate threshold for deadlock, we may bound theindividual flow rate by that threshold on switches that areinvolved in CBD. However, this requires intelligent rate lim-iting schemes to avoid over-punishing innocent flows. Weleave this to future work.Limiting PFC pause frames propagation: PFC is wellknown for its HoL blocking problem. The damage of HoLand the potential deadlock caused by PFC is significant be-cause the pause frames are generated near the destinationor in the middle of the network, where network congestionusually happen. Hence if we can limit the PFC pause framepropagation – or just generate them near the source, we canreduce the damage of both deadlock and HoL blocking.

Here we describe several possible ways of doing so: first,we can assign different PFC thresholds to the ports of a

switch based on their position in the topology. Ports connect-ing to the downstream (i.e towards leaf) get smaller thresh-old, whereas ports connecting to the upstream get larger thresh-old. Second, we can use switches with larger threshold val-ues at the higher tiers so that they can absorb small bursts in-stead of generating PFC pause frames. Third, again, we mayclassify packets with different TTL into different classes andassign them different PFC thresholds. Unfortunately, thesesolutions may lead to other issues including the unfairnessbetween long (across different high tier switches) and short(e.g., within the same rack) flows. This trade-off requiresfurther study.Preventing PFC from been generated. The recent trans-port protocols, DCQCN [24] and TIMELY [17] are designedto reduce the possibility of PFC generation. But due to thefeedback latency of end-to-end delay, neither algorithm candetect congestion instantaneously, and thus they cannot com-pletely prevent PFC from been generated.

One possible way to further reduce PFC is to integrate DC-QCN together with phantom queuing, like [4]. By reactingto the phantom queues that assume lower link speed, con-gestion signals are generated much earlier.

5. RELATED WORKRDMA in datacenters. RDMA has been used for im-proving distributed application performance, like in-memorykey value store [6, 13, 16], Hadoop RPC [15] and HBase [11].It has been recently deployed inside modern datacenters [10,17, 24], based on RoCE (RDMA over Converged Ethernet),which relies on PFC to create a lossless Ethernet. Recentwork [17, 24] discusses congestion control for RoCEv2 net-works. The issue of deadlock is mentioned in these papers,but not directly addressed. In this paper, we aim to getdeeper understanding on deadlock and possible mitigations.Deadlock-free routing. To avoid deadlock in losslessnetworks, previous work [14, 18, 20–22] has focused ondeadlock-free routing: i.e. deadlock freedom regardless oftraffic pattern etc. It has also been proven that that eliminat-ing CBD is a necessary and sufficient condition for deadlock-free routing [5]. However, deadlock-free routing is difficultto implement in practice – since it is challenging to elimi-nate CBD in face of arbitrary bugs and failures. Our workexplores how we may control the flows, packet formats andswitch configurations to avoid deadlock even if routing is notdeadlock-free.

6. CONCLUSIONIn this paper, we studied the problem of deadlock in dat-

acenter networks. We showed that CBD is a necessary bynot sufficient condition for deadlock formation. We are un-able to fully characterize the sufficient conditions, but usinginsights gained from a few examples, we discussed poten-tial deadlock mitigation mechanisms including TTL-basedschemes, rate limiting and reducing PFC propagation.

Page 7: Deadlocks in Datacenter Networks: Why Do They Form, and ... · Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them Shuihai Hu1;2 Yibo Zhu1 Peng Cheng1 Chuanxiong

7. REFERENCES

[1] RDMA over Converged Ethernet (RoCE).http://www.mellanox.com/page/products_dyn?product_family=79.

[2] IEEE. 802.11qbb. Priority-based flow control, 2011.[3] Hussam Abu-Libdeh et al. Symbiotic routing in future

data centers. In Proc. of SIGCOMM, 2010.[4] Mohammad Alizadeh et al. Less is more: Trading a

little bandwidth for ultra-low latency in the datacenter. In Proc. of NSDI, 2012.

[5] W.J. Dally and C. L. Seitz. Deadlock-free messagerouting in multiprocessor interconnection networks.C-36(5):547 –553, may 1987.

[6] Aleksandar Dragojevic, Dushyanth Narayanan, OrionHodson, and Miguel Castro. FaRM: Fast remotememory. In Proc. of NSDI, 2014.

[7] Jose Flich, Tor Skeie, Andres Mejia, Olav Lysne,Pierre Lopez, Antonio Robles, Jose Duato, MichihiroKoibuchi, Tomas Rokicki, and Jose Carlos Sancho. Asurvey and evaluation of topology-agnosticdeterministic routing algorithms. IEEE Transactionson Parallel and Distributed Systems, 2012.

[8] Mario Gerla and Leonard Kleinrock. Flow control: Acomparative survey. IEEE Transactions onCommunications, 1980.

[9] Chuanxiong Guo et al. BCube: a high performance,server-centric network architecture for modular datacenters. In Proc. of SIGCOMM, 2009.

[10] Chuanxiong Guo et al. RDMA over commodityethernet at scale. In Proc. of SIGCOMM, 2016.

[11] Jian Huang et al. High-performance design of hbasewith rdma over infiniband. In Proc. of IPDPS, 2012.

[12] Xin Jin, Hongqiang Harry Liu, Rohan Gandhi,Srikanth Kandula, Ratul Mahajan, Ming Zhang,Jennifer Rexford, and Roger Wattenhofer. Dynamicscheduling of network updates. In Proc. ofSIGCOMM, 2014.

[13] Anuj Kalia, Michael Kaminsky, and David G.Andersen. Using rdma efficiently for key-valueservices. In Proc. of SIGCOMM, 2014.

[14] Mark Karol, S Jamaloddin Golestani, and David Lee.Prevention of deadlocks and livelocks in losslessbackpressured packet networks. IEEE/ACMTransactions on Networking, 2003.

[15] Xiaoyi Lu, Nusrat S. Islam, Md. Wasi-Ur-Rahman,Jithin Jose, Hari Subramoni, Hao Wang, andDhabaleswar K. Panda. High-performance design ofhadoop rpc with rdma over infiniband. In Proc. ofICPP, 2013.

[16] Christopher Mitchell, Yifeng Geng, and Jinyang Li.Using one-sided rdma reads to build a fast,cpu-efficient key-value store. In Proc. of ATC, 2013.

[17] Radhika Mittal, Vinh The Lam, Nandita Dukkipati,

Emily Blem, Hassan Wassel, Monia Ghobadi, AminVahdat, Yaogong Wang, David Wetherall, and DavidZats. Timely: Rtt-based congestion control for thedatacenter. In Proc. of SIGCOMM, 2015.

[18] Jose Carlos Sancho, Antonio Robles, and Jose Duato.An effective methodology to improve the performanceof the up*/down* routing algorithm. IEEETransactions on Parallel and Distributed Systems.

[19] Ankit Singla, Chi-Yao Hong, Lucian Popa, andP. Brighten Godfrey. Jellyfish: Networking datacenters randomly. In Proc. of NSDI, 2012.

[20] Tor Skeie, Olav Lysne, and Ingebjørg Theiss. Layeredshortest path (lash) routing in irregular system areanetworks. In Prof. of IPDPS, 2012.

[21] Brent Stephens, Alan L. Cox, Ankit Singla, JohnCarter, Colin Dixon, and Wesley Felter. Practical dcbfor improved data center networks. In Prof. ofINFOCOM, 2014.

[22] Jie Wu. A fault-tolerant and deadlock-free routingprotocol in 2D meshes based on odd-even turn model.IEEE Transactions on Computers.

[23] Hongyi Zeng, Shidong Zhang, Fei Ye, VimalkumarJeyakumar, Mickey Ju, Junda Liu, Nick McKeown,and Amin Vahdat. Libra: Divide and conquer to verifyforwarding tables in huge networks. In Proc. of NSDI,2014.

[24] Yibo Zhu, Haggai Eran, Daniel Firestone, ChuanxiongGuo, Marina Lipshteyn, Yehonatan Liron, JitendraPadhye, Shachar Raindel, Mohamad Haj Yahia, andMing Zhang. Congestion control for large-scaleRDMA deployments. In Proc. of SIGCOMM, 2015.

[25] Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg,Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan,Ming Zhang, Ben Y. Zhao, and Haitao Zheng.Packet-level telemetry in large datacenter networks. InProc. of SIGCOMM, 2015.