Track – Next Generation Data Center Fabrics: Do New … · physical layer (PHY, Layer 1) data link layer (DLL, Layer 2) IEEE 802 divides DLL into: Medium Access Control (MAC) Multiple

Track ndashNext Generation Data Center Fabrics Do New Technologies meet Operator NeedsNANOG 7610 June 2019

1

2

Agenda

bull Advancing Ethernet for Data Center Networks Roger Marks EthAirNet Associates

bull Towards Hyperscale High Performance Computing with RDMA Omar Cardona Microsoft

bull The IEEE P8021Qcz Project on Congestion Isolation Paul Congdon Tallac Networks

bull Discussion Identifying Problems and Inefficiencies in Current Data Center Operations

2

Advancing Ethernet for Data Center Networks

Roger Marksrogerethairnet+1 802 capable

Prepared 3 June 2019

3

4

Disclaimer

bull All speakers presenting information on IEEE standards speak as individuals and their views should be considered the personal views of that individual rather than the formal position explanation or interpretation of the IEEE

4

5

IEEE 802 The LANMAN Standard Committee (LMSC) IEEE 802 LANMAN Standards Committee (LMSC) Develops standards for Local Area Networks (LAN) Metropolitan Area Networks (MAN) Regional Area Networks (RAN) Personal Area Networks (PAN) Wireless Specialty Networks etc In operation since March 1980

5

6

Reference Model for

End Stations

Application

Presentation

Session

Transport

Network

Data Link

Physical

bullCopperbullFiberbullAirbullOthers

IEEE802

Lower-Layer Focus IEEE 802 standards emphasize

the functionality of the lowest two layers of the OSI reference model and the higher layers as they relate to network management physical layer (PHY Layer 1) data link layer (DLL Layer 2)

IEEE 802 divides DLL into Medium Access Control (MAC)

Multiple specifications Common logical link control (LLC)

See details in IEEE Std 802 ldquoIEEE Standard for Local and

Metropolitan Area Networks Overview and Architecturerdquo

6

MAC

LLC

Medium

7

Nendicabull Nendica IEEE 802 ldquoNetwork Enhancements for the Next

Decaderdquo Industry Connections Activity IEEE Industry Connections Activities ldquoprovide an efficient

environment for building consensus and developing many different types of shared results Such activities may complement supplement or be precursors of IEEE Standards projectsrdquo

bull Organized under the IEEE 8021 Working Groupbull Chartered through March 2021bull Chair Roger Marksbull Open to all participants no membership

7

8

Nendica Motivation and Goalsbull ldquoThe goal of this activity is to assesshellip emerging

requirements for IEEE 802 wireless and higher-layer communication infrastructures identify commonalities gaps and trends not currently addressed by IEEE 802 standards and projects and facilitate building industry consensus towards proposals to initiate new standards development effortsrdquo

8

9

Nendica Report The Lossless Network for Data Centers

bull Paul Congdon Editorbull Key messages regarding the data center Packet loss leads to large delays Congestion leads to packet loss Conventional methods are problematic A Layer 3 network uses Layer 2 transport action at

Layer 2 can reduce congestion and thereby loss The paper is not specifying a ldquolosslessrdquo network but

describing a few prospective methods to progress towards a lossless data center network in the future

bull The report is open to comment and may be revised

9

10

Use Cases The Lossless Network for Data Centers

bull The scale of Data Center Networks continues to grow Online Data Intensive (OLDI) Services AI Deep Learning and Model Training Cloud High-Performance Computing Financial Trading Distributed Storage Non-Volatile Memory Express (NVMe) over Fabrics

bull Data Center =gt High Performance Computerbull New requirements for lossless Ethernet fabric

10

11

Data Center Applications are distributed and latency-sensitive

11

3

Copyright copy 2018 IEEE All rights reserved

experience is highly dependent upon the system responsiveness and even moderate delays of less than a second can have a measurable impact on individual queries and their associated advertising revenue A large chunk of unavoidable delay due to the speed of light is inherently built into a system that uses the remote cloud as the source of decision and information This puts even more pressure on the deadlines within the data center itself To address these latency concerns OLDI services deploy individual requests across thousands of servers simultaneously The responses from these servers are coordinated and aggregated to form the best recommendations or answers Delays in obtaining these answers are compounded by delayed or lsquostragglerrsquo communication flows between the servers This creates a long tail latency distribution in the data center for highly parallel applications To combat tail latency servers are often arranged in a hierarchy as shown in Figure 1 with strict deadlines given to each tier to produce an answer If valuable data arrives late because of latency in the network the data is simply discarded and a sub‐optimal answer may be returned Studies have shown that the network becomes a significant component of overall data center latency when congestion occurs in the network [2]

Figure 1 ndash Parallel Application Hierarchy

The long tail of latency distribution in OLDI data centers can be caused by various factors [3] One is simply related to the mix of traffic between control messages (mice) and data messages (elephants) While most of the flows in the data center are mice most of the bytes transferred across the network are due to elephants Therefore a small number of elephant flows can delay the set‐up of control channels established by mice flows Since OLDI data centers are processing requests over thousands of servers simultaneously the mix and interplay of mice and elephant flows is highly uncoordinated An additional complexity is that flows can change behavior over time what was once an elephant can transform into a mouse after an application has reached steady state Another cause of latency is due to incast at the lower tiers of the node hierarchy Leaf worker nodes return their answers to a common parent in the tree at nearly the same time This can cause buffer over‐runs and packet loss within an individual switch It may invoke congestion management schemes such as flow‐control or congestion notification which have little effect on mice flows and tail latency

Authorized licensed use limited to Roger Marks Downloaded on September 282018 at 034705 UTC from IEEE Xplore Restrictions apply

bull Tend toward congestion eg due to incastbull Packet loss leads to retransmission more

congestion more delaySource IEEE 802 Nendica Report The Lossless Network for Data Centers [2]

12

12

Remote Direct Memory Access (RDMA) in the Data Center

Source InfiniBand Trade Association lthttpwwwroceinitiativeorgroce-introductiongt

13

13

RoCE RDMA over Converged Ethernet

Dependent on transport layerIntolerant of packet loss

No protection against packet loss

Typically no protectionagainst packet loss

14

Lossless Ethernet is a Foundation of the New Data Center

bull Data Centers are becoming High Performance Computersbull Most Data Center networking is at Layer 34 but it all

rides on Layer 12bull Ethernet is preferred option for Layer 12 Ethernet infrastructure can support multiple upper layers

(eg TCP and RoCEv2) simultaneouslybull Most Data Center networking requires low latency bull Some Data Center networking requires a lossless

Ethernet fabric

14

15

IEEE 802 DevelopedData Center Bridging

bull In the IEEE 8021 Working Group the Data Center Bridging Task Group developed many standards to enhance Layer 2 support for the data center Beginning around 2006

bull Key technologies include Priority-based Flow Control (PFC) Congestion Notification Enhanced Transmission Selection (ETS)

15

16

Folded-Clos NetworkMany Paths from Server to Server

16

server

spine

rack

17

Incast fills output queue(note ECMP cannot help)

17

server

spine

rack

ECMPincast

18

Priority flow control (PFC)18

server

spine

rack

PFC

incast

bull Output backup fills ingress queuebull PFC can be used to pause input per QoS classbull IEEE 8021Q (originally in 8021Qbb)

19

PFC pauses all flows of the classincluding ldquovictimrdquo flows

19

server

spine

rack

PFC stops both flows

incast

20

Traffic-class blocking analysis20

17

Figure 11 Network performance degradation due to HOL blocking

other flows in the network that are injected from other sources and are addressed todestinations dicrarrerent from X (Y and Z respectively) so that they do not contribute to thecongestion tree However as can be seen the latter flows interact (ie share queues) withthe congested flows at several points of the network then the latter flows produce HOLblocking to the former ones As a consequence all the flows end up being injected at thesame rate as the contributor ones despite the fact that the links connected to Y and Zare not saturated On the other hand note that at the switch where the root is locatedthe HOL blocking appears due to the direct contention of the congested flows to accessthe root of the tree but in other (upstream) switches the HOL blocking appears due tocongestion propagation through flow control In general the HOL blocking produced inthe switch where congestion originates is called ldquolow-orderrdquo HOL blocking while the oneproduced at upstream switches is called ldquohigh-orderrdquo HOL blocking

4 Reducing In-Network Congestion

Taking into account the problems derived from congestion trees a natural approach to dealwith them is trying to prevent their formation This approach is actually a particular case

Source IEEE 8021-19-0012-00 [4]

21

Congested-flow Isolation(see IEEE Project P8021Qcz [5])

21

DownstreamUpstream1 3 1 3

2

4

2

4Ingress Port

(Virtual Queues)Egress Port Ingress Port

(Virtual Queues)Egress Port

Congested Flows

Non-Congested Flows

1 Identify the flow causing congestion and isolate locally

CIP 2 Signal to neighbor when congested queue fills

Eliminate HoL Blocking

3 Upstream isolates the flow too eliminating head-of-line blocking

PFC 4 If congested queue continues to fill invoke PFC for lossless

Source IEEE 802 Nendica Report The Lossless Network for Data Centers [2]

22

Congested-flow Isolation Analysis22

Source IEEE 8021-19-0012-00 [4]

23

Figure 14 Isolating contributor flows eliminates HOL blocking

other flows so that if HOL blocking were prevented congestion trees could exist withoutactually being a problem on their own

Although this basic approach have been followed by very dicrarrerent techniques in generala common advantage of all of them is that they deal with HOL blocking fast and locallyas they attack the HOL blocking when and where it may appear Therefore they cancomplement some techniques explained above that react slowly to congestion (see Section7) Another common issue of all these techniques is that they configure several queues perport to separately store packets belonging to dicrarrerent flows hence reducing or eliminatingthe interaction among these flows and so the HOL blocking they can produce each otherHence the ultimate goal of these techniques is to isolate in separate queues the flows thatcontribute to congestion which are the ones producing significant HOL blocking This canbe seen in Figure 15 that shows the same situation depicted in Figure 11 but now withseveral queues configured per port to separate the flows addressed to dicrarrerent destinationsAs can be seen the separation of flows into dicrarrerent queues leads to the elimination ofHOL blocking so to an improvement of the performance of the network as the utilizationof several links increases with respect to the situation that the flows share queues

Note that the HOL blocking would be ideally and completely eliminated if there wereat every port a queue specifically dedicated to each destination [10] but this is obvi-ously unacrarrordable as the number of queues in the network would grow quadratically with

23

Bibliography1) IEEE 802 Orientation for New Participants httpsmentorieeeorg802-ecdcn18ec-18-0117-02pdf

2) IEEE 802 ldquoNetwork Enhancements for the Next Decaderdquo Industry Connections Activity (Nendica) https1ieee802org802-nendica

3) IEEE 802 Nendica Report ldquoThe Lossless Network for Data Centersrdquo (18 August 2018) httpsmentorieeeorg8021dcn181-18-0042-00pdf

4) Paul Congdon ldquoThe Lossless Network in the Data Centerrdquo IEEE 8021-17-0007-01 7 November 2017 httpsmentorieeeorg8021dcn171-17-0007-01pdf

5) Pedro Javier Garcia Jesus Escudero-Sahuquillo Francisco J Quiles and Jose Duato ldquoCongestion Management for Ethernet-based Lossless DataCenterNetworksrdquo IEEE 8021-19-0012-00 4 February 2012 httpsmentorieeeorg8021dcn191-19-0012-00pdf

6) IEEE P8021Qcz Project ldquoCongestion Isolationrdquo https1ieee802orgtsn802-1qcz

23

24

Going forward

bull IEEE 802 Nendica Report ldquoThe Lossless Network for Data Centersrdquo (18 August 2018) is published but open to further comment Comments are welcome from all

bull Could open an activity to revise the report addressing new issues Proposal [5] may be discussed in future teleconference

bull Report could help identify issues in need of further research or unified action

bull Nendica could instigate standardization of key topics Mainly in IEEE 802 perhaps also in eg IETF

24

25

Nendica Participation

bull Nendica is open to all participants please join in no membership requirements Comment by Internet or contribute documents Call in to teleconferences Attend meetings

bull Nendica web site https1ieee802org802-nendica Details of in-person and teleconference meetings

bull Feel free to contact the Chair (see first slide)

25

2

Agenda

bull Advancing Ethernet for Data Center Networks Roger Marks EthAirNet Associates

bull Towards Hyperscale High Performance Computing with RDMA Omar Cardona Microsoft

bull The IEEE P8021Qcz Project on Congestion Isolation Paul Congdon Tallac Networks

bull Discussion Identifying Problems and Inefficiencies in Current Data Center Operations

2




3

4

Disclaimer


4

5


5

6

Reference Model for

End Stations

Application

Presentation

Session

Transport

Network

Data Link

Physical


IEEE802







6

MAC

LLC

Medium

7





7

8



8

9






9

10




10

11


11

3








12

12



13

13





14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25




3

4

Disclaimer


4

5


5

6

Reference Model for

End Stations

Application

Presentation

Session

Transport

Network

Data Link

Physical


IEEE802







6

MAC

LLC

Medium

7





7

8



8

9






9

10




10

11


11

3








12

12



13

13





14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

4

Disclaimer


4

5


5

6

Reference Model for

End Stations

Application

Presentation

Session

Transport

Network

Data Link

Physical


IEEE802







6

MAC

LLC

Medium

7





7

8



8

9






9

10




10

11


11

3








12

12



13

13





14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

5


5

6

Reference Model for

End Stations

Application

Presentation

Session

Transport

Network

Data Link

Physical


IEEE802







6

MAC

LLC

Medium

7





7

8



8

9






9

10




10

11


11

3








12

12



13

13





14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

6

Reference Model for

End Stations

Application

Presentation

Session

Transport

Network

Data Link

Physical


IEEE802







6

MAC

LLC

Medium

7





7

8



8

9






9

10




10

11


11

3








12

12



13

13





14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

7





7

8



8

9






9

10




10

11


11

3








12

12



13

13





14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

8



8

9






9

10




10

11


11

3








12

12



13

13





14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

9






9

10




10

11


11

3








12

12



13

13





14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

10




10

11


11

3








12

12



13

13





14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

11


11

3








12

12



13

13





14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

12

12



13

13





14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

13

13





14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

14





Ethernet fabric

14

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

15




15

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

16


16

server

spine

rack

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

17


17

server

spine

rack

ECMPincast

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

18


server

spine

rack

PFC

incast


19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

19


19

server

spine

rack


incast

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

20


17





Source IEEE 8021-19-0012-00 [4]

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

21


21


2

4

2

4Ingress Port



Congested Flows

Non-Congested Flows







22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

22


Source IEEE 8021-19-0012-00 [4]

23





23







23

24

Going forward





24

25





25

23







23

24

Going forward





24

25





25

24

Going forward





24

25





25

25





25

Documents

Track – Next Generation Data Center Fabrics: Do New … · physical layer (PHY, Layer 1) data link layer (DLL, Layer 2) IEEE 802 divides DLL into: Medium Access Control (MAC) Multiple