44
RDMA in Data Centers: Looking Back and Looking Forward Chuanxiong Guo ACM SIGCOMM APNet 2017 August 3 2017 Microsoft Research

RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

  • Upload
    leque

  • View
    215

  • Download
    3

Embed Size (px)

Citation preview

Page 1: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

RDMAinDataCenters:LookingBackandLookingForward

ChuanxiongGuo

ACMSIGCOMMAPNet 2017

August32017

MicrosoftResearch

Page 2: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

TheRisingofCloudComputing

40 AZUREREGIONS

Page 3: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

DataCenters

Page 4: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

DataCenters

Page 5: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

5

• Cloudscaleservices:IaaS,PaaS,Search,BigData,Storage,MachineLearning,DeepLearning

• Servicesarelatencysensitiveorbandwidthhungryorboth• Cloudscaleservicesneedcloudscalecomputingandcommunicationinfrastructure

Datacenternetworks(DCN)

Page 6: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

6

Datacenternetworks(DCN)

• Singleownership• Largescale• Highbisectionbandwidth• CommodityEthernetswitches• TCP/IPprotocolsuite

Spine

Leaf

ToR

Podset

Pod

Servers

Page 7: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

7

ButTCP/IPisnotdoingwell

Page 8: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

8

TCPlatency

405us(P50)

716us(P90)

2132us(P99)

Longlatencytail

Pingmeshmeasurementresults

Page 9: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

9

TCPprocessingoverhead(40G)Sender Receiver

8tcpconnections

40GNIC

Page 10: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

10

AnRDMArenaissancestory

Page 11: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

11

VirtualInterfaceArchitectureSpec1.0 1997

Infiniband ArchitectureSpec1.0 20001.1 20021.2 20041.3 2015RoCE 2010RoCEv22014

Page 12: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

12

RDMA

• RemoteDirectMemoryAccess(RDMA):Methodofaccessingmemoryonaremotesystemwithout interruptingtheprocessingoftheCPU(s)onthatsystem

• RDMAoffloadspacketprocessingprotocolstotheNIC• RDMAinEthernetbaseddatacenters

Page 13: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

13

RoCEv2:RDMAoverCommodityEthernet

• RoCEv2forEthernetbaseddatacenters

• RoCEv2encapsulatespacketsinUDP

• OSkernelisnotindatapath• NICfornetworkprotocolprocessingandmessageDMA

TCP/IP

NICdriver

User

Kernel

Hardware

RDMAtransport

IPEthernet

RDMAapp

DMA

RDMAverbs

TCP/IP

NICdriver

Ethernet

RDMAapp

DMA

RDMAverbs

Losslessnetwork

RDMAtransport

IP

Page 14: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

14

RDMAbenefit:latencyreduction

• Forsmallmsgs (<32KB),OSprocessinglatencymatters

• Forlargemsgs (100KB+),speedmatters

Page 15: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

15

RDMAbenefit:CPUoverheadreductionSender Receiver

OneNDconnection

40GNIC

37Gb/sgoodput

Page 16: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

16RDMA:SingleQP,88Gb/s,1.7%CPU TCP:Eightconnections,30-50Gb/s,Client:2.6%,Server:4.3%CPU

RDMAbenefit:CPUoverheadreductionIntel(R)Xeon(R)[email protected],twosockets28cores

Page 17: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

17

RoCEv2needsalosslessEthernetnetwork

• PFCforhop-by-hopflowcontrol• DCQCNforconnection-levelcongestioncontrol

Page 18: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

18

Priority-basedflowcontrol(PFC)

• Hop-by-hopflowcontrol,witheightprioritiesforHOLblockingmitigation

• ThepriorityindatapacketsiscarriedintheVLANtagorDSCP

• PFCpauseframetoinformtheupstreamtostop

• PFCcausesHOLandcolleterialdamage

PFCpauseframep1

Egressport Ingressport

p0p1

p7

Datapacket

p0 p0p1

p7

XOFFthreshold

DatapacketPFCpauseframe

Page 19: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

19

DCQCN

• CP: SwitchesuseECNforpacketmarking• NP:periodicallycheckifECN-markedpacketsarrived,ifso,notifythesender

• RP:adjustsendingratebasedonNPfeedbacks19

Sender NICReaction Point

(RP)

SwitchCongestion Point

(CP)

Receiver NICNotification Point

(NP)

DCQCN = Keep PFC + Use ECN + hardware rate-based congestion control

Page 20: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

20

Thelosslessrequirementcausessafetyandperformancechallenges

• RDMAtransportlivelock• PFCdeadlock

• PFCpauseframestorm• Slow-receiversymptom

Page 21: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

21

RDMAtransportlivelockRDMASend0

RDMASend1

RDMASendN+1

NAKN

RDMASend0

RDMASend1

RDMASend2

RDMASendN+2

Go-back-0 Go-back-N

RDMASend0

RDMASend1

RDMASendN+1

NAKN

RDMASendN

RDMASendN+1

RDMASendN+2

RDMASendN+2Sender Receiver

Switch

Pktdroprate1/256

Sender Receiver ReceiverSender

Page 22: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

22

PFCdeadlock

• OurdatacentersuseClosnetwork• Packetsfirsttravelupthengodown

• Nocyclicbufferdependencyforup-downrouting->nodeadlock

• Butwedidexperiencedeadlock!

Spine

Leaf

ToR

Podset

Pod

Servers

Page 23: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

23

PFCdeadlock

• Preliminaries• ARPtable:IPaddresstoMACaddressmapping

• MACtable:MACaddresstoportmapping

• IfMACentryismissing,packetsarefloodedtoallports

IP MAC TTL

IP0 MAC0 2h

IP1 MAC1 1h

MAC Port TTL

MAC0 Port0 10min

MAC1 - -

Input

Output

Dst:IP1

ARPtable

MACtable

Page 24: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

24

La Lb

T0 T1

S1 S2 S3 S4Server

p0 p1

p2 p3

p0 p1

p3 p4

p0 p1p0 p1

Egressport

Ingressport

1 432PFCpauseframes

p2

S5

Packetdrop

Congestedport

Deadserver

PFCpauseframes

Path:{S1,T0,La,T1,S3}

Path:{S1,T0,La,T1,S5}

Path:{S4,T1,Lb,T0,S2}

PFCdeadlock

Page 25: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

25

PFCdeadlock

• ThePFCdeadlockrootcause:theinteractionbetweenthePFCflowcontrolandtheEthernetpacketflooding

• Solution:dropthelosslesspacketsiftheARPentryisincomplete• Recommendation:donotfloodormulticastforlosslesstraffic

Page 26: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

26

L0

T0

L1

S0 S1

L2

T2

L3

T1 T3

L0

T0

L1

S0 S1

L2

T2

L3

T1 T3

Tagger:practicalPFCdeadlockprevention

• TaggerAlgorithmworksforgeneralnetworktopology

• DeployableinexistingswitchingASICs

• Concept:ExpectedLosslessPath(ELP)todecoupleTaggerfromrouting

• Strategy:movepacketstodifferentlosslessqueuebeforeCBDforming

Page 27: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

27

NICPFCpauseframestorm

• AmalfunctioningNICmayblockthewholenetwork

• PFCpauseframestormscausedseveralincidents

• Solution:watchdogsatbothNICandswitchsidestostopthestorm

ToRs

Leaflayer

Spinelayer

servers0 1 2 3 4 5 6 7MalfunctioningNIC

Podset0 Podset1

Page 28: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

28

Theslow-receiversymptom

• ToRtoNICis40Gb/s,NICtoserveris64Gb/s

• ButNICsmaygeneratelargenumberofPFCpauseframes

• Rootcause:NICisresourceconstrained

• Mitigation• LargepagesizefortheMTT(memorytranslationtable)entry

• DynamicbuffersharingattheToR

CPU DRAM

ToR

QSFP40Gb/s

PCIeGen38x864Gb/s

MTTWQEs

QPC

NIC

Server

Pauseframes

Page 29: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

29

Deploymentexperiencesandlessonslearned

Page 30: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

30

Latencyreduction

• RoCEv2deployedinBingworld-widefortwoandhalfyears

• Significantlatencyreduction

• Incast problemsolvedasnopacketdrops

Page 31: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

31

RDMAthroughput

• Usingtwopodsets eachwith500+servers• 5Tb/scapacitybetweenthetwopodsets

• Achieved3Tb/sinter-podset throughput• BottleneckedbyECMProuting• Closeto0CPUoverhead

Page 32: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

32

Latencyandthroughputtradeoff

L0

T0

L1

T1

L1 L1

S0,0 S0,23 S1,0 S1,23

• RDMAlatenciesincreaseasdatashufflingstarted

• Lowlatencyvshighthroughput

us

Beforedatashuffling Duringdatashuffling

Page 33: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

33

Lessonslearned

• Providinglosslessishard!• Deadlock,livelock,PFCpauseframespropagationandstormdidhappen

• Bepreparedfortheunexpected• Configurationmanagement,latency/availability,PFCpauseframe,RDMAtrafficmonitoring

• NICsarethekeytomakeRoCEv2work

Page 34: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

34

What’snext?

Page 35: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

35

Applications

Technologies

Architectures

Protocols

• RDMAforX(Search,Storage,HFT,DNN,etc.) • Lossyvslosslessnetwork

• Practical,large-scaledeadlockfreenetwork

• RDMAprogramming

• RDMAforheterogenouscomputingsystems

• RDMAvirtualization

• Reducingcolleterialdamage• RDMAsecurity

• Softwarevshardware

• Inter-DCRDMA

Page 36: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

36

• Historically,softwarebasedpacketprocessingwon(multipletimes)• TCPprocessingoverheadanalysisbyDavidClark,etal.• Nonofthestateful TCPoffloadingtookoff(e.g.,TCPChimney)

• Thestoryisdifferentthistime• Moore’slawisending• Acceleratorsarecoming• Networkspeedkeepincreasing• Demandsforultralowlatencyarereal

Willsoftwarewin(again)?

Page 37: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

37

• ThereisnobindingbetweenRDMAandlosslessnetwork• Butimplementingmoresophisticatedtransportprotocolinhardwareisachallenge

IslosslessmandatoryforRDMA?

Page 38: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

38

RDMAvirtualizationforthecontainernetworking

• Arouteractsasaproxyforthecontainers

• Sharedmemoryforimprovedperformance

• Zerocopypossible

Container1IP:1.1.1.1

Host1

HostNetwork

vNIC

NetAPI

Application

FreeFlowNetLib

Container2IP:2.2.2.2

vNIC

NetAPI

Application

FreeFlowNetLib

PhyNIC

Container3IP:3.3.3.3

Host2

vNIC

NetAPIFreeFlowNetLib

PhyNICRDMA

ControlAgent

IPCChannel

FreeFlowRouterFr

eeFlow

NetOrchestrator

SharedMemorySpace

Application

ControlAgent

ShmSpace

Page 39: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

39

RDMAforDNN

• TCPdoesnotworkfordistributedDNNtraining

• For16-GPU,2-hostspeechtrainingwithCNTK,TCPcommunicationsdominantthetrainingtime(72%),RDMAismuchfaster(44%)

Page 40: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

40

• HowmanyLOCfora“helloworld”communicationusingRDMA?• ForTCP,itis60LOCforclientorservercode• ForRDMA,itiscomplicated…

• IBVerbs:600LOC• RCMACM:300LOC• Rsocket:60LOC

RDMAProgramming

Page 41: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

41

• MakeRDMAprogrammingmoreaccessible• Easy-to-setupRDMAserverandswitchconfigurations• CanIrunanddebugmyRDMAcodeonmydesktop/laptop?• Highqualitycodesamples

• Looselycoupledvstightlycoupled(Send/Recv vsWrite/Read)

RDMAProgramming

Page 42: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

42

Summary:RDMAfordatacenters!• RDMAisexperiencingarenaissanceindatacenters

• RoCEv2hasbeenrunningsafelyinMicrosoftdatacentersfortwoandhalfyears

• Manyopportunitiesandinterestingproblemsforhigh-speed,low-latencyRDMAnetworking

• ManyopportunitiesinmakingRDMAaccessibletomoredevelopers

Page 43: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

43

• YanCai,GangCheng,ZhongDeng,DanielFirestone,JunchengGu,ShuihaiHu,HongqiangLiu,MarinaLipshteyn,AliMonfared,JitendraPadhye,GauravSoni,HaitaoWu,JianxiYe,YiboZhu

• Azure,Bing,CNTK,Phillycollaborators• AristaNetworks,Cisco,Dell,Mellanoxpartners

Acknowledgement

Page 44: RDMA in Data Centers: Looking Back and Looking Forwardconferences.sigcomm.org/events/apnet2017/slides/cx.pdf · Microsoft Research. ... • Cloud scale services: IaaS, PaaS, Search,

44

Questions?