76
6.888: Lecture 4 Data Center Load Balancing Mohammad Alizadeh Spring 2016 1

6.888: Lecture 4 Data Center Load Balancing

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

6.888:Lecture4

DataCenterLoadBalancing

MohammadAlizadeh

Spring2016

1

Leaf

1000sofserverports

Multi-rooted tree [Fat-tree, Leaf-Spine, …] Ø  Full bisection bandwidth, achieved via multipathing

Spine

Access

Single-rooted tree Ø  High oversubscription

1000sofserverports

Agg

Core

MoDvaDon

2

DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc)

Leaf

1000sofserverports

Multi-rooted tree [Fat-tree, Leaf-Spine, …] Ø  Full bisection bandwidth, achieved via multipathing

Spine

MoDvaDon

3

DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc)

MulD-rooted!=IdealDCNetwork

4

1000sofserverports

Ideal DC network: Big output-queued switch

Ø  No internal bottlenecks è predictable Ø  Simplifies BW management

[Bw guarantees, QoS, …]

Multi-rooted tree

1000sofserverports

Can’t build it L ≈Possible

bottlenecks Need efficient load balancing

Today:ECMPLoadBalancing

Pickamongequal-costpathsbyahashof5-tupleØ RandomizedloadbalancingØ Preservespacketorder

5

Problems: -  Hash collisions (coarse granularity)

-  Local & stateless (bad with asymmetry; e.g., due to link failures)

H(f)%3=0

SoluDonLandscape

6

Cong. Oblivious [ECMP, WCMP, packet-spray, …]

Cong. Aware [Flare, TeXCP, CONGA, DeTail, HULA, …]

Centralized Distributed [Hedera, Planck, Fastpass, …]

In-Network Host-Based

Cong. Oblivious [Presto]

Cong. Aware [MPTCP, FlowBender…]

MPTCP

7

² SlidesbyDamonWischik(withminormodificaDons)

WhatproblemisMPTCPtryingtosolve?MulDpath‘pools’links.

Two separate links A pool of links

=

TCPcontrolshowalinkisshared.Howshouldapoolbeshared?

ApplicaDon:MulDhomedwebserver9

100Mb/s

100Mb/s

2TCPs@50Mb/s

4TCPs@25Mb/s

ApplicaDon:MulDhomedwebserver10

100Mb/s

100Mb/s

2TCPs@33Mb/s

1MPTCP@33Mb/s

4TCPs@25Mb/s

11

100Mb/s

100Mb/s

2TCPs@25Mb/s

2MPTCPs@25Mb/s

4TCPs@25Mb/s

Thetotalcapacity,200Mb/s,issharedoutevenlybetweenall8flows.

ApplicaDon:MulDhomedwebserver

12

100Mb/s

100Mb/s

2TCPs@22Mb/s

3MPTCPs@22Mb/s

4TCPs@22Mb/s

Thetotalcapacity,200Mb/s,issharedoutevenlybetweenall9flows.

It’sasiftheywereallsharingasingle200Mb/slink.Thetwolinkscanbesaidtoforma200Mb/spool.

ApplicaDon:MulDhomedwebserver

13

100Mb/s

100Mb/s

2TCPs@20Mb/s

4MPTCPs@20Mb/s

4TCPs@20Mb/s

Thetotalcapacity,200Mb/s,issharedoutevenlybetweenall10flows.

It’sasiftheywereallsharingasingle200Mb/slink.Thetwolinkscanbesaidtoforma200Mb/spool.

ApplicaDon:MulDhomedwebserver

ApplicaDon:WIFI&cellulartogether

Howshouldyourphonebalanceitstrafficacrossverydifferentpaths?

14

wifipath:highloss,smallRTT

3Gpath:lowloss,highRTT

ApplicaDon:Datacenters

Canwemakethenetworkbehavelikealargepoolofcapacity?

15

MPTCPisageneral-purposemulDpathreplacementforTCP.

16

WhatistheMPTCPprotocol?MPTCPisareplacementforTCPwhichletsyouusemulDplepathssimultaneously.

17

TCP

IP

userspacesocketAPIMPTCP MPTCP

addr1 addr2addr

Thesenderstripespacketsacrosspaths

Thereceiverputsthepacketsinthecorrectorder

WhatistheMPTCPprotocol?MPTCPisareplacementforTCPwhichletsyouusemulDplepathssimultaneously.

18

TCP

IP

userspacesocketAPIMPTCP MPTCP

addraddr

Thesenderstripespacketsacrosspaths

Thereceiverputsthepacketsinthecorrectorder

portp1

portp2aswitchwithport-basedrouDng

Designgoal1:MulDpathTCPshouldbefairtoregularTCPatsharedbojlenecks

Tobefair,Mul.pathTCPshouldtakeasmuchcapacityasTCPatabo:lenecklink,noma:erhowmanypathsitisusing.

StrawmansoluDon:Run“½TCP”oneachpath

AmulDpathTCPflowwithtwosubflows

RegularTCP

19

Designgoal2:MPTCPshoulduseefficientpaths

Eachflowhasachoiceofa1-hopanda2-hoppath.Howshouldsplititstraffic?

12Mb/s

12Mb/s

12Mb/s

20

Designgoal2:MPTCPshoulduseefficientpaths

Ifeachflowsplititstraffic1:1...

8Mb/s

8Mb/s

8Mb/s

12Mb/s

12Mb/s

12Mb/s

21

Designgoal2:MPTCPshoulduseefficientpaths

Ifeachflowsplititstraffic2:1...

9Mb/s

9Mb/s

9Mb/s

12Mb/s

12Mb/s

12Mb/s

22

Designgoal2:MPTCPshoulduseefficientpaths

Ifeachflowsplititstraffic4:1...

10Mb/s

10Mb/s

10Mb/s

12Mb/s

12Mb/s

12Mb/s

23

Designgoal2:MPTCPshoulduseefficientpaths

Ifeachflowsplititstraffic∞:1...

12Mb/s

12Mb/s

12Mb/s

12Mb/s

12Mb/s

12Mb/s

24

Designgoal2:MPTCPshoulduseefficientpaths

12Mb/s

12Mb/s

12Mb/s

12Mb/s

12Mb/s

12Mb/s

TheoreDcalsoluDon(Kelly+Voice2005;Han,Towsleyetal.2006)MPTCPshouldsendallitstrafficonitsleast-congestedpaths.Theorem.ThiswillleadtothemostefficientallocaDonpossible,givenanetworktopologyandasetofavailablepaths.

25

Designgoal3:MPTCPshouldbefaircomparedtoTCP

DesignGoal2saystosendallyourtrafficontheleastcongestedpath,inthiscase3G.ButthishashighRTT,henceitwillgivelowthroughput.

cd

wifipath:highloss,smallRTT

3Gpath:lowloss,highRTT

Goal3a.AMulDpathTCPusershouldgetatleastasmuchthroughputasasingle-pathTCPwouldonthebestoftheavailablepaths.

Goal3b.AMulDpathTCPflowshouldtakenomorecapacityonanylinkthanasingle-pathTCPwould.

26

Designgoals

Goal1.BefairtoTCPatbojlenecklinksGoal2.Useefficientpaths...Goal3.asmuchaswecan,whilebeingfairtoTCPGoal4.AdaptquicklywhencongesDonchangesGoal5.Don’toscillate

HowdoesMPTCPachieveallthis?

27

Read:“Design,implementaDon,andevaluaDonofcongesDoncontrolformulDpathTCP,NSDI2011”

HowdoesTCPcongesDoncontrolwork?MaintainacongesDonwindoww.

•  IncreasewforeachACK,by1/w

•  Decreasewforeachdrop,byw/2

28

HowdoesMPTCPcongesDoncontrolwork?

MaintainacongesDonwindowwr,onewindowforeachpath,wherer ∊ Rrangesoverthesetofavailablepaths.

•  IncreasewrforeachACKonpathr,by

•  Decreasewrforeachdroponpathr,bywr /2

29

Discussion

30

WhatYouSaid

Ravi:“Aninteres.ngpointintheMPTCPpaperisthattheytargeta'sweetspot'wherethereisafairamountoftrafficbutthecoreisneitheroverloadednorunderloaded.”

31

WhatYouSaid

Hongzi:“Thepapertalkedabitof`probing’toseeifalinkhashighloadandpicksomeotherlinks,andtherearesomespecificwaysofassigningrandomizedassignmentofsubflowsonlinks.Iwaswonderingdoesthe`powerof2choices’havesomerolestoplayhere?”

32

MPTCPdiscoversavailablecapacity,anditdoesn’tneedmuchpathchoice.

Ifeachnode-pairbalancesitstrafficover8paths,chosenatrandom,thenuDlizaDonisaround90%ofopDmal.

33

FatTree,128nodes FatTree,8192nodesThroughput(%ofop.mal)

Num.pathsSimulaDonsofFatTree,100Mb/slinks,permutaDontrafficmatrix,oneflowperhost,TCP+ECMPversusMPTCP.

MPTCPdiscoversavailablecapacity,anditsharesitoutmorefairlythanTCP+ECMP.

34

FatTree,128nodes FatTree,8192nodesThroughput(%ofop.mal)

FlowrankSimulaDonsofFatTree,100Mb/slinks,permutaDontrafficmatrix,oneflowperhost,TCP+ECMPversusMPTCP.

MPTCPcanmakegoodpathchoices,asgoodasaveryfastcentralizedscheduler.

35

SimulaDonofFatTreewith128hosts.•  PermutaDontrafficmatrix

•  Closed-loopflowarrivals(oneflowfinishes,anotherstarts)

•  FlowsizedistribuDonsfromVL2dataset

Throughput[%ofop.mal]

Hederafirst-fitheuris.c

MPTCP

MPTCPpermitsflexibletopologies

BecauseanMPTCPflowshiusitstrafficontoitsleastcongestedpaths,congesDonhotspotsaremadeto“diffuse”throughoutthenetwork.Non-adapDvecongesDoncontrol,ontheotherhand,doesnotcopewellwithnon-homogenoustopologies.

36

Averagethroughput[%ofop.mal]

Rankofflow

SimulaDonof128-nodeFatTree,whenoneofthe1Gb/scorelinksiscutto100Mb/s

MPTCPpermitsflexibletopologies

•  Atlowloads,therearefewcollisions,andNICsaresaturated,soTCP≈MPTCP•  Athighloads,thecoreisseverelycongested,andTCPcanfullyexploitallthecorelinks,soTCP≈MPTCP• Whenthecoreis“right-provisioned”,i.e.justsaturated,MPTCP>TCP

37

Connec.onsperhost

Ra.oofthroughputs,MPTCP/TCP

SimulaDonofaFatTree-liketopologywith512nodes,butwith4hostsforeveryup-linkfromatop-of-rackswitch,i.e.thecoreisoversubscribed4:1.•  PermutaDonTM:eachhostsendstooneother,eachhostreceivesfromoneother

•  RandomTM:eachhostsendstooneother,eachhostmayreceivefromanynumber

MPTCPpermitsflexibletopologies

Ifonly50%ofhostsareacDve,you’dlikeeachhosttobeabletosendat2Gb/s,fasterthanoneNICcansupport.

38

FatTree(5portsperhostintotal,1Gb/sbisecDonbandwidth)

Dual-homedFatTree(5portsperhostintotal,1Gb/sbisecDonbandwidth)

1Gb/s 1Gb/s

Presto

39

² AdaptedfromslidesbyKeqiangHe(Wisconsin)

SoluDonLandscape

40

Cong. Oblivious [ECMP, WCMP, packet-spray, …]

Cong. Aware [Flare, TeXCP, CONGA, HULA, DeTail…]

Centralized Distributed [Hedera, Planck, Fastpass, …]

In-Network Host-Based

Cong. Oblivious [Presto]

Cong. Aware [MPTCP, FlowBender…]

IscongesDon-awareloadbalancingoverkillfordatacenters?

We’vealreadyseenacongesDon-obliviousloadbalancingscheme

41

ECMP Prestoisfine-grainedcongesDon-obliviousloadbalancing

KeychallengeismakingthispracDcal

KeyDesignDecisions

Usesouwareedge•  Nochangestotransport(e.g.,insideVMs)orswitches

LBgranularityflowcells[e.g.64KBofdata]•  WorkswithTSOhardwareoffload•  Noreorderingformice•  Makesdealingwithreorderingsimpler(Why?)

“Fix”reorderingatGROlayer•  Avoidhighper-packetprocessing(esp.at10Gb/sand

above)End-to-endpathcontrol

42

PrestoataHighLevel

43

vSwitch NIC NIC

vSwitch TCP/IP

Spine

Leaf

TCP/IP

Nearuniform-sizeddataunits

PrestoataHighLevel

44

vSwitch NIC NIC

vSwitch TCP/IP

Spine

Leaf

TCP/IP

Proac.velydistributedevenlyoversymmetricnetworkbyvSwitchsender

Nearuniform-sizeddataunits

PrestoataHighLevel

45

vSwitch NIC NIC

vSwitch TCP/IP

Spine

Leaf

TCP/IP

Proac.velydistributedevenlyoversymmetricnetworkbyvSwitchsender

Nearuniform-sizeddataunits

PrestoataHighLevel

46

vSwitch NIC NIC

vSwitch TCP/IP

Spine

Leaf

TCP/IPReceivermaskspacketreorderingduetomul.pathingbelowtransportlayer

Proac.velydistributedevenlyoversymmetricnetworkbyvSwitchsender

Nearuniform-sizeddataunits

Discussion

47

WhatYouSaidArman:“Froman(informa.on)theore.cperspec.ve,ordershouldnotbesuchatroublingphenomenon,yetinrealnetworksorderingissoimportant.Howprac.calare“ratelesscodes”(networkcoding,raptorcodes,etc.)inallevia.ngthisproblem?”

Amy:“Themaincomplexi.esinthepaperstemfromtherequirementthatserversuseTSOandGROtoachievehighthroughput.Itissurprisingtomethatpeoples.llrelysoheavilyonTSOandGRO.Whydoesn'tsomeonebuildamul.-coreTCPstackthatcanprocessindividualpacketsatlinerateinsokware?”

48

PrestoLBGranularity

Presto:load-balanceonflowcellsWhatisflowcell?–  AsetofTCPsegmentswithboundedbytecount–  BoundismaximalTCPSegmentaDonOffload(TSO)size

•  MaximizethebenefitofTSOforhighspeed•  64KBinimplementaDon

What’sTSO?

49

TCP/IP

NICSegmenta.on&ChecksumOffload

MTU-sizedEthernetFrames

LargeSegment

PrestoLBGranularity

Presto:load-balanceonflowcellsWhatisflowcell?–  AsetofTCPsegmentswithboundedbytecount–  BoundismaximalTCPSegmentaDonOffload(TSO)size

•  MaximizethebenefitofTSOforhighspeed•  64KBinimplementaDon

Examples

50

25KB 30KB 30KB

Flowcell:55KB

TCPsegments

Start

IntrotoGRO

GenericReceiveOffload(GRO)–  ThereverseprocessofTSO

51

IntrotoGRO

TCP/IP

GRO

NIC

52

OS

Hardware

IntrotoGRO

TCP/IP

GRO

NICMTU-sizedPackets

53

P2 P3 P4 P5P1

Queuehead

IntrotoGRO

TCP/IP

GRO

NICMTU-sizedPackets

54

P2 P3 P4 P5P1

Merge

Queuehead

IntrotoGRO

TCP/IP

GRO

NICMTU-sizedPackets

55

P2 P3 P4 P5

P1 Merge

Queuehead

IntrotoGRO

TCP/IP

GRO

NICMTU-sizedPackets

56

P3 P4 P5

P1–P2 Merge

Queuehead

IntrotoGRO

TCP/IP

GRO

NICMTU-sizedPackets

57

P4 P5

P1–P3 Merge

Queuehead

IntrotoGRO

TCP/IP

GRO

NICMTU-sizedPackets

58

P5

P1–P4 Merge

Queuehead

IntrotoGRO

TCP/IP

GRO

NICMTU-sizedPackets

59

P1–P5 Push-up

LargeTCPsegmentsarepushed-upattheendofabatchedIOevent(i.e.,apollingevent)

IntrotoGRO

TCP/IP

GRO

NICMTU-sizedPackets

60

P1–P5 Push-up

MergingpktsinGROcreateslesssegments&avoidsusingsubstan.allymorecyclesatTCP/IPandabove[Menon,ATC’08]IfGROisdisabled,~6Gbpswith100%CPUusageofonecore

ReorderingChallenges

61

P1 P2 P3 P6 P4 P7 P5 P8 P9

TCP/IP

GRO

NIC

Outoforderpackets

ReorderingChallenges

62

P1

P2 P3 P6 P4 P7 P5 P8 P9

TCP/IP

GRO

NIC

ReorderingChallenges

63

P1–P2

P3 P6 P4 P7 P5 P8 P9

TCP/IP

GRO

NIC

ReorderingChallenges

64

P1–P3

P6 P4 P7 P5 P8 P9

TCP/IP

GRO

NIC

ReorderingChallenges

65

P1–P3 P6

P4 P7 P5 P8 P9

TCP/IP

GRO

NIC

GROisdesignedtobefastandsimple;itpushes-uptheexisDngsegmentimmediatelywhen1)thereisagapinsequencenumber,2)MSSreachedor3)Dmeoutfired

ReorderingChallenges

66

P1–P3

P6

P4 P7 P5 P8 P9

TCP/IP

GRO

NIC

ReorderingChallenges

67

P1–P3 P6

P4

P7 P5 P8 P9

TCP/IP

GRO

NIC

ReorderingChallenges

68

P1–P3 P6 P4

P7

P5 P8 P9

TCP/IP

GRO

NIC

ReorderingChallenges

69

P1–P3 P6 P4 P7

P5

P8 P9

TCP/IP

GRO

NIC

ReorderingChallenges

70

P1–P3 P6 P4 P7 P5

P8

P9

TCP/IP

GRO

NIC

ReorderingChallenges

71

P1–P3 P6 P4 P7 P5

P8–P9

TCP/IP

GRO

NIC

ReorderingChallenges

72

P1–P3 P6 P4 P7 P5 P8–P9 TCP/IP

GRO

NIC

ReorderingChallenges

GROiseffec1velydisabledLotsofsmallpacketsarepusheduptoTCP/IP

73

HugeCPUprocessingoverhead

PoorTCPperformanceduetomassivereordering

HandlingAsymmetry

74

40G

40G

40G

40G

40G

40G

HandlingasymmetryopDmallyneedstrafficawareness

40G

40G

40G

40G

40G

75

30G30G(UDP)

40G(TCP)

HandlingAsymmetry

HandlingasymmetryopDmallyneedstrafficawareness

35G

5G

76