Passiverealtime datacenterfaultdetectionandlocalization
ArjunRoy,JamesHongyi Zeng*,JasmeetBagga*,andAlexC.SnoerenUniversityofCalifornia,SanDiegoFacebook*
1
“Itwouldbeniceifwecouldfigureoutwhichlinkwascausingtheseretransmits.”
- Ranjeeth Dasineni,Facebook(paraphrased)
2
Contemporarydatacenternetwork
However:faultsmaybepartial/intermittent.3
Partialfaults:Afewexamples
• Netpilot (Sigcomm 2011):Framecheckerror,unequalECMPhashing,etc.Wu,Xin,etal."Netpilot:automatingdatacenternetworkfailuremitigation." ACMSIGCOMMComputerCommunicationReview 42.4(2012):419-430.
• Everflow (Sigcomm 2015):TCAMbiterrors,silentpacketdrops.Zhu,Yibo,etal."Packet-LevelTelemetryinLargeDatacenterNetworks.”SIGCOMM,2015.
• Pingmesh (Sigcomm 2015):“fiberFCS…errors,switchingASICdefects,switchfabricflaw,switchsoftwarebug,NICconfigurationissue,networkcongestions,etc.Wehaveseenallthesetypesofissuesinourproductionnetworks.”
Guo,Chuanxiong,etal."Pingmesh:ALarge-ScaleSystemforDataCenterNetworkLatencyMeasurementandAnalysis.” SIGCOMM,2015.
4
Vastbodyofpriorwork(justasmallsample…)• Applicationinstrumentation:variousproductionsystems
• Activeprobing:Pingmesh (SIGCOMM’15),NetNorad (Facebook),ATPG(CoNEXT ‘12),Everflow (SIGCOMM‘15)
• Machinelearning:NetPoirot (SIGCOMM’16)
• Graphalgorithms:Gestalt(Usenix ATC‘14),SCORE(NSDI‘05)
• Pathtracing: Everflow (SIGCOMM‘15),NetNorad (Facebook),NetSight (NSDI‘14),TinyPacketPrograms(SIGCOMM‘14)
• Networkinstrumentation:FlowRadar (NSDI’16),Planck(SIGCOMM‘14),NetPilot (SIGCOMM‘11)
5
Weexploit:highlyregularloadbalancedtraffic
Sourceracktrafficmagnitude
Destinationracktrafficmagnitude
6
ArjunRoy,Hongyi Zeng,JasmeetBagga,GeorgePorter,andAlexC.Snoeren.InsidetheSocialNetwork's(Datacenter)Network. ACMSIGCOMM'15,London,England.
Loadbalancedtrafficsimplifiesfaulthandling
• Evenlyloadedpathsmeansperpathperformanceissimilarifnoerrors.• Networkfaultsleadtooutlierpaths.• Ifflownetworkpathknown,cancorrelateflowperformancewithpath.
• Approachallowsustofindandlocalizefaults:• Inanapplicationagnosticmanner• Incurringnoadditionalprobingoverhead• Morerapidlythanpriorpublishedworks
7
Facebookdatacentertopology
8
AlexeyAndreyev.Introducingdatacenterfabric,thenext-generationFacebookdatacenternetwork.https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/
FindingpathinformationatFacebook
ToR ToRCoreCore
Core
CoreCore
Core
CoreCore
Core
CoreCore
Core
Sourcehost
DestinationhostAgg
Agg
Agg
Agg Agg
Agg
Agg
Agg9
FindingpathinformationatFacebook
ToR ToRCoreCore
Core
CoreCore
Core
CoreCore
Core
CoreCore
Core
Sourcehost
DestinationhostAgg
Agg
Agg
Agg Agg
Agg
Agg
Agg10
FindingpathinformationatFacebook
ToR ToRCoreCore
Core
CoreCore
Core
CoreCore
Core
CoreCore
Core
Sourcehost
DestinationhostAgg
Agg
Agg
Agg Agg
Agg
Agg
Agg11
FindingpathinformationatFacebook
ToR ToRCoreCore
Core
CoreCore
Core
CoreCore
Core
CoreCore
Core
Sourcehost
DestinationhostAgg
Agg
Agg
Agg Agg
Agg
Agg
Agg12
FindingpathinformationatFacebook
ToR ToRCoreCore
Core
CoreCore
Core
CoreCore
Core
CoreCore
Core
Sourcehost
DestinationhostAgg
Agg
Agg
Agg Agg
Agg
Agg
Agg13
FindingpathinformationatFacebook
ToR ToRCoreCore
Core
CoreCore
Core
CoreCore
Core
CoreCore
Core
Sourcehost
DestinationhostAgg
Agg
Agg
Agg Agg
Agg
Agg
Agg14
FindingpathinformationatFacebook
Core
Core
Core
Agg
Agg
AggToR ToR
Agg
Agg
Core
Core
Sourcehost
Destinationhost
Agg
Agg
Agg
15
FindingpathinformationatFacebook
Core
Core
Core
Agg
Agg
AggToR ToR
Agg
Agg
Core
Core
Sourcehost
Destinationhost
Agg
Agg
Agg
16
FindingpathinformationatFacebook
Core
Core
Core
Agg
Agg
AggToR ToR
Agg
Agg
Core
Core
Sourcehost
Destinationhost
Agg
Agg
Solution:aggregationswitchmarkspacketsbasedoncoredownlinktraversed.
Agg
17
Howdoweusepathinformation?
• Inprinciple:cancompareflowperformancebypath.1. Combinatorialdisaster:O(10,000)pathsfromsinglehosttoremoteracks.2. Nolocalization:doesn’ttelluswhichlink/switchisatfault.
• But:forthistrafficpattern,ECMProutinggivesusevenbytes/link.
• Solution:Justcomparelinks!
Create“EquivalenceSets”:setsoflinkshandlingsimilarload
andexhibitingsimilarperformance,intheabsenceoffaults
18
Equivalencesets:1. Reducesnumberofcomparisonsneeded.
2. Pinpointsfaulttospecificlocation.
EquivalencesetsinFacebooktopologyCoreCoreCore
CoreCoreCore
CoreCoreCore
Sourcehost
Agg
Agg
Agg
ToRCoreCoreCoreAgg
Equivalenceset:4uplinksfromeachToR
topodAgg layer
…eachhasclosetoidenticalperformancedistribution
inabsenceoferrors19
CoreCoreCore
CoreCoreCore
CoreCoreCore
Sourcehost
Agg
Agg
Agg
ToRCoreCoreCoreAgg
…eachhasclosetoidenticalperformancedistribution
inabsenceoferrors
Equivalenceset:NuplinksfrompodAgg layertocorelayer
EquivalencesetsinFacebooktopology
20
Outlieranalysiswithapplicationagnosticmetrics
Hostsalreadytrackmetricsforcongestioncontrolorperformancemonitoring:
TCPCongestionwindow:Affectedbypacketloss.TCPRetransmits:Affectedbypacketloss.SmoothedRoundtriptime:Affectedbylatencyspikes.Systemcalllatency: Affectedbypacketloss.
Caveat:Canbedifficulttodetermineifanaffectisduetoafaultylink,overloadedhosts,applicationvariance,etc.
Withequivalencesetbasedgrouping,wecancomparedistributionsbylink.
Onlylinkfaultscausevariancebetweenlinks.
21
DemonstratingequivalencesetsfromAgg toToR
(1)ToR markspacketDSCP
perinboundlink
(2)HostaggregatesTCPmetricsbylink(3b)Host drops0.5%ofpacketstraversinglink
(3a)Wesimulateerroronthislink:
22
Host ToRAgg 2
Agg 3
Agg 4
Agg 1
TCPCongestionwindowinAgg toToR equivalenceset
Cacheserver 23
Congestionwindowsignalisapplicationagnostic
Cacheserver Webserver24
Weuse:TCPretransmitsinourwork
Cacheserver Webserver25
Detectingfaultsinproduction
• Monitoredtrafficthroughpodaggregationswitch.1. Nofaultsinjected.2. CollectedTCPmetricdataon30webserverhosts.3. Equivalenceset:fourlinecards connectingtocorelayer
(eachlinecard hasequalshareofuplinks).
• OnJanuary25th,asinglelinecard hadasoftwarefault.1. Linecard controllersoftwarehung.2. BGProutestimedout,productiontrafficthroughlinecard routedaway.3. Afewminuteslater,NetNORAD flaggedunresponsivelinecard.
26
Faultvisibletoourapproachin30seconds
27
Classifyingfaultylinks
• “Doesthislinkhavemoreretransmitsperflowthantheotherlinks?”
• “Dotwodistributionshavethesamemean,orisonegreater?”
28
Classifier:compareeachlinktootherlinkswithonesampleStudent’sT-Test.
OnlinefaultmonitoringwithT-Testalone
• Inprinciple:cansetupasystemthatusesendhostT-Testresulttotelluswhichnetworklinksarefaulty.
• However:byitselfthisissusceptibletoFalsePositives.
• Can’taffordfalsepositivesinnetworkwithO(10,000)links!
29
Accountingforfalsepositives
• However,twocharacteristicsaidus:1. Per-hostfalsepositivesevenlydistributedperlinkovertime.2. Datacenterhasaplethoraofhostsforwhichthisistrue.
• Thus,we’renottryingtoseeif agivenlinkismarkedfaultybyhosts.
• Instead,weonceagainperformoutlieranalysis.1. “Areallthelinksbeingmarkedfaultybyhostsatsimilarrates?”2. “Arehostsflaggingaparticularsubsetoflinksasfaultyathigherrates?”
30
Chi-squaredtest:determinesifanylinksareoutliers.
P-Value≈ 1:“Yes,allthelinksbeingmarkedfaultybyhostsatsimilarrates.”
P-Value≈ 0: “No,asubsethasacomparativelyhighpercentageofhostsclaimingfault.”
Evaluationinthedatacenter
• Smalldetectionsurface;didnotdetectany‘organic’partialfaults.
• Approach:inject‘simulated’faultstoevaluateapproach.
• Inducedavarietyoffaultscenariostochallengeoursystem.
31
Evaluationinthedatacenter:faultscenarios
• Minisculefaults:faultscanhaveverylowdroprates.
• Concurrentfaults:multiplefaultscanoccursimultaneously.
• Maskedfaults:largerfaultcanmaskeffectofminisculefault.
• Correlatedfaults:hardwarefaultcanimpactmultiplenearbylinks,confoundingoutlieranalysis.
32
Evaluationinthedatacenter:faultscenarios
• Minisculefaults:faultscanhavevery lowdroprates.
• Concurrentfaults:multiplefaultscanoccursimultaneously.
• Maskedfaults:largerfaultcanmaskeffectofminisculefault.
• Correlatedfaults:hardwarefaultcanimpactmultiplenearbylinks,confoundingoutlieranalysis.
33
CoreCoreCore
CoreCoreCore
CoreCoreCore
HostHostHost
HostHostHost
HostHostHost
Agg
Agg
Agg
ToR
ToR
ToR
Findingminisculefaults:experimentsetup
Core1
Core2
CoreN
Agg
…Core3
34
CoreCoreCore
CoreCoreCore
CoreCoreCore
HostHostHost
HostHostHost
HostHostHost
Agg
Agg
Agg
ToR
ToR
ToR
Findingminisculefaults:experimentsetup
Core1
Core2
CoreN
Agg
…Core3
35
CoreCoreCore
CoreCoreCore
CoreCoreCore
HostHostHost
HostHostHost
HostHostHost
Agg
Agg
Agg
ToR
ToR
ToR
Findingminisculefaults:experimentsetup
Core1
Core2
CoreN
Agg
…Core3
Equivalenceset:NuplinksfrompodAgg layertocorelayer
36
CoreCoreCore
CoreCoreCore
CoreCoreCore
HostHostHost
HostHostHost
HostHostHost
Agg
Agg
Agg
ToR
ToR
ToR
Findingminisculefaults:experimentsetup
Core1
Core2
CoreN
Agg
…Core3
Partialfaultinducedonsingle
CoretoAggdownlink.
37
Faultdetectionratevsdroprate
38
Minisculefaults:choosingbetweendetectionspeedandsensitivity
39
Minisculefaults:choosingbetweendetectionspeedandsensitivity
40
Minisculefaults:choosingbetweendetectionspeedandsensitivity
41
Minisculefaults:choosingbetweendetectionspeedandsensitivity
42
Minisculefaults:choosingbetweendetectionspeedandsensitivity
43
Minisculefaults:choosingbetweendetectionspeedandsensitivity
44
“Itwouldbeniceifwecouldfigureoutwhichlinkwascausingtheseretransmits.”
Ranjeeth Dasineni,Facebook(paraphrased)
45
46
InterpretingtheT-Test
1. T-Statistic:“Doesthislinkhavemoreorlessretransmitsthanaverage?”
• Positive T-statisticmeanslargerthanaverage.• Negative T-statisticmeanssmallerthanaverage.
2. P-Value:“Isthedifferenceinmeanbigenoughtoconcernus?”
• Closeto0meansthislinkcouldbeanoutlier.• Closeto1meanswearenotconcerned.
47
InterpretingtheT-Test
P-value0,t-stat>0
P-value1,t-stat≈0
48