Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Peer‐to‐peersystemsandDistributedHashTables(DHTs)
COS461:ComputerNetworksSpring2010(MW3:00‐4:20inCOS105)
MikeFreedmanhKp://www.cs.princeton.edu/courses/archive/spring10/cos461/
OverlayNetworks• P2PapplicaPonsneedto:
– TrackidenPPes&(IP)addressesofpeers• Maybemanyandmayhavesignificantchurn• Bestnottohaven2IDreferences
– Thus,nodes’“views”<<viewinconsistenthashing– Routemessagesamongpeers
• Ifyoudon’tkeeptrackofallpeers,thisis“mulP‐hop”
• Overlaynetwork– PeersdoingbothnamingandrouPng– IPbecomes“just”thelow‐leveltransport
• AlltheIProuPngisopaque• AssumpPonthatnetworkisfully‐connected(true?)
(ManyslidesborrowedfromJoeHellerstein’sVLDB’04keynote)
ManyNewChallenges
• RelaPvetootherparallel/distributedsystems– ParPalfailure– Churn– Fewguaranteesontransport,storage,etc.– HugeopPmizaPonspace– NetworkboKlenecks&otherresourceconstraints– NoadministraPveorganizaPons– Trustissues:security,privacy,incenPves
• RelaPvetoIPnetworking– MuchhigherfuncPon,moreflexible– Muchlesscontrollable/predictable
EarlyP2P
EarlyP2PI:Client‐Server
• Napster
xyz.mp3?
xyz.mp3
EarlyP2PI:Client‐Server
• Napster– Client‐searchsearch
xyz.mp3
EarlyP2PI:Client‐Server
• Napster– Client‐searchsearch
xyz.mp3?
xyz.mp3
EarlyP2PI:Client‐Server
• Napster– Client‐searchsearch– “P2P”filexfer
xyz.mp3?
xyz.mp3
EarlyP2PI:Client‐Server
• Napster– Client‐searchsearch– “P2P”filexfer
xyz.mp3?
xyz.mp3
EarlyP2PII:FloodingonOverlays
xyz.mp3?
xyz.mp3
An“unstructured”overlaynetwork
EarlyP2PII:FloodingonOverlays
xyz.mp3?
xyz.mp3
Flooding
EarlyP2PII:FloodingonOverlays
xyz.mp3?
xyz.mp3
Flooding
EarlyP2PII:FloodingonOverlays
xyz.mp3
EarlyP2PII.v:“Ultra/superpeers”• Ultra‐peerscanbeinstalled(KaZaA)orself‐promoted(Gnutella)– AlsousefulforNATcircumvenPon,e.g.,inSkype
HierarchicalNetworks(&Queries)
• IP– Hierarchicalnamespace– HierarchicalrouPng:AS’scorr.withnamespace(notperfectly)
• DNS– Hierarchicalnamespace(“clients”+hierarchyofservers)– HierarchicalrouPngw/aggressivecaching
• TradiPonalpros/consofhierarchicalmgmt– Workswellforthingsalignedwiththehierarchy
• E.g.,physicaloradministraPvelocality
– Inflexible• Nodataindependence!
LessonsandLimitaPons
• Client‐Serverperformswell– Butnotalwaysfeasible:Performancenotonenkeyissue!
• Thingsthatflood‐basedsystemsdowell– Organicscaling– DecentralizaPonofvisibilityandliability– Findingpopularstuff– Fancylocalqueries
• Thingsthatflood‐basedsystemsdopoorly– Findingunpopularstuff– Fancydistributedqueries– VulnerabiliPes:datapoisoning,tracking,etc.– Guaranteesaboutanything(answerquality,privacy,etc.)
StructuredOverlays:DistributedHashTables
DHTOutline
• High‐leveloverview• Fundamentalsofstructurednetworktopologies
– Andexamples
• OneconcreteDHT– Chord
• Somesystemsissues– Heterogeneity– Storagemodels&sonstate
– Locality– Churnmanagement
– Underlaynetworkissues
High‐LevelIdea:IndirecPon
• IndirecPoninspace– Logical(content‐based)IDs,rouPngtothoseIDs
• “Content‐addressable”network– Tolerantofchurn
• nodesjoiningandleavingthenetworktoh
y
zh=y
High‐LevelIdea:IndirecPon
• IndirecPoninspace– Logical(content‐based)IDs,rouPngtothoseIDs
• “Content‐addressable”network– Tolerantofchurn
• nodesjoiningandleavingthenetwork
• IndirecPoninPme– Temporallydecouplesendandreceive– Persistencerequired.Hence,typicalsol’n:sonstate
• Comboofpersistenceviastorageandviaretry– “Publisher”requestsTTLonstorage– Republishesasneeded
• Metaphor:DistributedHashTable
tohz
h=z
WhatisaDHT?
• HashTable– Datastructurethatmaps“keys”to“values”– EssenPalbuildingblockinsonwaresystems
• DistributedHashTable(DHT)– Similar,butspreadacrosstheInternet
• Interface– insert(key,value)orput(key,value)– lookup(key)orget(key)
How?
EveryDHTnodesupportsasingleoperaPon:
– Givenkeyasinput;routemessagestowardnodeholdingkey
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
DHTinacPon
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
DHTinacPon
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
DHTinacPon
OperaPon:takekeyasinput;routemsgstonodeholdingkey
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
DHTinacPon:put()
put(K1,V1)
OperaPon:takekeyasinput;routemsgstonodeholdingkey
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
DHTinacPon:put()
put(K1,V1)
OperaPon:takekeyasinput;routemsgstonodeholdingkey
(K1,V1)
K V
K V K V
K V
K V
K V
K V
K V
K V
K V
K V
DHTinacPon:put()
OperaPon:takekeyasinput;routemsgstonodeholdingkey
get(K1)
K V
K V K V
K V
K V
K V
K V
K V
K V
K V
K V
DHTinacPon:get()
OperaPon:takekeyasinput;routemsgstonodeholdingkey
get(K1)
K V
K V K V
K V
K V
K V
K V
K V
K V
K V
K V
IteraPvevs.RecursiveRouPngPreviouslyshowedrecursive.AnotheropPon:itera8ve
OperaPon:takekeyasinput;routemsgstonodeholdingkey
DHTDesignGoals
• An“overlay”networkwith:– Flexiblemappingofkeystophysicalnodes– Smallnetworkdiameter
– Smalldegree(fanout)– LocalrouPngdecisions– Robustnesstochurn– RouPngflexibility– Decentlocality(low“stretch”)
• Different“storage”mechanismsconsidered:– Persistencew/addiPonalmechanismsforfaultrecovery– Besteffortcachingandmaintenanceviasonstate
DHTOutline
• High‐leveloverview• Fundamentalsofstructurednetworktopologies
– Andexamples
• OneconcreteDHT– Chord
• Somesystemsissues– Heterogeneity– Storagemodels&sonstate
– Locality– Churnmanagement
– Underlaynetworkissues
AnExampleDHT:Chord
• Assumen=2mnodesforamoment– A“complete”Chordring
– We’llgeneralizeshortly
AnExampleDHT:Chord
• EachnodehasparPcularviewofnetwork– Setofknownneighbors
AnExampleDHT:Chord
• EachnodehasparPcularviewofnetwork– Setofknownneighbors
AnExampleDHT:Chord
• EachnodehasparPcularviewofnetwork– Setofknownneighbors
CayleyGraphs
• TheCayleyGraph(S,E)ofagroup:– VerPcescorrespondingtotheunderlyingsetS– Edgescorrespondingtotheac8onsofthegenerators
• (Complete)ChordisaCayleygraphfor(Zn,+)– S=Zmodn(n=2k).
– Generators{1,2,4,…,2k‐1}– That’swhatthepolygonsareallabout!
• Fact:Most(complete)DHTsareCayleygraphs– Andtheydidn’tevenknowit!– FollowsfromparallelInterConnectNetworks(ICNs)
HowHairymetCayley• Whatdoyouwantinastructurednetwork?
– UniformityofrouPnglogic– Efficiency/load‐balanceofrouPngandmaintenance– Generalityatdifferentscales
• Theorem:AllCayleygraphsarevertexsymmetric.– I.e.isomorphicunderswapsofnodes– SorouPngfromytoxlooksjustlikerouPngfrom(y‐x)to0
• TherouPngcodeateachnodeisthesame• Moreover,underarandomworkloadtherouPngresponsibiliPes(congesPon)ateachnodearethesame!
• Cayleygraphstendtohavegooddegree/diametertradeoffs– EfficientrouPngwithfewneighborstomaintain
• ManyCayleygraphsarehierarchical– MadeofsmallerCayleygraphsconnectedbyanewgenerator
• E.g.aChordgraphon2m+1nodeslookslike2interleaved(half‐notchrotated)Chordgraphsof2mnodeswithhalf‐notchedges
Pastry/Bamboo
• BasedonPlaxtonMesh• Namesarefixedbitstrings• Topology:PrefixHypercube
– Foreachbitfromlentoright,pickneighborIDwithcommonflippedbitandcommonprefix
– logndegree&diameter
• Plusaring– Forreliability(withkpred/succ)
• SuffixRouPngfromAtoB– “Fix”bitsfromlentoright– E.g.1010to0001:
1010
0101
1100 1000
1011
1010→0101→0010→0000→0001
CAN:ContentAddressableNetwork
ExploitmulPpledimensions
Eachnodeisassignedazone
(0,0) (1,0)
(0,1)
(0,0.5,0.5,1)
(0.5,0.5,1,1)
(0,0,0.5,0.5)
• •
••
(0.5,0.25,0.75,0.5)
(0.75,0,1,0.5)•
NodesID’dbyzoneboundaries Join:choserandompoint,
splititszones
RouPngin2‐dimensions
• RouPngisnavigaPngad‐dimensionalIDspace– RoutetoclosestneighborindirecPonofdesPnaPon– RouPngtablecontainsO(d)neighbors
• NumberofhopsisO(dN1/d)
(0,0) (1,0)
(0,1)
(0,0.5,0.5,1)
(0.5,0.5,1,1)
(0,0,0.5,0.5) (0.75,0,1,0.5)
•
•
•
••
(0.5,0.25,0.75,0.5)
Koorde
• DeBruijngraphs– Linkfromnodextonodes2xand2x+1
– Degree2,diameterlogn• OpPmal!
• KoordeisChord‐based– BasicallyChord,butwithDeBruijnfingers
TopologiesofOtherOn‐citedDHTs• Tapestry
– VerysimilartoPastry/Bambootopology– Noring
• Kademlia– AlsosimilartoPastry/Bamboo
– Butthe“ring”isorderedbytheXORmetric:“bidirecPonal”– UsedbytheeMule/BitTorrent/Azureus(Vuze)systems
• Viceroy– AnemulatedBuKerflynetwork
• Symphony– Arandomized“small‐world”network
IncompleteGraphs:EmulaPon• ForChord,weassumedexactly2mnodes.Whatifnot?– Needto“emulate”acompletegraphevenwhenincomplete.
• DHT‐specificschemesused– InChord,nodexisresponsiblefortherange(pred(x),x]
– The“holes”ontheringshouldberandomlydistributedduetohashing
Handlenodeheterogeneity• Sourcesofunbalancedload
– UnequalporPonofkeyspace– Unequalloadperkey
• Balancingkeyspace– Consistenthashing:RegionownedbysinglenodeisO(1/n(1+logn))
– Whataboutnodehetergeneity?• Nodescreate“virtualnodes”of#proporPonaltocapacity
• Loadperkey– Assumesmanykeysperregion
ChordinFlux
• EssenPallynevera“complete”graph– Maintaina“ring”ofsuccessornodes
– Forredundancy,pointtoksuccessors
– PointtonodesresponsibleforIDsatpowersof2
• Called“fingers”inChord• 1stfingeristhesuccessor
JoiningtheChordRing
• NeedIPofsomenode• PickarandomID
– e.g.SHA‐1(IP)• SendmsgtocurrentownerofthatID– That’syoursuccessorinChordrouPng
JoiningtheChordRing
• Updatepred/succlinks– Onceringisinplace,allwell!
• InformapplicaPontomovedataappropriately
• Searchtofind“fingers”ofvaryingpowersof2– Orjustcopyfrompred/succandcheck!
• Inboundfingersfixedlazily
Theorem:Ifconsistencyisreachedbeforenetworkdoubles,lookupsremainlogn
Fingersmustbeconstrained?
• No:ProximityNeighborSelecPon(PNS)
HandlingChurn
• Churn– SessionPme?LifePme?
• Forsystemresilience,sessionPmeiswhatmaKers
• Threemainissues– DeterminingPmeouts
• Significantcomponentoflookuplatencyunderchurn
– Recoveringfromalostneighborin“leafset”• Periodic,notreacPve!• ReacPvecausesfeedbackcycles
– Esp.whenaneighborisstressedandPminginandout
– NeighborselecPonagain
Timeouts• RecallIteraPvevs.RecursiveRouPng
– IteraPve:OriginatorrequestsIPaddressofeachhop– Recursive:Messagetransferredhop‐by‐hop
• EffectonPmeoutmechanism– NeedtotracklatencyofcommunicaPonchannels– IteraPveresultsindirectn×ncommunicaPon
• Can’tkeepPmeoutstatsatthatscale• SoluPon:virtualcoordinateschemes[Vivaldi,etc.]
– WithrecursivecandoTCP‐liketrackingoflatency• ExponenPallyweightedmeanandvariance
• Upshot:BothworkOKuptoapoint– TCP‐styledoessomewhatbeKerthanvirtualcoordsatmodestchurnrates(23min.ormoremeansessionPme)
– Virtualcoordsbeginstofailathigherchurnrates
Recursivevs.IteraPve
Len:SimulaPonof20,000lkpsforrandomkeysRecursivelookuptakes0.6PmesaslongasiteraPve
RightFigure:1,000lookupsintest‐bed;confirmssimulaPon
Recursivevs.IteraPve
• Recursive– FasterundermanycondiPons
• Fewerround‐trip‐Pmes• BeKerproximityneighborselecPon• CanPmeoutindividualRPCsmorePghtly
– BeKertolerancetonetworkfailure• Pathbetweenneighborsalreadyknown
• IteraPve– TightercontroloverenPrelookup
• EasilysupportwindowedRPCsforparallelism• EasiertoPmeoutenPrelookupasfailed
– Fastertoreturndatadirectlythanuserecursivepath
StorageModelsforDHTs
• UptonowwefocusedonrouPng– DHTsas“content‐addressablenetwork”
• Implicitin“DHT”nameissomekindofstorage– OrperhapsabeKerwordis“memory”– EnablesindirecPoninPme– Butalsocanbeviewedasaplacetostorethings
Storagemodels
• Storeonlyonkey’simmediatesuccessor– Churn,rouPngissues,packetlossmakelookupfailuremorelikely
• Storeonksuccessors– Whennodesdetectsucc/predfail,re‐replicate
• Cachealongreverselookuppath– Provideddataisimmutable– …andperformingrecursiveresponses
Storageonsuccessors?• Erasure‐coding
– Datablocksplitintolfragments– mdiff.fragmentsnecessarytoreconstructtheblock– Redundantstorageofdata
• ReplicaPon– NodestoresenPreblock– Specialcase:m =1andlisnumberofreplicas– RedundantinformaPonspreadoverfewernodes
• Comparisonofbothmethods– r=l/mamountofredundancy
• Prob.blockavailable:
€
pavail =li
p0
i i − p0( )l− ii=m
l
∑
Latency:Erasure‐codingvs.replicaPon
ReplicaPon:slightlylowerlatency Erasure‐coding:higheravailability
DHash++useserasure‐codingwithm=7andl=14
Whataboutmutabledata?
• Ugh!
• Differentviews– Ivy:Createversiontrees[Muthitacharoen,OSDI‘02]
• Think“distributedversioncontrol”system
• Globalagreement?– Reachconsensusamongallnodesbelongingtoasuccessorgroups:“distributedagreement”• Difficult,especiallyatscale
AnonoverlookedassumpPon:Theunderlayisn’tperfect!
• AllhaveimplicitassumpPon:fullconnecPvity
• Non‐transi8veconnec8vity(NTC)notuncommon
B↔C,C↔A,A↔B
• AthinksCisitssuccessor!
kA B C
X
Doesnon‐transiPvityexist?• Gerding/StriblingPlanetLabstudy
– 9%ofallnodetriplesexhibitNTC– AKributedhighextenttoInternet‐2
• YetNTCisalsotransient– One3hourPlanetLaball‐pair‐pingstrace– 2.9%havepersistentNTC– 2.3%haveintermiKentNTC– 1.3%failonlyforasingle15‐minutesnapshot
• Level3↔Cogent,butLevel3↔X↔Cogent
• NTCmoPvatesRONandotheroverlayrouPng!
NTCproblemfundamental?
S → R A
A → R B
B → R R
RS CBA
TradiPonalrouPng
NTCproblemfundamental?
• DHTsimplementgreedyrouPngforscalability
• Sendermightnotusepath,eventhoughexists:findslocalminimawhenid‐distancerouPng
S → R A
A → R B
B → R R
RS CBA
TradiPonalrouPng
S → R A
A → R C
C → R X
GreedyrouPng
PotenPalproblems?
• Invisiblenodes
• RouPngloops
• Brokenreturnpaths
• Inconsistentroots
IteraPverouPng:Invisiblenodes
RA
S
C
X
B
Invisiblenodescauselookuptohalt
k
IteraPverouPng:Invisiblenodes
RA
S
C
X
B
Invisiblenodescauselookuptohalt
EnablelookuptoconPnue TighterPmeoutsvianetworkcoordinates LookupRPCsinparallel Unreachablenodecache
X Dk
Inconsistentroots
R
S
• Nodesdonotagreewherekeyisassigned:inconsistentviewsofroot
– Canbecausedbymembershipchanges
– Alsoduetonon‐transiPveconnecPvity:maypersist!
R’?
S’
X
k
Inconsistentroots
R
• Rootreplicates(key,value)amongleafset– Leafsperiodicallysynchronize– GetgathersresultsfrommulPpleleafs
• Notapplicablewhenrequirefastupdate
R’MN
XS
k
LongertermsoluPon?
• Routearoundlocalminimawhenpossible
• Havenodesmaintainlink‐stateofneighbors– Performone‐hopforwardingifnecessary
S → R A
A → R B
B → R R
RS CBA
TradiPonalrouPng
S → R A
A → R C
C → R X
GreedyrouPng
Summary• Peer‐to‐peersystems
– Unstructuredsystems• Findinghay,performingkeywordsearch
– Structuredsystems(DHTs)• Findingneedles,exactmatch
• Distributedhashtables– BasedaroundconsistenthashingwithviewsofO(logn)– Chord,Pastry,CAN,Koorde,Kademlia,Tapestry,Viceroy,…
• Lotsofsystemsissues– Heterogeneity,storagemodels,locality,churnmanagement,underlayissues,…
– DHTs(Kademlia)deployedinwild:Vuzeis1M+acPveusers