Upload
vinthemaverick
View
338
Download
0
Tags:
Embed Size (px)
Citation preview
Scalable membership management and failure detection?
Vinay SettyINF5360
What is Gossiping?
• Spread of information in a random manner• Some examples:– Human gossiping – Epidemic diseases– Physical phenomenon: wild fire, diffusion etc– Computer viruses and worms
Gossiping in Computer Science
• Term first coined by Demers et al (1987)• Some applications of gossip protocols– Peer Sampling– Data Aggregation– Clustering– Information Dissemination (Multicast, Pub/Sub)– Overlay/topology – Maintenance– Failure detection?
Gossip-Based Protocol: Example
0
12
5
76
3
9
4
8
Today’s Focus
• Theoretical angle for Gossip-based protocols [Allavena et al PODC 2005]– Probability of partitioning– Time till partitioning– Bounds on in-degree– Essential elements of gossiping– Simulation results
• Cyclon [Voulgaris et al]• Scamp [Ganesh et al]• NewsCast [Jelasity et al]
Membership Service
• Full Membership– Complete knowledge at each node– Random subset used for gossiping– Not scalable– Hard to maintain
• Partial Membership– Random subset at each node– Gossip partners chosen from local view
View Selection
u
q
r
ps
t
s,p,r
t,q,r
L1
L2
L2L1
s,p,tv
v
Weighted with w
Essential Elements of Gossiping
• Mixing: Construct a list L1 consisting of local views of local view of node u– Guarantees non partitioning– “Pull” based
• Reinforcement: Construct a list L2 consisting of local views of nodes that requested local view of u– Balances network– removes old possibly dead edges, adds new edges
Partitioning and Size Estimate
• A and B partition iff x=1 and y=0• Partitioning is least possible when x=y• Goal of protocol is to maintain this balance
Size Estimates
• Idea: – Assuming edges were drawn uniformly randomly,
expected x+y µ |A|– x is estimate of size of A by nodes in A – y is estimate of size of A by nodes in B
• Mixing:– Agreeing on estimation of x and y ensures no
partition (even if x and y are not accurate)• Reinforcement:– Brings estimation of x and y to correct value
K-regularity
• View Size: k• Number of nodes: n• Fraction of nodes in partition: γ• |A|= γn ≤ |B|• #edges from A to B: (1-x)γkn• #edges from B to A: y (1-γ)kn• Number of edges in A-B cut:– (1-x)γkn +x (1-γ)kn
(since x=y)– ≥ γkn (assuming γ≤½)
Time Till Partitioning
• View Size: k• Number of nodes: n• Fraction of nodes in partition: γ• Churn rate: μ (μn nodes leave and join) • Claim: Expected time before a partition of size
γ happens ≈ 2γkn – As long as μ≪γkn
Iterations until Partitioning
Number of nodes: n View size: k = log n Churn: n/32
View Size vs Time until Partition
Number of nodes: n View size: k = log n Churn: n/32
Simplified Model for Proof
– Single randomly chosen element from view is replaced instead of whole views
– Assumption: The out-edges of nodes of A are identically distributed and same applies to B
– a = #edges from A to A– c = #edges from A to B– b = #edges from B to A– d = #edges from B to B
Proof Intuition
Partition state: a = γkn and b = 0
In-Degree Analysis
• Load balancing requires balance in in-degree distribution
• In-degree is governed by the way edges created, copied and destroyed
• Copying some edges more than others cause variability in in-degree
• Node living longer is expected to have higher in-degree• Solution: Increase reinforcement and keep track of
timestamps like in Cyclon• Simulation: max in-degree < 4.5 times of random graph
and standard deviation < 3.2 times
Discussion
• Are these theoretical guarantees practically useful?
• Goal is not provide failure detection
Cyclon
• Consists of same elements as suggested by [Allavena et al PODC 2005]
• [Allavena et al PODC 2005] Analysis holds for Cyclon
• Major differences:– Timestamps– shuffling
Basic Shuffling
• Select a random subset of l neighbors (1 ≤ l ≤ c) from P’s own cache, and a random peer, Q, within this subset, where l is a system parameter, called shuffle length.
• Replace Q’s address with P’s address. • Send the updated subset to Q. • Receive from Q a subset of no more than l of Q’s neighbors. • Discard entries pointing to P, and entries that are already in
P’s cache. • Update P’s cache to include all remaining entries, by
– firstly using empty cache slots (if any), and – secondly replacing entries among the ones originally sent to Q.
Shuffling Example
Enhanced Shuffling• Increase by one the age of all neighbors. • Select neighbor Q with the highest age among all neighbors, and l −
1 other random neighbors. • Replace Q’s entry with a new entry of age 0 and with P’s address. • Send the updated subset to peer Q. • Receive from Q a subset of no more that l of its own entries. • Discard entries pointing at P and entries already contained in P’s
cache. • Update P’s cache to include all remaining entries, by firstly using
empty • cache slots (if any), and secondly replacing entries among the ones
sent to Q.
Time Until Dead Links Removed
Number of Clusters
Tolerance to Partitioning
In-Degree Distribution
SCAMP
• Partial knowledge of the membership: local view
• Fanout automatically set = size of the local view
• Fanout evolves naturally with the size of the group– Size of local views converges towards C.log(n)
Join (Subscription)
0
2
3
1s
s
s
P=1/sizeof view
(1-P)
P=1/sizeof view
P=1/sizeof view
(1-P)
Subscription forwarded
S
Subscription to a random member
s
s
s
s
s(1-P)
Join(Subscription) algorithm
0
1
5
4
6
1 4 5 6
6
6
0
Local view7 6
2
8 7
7 2
8 3 6
3 6
7 0 1 5 6
6
6 6
Load Balancing
• Indirection:– Forward the subscription instead of handling
request• Lease associated with each subscription• Periodically nodes have to re-subscribe– Nodes having failed permanently will time out– Re-balance the partial views
Unsubscription
0
1
5
41 4 5 Unsub (0), [1,4,5]
Local view
z
x
y
8 9 0
7 3 0
6 0 2
8 9 4
x
y
z
7 3 5
6 0 1
Degree
• System modelled as random directed graph• D(N) = Average out-degree for N-nodes
system• Subscription adds D(N)+1 directed arcs, so• (N+1) D(N+1) = N D(N) + D(N)+1 • Solution of this recursion is• D(N)=D(1)+1/2+1/3+…+1/N Log(N)
33
Distribution of view size
Log=12.2
Log=13.12
34
Reliability: 5000 node system
2500
0.9
0.92
0.94
0.96
0.98
1
0 500 1000 1500 2000Number of failures
Relia
bilit
y
SCAMP
Global membership knowledge, fanout=8
Global membership knowledge, fanout=9
NewsCast
• Goal: Aggregate information in – a large and dynamic – distributed environment – a robust and dependable manner
Idea
• Gets news from application, timestamps it and adds local peer address to the cache entry
• Finds a random peer in cache addresses – Sends all cache entries to this peer– Receives all cache entries from that peer
• Passes on cache entries (containing news items) to application
• Merges old cache with received cache – Keeps at most C most recent cache entries
Aggregation
• Each node ni maintains a single number xi
• Every node ni selects a random node nk, and sends its value xi to nk
• nk responds with the aggregate (e.g. max(xi,xk) ) of the incoming and its own value
• 4. Aggregate values will converge “exponentially”
Path length under failures
Connectivity Under Failures
Aggregation