Scalable membership management

Scalable membership management and failure detection?

Vinay SettyINF5360

What is Gossiping?

• Spread of information in a random manner• Some examples:– Human gossiping – Epidemic diseases– Physical phenomenon: wild fire, diffusion etc– Computer viruses and worms

Gossiping in Computer Science

• Term first coined by Demers et al (1987)• Some applications of gossip protocols– Peer Sampling– Data Aggregation– Clustering– Information Dissemination (Multicast, Pub/Sub)– Overlay/topology – Maintenance– Failure detection?

Gossip-Based Protocol: Example

0

12

5

76

3

9

4

8

Today’s Focus

• Theoretical angle for Gossip-based protocols [Allavena et al PODC 2005]– Probability of partitioning– Time till partitioning– Bounds on in-degree– Essential elements of gossiping– Simulation results

• Cyclon [Voulgaris et al]• Scamp [Ganesh et al]• NewsCast [Jelasity et al]

Membership Service

• Full Membership– Complete knowledge at each node– Random subset used for gossiping– Not scalable– Hard to maintain

• Partial Membership– Random subset at each node– Gossip partners chosen from local view

View Selection

u

q

r

ps

t

s,p,r

t,q,r

L1

L2

L2L1

s,p,tv

v

Weighted with w

Essential Elements of Gossiping

• Mixing: Construct a list L1 consisting of local views of local view of node u– Guarantees non partitioning– “Pull” based

• Reinforcement: Construct a list L2 consisting of local views of nodes that requested local view of u– Balances network– removes old possibly dead edges, adds new edges

Partitioning and Size Estimate

• A and B partition iff x=1 and y=0• Partitioning is least possible when x=y• Goal of protocol is to maintain this balance

Size Estimates

• Idea: – Assuming edges were drawn uniformly randomly,

expected x+y µ |A|– x is estimate of size of A by nodes in A – y is estimate of size of A by nodes in B

• Mixing:– Agreeing on estimation of x and y ensures no

partition (even if x and y are not accurate)• Reinforcement:– Brings estimation of x and y to correct value

K-regularity

• View Size: k• Number of nodes: n• Fraction of nodes in partition: γ• |A|= γn ≤ |B|• #edges from A to B: (1-x)γkn• #edges from B to A: y (1-γ)kn• Number of edges in A-B cut:– (1-x)γkn +x (1-γ)kn

(since x=y)– ≥ γkn (assuming γ≤½)

Time Till Partitioning

• View Size: k• Number of nodes: n• Fraction of nodes in partition: γ• Churn rate: μ (μn nodes leave and join) • Claim: Expected time before a partition of size

γ happens ≈ 2γkn – As long as μ≪γkn

Iterations until Partitioning

Number of nodes: n View size: k = log n Churn: n/32

View Size vs Time until Partition

Number of nodes: n View size: k = log n Churn: n/32

Simplified Model for Proof

– Single randomly chosen element from view is replaced instead of whole views

– Assumption: The out-edges of nodes of A are identically distributed and same applies to B

– a = #edges from A to A– c = #edges from A to B– b = #edges from B to A– d = #edges from B to B

Proof Intuition

Partition state: a = γkn and b = 0

In-Degree Analysis

• Load balancing requires balance in in-degree distribution

• In-degree is governed by the way edges created, copied and destroyed

• Copying some edges more than others cause variability in in-degree

• Node living longer is expected to have higher in-degree• Solution: Increase reinforcement and keep track of

timestamps like in Cyclon• Simulation: max in-degree < 4.5 times of random graph

and standard deviation < 3.2 times

Discussion

• Are these theoretical guarantees practically useful?

• Goal is not provide failure detection

Cyclon

• Consists of same elements as suggested by [Allavena et al PODC 2005]

• [Allavena et al PODC 2005] Analysis holds for Cyclon

• Major differences:– Timestamps– shuffling

Basic Shuffling

• Select a random subset of l neighbors (1 ≤ l ≤ c) from P’s own cache, and a random peer, Q, within this subset, where l is a system parameter, called shuffle length.

• Replace Q’s address with P’s address. • Send the updated subset to Q. • Receive from Q a subset of no more than l of Q’s neighbors. • Discard entries pointing to P, and entries that are already in

P’s cache. • Update P’s cache to include all remaining entries, by

– firstly using empty cache slots (if any), and – secondly replacing entries among the ones originally sent to Q.

Shuffling Example

Enhanced Shuffling• Increase by one the age of all neighbors. • Select neighbor Q with the highest age among all neighbors, and l −

1 other random neighbors. • Replace Q’s entry with a new entry of age 0 and with P’s address. • Send the updated subset to peer Q. • Receive from Q a subset of no more that l of its own entries. • Discard entries pointing at P and entries already contained in P’s

cache. • Update P’s cache to include all remaining entries, by firstly using

empty • cache slots (if any), and secondly replacing entries among the ones

sent to Q.

Time Until Dead Links Removed

Number of Clusters

Tolerance to Partitioning

In-Degree Distribution

SCAMP

• Partial knowledge of the membership: local view

• Fanout automatically set = size of the local view

• Fanout evolves naturally with the size of the group– Size of local views converges towards C.log(n)

Join (Subscription)

0

2

3

1s

s

s

P=1/sizeof view

(1-P)

P=1/sizeof view

P=1/sizeof view

(1-P)

Subscription forwarded

S

Subscription to a random member

s

s

s

s

s(1-P)

Join(Subscription) algorithm

0

1

5

4

6

1 4 5 6

6

6

0

Local view7 6

2

8 7

7 2

8 3 6

3 6

7 0 1 5 6

6

6 6

Load Balancing

• Indirection:– Forward the subscription instead of handling

request• Lease associated with each subscription• Periodically nodes have to re-subscribe– Nodes having failed permanently will time out– Re-balance the partial views

Unsubscription

0

1

5

41 4 5 Unsub (0), [1,4,5]

Local view

z

x

y

8 9 0

7 3 0

6 0 2

8 9 4

x

y

z

7 3 5

6 0 1

Degree

• System modelled as random directed graph• D(N) = Average out-degree for N-nodes

system• Subscription adds D(N)+1 directed arcs, so• (N+1) D(N+1) = N D(N) + D(N)+1 • Solution of this recursion is• D(N)=D(1)+1/2+1/3+…+1/N Log(N)

33

Distribution of view size

Log=12.2

Log=13.12

34

Reliability: 5000 node system

2500

0.9

0.92

0.94

0.96

0.98

1

0 500 1000 1500 2000Number of failures

Relia

bilit

y

SCAMP

Global membership knowledge, fanout=8

Global membership knowledge, fanout=9

NewsCast

• Goal: Aggregate information in – a large and dynamic – distributed environment – a robust and dependable manner

Idea

• Gets news from application, timestamps it and adds local peer address to the cache entry

• Finds a random peer in cache addresses – Sends all cache entries to this peer– Receives all cache entries from that peer

• Passes on cache entries (containing news items) to application

• Merges old cache with received cache – Keeps at most C most recent cache entries

Aggregation

• Each node ni maintains a single number xi

• Every node ni selects a random node nk, and sends its value xi to nk

• nk responds with the aggregate (e.g. max(xi,xk) ) of the incoming and its own value

• 4. Aggregate values will converge “exponentially”

Path length under failures

Connectivity Under Failures

Aggregation

Education

Scalable membership management