Self-stabilizing Overlay Networks

Self-stabilizing Overlay Networks

Sukumar GhoshUniversity of Iowa

Work in progress. Jointly with Andrew Berns and Sriram Pemmaraju

(Talk at Michigan Technological University)

On Thursday, 16th August 2007 Skype had an outage

(Skype is known to be a “self-healing” overlay network)

(Skype’s explanation)

The disruption was triggered by a massive restart of users’

computers across the globe within a very short timeframe,

as they re-booted after receiving a routine set of patches

through Windows Update.

Overlay Network

A logical network laid on top of the Internet

AB

C

Internet

Logical link AB Logical link BC

The Formal Model

Let V be a set of nodes. The functions

id : V Z+ assigns a unique id to each node in V

rs : V {0, 1}* assigns a random bit string to each node

in V

A family of overlay networks ON : F G, where F is the

set

of all triples λ= (V; id; rs) and G is the set of all directed

graphs.

The family of overlay networks associates a unique

directed

graph ON(λ)∈ G with each labeled set λ = (V; id; rs) of

nodes.

Structured vs. Unstructured

Overlay networks

Unstructured Structured

No restriction on

network topology.

Examples: Gnutella,

Kazaa, Bittorrent,

Skype etc.

Network topology

satisfies specific

invariants.

Examples:

Chord, CAN, Pastry

Skip Graph etc

The Challenge

Can an overlay network restore its correct functionality from

an arbitrary initial configuration?

Bad configurations can be caused by failures, perturbations,

selfish actions, malicious attacks.

Autonomic Systems

Self-management is the holy grail of all complex

dynamic systems.

Self-stabilizing systems

(Convergence) Recover from any arbitrary

initial configuration to a legal configuration in a

bounded number of steps, and

(Closure) remain in the legal configuration

thereafter, until another failure or perturbation

occurs.

Self-stabilizing Overlay Networks

Can an overlay network restore its topology from

an arbitrary initial configuration?

Does it make sense in unstructured networks?

Does it make sense in structured networks?

Related work

Self-stabilizing and Byzantine-tolerantoverlay network. OPODIS 2007[Dolev, Hoch, van Renesse]

A distributed polylog time algorithm for self-stabilizing SKIP graph. PODC ’09[Jacob, Richa, Scheideler et. al]

Linearization: Locally self-stabilizing Sorting in graphs. ALENEX, SIAM ‘07[Onus, Richa, Scheideler]

Example: Linearization

2 7

102015

30

13

18 3421

2 5 7 10 131518 213034

The ideal topology is a sorted list. The goal is to spontaneously recover to the ideal topology from anarbitrary connected topology

(Onus, Richa, Scheideler, ALENEX 2007)

Self-stabilizing algorithm: Linearization

Left and right neighbors:– ‘w’ is left neighbor of node ‘u’ if {u, w} E and w < u.

– ‘w’ is right neighbor of node ‘u’ if {u, w} E and u < w.

u=10

w1=2 w2=3 w4=8w3=6 v1=19 v2=28 v4=35v3=30

left neighbors right neighbors

Self-stabilizing algorithm: Linearization

u=10

w1=2 w2=3 w4=8w3=6 v1=19 v2=28 v4=35v3=30

(The Algorithm) In each round do

Convert left neighbors into sorted listConvert right neighbors into sorted list

Takes at most (n-2) rounds.

Slide borrowed from Onus et al.

Evolution of Skip Graph(Aspenes, Shah SODA 2003)

42 329 15 6347 9380 107

Search time is O(n) hops

SKIP Graph

42 329 15 6347 9380 107

Node degree = O(log n), diameter = O(log n)

Number of levels = O(log n),

Search time now is O(log n) hops

001 100 110 010 111 000101 011 101 010

Level 0

Level 1

Level 2

0 - -

1 - -

00 -

01 -

10 -

11 -

SKIP Graph: the question

Can we have a self-stabilizing skip graph that

can spontaneously restore its topology starting

from any “connected” initial configuration?

Why local checking is important

Unless bad configurations are detected via local

checking, periodic global snapshots are needed,

which is disruptive for the system.

SKIP Graph is NOT locally checkable

Self-stabilization requires local detection of errors,

but certain failures are not locally checkable

SKIP+ graph

Jacob, Richa, Scheideler et al. (PODC 2009) proposed a

locally checkable version of SKIP Graph by adding a

few extra edges to an existing Skip Graph. They called

it a SKIP+ Graph.

They presented an algorithm to stabilize such a topology

in O(log2n) rounds with high probability. The algorithm is

quite cumbersome.

We try to devise a simpler and better solution.

Detectors

detectordetector

detector

detector

detectordetector

Our first step

Detector diameter

The detector diameter of G, is the maximum hop

distance in G between any node and the closest

detector.

Transitive Closure Framework

Due to the local checkability property

in any faulty configuration, there is at least one detector


Theorem

For a SKIP+ graph, the detector diameter D = O(log n)



The neighbors of each detector become detectors in the next

round. In O(log n) rounds, every node becomes a detector, and

these detectors initiate the transitive closure process. After

an additional O(log n) rounds, all nodes become connected with

one another, and the topology becomes completely connected.


After all nodes becomes detectors and eventually the

topology becomes completely connected, the nodes

rebuild the correct topology using a REPAIR

subroutine. REPAIR takes only one round.

The Repair Process

Lemma

If the network is completely connected and all nodes are

detectors in round i, a legal overlay network will be built in

round (i + 1), and no node will be a detector.

Compare with Jacob et. al’s results

Local checkability

Let L define a correct configuration of an overlay network.

Then network is locally checkable when

L = p0 p∧ 1 p∧ 2 … p∧ ∧ n-1

where pi is a local predicate involving process i and its

immediate neighbors only.

Most of the real life networks are NOT locally checkable

Example: a clique

Theorem. A complete connected topology is locally checkable

a

b

c

Example: a clique

Theorem. A complete connected topology is locally checkable

a

b

c

Chord is not locally checkable

Chord ring Loopy chord ring

CAN is not locally checkable

Content Addressable Network (CAN) on a 2D torus

Replace the black edges by thered edges, and each columnbecomes a loopy chord ring

LCON: a locally checkableoverlay network in a circular key space

18

0

3

32

5

37

23

25

40

50

54

59

N= 64

7

LCON: a locally checkableoverlay network in a circular key space

18

0

3

32

5

37

23

25

40

50

54

59

S-links for node u:one edge to each node

in the range (u to u+s mod N)

D-links for node u:Succ (u+s mod N), Succ (u+2s mod N)Succ (u+(d-1)s + mod N)

Nmax = s x dLet s=16, d=4

7

Observations

Observation

Each node in LCON has (d+s-2) neighbors.

When

d = s, the size of the neighborhood is O(sqrt

N).

Theorem

The detector diameter of LCON is at most

two.

Some properties of LCON

Theorem. LCON is locally checkable.

Main idea.

Case 1. If the diameter is two, then every node can “see”

every other node, and check if the topology is correct.

Case 2. We show that if the diameter if greater than two,

then there is at least one detector.

Self-stabilization of LCON

The Transitive Closure Framework (TCF) will stabilize LCON

in O(log N) time.

But it may be a sledgehammer. What is the space complexity

of stabilization using TCF?

Self-stabilization of LCON

We have an algorithm customized for LCON that stabilizes

LCON in polylog time, while the space complexity does

not skyrocket to O(n)

Generalization of LCON

Main idea

Consider a CAN-like topology on

a d-dimensional torus. Convert

the “ring” in each dimension into

an LCON ring. It is only partially

shown in the figure on a 2-

dimensional torus

Each node has O(d.N1/2d)

neighbors

Conclusion

A new problem of growing interest. We need

efficient

algorithms for stabilizing a variety of overlay

topologies.

The initial topology must be connected.

Stabilization

from a partitioned topology is impossible. Also for a

given (V, id, rs) the legal topology should be

unique.

Otherwise there will be an additional step for

distributed

consensus

Working on extending this to more fragile

networks.

Questions?

Documents

Self-stabilizing Overlay Networks