Failure Detectors CS 717 Ashish Motivala Dec 6 th 2001


Failure Detectors

CS 717Ashish Motivala

Dec 6th 2001

Some Papers Relevant Papers• Unreliable Failure Detectors for Reliable Distributed

Systems. Tushar Deepak Chandra and Sam Toueg. Journal of the ACM.

• A gossip-style failure detection service. R. van Renesse, Y. Minsky, and M. Hayden. Middleware '98.

• Scalable Weakly-consistent Infection-style Process Group Membership protocol. Ashish Motivala, Abhinandan Das, Indranil Gupta. To be submitted to DSN 2002 tomorrow.

• On the Quality of Service of Failure Detectors. Wei Chen, Cornell University (with Sam Toueg, Advisor, and Marcos Aguilera, Contributing Author). DSN 2000.

• Fail-aware failure detectors. C. Fetzer and F. Cristian. In Proceedings of the 15th Symposium on Reliable Distributed Systems.

Asynchronous vs Synchronous Model

– No value to assumptions about process speed– Network can arbitrarily delay a message– But we assume that messages are sequenced and

retransmitted (arbitrary numbers of times), so they eventually get through.

• Failures in asynchronous model?• Usually, limited to process “crash” faults

– If detectable, we call this “fail-stop” – but how to detect?

Asynchronous vs Synchronous Model

• No value to assumptions about process speed

• Network can arbitrarily delay a message

• But we assume that messages are sequenced and retransmitted (arbitrary numbers of times), so they eventually get through.

• Assume that every process will run within bounded delay

• Assume that every link has bounded delay

• Usually described as “synchronous rounds”

Failures in Asynchronous and Synchronous Systems

• Usually, limited to process “crash” faults

• If detectable, we call this “fail-stop” – but how to detect?

• Can talk about message “omission” failures: failure to send is the usual approach

• But network assumed reliable (loss “charged” to sender)

• Process crash failures, as in asynchronous setting

• “Byzantine” failures: arbitrary misbehavior by processes

Realistic???• Asynchronous model is too weak since they

have no clocks(real systems have clocks, “most” timing meets expectations… but heavy tails)

• Synchronous model is too strong (real systems lack a way to implement synchronize rounds)

• Partially Synchronous Model: async n/w with a reliable channel

• Timed Asynchronous Model: time bounds on clock drift rates and message delays [Fetzer]

Impossibility Results

• Consensus: All processes need to agree on a value• FLP Impossibility of Consensus

– A single faulty process can prevent consensus– Realistic because a slow process is indistinguishable from a

crashed one.• Chandra/Toueg Showed that FLP Impossibility applies to

many problems, not just consensus– In particular, they show that FLP applies to group

membership, reliable multicast– So these practical problems are impossible in

asynchronous systems• They also look at the weakest condition under which

consensus can be solved

Byzantine Consensus

• Example: 3 processes, 1 is faulty (A, B, C)• Non-faulty processes A and B start with input 0 and 1,

respectively• They exchange messages: each now has a set of inputs

{0, 1, x}, where x comes from C• C sends 0 to A and 1 to B• A has {0, 1, 0} and wants to pick 0. B has {0, 1, 1} and

wants to pick 1.

• By definition, impossibility in this model means “xxx can’t always be done”

Chandra/Toueg Idea

• Theoretical Idea• Separate problem into

– The consensus algorithm itself– A “failure detector:” a form of oracle that announces

suspected failure– But the process can change its decision

• Question: what is the weakest oracle for which consensus is always solvable?

Sample properties

• Completeness: detection of every crash– Strong completeness: Eventually, every process that

crashes is permanently suspected by every correct process

– Weak completeness: Eventually, every process that crashes is permanently suspected by some correct process

Sample properties

• Accuracy: does it make mistakes?– Strong accuracy: No process is suspected before it

crashes.– Weak accuracy: Some correct process is never

suspected– Eventual {strong/ weak} accuracy: there is a time

after which {strong/weak} accuracy is satisfied.

Strong Weak Eventually Strong Eventually Weak

Strong PerfectP


Eventually PerfectP

Eventually Strong S

Weak D WeakW

D Eventually Weak W

Perfect Detector?

• Named Perfect, written P• Strong completeness and strong accuracy• Immediately detects all failures• Never makes mistakes

Example of a failure detector

• The detector they call W: “eventually weak”• More commonly: W: “diamond-W”• Defined by two properties:

– There is a time after which every process that crashes is suspected by some correct process {weak completeness}

– There is a time after which some correct process is never suspected by any correct process {weak accuracy}

• Eg. we can eventually agree upon a leader. If it crashes, we eventually, accurately detect the crash

W: Weakest failure detector

• They show that W is the weakest failure detector for which consensus is guaranteed to be achieved

• Algorithm is pretty simple– Rotate a token around a ring of processes– Decision can occur once token makes it around once

without a change in failure-suspicion status for any process

– Subsequently, as token is passed, each recipient learns the decision outcome

Building systems with W

• Unfortunately, this failure detector is not implementable

• This is the weakest failure detector that solves consensus

• Using timeouts we can make mistakes at arbitrary times

Group Membership Service

XAsynchronous Lossy Networkpi

pj pi


pj’s Membership list


Process Group

Data Dissemination using Epidemic Protocols

• Want efficiency, robustness, speed and scale• Tree distribution is efficient, but fragile and

hard configure• Gossip is efficient and robust but has high

latency. Almost linear in network load and scales O(nlogn) in detection time with number of processes.

State Monotonic Property

• A gossip message contains the state of the sender of the gossip.

• The receiver used a merge function to merge the received state and the sent state.

• Need some kind of monotonicity in state and in gossip

Simple Epidemic

• Assume a fixed population of size n• For simplicity, assume homogeneous

spreading– Simple epidemic: any one can infect any one with

equal probability

• Assume that k members are already in infected

• And that the infection occurs in rounds

Probability of Infection

• Probability Pinfect(k,n) that a particular uninfected member is infected in a round if k are already in a round if k are already infected?

• Pinfect(k,n) = 1 – P(nobody infects member)

= 1 – (1 – 1/n)k

E(#newly infected members) = (n-k)x Pinfect(k,n)

Basically its a Binomial Distribution

2 Phases

• Intuition: 2 Phases

• First Half: 1 -> n/2 Phase 1• Second Half: n/2 -> n Phase 2

• For large n, Pinfect(n/2,n) ~ 1 – (1/e)0.5 ~ 0.4

Infection and Uninfection

• Infection– Initial Growth Factor is very high about 2– At the half way mark its about 1.4– Exponential growth

• Uninfection– Slow death of uninfection to start– At half way mark its about 0.4– Exponential decline


• Number of rounds necessary to infect the entire population is O(log n)

• Robbert uses and base of 1.585 for experiments

How the Protocol Works

• Each member maintains a list of (address heartbeat) pairs.

• Periodically each member gossips:– Increments his heartbeat– Sends (part of) list to a randomly chosen member

• On receipt of gossip, merge the lists• Each member maintains the last heartbeat of

each list member

SWIMGroup Membership Service

XAsynchronous Lossy Networkpi

pj pi


pj’s Membership list


Process Group

System Design

• Join, Leave, Failure : broadcast to all processes• Need to detect a process failure at some

process quickly (to be able to broadcast it)• Failure Detector Protocol Specifications

– Detection Time– Accuracy– Load

Specified by application designer to SWIM

Optimized by SWIM

SWIM Failure Detector Protocol

Protocol period= T time units



K randomprocesses

pi pj

• Expected Detection time e/(e-1) protocol periods

• Load: O(K) per process– Inaccuracy probability exponential in K

• Process failures detected – in O(log N) protocol periods w.h.p.– in O(N) protocol periods deterministically


Why not Heartbeating ?

• Centralized : single failure point• All-to-all : O(N) load per process• Logical ring : unpredictability on multiple failures









Number of Processes


n Ti


to F





n / R




Win2000, 100 Base-T Ethernet LANProtocol Period = 3*RTT, RTT=10 ms, K=1

LAN Scalability


• Broadcast ‘suspicion’ before ‘declaring’ process failure• Piggyback broadcasts through ping messages

– Epidemic-style broadcast

• WAN– Load on core routers– No representatives per subnet/domain