BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

© 2009 Andreas Haeberlen

1BFTW3 workshop (Sep 22, 2009)

The Fault Detection Problem

Andreas HaeberlenMPI-SWS

Petr KuznetsovTU Berlin / Deutsche Telekom

Labs

BFTW3 workshop (Sep 22, 2009)2


Fault tolerance vs. fault detection

Distributed systems need to be robust against faults

Approaches: Masking and detection Complementary: In many systems, we want both!

Masking is well understood, detection is not

Network

"Machine XYZ is faulty"

and



What we know about fault detection Rich literature on detecting crash faults

e.g. failure detector abstraction [Chandra96]

But very little on general (Byzantine) faults Byzantine failure detector for consensus

[Kihlstrom03] Several specific algorithms: SUNDR, PeerReview...

What we do not know about general faults: Which faults are detectable in a given system? What is the complexity of detection? How does detection depend on synchrony? How much do cryptographic primitives help? ...

This talk



Outline

A "language" for reasoning about faults System model Extensions and fault instances Precise problem statement

Two initial results Characterization of the set of detectable faults Tight lower bounds on the 'cost' of detection



A B

C

System model Set N of nodes

One terminal per node

Reliable unicast network Messages can be signed System is asynchronous

Execution is a sequence of events ek=(ik,Ik,Ok)

Node i has an algorithm Ai=(Mi,TIi,TOi,Si,s0,i,ai) Algorithm is deterministic We say that node i is correct if it follows its

algorithm Ai

Distributed algorithm A:=(A1,A2,...,An)

Network

Terminal



Intuition: What is fault detection?

Given: Algorithm A, set of faults F to detect

Goal: When fault in fF occurs, correct nodes output a list of faulty nodes to their terminals

"Fault detection problem" for F: Find a function tF that maps any algorithm A to another

algorithm tF (A) that solves the same problem but can additionally detect all the faults in F

C D

A

B

0<x<10x+y

0<y<10

711

4

15FAULTY(C)

FAULTY(C)

FAULTY(C)



Intuition: Faults and extensions

How should a general 'fault' be defined? Needs to be specific to an algorithm! First approximation: A tuple (A,e)

Problem: Need to change the algorithm to do fault detection new algorithm A=tF(A)! What does it mean to detect a fault (A,e) in

execution of A?

Need to restrict the power of t! Idea: Produce an extension of A that 'works exactly

like A', except that it does some additional work to detect faults



Definition: Extensions

A is an extension of A if:1. their terminal in/outputs are compatible: TI=TI,

TOTO2. there are surjective mappings m1 and m2 such that,

when a has a transition (I,s1)(O,s2), then a has a transition (m1(I), m2(s1))(m1(O), m2(s2))

What does this mean? We can map each execution e of A to an execution

me(e) of A If a node is correct in e, then it is also correct in me

(e)!

A=(M,TI,TO,S,s0,a)

A=(M,TI,TO,S,s0,a)

= m1 m2



Definition: Fault instances

A fault instance is a four-tuple (A, e): A is an algorithm e is an execution C is a set of correct nodes S is a set of suspects; S must contain at least one faulty

node

G

H

K

C

IFL

J

MO

E

Q

N

CS

P

D

A

B

915

4

Algorithm:1. A sends 0≤x≤10 to C2. B sends 0≤y≤10 to C3. C sends x+y to D

C,S,

C

Needed because detectabilitydepends on who is correct

Needed to quantify how preciselywe can say who is faulty

?

?

23



The Fault Detection Problem Given a fault class F, find a t that maps

any algorithm A to an extension t(A) such that:

Nontriviality: Correct nodes regularly output lists of faulty nodes

Completeness: If a fault (A,C,S,e)F occurs, at least one correct node cC will permanently output at least one faulty suspect sS

Accuracy: Correct nodes do not permanently output each other (occasional mistakes are ok)

Agreement: Eventually all correct nodes will permanently output the same set of nodes



Outline

A "language" for reasoning about faults System model Extensions and fault instances Precise problem statement

Some initial results How to prove impossibilities Fault classification; commission and omission faults Message complexity of detection



Preview: Fault classification

Characterize the set of fault instances for which Fault Detection Problem can be solved

All (!) general

faultinstances

Solution exists

Non-observablefault instances

Ambiguousfault instances

Impossible to solve

Commission faults

Omission faults



Preview: Message complexity

How many additional messages are needed? t has message complexity c iff, for each execution

e, the number of messages sent by correct nodes in any e with me(e)=e is at most c times the number of messages in e

Assumption: At most f<|N|-2 nodes can be faulty

Fault detectionproblem

Fault detection problem with agreement

Commission faults f+2 f+2

Omission faults 3f+4 (|N|-1)2



Future work

This work provides a "language" for reasoning about fault detection with general faults

Possible next steps: Probabilistic guarantees

Lower cost? [SOSP'07] More synchrony

More faults detectable? Stronger accuracy? Bound the time to detection?

Bounds on message sizes and/or state space Impact on set of detectable faults?

Broadcast channel Different cryptographic primitives ...



Summary

Framework for reasoning about fault detection with general (Byzantine) faults Precise definition of a general fault instance Formal statement of the fault detection problem

Two initial results Characterization of the set of detectable faults Tight lower bounds on the message complexity of

detection

Questions?

Documents

BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom