Upload
chad-roberts
View
212
Download
0
Embed Size (px)
Citation preview
© 2009 Andreas Haeberlen
1BFTW3 workshop (Sep 22, 2009)
The Fault Detection Problem
Andreas HaeberlenMPI-SWS
Petr KuznetsovTU Berlin / Deutsche Telekom
Labs
BFTW3 workshop (Sep 22, 2009)2
© 2009 Andreas Haeberlen
Fault tolerance vs. fault detection
Distributed systems need to be robust against faults
Approaches: Masking and detection Complementary: In many systems, we want both!
Masking is well understood, detection is not
Network
"Machine XYZ is faulty"
and
BFTW3 workshop (Sep 22, 2009)3
© 2009 Andreas Haeberlen
What we know about fault detection Rich literature on detecting crash faults
e.g. failure detector abstraction [Chandra96]
But very little on general (Byzantine) faults Byzantine failure detector for consensus
[Kihlstrom03] Several specific algorithms: SUNDR, PeerReview...
What we do not know about general faults: Which faults are detectable in a given system? What is the complexity of detection? How does detection depend on synchrony? How much do cryptographic primitives help? ...
This talk
BFTW3 workshop (Sep 22, 2009)4
© 2009 Andreas Haeberlen
Outline
A "language" for reasoning about faults System model Extensions and fault instances Precise problem statement
Two initial results Characterization of the set of detectable faults Tight lower bounds on the 'cost' of detection
BFTW3 workshop (Sep 22, 2009)5
© 2009 Andreas Haeberlen
A B
C
System model Set N of nodes
One terminal per node
Reliable unicast network Messages can be signed System is asynchronous
Execution is a sequence of events ek=(ik,Ik,Ok)
Node i has an algorithm Ai=(Mi,TIi,TOi,Si,s0,i,ai) Algorithm is deterministic We say that node i is correct if it follows its
algorithm Ai
Distributed algorithm A:=(A1,A2,...,An)
Network
Terminal
BFTW3 workshop (Sep 22, 2009)6
© 2009 Andreas Haeberlen
Intuition: What is fault detection?
Given: Algorithm A, set of faults F to detect
Goal: When fault in fF occurs, correct nodes output a list of faulty nodes to their terminals
"Fault detection problem" for F: Find a function tF that maps any algorithm A to another
algorithm tF (A) that solves the same problem but can additionally detect all the faults in F
C D
A
B
0<x<10x+y
0<y<10
711
4
15FAULTY(C)
FAULTY(C)
FAULTY(C)
BFTW3 workshop (Sep 22, 2009)7
© 2009 Andreas Haeberlen
Intuition: Faults and extensions
How should a general 'fault' be defined? Needs to be specific to an algorithm! First approximation: A tuple (A,e)
Problem: Need to change the algorithm to do fault detection new algorithm A=tF(A)! What does it mean to detect a fault (A,e) in
execution of A?
Need to restrict the power of t! Idea: Produce an extension of A that 'works exactly
like A', except that it does some additional work to detect faults
BFTW3 workshop (Sep 22, 2009)8
© 2009 Andreas Haeberlen
Definition: Extensions
A is an extension of A if:1. their terminal in/outputs are compatible: TI=TI,
TOTO2. there are surjective mappings m1 and m2 such that,
when a has a transition (I,s1)(O,s2), then a has a transition (m1(I), m2(s1))(m1(O), m2(s2))
What does this mean? We can map each execution e of A to an execution
me(e) of A If a node is correct in e, then it is also correct in me
(e)!
A=(M,TI,TO,S,s0,a)
A=(M,TI,TO,S,s0,a)
= m1 m2
BFTW3 workshop (Sep 22, 2009)9
© 2009 Andreas Haeberlen
Definition: Fault instances
A fault instance is a four-tuple (A, e): A is an algorithm e is an execution C is a set of correct nodes S is a set of suspects; S must contain at least one faulty
node
G
H
K
C
IFL
J
MO
E
Q
N
CS
P
D
A
B
915
4
Algorithm:1. A sends 0≤x≤10 to C2. B sends 0≤y≤10 to C3. C sends x+y to D
C,S,
C
Needed because detectabilitydepends on who is correct
Needed to quantify how preciselywe can say who is faulty
?
?
23
BFTW3 workshop (Sep 22, 2009)10
© 2009 Andreas Haeberlen
The Fault Detection Problem Given a fault class F, find a t that maps
any algorithm A to an extension t(A) such that:
Nontriviality: Correct nodes regularly output lists of faulty nodes
Completeness: If a fault (A,C,S,e)F occurs, at least one correct node cC will permanently output at least one faulty suspect sS
Accuracy: Correct nodes do not permanently output each other (occasional mistakes are ok)
Agreement: Eventually all correct nodes will permanently output the same set of nodes
BFTW3 workshop (Sep 22, 2009)11
© 2009 Andreas Haeberlen
Outline
A "language" for reasoning about faults System model Extensions and fault instances Precise problem statement
Some initial results How to prove impossibilities Fault classification; commission and omission faults Message complexity of detection
BFTW3 workshop (Sep 22, 2009)12
© 2009 Andreas Haeberlen
Preview: Fault classification
Characterize the set of fault instances for which Fault Detection Problem can be solved
All (!) general
faultinstances
Solution exists
Non-observablefault instances
Ambiguousfault instances
Impossible to solve
Commission faults
Omission faults
BFTW3 workshop (Sep 22, 2009)13
© 2009 Andreas Haeberlen
Preview: Message complexity
How many additional messages are needed? t has message complexity c iff, for each execution
e, the number of messages sent by correct nodes in any e with me(e)=e is at most c times the number of messages in e
Assumption: At most f<|N|-2 nodes can be faulty
Fault detectionproblem
Fault detection problem with agreement
Commission faults f+2 f+2
Omission faults 3f+4 (|N|-1)2
BFTW3 workshop (Sep 22, 2009)14
© 2009 Andreas Haeberlen
Future work
This work provides a "language" for reasoning about fault detection with general faults
Possible next steps: Probabilistic guarantees
Lower cost? [SOSP'07] More synchrony
More faults detectable? Stronger accuracy? Bound the time to detection?
Bounds on message sizes and/or state space Impact on set of detectable faults?
Broadcast channel Different cryptographic primitives ...
BFTW3 workshop (Sep 22, 2009)15
© 2009 Andreas Haeberlen
Summary
Framework for reasoning about fault detection with general (Byzantine) faults Precise definition of a general fault instance Formal statement of the fault detection problem
Two initial results Characterization of the set of detectable faults Tight lower bounds on the message complexity of
detection
Questions?