15
BFTW 3 workshop (Sep 22, 2009) © 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom Labs

BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

Embed Size (px)

Citation preview

Page 1: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

© 2009 Andreas Haeberlen

1BFTW3 workshop (Sep 22, 2009)

The Fault Detection Problem

Andreas HaeberlenMPI-SWS

Petr KuznetsovTU Berlin / Deutsche Telekom

Labs

Page 2: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)2

© 2009 Andreas Haeberlen

Fault tolerance vs. fault detection

Distributed systems need to be robust against faults

Approaches: Masking and detection Complementary: In many systems, we want both!

Masking is well understood, detection is not

Network

"Machine XYZ is faulty"

and

Page 3: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)3

© 2009 Andreas Haeberlen

What we know about fault detection Rich literature on detecting crash faults

e.g. failure detector abstraction [Chandra96]

But very little on general (Byzantine) faults Byzantine failure detector for consensus

[Kihlstrom03] Several specific algorithms: SUNDR, PeerReview...

What we do not know about general faults: Which faults are detectable in a given system? What is the complexity of detection? How does detection depend on synchrony? How much do cryptographic primitives help? ...

This talk

Page 4: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)4

© 2009 Andreas Haeberlen

Outline

A "language" for reasoning about faults System model Extensions and fault instances Precise problem statement

Two initial results Characterization of the set of detectable faults Tight lower bounds on the 'cost' of detection

Page 5: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)5

© 2009 Andreas Haeberlen

A B

C

System model Set N of nodes

One terminal per node

Reliable unicast network Messages can be signed System is asynchronous

Execution is a sequence of events ek=(ik,Ik,Ok)

Node i has an algorithm Ai=(Mi,TIi,TOi,Si,s0,i,ai) Algorithm is deterministic We say that node i is correct if it follows its

algorithm Ai

Distributed algorithm A:=(A1,A2,...,An)

Network

Terminal

Page 6: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)6

© 2009 Andreas Haeberlen

Intuition: What is fault detection?

Given: Algorithm A, set of faults F to detect

Goal: When fault in fF occurs, correct nodes output a list of faulty nodes to their terminals

"Fault detection problem" for F: Find a function tF that maps any algorithm A to another

algorithm tF (A) that solves the same problem but can additionally detect all the faults in F

C D

A

B

0<x<10x+y

0<y<10

711

4

15FAULTY(C)

FAULTY(C)

FAULTY(C)

Page 7: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)7

© 2009 Andreas Haeberlen

Intuition: Faults and extensions

How should a general 'fault' be defined? Needs to be specific to an algorithm! First approximation: A tuple (A,e)

Problem: Need to change the algorithm to do fault detection new algorithm A=tF(A)! What does it mean to detect a fault (A,e) in

execution of A?

Need to restrict the power of t! Idea: Produce an extension of A that 'works exactly

like A', except that it does some additional work to detect faults

Page 8: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)8

© 2009 Andreas Haeberlen

Definition: Extensions

A is an extension of A if:1. their terminal in/outputs are compatible: TI=TI,

TOTO2. there are surjective mappings m1 and m2 such that,

when a has a transition (I,s1)(O,s2), then a has a transition (m1(I), m2(s1))(m1(O), m2(s2))

What does this mean? We can map each execution e of A to an execution

me(e) of A If a node is correct in e, then it is also correct in me

(e)!

A=(M,TI,TO,S,s0,a)

A=(M,TI,TO,S,s0,a)

= m1 m2

Page 9: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)9

© 2009 Andreas Haeberlen

Definition: Fault instances

A fault instance is a four-tuple (A, e): A is an algorithm e is an execution C is a set of correct nodes S is a set of suspects; S must contain at least one faulty

node

G

H

K

C

IFL

J

MO

E

Q

N

CS

P

D

A

B

915

4

Algorithm:1. A sends 0≤x≤10 to C2. B sends 0≤y≤10 to C3. C sends x+y to D

C,S,

C

Needed because detectabilitydepends on who is correct

Needed to quantify how preciselywe can say who is faulty

?

?

23

Page 10: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)10

© 2009 Andreas Haeberlen

The Fault Detection Problem Given a fault class F, find a t that maps

any algorithm A to an extension t(A) such that:

Nontriviality: Correct nodes regularly output lists of faulty nodes

Completeness: If a fault (A,C,S,e)F occurs, at least one correct node cC will permanently output at least one faulty suspect sS

Accuracy: Correct nodes do not permanently output each other (occasional mistakes are ok)

Agreement: Eventually all correct nodes will permanently output the same set of nodes

Page 11: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)11

© 2009 Andreas Haeberlen

Outline

A "language" for reasoning about faults System model Extensions and fault instances Precise problem statement

Some initial results How to prove impossibilities Fault classification; commission and omission faults Message complexity of detection

Page 12: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)12

© 2009 Andreas Haeberlen

Preview: Fault classification

Characterize the set of fault instances for which Fault Detection Problem can be solved

All (!) general

faultinstances

Solution exists

Non-observablefault instances

Ambiguousfault instances

Impossible to solve

Commission faults

Omission faults

Page 13: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)13

© 2009 Andreas Haeberlen

Preview: Message complexity

How many additional messages are needed? t has message complexity c iff, for each execution

e, the number of messages sent by correct nodes in any e with me(e)=e is at most c times the number of messages in e

Assumption: At most f<|N|-2 nodes can be faulty

Fault detectionproblem

Fault detection problem with agreement

Commission faults f+2 f+2

Omission faults 3f+4 (|N|-1)2

Page 14: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)14

© 2009 Andreas Haeberlen

Future work

This work provides a "language" for reasoning about fault detection with general faults

Possible next steps: Probabilistic guarantees

Lower cost? [SOSP'07] More synchrony

More faults detectable? Stronger accuracy? Bound the time to detection?

Bounds on message sizes and/or state space Impact on set of detectable faults?

Broadcast channel Different cryptographic primitives ...

Page 15: BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom

BFTW3 workshop (Sep 22, 2009)15

© 2009 Andreas Haeberlen

Summary

Framework for reasoning about fault detection with general (Byzantine) faults Precise definition of a general fault instance Formal statement of the fault detection problem

Two initial results Characterization of the set of detectable faults Tight lower bounds on the message complexity of

detection

Questions?