37
Byzantine Fault Isolation in the Farsite Distributed File System John R. Douceur and Jon Howell

Byzantine Fault Isolation in the Farsite Distributed File System John R. Douceur and Jon Howell

Embed Size (px)

Citation preview

Byzantine Fault Isolation in the Farsite Distributed File System

John R. Douceur and Jon Howell

Byzantine fault isolation \'biz-ən- tēn folt ī-sə-'lā- shən\ n (2006) : methodology for designing a distributed system that can, under Byzantine failure, operate with application-defined partial correctness

' '

˙

'

Farsite \'fär-sīt\ n (2000) : serverless distributed file system developed at Microsoft Research, designed to be scalable, strongly consistent, and secure despite running on an untrusted infrastructure of desktop PCs

Definitions

Byzantine fault \'biz-ən- tēn folt\ n (1982) : a failure of a system component that produces arbitrary behavior

'

˙

'

BFI \ bē-ef-'ī\ n (2006) : Byzantine fault isolation

'

Talk Outline

• Context – Farsite system

• Why BFT doesn’t scale

• Farsite’s use of multiple BFT groups

• The need for isolating Byzantine faults

• Formal system specification

• BFI in Farsite

Farsite System

client

server

client

server

server

Farsite System

users BFT group

metadata

clients

– Metadata

users BFT groupclients

•Using Byzantineagreement protocol,assign sequencenumbers to messages

•Prepare-commitamong 2 T + 1 servers

T = tolerable faults

R = count of replicas

R > 3 T

•Deterministicallyupdate metadata

•Reply to client

Farsite System – Metadata

The Cost of BFT Groups

computation

messages

message delays

1

2

2

4

32

5

0

1

2

3

4

5

6

7

1 2 3 4 5 6 7

machine count

thro

ug

hp

ut

mu

ltip

le

ideal typical flat BFT

Throughput vs. Scale

Workload Sharing

Workload

client server

BFT at Scale

Multiple BFT Groups

Tree of BFT Groups

Tree of BFT Groups

/

users

cruftemacs

viOutlook

public

Alice Bob

docscode

C++ C#

foo bar

Proj X

src bin src bin

Delegation to New Group

/

users

cruftemacs

viOutlook

public

Alice Bob

docscode

C++ C#

foo bar

Proj X

src bin src bin

Pathname Resolution

/

users

cruftemacs

viOutlook

public

Alice Bob

docscode

C++ C#

foo bar

Proj X

src bin src bin

/users/Alice/code/C#/bar

Machine Failures at Scale

Group Failures at Scale

System Failure at Scale

Quantitative Fault Analysis

• Example system– File system distributed among interacting BFT groups

• Simplifying assumptions– Files are partitioned evenly among BFT groups– Machine failures are independent

• Machine fault probability = 0.001• Evaluate: operational fault rate

– Probability that an operation on a randomly selected file exhibits a fault

Operational Faults vs. System Scale

1 10 100 1,000 10,000 100,000

system scale (count of BFT groups)

op

erat

ion

al f

ault

rat

e

BFT 4, no BFI BFT 7, no BFI BFT 10, no BFI

BFT 4, ideal BFI BFT 4, tree (4) BFI BFT 4, tree (16) BFI

10 –1

10 0

10 –2

10 –3

10 –4

10 –5

10 –6

10 –7

610 –6

0.45

610 –6

310 –5

BFI versus no BFI

BFI versus no BFI

computation

throughput reduction:

messages

4

32

10

60%

200

84%

4-member BFT groupswith BFI

10-member BFT groupswithout BFI

refinement

BFI via Formal Specification

state

actions

state

semanticspec

distributedsystemspec

actions+ faults

ment + faults

Impro

ved!NEW

C++ emacs

tools

src

a.h a.cpp a.exe

Farsite Semantic Spec

cl.exe

open handles pending operations

openread

move

/

code

bin

a.obj

Farsite Distributed-System Spec

Farsite Refinement

del

C++ emacs

tools

src

a.h a.cpp a.execl.exe

open handles pending operations

read

move

/

code

bin

a.obj

Actions are State Transitions/

openhandles

pendingoperations

a.cpp

Proving Refinement Inductively/

openhandles

pendingoperations

a.cpp

Refinement with Byzantine Faults

del

C++ emacs

tools

src

a.h a.cpp a.execl.exe

open handles pending operations

read

move

code

bin

a.obj

/

Refinement with Byzantine Faults

del

C++ emacs

tools

src

a.h a.cpp a.execl.exe

open handles pending operations

read

move

/

code

bin

a.obj

emacs src

a.h a.cpp a.exe

bin

a.obj

codeHelloworld

,,)*&#()*&{ 1[9^^x **{ o [[ …. 2 %%% @@)

,. ,. {^ \-~-/ ^} " " ,". { <o> _ <o> } / } ==_ .:Y:. _=={ { _/ `--^--' \_} } / \ / \ /{ ( ) y \ ! | | ! / ,-.i~ ~i i~ ~i,-.(!!( V )!!) ^-'-'-^-'-'-^

• Safety– A tainted file may have arbitrary contents and attributes– A tainted file may appear not linked into namespace– A tainted file may pretend not to have children it actually has– A tainted file may pretend to have children that do not exist– A tainted file may pretend another tainted file is a child or parent

• Liveness– Operations involving a tainted file may not complete

Semantic Fault Specification

C++

tools

cl.exe

/

A tainted file may have arbitrary contents and attributesA tainted file may appear not linked into namespaceA tainted file may pretend not to have children it actually hasA tainted file may pretend to have children that do not existA tainted file may pretend another tainted file is a child or parent

Operations involving a tainted file may not complete

foo bar

• Maintain redundant info across BFT group boundaries

• Augment messages with info that justifies correctness

• Ensure unambiguous chains of authority over data

• Carefully order messages and state updates for operations involving multiple BFT groups

Distributed-System ImprovementsMaintain redundant info across BFT group boundaries

Augment messages with info that justifies correctness

Ensure unambiguous chains of authority over data

Carefully order messages and state updates foroperations involving multiple BFT groups

Summary of BFI Methodology

• Formally specify your system– Semantic spec: user’s view of system– Distributed-system spec: designer’s view of system– Refinement interprets distributed-system spec in

semantic terms• Modify distributed-system spec to express

Byzantine faults• Simultaneously

– Strategically weaken semantic spec to describe faults– Improve distributed-system spec to quarantine faults

• Refinement lets you know when you are done

Conclusions

• BFT groups have negative throughput scaling• Scalable systems can be built from multiple BFT groups• System scale increases the probability of non-maskable

Byzantine faults• If faults are not isolated, a single faulty group can corrupt

the entire system.• BFI is a methodology for isolating Byzantine faults• BFI uses formal system specification• Improves fault tolerance without hurting throughput,

unlike increasing BFT group size

Contact Information

[email protected]

[email protected]

http://research.microsoft.com/farsite

Backup Slides

• Semantic specification– 1800 lines of TLA+– 114 definitions

• Distributed-system specification– 11,500 lines of TLA+– 775 definitions

• Why so big?– Windows file-system semantics are complex– Scalability and strong consistency– Byzantine fault isolation

Farsite Spec Stats