PASC fault tolerance

Taming Data Corruptions in Distributed Systems

Marco Serafini (Yahoo! Research BCN)

Infrastructure dependability

o Service availability, data durabilityo In presence of hardware faultso Current approaches tolerate crashes

Crashes

oAssumptionso A server (process) suddenly stopso Until then, only correct steps

Data corruptions

oWhat if there are data corruptions?o The state of a process may be corruptedo The process may make incorrect steps before stopping

Datacorruptions

Data corruptions

oWhat if there are data corruptions?o The state of a process may be corruptedo The process may make incorrect steps before stopping

Datacorruptions

NOT COVERED!

Sources of data corruptions

o Commodity disks are known to be unreliableo Faulty firmware, bad sectors etc.

oRAM: ECC errors are frequento Production machines only see detected errors

Coverage not knowno Interconnects and CPUs also fail

o Faulty drivers or bit flips

A horror storyAn 8-hour system-wide outage due to a single hardware fault

What happened?

oQuoted from the Amazon service health dashboardo “A handful of messages had a single bit corrupted”o “The message was still intelligible, but the system state

information was incorrect”o “We used MD5 checksums throughout the system (but

not) for this particular internal state information”o “(The corruption) spread throughout the system causing

the symptoms described above”

Error propagation

Event handling

Eventhandling

Process i Process j

Common practice

oManual placement of ad-hoc error detection checkso Application knowledgeo Time consuming

oHard to structure without fault model

oNo error isolation guarantee

Research: Byzantine faults

oByzantine modelo Faulty nodes controlled by an adversaryo Worst-case model

Byzantinefault

Byzantine fault model

oBlack-box model of faulty processes: adversarialoHardening for error isolation [Nysiad NSDI 2008]

o Based on state machine replicationo Replication and performance costs

Servers

Client

Agreement on requests

Byzantine faults

oByzantine hardening covers attacks and bugs…o… assuming, e.g., design diversity of replicas

o Unpractical in most systems no real adoption

Attacks

Security

Data corruptions

ASC Hardening

A new approach to error isolation

Event handling

Eventhandling

Process i Process j

1. General model of process behavior2. Arbitrary State Corruption (ASC) fault model3. Guarantee error isolation through hardening

A new approach to error isolation

Event handling

Eventhandling

Process i Process j

1. General model of process behavior2. Arbitrary State Corruption (ASC) fault model3. Guarantee error isolation through hardening

with M. Correia, D. Ferro and F. Junqueira2012 Usenix Annual Technical Conference

Process and fault modelsDefining Arbitrary State Corruptions

Process model

Upon receive message <REQ, r> doif v > 5 then

u = r + v + 5;

elseu = r + v;

v = u;send <WRITE, v> to

process p

1) Event Dispatching

2) Event Handling

3) Message sending

ASC fault model

oAn Arbitrary State Corruption can make a process o Crasho Assign an arbitrary value to any variableo Start the execution from an arbitrary instruction

PC 320

Fault frequency

oOne fault for every processed input message

Upon receive message <REQ, r> doif v > 5 then

u = r + v + 5;

elseu= r + v;

v = u;send <WRITE, v> to

process p

1) Event Dispatching

2) Event Handling

3) Message sending

Fault diversity

oA corrupted variable is different from its replica

oOnly holds immediately after the faulto Can be invalidated if instructions modify the variable

PC 320

original replica original replica

Error propagation

o Fault diversity does not holdoHardening preserves diversity

Original ReplicaFault diversity

ASC hardeningFrom ASC faults to crashes and message omissions

From ASC to crashesoTransparent: to the hardened processo Local: no process replication on multiple machinesoUntrusted: can have faults while executing hardening

HARDENING RUNTIME

Event handling

PASC runtime

EH1 EH2 EH3

Process state

PASC checks

PASC library

User- defined

Transparent

github.com/yahoo/pasc

Replica state

Evaluation

Hardening an echo server

o Little computation, network bound, no overheado PBFT is a reference (Nysiad not available)

Hardening State Machine Replication

+ 70 %- 15 %

Zookeeper (core)

Memory overhead

Scalability

o SimpleKV: eventually consistent store, no replicationo Scales similarly with hardeningo No server “wasted” for replication

1 3 5 70

100000

PASC sKVUnprot. sKV

Number of servers

PASC fault coverageo Injected random bit flips in Paxos

o Code corruptions: bytecode and binary codeo State corruptions: pointers and primitive values

Code corruptions State corruptions

Unprot PASC Unprot PASC

Undet. 3 0 93 0

Det. - 1 - 330

Crash 1640 1663 2301 2066

Not manif. 1213 1193 2843 2841

Total 2856 2856 5237 5237

Wrap up

oHardware data corruptions are a real dangero Proposed new systematic approach

o BFT not realistico Ad-hoc approaches are not systematic

oHardening algorithm for error isolation o Local: does not require replicationo Efficient: PASC-Paxos has up to 70% more throughput

than PBFTo High fault coverage

Directions

o Systematic protection of Yahoo! infrastructure against data corruptions

oASC just scratched the surface – some todoso Reduce memory footprinto Support for external memory (disks/SSDs)o Hardening of legacy codeo Theoretical foundations

Thank you

serafini@yahoo-inc.com

PASC fault tolerance

Technology

Byzantine fault-tolerance

Fault tolerance

System Reliability and Fault Tolerance Reliable Communication Byzantine Fault Tolerance

“Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Hadoop fault tolerance

6.Fault Tolerance

Today: Fault Tolerance

Fault Tolerance 7

A Survey of Software Fault Tolerance Techniques · A Survey of Software Fault Tolerance Techniques ... Hardware fault tolerance ... serves to relate software fault tolerance techniques

Fault Tolerance Fundamentals

16 Fault Tolerance

02 fault tolerance

HSRP Fault Tolerance

Fault Tolerance Datasheet

Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises

L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2

Fault Tolerance Distributed

Simulation Fault-Injection & Software Fault-Tolerance

Byzantine Fault Tolerance

Software Fault Tolerance The 1990s were to be the decade of fault tolerance computing. Fault tolerance hardware was in the works and software fault tolerance