Reliable Distributed Systems

How and Why Complex Systems Fail

How and Why Systems Fail We’ve talked about

Transactional reliability And we’ve mentioned replication for high

availability But does this give us “fault-tolerant

solutions?” How and why do real systems fail? Do real systems offer the hooks we’ll

need to intervene?

Failure Failure is just one of the aspects of

reliability, but it is clearly an important one

To make a system fault-tolerant we need to understand how to detect failures and plan an appropriate response if a failure occurs

This lecture focuses on how systems fail, how they can be “hardened”, and what still fails after doing so

Systems can be built in many ways Reliability is not always a major goal

when development first starts Most systems evolve over time, through

incremental changes with some rewriting

Most reliable systems are entirely rewritten using clean-room techniques after they reach a mature stage of development

Clean-room concept Based on goal of using “best available”

practice Requires good specifications Design reviews in teams Actual software also reviewed for correctness Extensive stress testing and code coverage

testing, use tools like “Purify” Use of formal proof tools where practical

But systems still fail! Gray studied failures in Tandem

systems Hardware was fault-tolerant and

rarely caused failures Software bugs, environmental

factors, human factors (user error), incorrect specification were all major sources of failure

Bohrbugs and Heisenbugs Classification proposed by Bruce Lindsey Bohrbug: like the Bohr model of the nucleus:

solid, easily reproduced, can track it down and fix it

Heisenbug: like the Heisenberg nucleus: a diffuse cloud, very hard to pin down and hence fix

Anita Borr and others have studied life-cycle bugs in complex software using this classification

Programmer facing bugs

Heisenbug is fuzzy,hard to find/fix

Bohrbug is solid,easy to recognize and fix

Lifecycle of Bohrbug Usually introduced in some form of code

change or in original design Often detected during thorough testing Once seen, easily fixed Remain a problem over life-cycle of

software because of need to extend system or to correct other bugs.

Same input will reliably trigger the bug!

Lifecycle of Bohrbug

A Bohrbug is boring.

Lifecycle of a Heisenbug These are often side-effects of some

other problem Example: bug corrupts a data structure

or misuses a pointer. Damage is not noticed right away, but causes a crash much later when structure is referenced

Attempting to detect the bug may shift memory layout enough to change its symptoms!

How programmers fix a Bohrbug They develop a test scenario that

triggers it Use a form of binary search to narrow in

on it Pin down the bug and understand

precisely what is wrong Correct the algorithm or the coding

error Retest extensively to confirm that the

bug is fixed

How they fix Heisenbugs They fix the symptom: periodically scan the

structure that is ususally corrupted and clean it up

They add self-checking code (which may itself be a source of bugs)

They develop theories of what is wrong and fix the theoretical problem, but lack a test to confirm that this eliminated the bug

These bugs are extremely sensitive to event orders

Bug-free software is uncommon Heavily used software may become

extremely reliable over its life (the C compiler rarely crashes, UNIX is pretty reliable by now)

Large, complex systems depend upon so many components, many complex, that bug freedom is an unachievable goal

Instead, adopt view that bugs will happen and we should try and plan for them

Bugs in a typical distributed system Usual pattern: some component crashes

or becomes partitioned away Other system components that depend

on it freeze or crash too Chains of dependencies gradually cause

more and more of the overall system to fail or freeze

Tools can help Everyone should use tools like “purify”

(detects stray pointers, uninitialized variables and memory leaks)

But these tools don’t help at the level of a distributed system

Benefit of a model, like transactions or virtual synchrony, is that the model simplifies developer’s task

Leslie Lamport

“A distributed system is one in which the failure of a machine you have never heard of can cause your own machine to become unusable”

Issue is dependency on critical components

Notion is that state and “health” of system at site A is linked to state and health at site B

Component Architectures Make it Worse

Modern systems are structured using object-oriented component interfaces: CORBA, COM (or DCOM), Jini XML

In these systems, we create a web of dependencies between components

Any faulty component could cripple the system!

Reminder: Networks versus Distributed Systems

Network focus is on connectivity but components are logically independent: program fetches a file and operates on it, but server is stateless and forgets the interaction Less sophisticated but more robust?

Distributed systems focus is on joint behavior of a set of logically related components. Can talk about “the system” as an entity. But needs fancier failure handling!

Component Systems? Includes CORBA and Web Services These are distributed in the sense of our

definition Often, they share state between

components If a component fails, replacing it with a new

version may be hard Replicating the state of a component: an

appealing option… Deceptively appealing, as we’ll see

Thought question Suppose that a distributed system was

built by interconnecting a set of extremely reliable components running on fault-tolerant hardware

Would such a system be expected to be reliable?

Thought question Suppose that a distributed system was built by

interconnecting a set of extremely reliable components running on fault-tolerant hardware

Would such a system be expected to be reliable?

Perhaps not. The pattern of interaction, the need to match rates of data production and consumption, and other “distributed” factors all can prevent a system from operating correctly!

Example The Web components are individually reliable But the Web can fail by returning inconsistent

or stale data, can freeze up or claim that a server is not responding (even if both browser and server are operational), and it can be so slow that we consider it faulty even if it is working

For stateful systems (the Web is stateless) this issue extends to joint behavior of sets of programs

Example The Arianne rocket is designed in a

modular fashion Guidance system Flight telemetry Rocket engine control …. Etc

When they upgraded some rocket components in a new model, working modules failed because hidden assumptions were invalided.

Arianne Rocket

Guidance

Thrust Control

Attitude Control

Accelerometer

Telemetry

Altitude

Arianne Rocket

Guidance

Thrust Control

Attitude Control

Accelerometer

Telemetry

AltitudeOverflow!

Arianne Rocket

Guidance

Thrust Control

Attitude Control

Accelerometer

Telemetry

Altitude

Insights? Correctness depends very much on the

environment A component that is correct in setting A

may be incorrect in setting B Components make hidden assumptions Perceived reliability is in part a matter of

experience and comfort with a technology base and its limitations!

Detecting failure Not always necessary: there are ways to

overcome failures that don’t explicitly detect them

But situation is much easier with detectable faults

Usual approach: process does something to say “I am still alive”

Absence of proof of liveness taken as evidence of a failure

Example: pinging with timeouts Programs P and B are the primary,

backup of a service Programs X, Y, Z are clients of the

service All “ping” each other for liveness If a process doesn’t respond to a

few pings, consider it faulty.

Consistent failure detection Impossible in an asynchronous network

that can lose packets: partitioning can mimic failure Best option is to track membership But few systems have GMS services

Many real networks suffer from this problem, hence consistent detection is impossible “in practice” too!

Can always detect failures if risk of mistakes is acceptable

Component failure detection An even harder problem! Now we need to worry

About programs that fail But also about modules that fail

Unclear how to do this or even how to tell Recall that RPC makes component

use rather transparent…

Vogels: the Failure Investigator Argues that we would not consider someone to

have died because they don’t answer the phone

Approach is to consult other data sources: Operating system where process runs Information about status of network routing nodes Can augment with application-specific solutions

Won’t detect program that looks healthy but is actually not operating correctly

Further options: “Hot” button Usually implemented using shared

memory Monitored program must periodically

update a counter in a shared memory region. Designed to do this at some frequency, e.g. 10 times per second.

Monitoring program polls the counter, perhaps 5 times per second. If counter stops changing, kills the “faulty” process and notifies others.

Friedman’s approach Used in a telecommunications co-

processor mockup Can’t wait for failures to be sensed, so

his protocol reissues requests as soon as soon as the reply seems late

Issue of detecting failure becomes a background task; need to do it soon enough so that overhead won’t be excessive or realtime response impacted

Broad picture? Distributed systems have many

components, linked by chains of dependencies

Failures are inevitable, hardware failures are less and less central to availability

Inconsistency of failure detection will introduce inconsistency of behavior and could freeze the application

Suggested solution? Replace critical components with group

of components that can each act on behalf of the original one

Develop a technology by which states can be kept consistent and processes in system can agree on status (operational/failured) of components

Separate handling of partitioning from handling of isolated component failures if possible

Reliable Distributed Systems

Documents

Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring 2004 1 Principles of Reliable Distributed Systems Recitation 5: Reliable

Reliable and Highly Available Distributed Publish/Subscribe Systems

Reliable Distributed Systems More on Group Communication

Reliable Distributed Systems

1 Reliable Distributed Systems Stateless and Stateful Client- Server Systems

Reliable Distributed Systems Distributed Hash Tables

High-Performance Reliable Distributed Storage Systems

Unreliable Failure Detectors for Reliable Distributed Systems

Dependability I: Reliable Distributed Systems

Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 1 Principles of Reliable Distributed Systems Lecture 5: Synchronous (Uniform)

Guide to Reliable Distributed Systems: Building High ...cse.iitd.ernet.in/~bhalla/OHP1.pdfGuide to Reliable Distributed Systems: Building High-Assurance Applications and ... Yahoo!

Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 1 Principles of Reliable Distributed Systems Lecture 9: Paxos Spring

Reliable Distributed Computing: The Price of Mastering Churn in Distributed Systems

Guide to Reliable Distributed Systems: Building High

Aran Bergman & Eddie Bortnikov & Alex Shraer, Principles of Reliable Distributed Systems, Spring 2008 1 Principles of Reliable Distributed Systems Recitation

Reliable Distributed Systems How and Why Complex Systems Fail

Unreliable Failure Detectors for Reliable Distributed Systems · 2009. 2. 12. · Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra I.B.M Thomas

designing distributed scalable and reliable systems

Alex Shraer, Principles of Reliable Distributed Systems, Technion EE, Spring 2008 1 Principles of Reliable Distributed Systems Tutorial 4: SkipNet Spring

Reliable Distributed Systems Fault Tolerance (Recoverability High Availability)