Leslie Lamport “A distributed system is one in which the failure of a machine you have never heard of can cause your own machine to become unusable” Issue

Leslie Lamport

“A distributed system is one in which the failure of a machine you have never heard of can cause your own machine to become unusable”

• Issue is dependency on critical components• Notion is that state and “health” of system at site

A is linked to state and health at site B

Component Architectures Make it Worse

• Modern systems are structured using object-oriented component interfaces:– CORBA, COM (or DCOM), Jini– XML

• In these systems, we create a web of dependencies between components

• Any faulty component could cripple the system!

Reminder: Networks versus Distributed Systems

• Network focus is on connectivity but components are logically independent: program fetches a file and operates on it, but server is stateless and forgets the interaction– Less sophisticated but more robust?

• Distributed systems focus is on joint behavior of a set of logically related components. Can talk about “the system” as an entity.– But needs fancier failure handling!

Component Systems?

• Includes CORBA and Web Services• These are distributed in the sense of our

definition– Often, they share state between components– If a component fails, replacing it with a new version

may be hard– Replicating the state of a component: an appealing

option…• Deceptively appealing, as we’ll see

Example

• The Web components are individually reliable• But the Web can fail by returning inconsistent or stale

data, can freeze up or claim that a server is not responding (even if both browser and server are operational), and it can be so slow that we consider it faulty even if it is working

• For stateful systems (the Web is stateless) this issue extends to joint behavior of sets of programs

Example

• The Arianne rocket is designed in a modular fashion– Guidance system– Flight telemetry– Rocket engine control– …. Etc

• When they upgraded some rocket components in a new model, working modules failed because hidden assumptions were invalided.

Arianne Rocket

Guidance

Thrust Control

Attitude Control

Accelerometer

Telemetry

Altitude

Arianne Rocket

Guidance

Thrust Control

Attitude Control

Accelerometer

Telemetry

AltitudeOverflow!

Arianne Rocket

Guidance

Thrust Control

Attitude Control

Accelerometer

Telemetry

Altitude

Insights?

• Correctness depends very much on the environment– A component that is correct in setting A may be

incorrect in setting B– Components make hidden assumptions– Perceived reliability is in part a matter of experience

and comfort with a technology base and its limitations!

Detecting failure

• Not always necessary: there are ways to overcome failures that don’t explicitly detect them

• But situation is much easier with detectable faults

• Usual approach: process does something to say “I am still alive”

• Absence of proof of liveness taken as evidence of a failure

Example: pinging with timeouts

• Programs P and B are the primary, backup of a service

• Programs X, Y, Z are clients of the service• All “ping” each other for liveness• If a process doesn’t respond to a few

pings, consider it faulty.

Component failure detection

• An even harder problem!• Now we need to worry

– About programs that fail– But also about modules that fail

• Unclear how to do this or even how to tell– Recall that RPC makes component use rather

transparent…

Vogels: the Failure Investigator

• Argues that we would not consider someone to have died because they don’t answer the phone

• Approach is to consult other data sources:– Operating system where process runs– Information about status of network routing nodes– Can augment with application-specific solutions

• Won’t detect program that looks healthy but is actually not operating correctly

Further options: “Hot” button

• Usually implemented using shared memory• Monitored program must periodically update a

counter in a shared memory region. Designed to do this at some frequency, e.g. 10 times per second.

• Monitoring program polls the counter, perhaps 5 times per second. If counter stops changing, kills the “faulty” process and notifies others.

Friedman’s approach

• Used in a telecommunications co-processor mockup

• Can’t wait for failures to be sensed, so his protocol reissues requests as soon as soon as the reply seems late

• Issue of detecting failure becomes a background task; need to do it soon enough so that overhead won’t be excessive or realtime response impacted

Broad picture?

• Distributed systems have many components, linked by chains of dependencies

• Failures are inevitable, hardware failures are less and less central to availability

• Inconsistency of failure detection will introduce inconsistency of behavior and could freeze the application

Suggested solution?

• Replace critical components with group of components that can each act on behalf of the original one

• Develop a technology by which states can be kept consistent and processes in system can agree on status (operational/failured) of components

• Separate handling of partitioning from handling of isolated component failures if possible

Suggested Solution

Program

Moduleit uses

Suggested Solution

Program Moduleit usesModuleit uses

Transparent replicationmulticast

Replication: the key technology

• Replicate critical components for availability• Replicate critical data: like coherent caching• Replicate critical system state: control

information such as “I’ll do X while you do Y”• In limit, replication and coordination are really

the same problem

Basic issues with the approach

• We need to understand client-side software architectures better to appreciate the practical limitations on replacing a server with a group

• Sometimes, this simply isn’t practical

Client-Server issues

• Suppose that a client observes a failure during a request

• What should it do?

Client-server issues

Timeout


• What should the client do?– No way to know if request was finished– We don’t even know if server really crashed– But suppose it genuinely crashed…


Timeout

backup


• What should client “say” to backup?– Please check on the status of my last request?

• But perhaps backup has not yet finished the fault-handling protocol

– Reissue request?• Not all requests are idempotent

• And what about any “cached” server state? Will it need to be refreshed?

• Worse still: what if RPC throws an exception? Eg. “demarshalling error”– A risk if failure breaks a stream connection


• Client is doing a request that might be disrupted by failure– Must catch this request

• Client needs to reconnect– Figure out who will take over– Wait until it knows about the crash– Cached data may no longer be valid– Track down outcome of pending requests

• Meanwhile must synchronize wrt any new requests that application issues


• This argues that we need to make server failure “transparent” to client– But in practice, doing so is hard– Normally, this requires deterministic servers

• But not many servers are deterministic

– Techniques are also very slow…


• Transparency– On client side, “nothing happens”– On server side

• There may be a connection that backup needs to take over

• What if server was in the middle of sending a request?

• How can backup exactly mimic actions of the primary?

Other approaches to consider

• N-version programming: use more than one implementation to overcome software bugs– Explicitly uses some form of group architecture– We run multiple copies of the component– Compare their outputs and pick majority

• Could be identical copies, or separate versions• In limit, each is coded by a different team!

Other approaches to consider

• Even with n-version programming, we get limited defense against bugs– ... studies show that Bohrbugs will occur in all

versions! For Heisenbugs we won’t need multiple versions; running one version multiple times suffices if versions see different inputs or different order of inputs

Logging and checkpoints

• Processes make periodic checkpoints, log messages sent in between

• Rollback to consistent set of checkpoints after a failure. Technique is simple and costs are low.

• But method must be used throughout system and is limited to deterministic programs (everything in the system must satisfy this assumption)

• Consequence: useful in limited settings.

Byzantine approach

• Assumes that failures are arbitrary and may be malicious• Uses groups of components that take actions by majority

consensus only• Protocols prove to be costly

– 3t+1 components needed to overcome t failures– Takes a long time to agree on each action

• Currently employed mostly in security settings

Tougher failure models

• We’ve focused on crash failures– In the synchronous model these look like a “farewell

cruel world” message– Some call it the “failstop model”. A faulty process is

viewed as first saying goodbye, then crashing

• What about tougher kinds of failures?– Corrupted messages– Processes that don’t follow the algorithm– Malicious processes out to cause havoc?

Here the situation is much harder

• Generally we need at least 3f+1 processes in a system to tolerate f Byzantine failures– For example, to tolerate 1 failure we need 4 or

more processes

• We also need f+1 “rounds”

• Let’s see why this happens

Byzantine scenario

• Generals (N of them) surround a city– They communicate by courier

• Each has an opinion: “attack” or “wait”– In fact, an attack would succeed: the city will fall.– Waiting will succeed too: the city will surrender. – But if some attack and some wait, disaster ensues

• Some Generals (f of them) are traitors… it doesn’t matter if they attack or wait, but we must prevent them from disrupting the battle– Traitor can’t forge messages from other Generals

Byzantine scenario

Attack!

Wait…

Attack!

Attack! No, wait! Surrender!

Wait…

A timeline perspective

• Suppose that p and q favor attack, r is a traitor and s and t favor waiting… assume that in a tie vote, we attack

p

q

r

s

t


• After first round collected votes are:– {attack, attack, wait, wait, traitor’s-vote}

p

q

r

s

t

What can the traitor do?

• Add a legitimate vote of “attack”– Anyone with 3 votes to attack knows the

outcome

• Add a legitimate vote of “wait”– Vote now favors “wait”

• Or send different votes to different folks

• Or don’t send a vote, at all, to some

Outcomes?

• Traitor simply votes:– Either all see {a,a,a,w,w}– Or all see {a,a,w,w,w}

• Traitor double-votes– Some see {a,a,a,w,w} and some {a,a,w,w,w}

• Traitor withholds some vote(s)– Some see {a,a,w,w}, perhaps others see {a,a,a,w,w,}

and still others see {a,a,w,w,w}• Notice that traitor can’t manipulate votes of loyal

Generals!

What can we do?

• Clearly we can’t decide yet; some loyal Generals might have contradictory data– In fact if anyone has 3 votes to attack, they can

already “decide”.– Similarly, anyone with just 4 votes can decide– But with 3 votes to “wait” a General isn’t sure (one

could be a traitor…)

• So: in round 2, each sends out “witness” messages: here’s what I saw in round 1– General Smith send me: “attack(signed) Smith”

Digital signatures

• These require a cryptographic system– For example, RSA– Each player has a secret (private) key K-1 and

a public key K. • She can publish her public key

– RSA gives us a single “encrypt” function:• Encrypt(Encrypt(M,K),K-1) = Encrypt(Encrypt(M,K-

1),K) = M• Encrypt a hash of the message to “sign” it

With such a system

• A can send a message to B that only A could have sent– A just encrypts the body with her private key

• … or one that only B can read– A encrypts it with B’s public key

• Or can sign it as proof she sent it– B can recompute the signature and decrypt A’s

hashed signature to see if they match• These capabilities limit what our traitor can do:

he can’t forge or modify a message


• In second round if the traitor didn’t behave identically for all Generals, we can weed out his faulty votes

p

q

r

s

t


• We attack!

p

q

r

s

t

Attack!!

Attack!!

Attack!!

Attack!!

Damn! They’re on to me

Traitor is stymied

• Our loyal generals can deduce that the decision was to attack

• Traitor can’t disrupt this…– Either forced to vote legitimately, or is caught– But costs were steep!

• (f+1)*n2 ,messages!• Rounds can also be slow….

– “Early stopping” protocols: min(t+2, f+1) rounds; t is true number of faults

Recent work with Byzantine model

• Focus is typically on using it to secure particularly sensitive, ultra-critical services– For example the “certification authority” that hands out

keys in a domain– Or a database maintaining top-secret data

• Researchers have suggested that for such purposes, a “Byzantine Quorum” approach can work well

• They are implementing this in real systems by simulating rounds using various tricks

Byzantine Quorums

• Arrange servers into a n x n array– Idea is that any row or column is a quorum– Then use Byzantine Agreement to access that

quorum, doing a read or a write

• Separately, Castro and Liskov have tackled a related problem, using BA to secure a file server– By keeping BA out of the critical path, can avoid most

of the delay BA normally imposes

Split secrets

• In fact BA algorithms are just the tip of a broader “coding theory” iceberg

• One exciting idea is called a “split secret”– Idea is to spread a secret among n servers so that

any k can reconstruct the secret, but no individual actually has all the bits

– Protocol lets the client obtain the “shares” without the servers seeing one-another’s messages

– The servers keep but can’t read the secret! • Question: In what ways is this better than just

encrypting a secret?

How split secrets work

• They build on a famous result– With k+1 distinct points you can uniquely identify an

order-k polynomial• i.e 2 points determine a line• 3 points determine a unique quadratic

– The polynomial is the “secret”– And the servers themselves have the points – the

“shares”– With coding theory the shares are made just

redundant enough to overcome n-k faults

Byzantine Broadcast (BB)

• Many classical research results use Byzantine Agreement to implement a form of fault-tolerant multicast– To send a message I initiate “agreement” on

that message– We end up agreeing on content and ordering

w.r.t. other messages

• Used as a primitive in many published papers

Pros and cons to BB

• On the positive side, the primitive is very powerful– For example this is the core of the Castro and Liskov

technique

• But on the negative side, BB is slow– We’ll see ways of doing fault-tolerant multicast that

run at 150,000 small messages per second– BB: more like 5 or 10 per second

• The right choice for infrequent, very sensitive actions… but wrong if performance matters

Take-aways?

• Fault-tolerance matters in many systems– But we need to agree on what a “fault” is– Extreme models lead to high costs!

• Common to reduce fault-tolerance to some form of data or “state” replication– In this case fault-tolerance is often provided by some

form of broadcast– Mechanism for detecting faults is also important in

many systems. • Timeout is common… but can behave inconsistently • “View change” notification is used in some systems. They

typically implement a fault agreement protocol.

Documents

Leslie Lamport “A distributed system is one in which the failure of a machine you have never heard of can cause your own machine to become unusable” Issue