124
Reliable Distributed Systems Logical Clocks

Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Embed Size (px)

Citation preview

Page 1: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Reliable Distributed Systems

Logical Clocks

Page 2: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Time and Ordering

We tend to casually use temporal concepts Example: “membership changes dynamically”

Implies a notion of time: first membership was X, later membership was Y

Challenge: relating local notion of time in a single process to a global notion of time

Will discuss this issue before developing multicast delivery ordering options in more detail

Page 3: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Time in Distributed Systems

Three notions of time: Time seen by external observer. A global

clock of perfect accuracy Time seen on clocks of individual processes.

Each has its own clock, and clocks may drift out of sync.

Logical notion of time: event a occurs before event b and this is detectable because information about a may have reached b.

Page 4: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

External Time

The “gold standard” against which many protocols are defined Not implementable: no system can avoid

uncertain details that limit temporal precision!

Use of external time is also risky: many protocols that seek to provide properties defined by external observers are extremely costly and, sometimes, are unable to cope with failures

Page 5: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Time seen on internal clocks Most workstations have reasonable clocks Clock synchronization is the big problem (will

visit topic later in course): clocks can drift apart and resynchronization, in software, is inaccurate

Unpredictable speeds a feature of all computing systems, hence can’t predict how long events will take (e.g. how long it will take to send a message and be sure it was delivered to the destination)

Page 6: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Logical notion of time

Has no clock in the sense of “real-time” Focus is on definition of the “happens

before” relationship: “a happens before b” if: both occur at same place and a finished

before b started, or a is the send of message m, b is the

delivery of m, or a and b are linked by a chain of such events

Page 7: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Logical time as a time-space picture

a

b

d

c

p0

p1

p2

p3

a, b are concurrent

c happens after a, b

d happens after a, b, c

Page 8: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Notation

Use “arrow” to represent happens-before relation

For previous slide: a c, b c, c d hence, a d, b d a, b are concurrent

Also called the “potential causality” relation

Page 9: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Logical clocks

Proposed by Lamport to represent causal order

Write: LT(e) to denote logical timestamp of an event e, LT(m) for a timestamp on a message, LT(p) for the timestamp associated with process p

Algorithm ensures that if a b, then LT(a) < LT(b)

Page 10: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Algorithm

Each process maintains a counter, LT(p) For each event other than message delivery:

set LT(p) = LT(p)+1 When sending message m, LT(m) = LT(p) When delivering message m to process q, set

LT(q) = max(LT(m), LT(q))+1

Page 11: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Illustration of logical timestamps

LT(a)=1

LT(b)=1

g

LT(e)=3

LT(f)=4

p0

p1

p2

p3

LT(d)=2

LT(c)=1

Page 12: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Concurrent events

If a, b are concurrent, LT(a) and LT(b) may have arbitrary values!

Thus, logical time lets us determine that a potentially happened before b, but not that a definitely did so!

Example: processes p and q never communicate. Both will have events 1, 2, ... but even if LT(e)<LT(e’) e may not have happened before e’

Page 13: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

What about “real-time” clocks? Accuracy of clock synchronization is

ultimately limited by uncertainty in communication latencies

These latencies are “large” compared with speed of modern processors (typical latency may be 35us to 500us, time for thousands of instructions)

Limits use of real-time clocks to “coarse-grained” applications

Page 14: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Interpretations of temporal terms

Understand now that “a happens before b” means that information can flow from a to b

Understand that “a is concurrent with b” means that there is no information flow between a and b

What about the notion of an “instant in time”, over a set of processes?

Page 15: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Chandy and Lamport: Consistent cuts Draw a line across a set of processes Line cuts each execution Consistent cut has property that the set

of included events is closed under happens-before relation: If the cut “includes” event b, and event a

happens before b, then the cut also includes event a

In practice, this means that every “delivered” message was sent within the cut

Page 16: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Illustration of consistent cuts

a

b

g

e

p0

p1

p2

p3

d

c

f

green cuts are consistent

red cut is inconsistent

Page 17: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Intuition into consistent cuts A consistent cut is a state that could

have arisen during execution, depending on how processes were scheduled

An inconsistent cut could not have arisen during execution

One way to see this: think of process timelines as rubber bands. Scheduler stretches or compresses time but can’t deliver message before it was sent

Page 18: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Illustration of consistent cuts

a

b

g

e

p0

p1

p2

p3

d

c

f

green cuts are consistent

stretch

shrink

Page 19: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

There may be many consistent cuts through any point in the execution

a

b

g

e

p0

p1

p2

p3

d

c

f

possible cuts define a range of potentially instantaneous system states

Page 20: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Illustration of consistent cuts

a

b

g

e

p0

p1

p2

p3

d

c

f

to make the red cut “straight” message f has to travel backwards in time!

Page 21: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Reliable Distributed Systems

Quorums

Page 22: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Quorum replication We developed a whole architecture

based on our four-step recipe But there is a second major approach

that also yields a complete group communication framework and solutions Based on “quorum” read and write

operations Omits notion of process group views

Page 23: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Today’s topic Quorum methods from a mile high

Don’t have time to be equally detailed We’ll explore

How the basic read/update protocol works

Failure considerations State machine replication (a form of lock-

step replication for deterministic objects) Performance issues

Page 24: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

A peek at the conclusion These methods are

Widely known and closely tied to consensus Perhaps, easier to implement

But they have serious drawbacks: Need deterministic components Are drastically slower (10s-100s of events/second)

Big win? Recent systems combine quorums with Byzantine

Agreement for ultra-sensitive databases

Page 25: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Static membership Subsets of a known set of processes

E.g. a cluster of five machines, each running replica of a database server

Machines can crash or recover but don’t depart permanently and new ones don’t join “out of the blue”

In practice the dynamic membership systems can easily be made to work this way… but usually aren’t

Page 26: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Static membership example

p

q

r

s

t

client

Qread = 2, Qwrite = 4

read write read Write fails

To do a read, this client (or even a

group member) must access at least 2

replicas

To do a write, must update at least 4

replicas

This write will fail: the client only manages to contact 2 replicas and must

“abort” the operation (we use this terminology even though we aren’t

doing transactions)

Page 27: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Quorums

Must satisfy two basic rules1. A quorum read should “intersect” any

prior quorum write at >= 1 processes2. A quorum write should also intersect

any other quorum write So, in a group of size N:

1. Qr + Qw > N, and

2. Qw + Qw > N

Page 28: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Versions of replicated data

Replicated data items have “versions”, and these are numbered I.e. can’t just say “Xp=3”. Instead say

that Xp has timestamp [7,q] and value 3 Timestamp must increase monotonically

and includes a process id to break ties This is NOT the pid of the update

source… we’ll see where it comes from

Page 29: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Doing a read is easy Send RPCs until Qr processes reply Then use the value with the largest

timestamp Break ties by looking at the pid For example

[6,x] < [9,a] (first look at the “time”) [7,p] < [7,q] (but use pid as a tie-breaker)

Even if a process owns a replica, it can’t just trust it’s own data. Every “read access” must collect Qr values first…

Page 30: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Doing a write is trickier First, we can’t support incremental updates

(x=x+1), since no process can “trust” its own replica.

Such updates require a read followed by a write. When we initiate the write, we don’t know if

we’ll succeed in updating a quorum of processes wE can’t update just some subset; that could confuse a

reader Hence need to use a commit protocol

Moreover, must implement a mechanism to determine the version number as part of the protocol. We’ll use a form of voting

Page 31: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

The sequence of events1. Propose the write: “I would like to set X=3”

2. Members “lock” the variable against reads, put the request into a queue of pending writes (must store this on disk or in some form of crash-tolerant memory), and send back:

“OK. I propose time [t,pid]”

Here, time is a logical clock. Pid is the member’s own pid

3. Initiator collects replies, hoping to receive Qw (or more)

Qw OKs

Compute maximum of proposed [t,pid] pairs.

Commit at that time

Abort

< Qw OKs

Page 32: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Which votes got counted? It turns out that we also need to know

which votes were “counted” E.g. suppose there are five group members,

A…E and they vote: {[17,A] [19,B] [20,C] [200,D] [21,E]}

But somehow the vote from D didn’t get through and the maximum is picked as [21,E]

We’ll need to also remember that the votes used to make this decision were from {A,B,C,E}

Page 33: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

What’s with the [t,pid] stuff? Lamport’s suggestion: use logical clocks

Each process receives an update message Places it in an ordered queue And responds with a proposed time: [t,pid]

using its own process id for the time The update source takes the maximum

Commit message says “commit at [t,pid]” Group members who’s votes were considered

deliver committed updates in timestamp order Group members who votes were not

considered discard the update and don’t do it, at all.

Page 34: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Reliable Distributed Systems

Models for Distributed Computing. The Fischer,

Lynch and Paterson Result

Page 35: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Who needs failure “models”? Role of a failure model

Lets us reduce fault-tolerance to a mathematical question

In model M, can problem P be solved? How costly is it to do so? What are the best solutions? What tradeoffs arise?

And clarifies what we are saying Lacking a model, confusion is common

Page 36: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Categories of failures

Crash faults, message loss These are common in real systems Crash failures: process simply stops,

and does nothing wrong that would be externally visible before it stops

These faults can’t be directly detected

Page 37: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Categories of failures Fail-stop failures

These require system support Idea is that the process fails by crashing,

and the system notifies anyone who was talking to it

With fail-stop failures we can overcome message loss by just resending packets, which must be uniquely numbered

Easy to work with… but rarely supported

Page 38: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Categories of failures Non-malicious Byzantine failures

This is the best way to understand many kinds of corruption and buggy behaviors

Program can do pretty much anything, including sending corrupted messages

But it doesn’t do so with the intention of screwing up our protocols

Unfortunately, a pretty common mode of failure

Page 39: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Categories of failure Malicious, true Byzantine, failures

Model is of an attacker who has studied the system and wants to break it

She can corrupt or replay messages, intercept them at will, compromise programs and substitute hacked versions

This is a worst-case scenario mindset In practice, doesn’t actually happen Very costly to defend against; typically used

in very limited ways (e.g. key mgt. server)

Page 40: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Models of failure

Question here concerns how failures appear in formal models used when proving things about protocols

Think back to Lamport’s happens-before relationship, Model already has processes, messages,

temporal ordering Assumes messages are reliably

delivered

Page 41: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Recall: Two kinds of models We tend to work within two models

Asynchronous model makes no assumptions about time

Lamport’s model is a good fit Processes have no clocks, will wait indefinitely

for messages, could run arbitrarily fast/slow Distributed computing at an “eons” timescale

Synchronous model assumes a lock-step execution in which processes share a clock

Page 42: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Adding failures in Lamport’s model

Also called the asynchronous model Normally we just assume that a failed

process “crashes:” it stops doing anything Notice that in this model, a failed process is

indistinguishable from a delayed process In fact, the decision that something has

failed takes on an arbitrary flavor Suppose that at point e in its execution, process p

decides to treat q as faulty….”

Page 43: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

What about the synchronous model?

Here, we also have processes and messages But communication is usually assumed to be

reliable: any message sent at time t is delivered by time t+

Algorithms are often structured into rounds, each lasting some fixed amount of time , giving time for each process to communicate with every other process

In this model, a crash failure is easily detected

When people have considered malicious failures, they often used this model

Page 44: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Neither model is realistic Value of the asynchronous model is that

it is so stripped down and simple If we can do something “well” in this model

we can do at least as well in the real world So we’ll want “best” solutions

Value of the synchronous model is that it adds a lot of “unrealistic” mechanism If we can’t solve a problem with all this help,

we probably can’t solve it in a more realistic setting!

So seek impossibility results

Page 45: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Tougher failure models We’ve focused on crash failures

In the synchronous model these look like a “farewell cruel world” message

Some call it the “failstop model”. A faulty process is viewed as first saying goodbye, then crashing

What about tougher kinds of failures? Corrupted messages Processes that don’t follow the algorithm Malicious processes out to cause havoc?

Page 46: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Here the situation is much harder

Generally we need at least 3f+1 processes in a system to tolerate f Byzantine failures For example, to tolerate 1 failure we

need 4 or more processes We also need f+1 “rounds” Let’s see why this happens

Page 47: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Byzantine scenario Generals (N of them) surround a city

They communicate by courier Each has an opinion: “attack” or “wait”

In fact, an attack would succeed: the city will fall. Waiting will succeed too: the city will surrender. But if some attack and some wait, disaster ensues

Some Generals (f of them) are traitors… it doesn’t matter if they attack or wait, but we must prevent them from disrupting the battle Traitor can’t forge messages from other Generals

Page 48: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Byzantine scenario

Attack!

Wait…

Attack!

Attack! No, wait! Surrender!

Wait…

Page 49: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

A timeline perspective

Suppose that p and q favor attack, r is a traitor and s and t favor waiting… assume that in a tie vote, we attack

p

q

r

s

t

Page 50: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

A timeline perspective

After first round collected votes are: {attack, attack, wait, wait, traitor’s-

vote}

p

q

r

s

t

Page 51: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

What can the traitor do?

Add a legitimate vote of “attack” Anyone with 3 votes to attack knows

the outcome Add a legitimate vote of “wait”

Vote now favors “wait” Or send different votes to different

folks Or don’t send a vote, at all, to some

Page 52: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Outcomes? Traitor simply votes:

Either all see {a,a,a,w,w} Or all see {a,a,w,w,w}

Traitor double-votes Some see {a,a,a,w,w} and some {a,a,w,w,w}

Traitor withholds some vote(s) Some see {a,a,w,w}, perhaps others see

{a,a,a,w,w,} and still others see {a,a,w,w,w} Notice that traitor can’t manipulate votes

of loyal Generals!

Page 53: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

What can we do? Clearly we can’t decide yet; some loyal

Generals might have contradictory data In fact if anyone has 3 votes to attack, they

can already “decide”. Similarly, anyone with just 4 votes can decide But with 3 votes to “wait” a General isn’t sure

(one could be a traitor…) So: in round 2, each sends out “witness”

messages: here’s what I saw in round 1 General Smith send me: “attack(signed) Smith”

Page 54: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Digital signatures These require a cryptographic system

For example, RSA Each player has a secret (private) key K-1

and a public key K. She can publish her public key

RSA gives us a single “encrypt” function: Encrypt(Encrypt(M,K),K-1) =

Encrypt(Encrypt(M,K-1),K) = M Encrypt a hash of the message to “sign” it

Page 55: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

With such a system A can send a message to B that only A

could have sent A just encrypts the body with her private key

… or one that only B can read A encrypts it with B’s public key

Or can sign it as proof she sent it B can recompute the signature and decrypt

A’s hashed signature to see if they match These capabilities limit what our traitor

can do: he can’t forge or modify a message

Page 56: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

A timeline perspective

In second round if the traitor didn’t behave identically for all Generals, we can weed out his faulty votes

p

q

r

s

t

Page 57: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

A timeline perspective

We attack!

p

q

r

s

t

Attack!!

Attack!!

Attack!!

Attack!!

Damn! They’re on to me

Page 58: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Traitor is stymied Our loyal generals can deduce that the

decision was to attack Traitor can’t disrupt this…

Either forced to vote legitimately, or is caught

But costs were steep! (f+1)*n2 ,messages! Rounds can also be slow….

“Early stopping” protocols: min(t+2, f+1) rounds; t is true number of faults

Page 59: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Recent work with Byzantine model

Focus is typically on using it to secure particularly sensitive, ultra-critical services For example the “certification authority” that

hands out keys in a domain Or a database maintaining top-secret data

Researchers have suggested that for such purposes, a “Byzantine Quorum” approach can work well

They are implementing this in real systems by simulating rounds using various tricks

Page 60: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Split secrets In fact BA algorithms are just the tip of a

broader “coding theory” iceberg One exciting idea is called a “split secret”

Idea is to spread a secret among n servers so that any k can reconstruct the secret, but no individual actually has all the bits

Protocol lets the client obtain the “shares” without the servers seeing one-another’s messages

The servers keep but can’t read the secret! Question: In what ways is this better than

just encrypting a secret?

Page 61: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Take-aways? Fault-tolerance matters in many systems

But we need to agree on what a “fault” is Extreme models lead to high costs!

Common to reduce fault-tolerance to some form of data or “state” replication In this case fault-tolerance is often provided

by some form of broadcast Mechanism for detecting faults is also

important in many systems. Timeout is common… but can behave inconsistently “View change” notification is used in some systems.

They typically implement a fault agreement protocol.

Page 62: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Reliable Distributed Systems

Fault Tolerance (Recoverability High

Availability)

Page 63: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Reliability and transactions Transactions are well matched to

database model and recoverability goals

Transactions don’t work well for non-database applications (general purpose O/S applications) or availability goals (systems that must keep running if applications fail)

When building high availability systems, encounter replication issue

Page 64: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Types of reliability Recoverability

Server can restart without intervention in a sensible state

Transactions do give us this High availability

System remains operational during failure Challenge is to replicate critical data

needed for continued operation

Page 65: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Replicating a transactional server Two broad approaches

Just use distributed transactions to update multiple copies of each replicated data item

We already know how to do this, with 2PC Each server has “equal status”

Somehow treat replication as a special situation

Leads to a primary server approach with a “warm standby”

Page 66: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Replication with 2PC Our goal will be “1-copy serializability”

Defined to mean that the multi-copy system behaves indistinguishably from a single-copy system

Considerable form and theoretical work has been done on this

As a practical matter Replicate each data item Transaction manager

Reads any single copy Updates all copies

Page 67: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Observation Notice that transaction manager must

know where the copies reside In fact there are two models

Static replication set: basically, the set is fixed, although some members may be down

Dynamic: the set changes while the system runs, but only has operational members listed within it

Today stick to the static case

Page 68: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Replication and Availability

A series of potential issues How can we update an object during

periods when one of its replicas may be inaccessible?

How can 2PC protocol be made fault-tolerant?

A topic we’ll study in more depth But the bottom line is: we can’t!

Page 69: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Usual responses?

Quorum methods: Each replicated object has an update

and a read quorum Designed so Qu+Qr > # replicas and

Qu+Qu > # replicas Idea is that any read or update will

overlap with the last update

Page 70: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Quorum example X is replicated at {a,b,c,d,e} Possible values?

Qu = 1, Qr = 5 (violates QU+Qu > 5) Qu = 2, Qr = 4 (same issue) Qu = 3, Qr = 3 Qu = 4, Qr = 2 Qu = 5, Qr = 1 (violates availability)

Probably prefer Qu=4, Qr=2

Page 71: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Things to notice Even reading a data item requires that

multiple copies be accessed! This could be much slower than normal

local access performance Also, notice that we won’t know if we

succeeded in reaching the update quorum until we get responses Implies that any quorum replication scheme

needs a 2PC protocol to commit

Page 72: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Next issue?

Now we know that we can solve the availability problem for reads and updates if we have enough copies

What about for 2PC? Need to tolerate crashes before or

during runs of the protocol A well-known problem

Page 73: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Availability of 2PC

It is easy to see that 2PC is not able to guarantee availability Suppose that manager talks to 3

processes And suppose 1 process and manager

fail The other 2 are “stuck” and can’t

terminate the protocol

Page 74: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

What can be done? We’ll revisit this issue soon Basically,

Can extend to a 3PC protocol that will tolerate failures if we have a reliable way to detect them

But network problems can be indistinguishable from failures

Hence there is no commit protocol that can tolerate failures

Anyhow, cost of 3PC is very high

Page 75: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

A quandry?

We set out to replicate data for increased availability

And concluded that Quorum scheme works for updates But commit is required And represents a vulnerability

Other options?

Page 76: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Other options

We mentioned primary-backup schemes

These are a second way to solve the problem

Based on the log at the data manager

Page 77: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Server replication Suppose the primary sends the log

to the backup server It replays the log and applies

committed transactions to its replicated state

If primary crashes, the backup soon catches up and can take over

Page 78: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Primary/backup

primary

backup

Clients initially connected to primary, which keeps backup up to date. Backup tracks log

log

Page 79: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Primary/backup

primary

backup

Primary crashes. Backup sees the channel break, applies committed updates. But it may have missedthe last few updates!

Page 80: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Primary/backup

primary

backup

Clients detect the failure and reconnect to backup. Butsome clients may have “gone away”. Backup state couldbe slightly stale. New transactions might suffer from this

Page 81: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Issues? Under what conditions should

backup take over Revisits the consistency problem seen

earlier with clients and servers Could end up with a “split brain”

Also notice that still needs 2PC to ensure that primary and backup stay in same states!

Page 82: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Split brain: reminder

primary

backup

Clients initially connected to primary, which keeps backup up to date. Backup follows log

log

Page 83: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Split brain: reminder

Transient problem causes some links to break but not all.Backup thinks it is now primary, primary thinks backup is down

primary

backup

Page 84: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Split brain: reminder

Some clients still connected to primary, but one has switchedto backup and one is completely disconnected from both

primary

backup

Page 85: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Implication?

A strict interpretation of ACID leads to conclusions that There are no ACID replication

schemes that provide high availability Most real systems solve by

weakening ACID

Page 86: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Real systems

They use primary-backup with logging

But they simply omit the 2PC Server might take over in the wrong

state (may lag state of primary) Can use hardware to reduce or

eliminate split brain problem

Page 87: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

How does hardware help? Idea is that primary and backup share a

disk Hardware is configured so only one can

write the disk If server takes over it grabs the “token” Token loss causes primary to shut down

(if it hasn’t actually crashed)

Page 88: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Reconciliation This is the problem of fixing the transactions

impacted by lack of 2PC Usually just a handful of transactions

They committed but backup doesn’t know because never saw commit record

Later. server recovers and we discover the problem

Need to apply the missing ones Also causes cascaded rollback Worst case may require human intervention

Page 89: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Summary Reliability can be understood in terms of

Availability: system keeps running during a crash

Recoverability: system can recover automatically

Transactions are best for latter Some systems need both sorts of

mechanisms, but there are “deep” tradeoffs involved

Page 90: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Replication and High Availability All is not lost! Suppose we move away from the

transactional model Can we replicate data at lower cost and

with high availability? Leads to “virtual synchrony” model Treats data as the “state” of a group of

participating processes Replicated update: done with multicast

Page 91: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Steps to a solution First look more closely at 2PC, 3PC, failure

detection 2PC and 3PC both “block” in real settings But we can replace failure detection by

consensus on membership Then these protocols become non-blocking

(although solving a slightly different problem) Generalized approach leads to ordered

atomic multicast in dynamic process groups

Page 92: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Non-blocking Commit

Goal: a protocol that allows all operational processes to terminate the protocol even if some subset crash

Needed if we are to build high availability transactional systems (or systems that use quorum replication)

Page 93: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Definition of problem Given a set of processes, one of which

wants to initiate an action Participants may vote for or against the

action Originator will perform the action only if

all vote in favor; if any votes against (or don’t vote), we will “abort” the protocol and not take the action

Goal is all-or-nothing outcome

Page 94: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Non-triviality Want to avoid solutions that do nothing

(trivial case of “all or none”) Would like to say that if all vote for

commit, protocol will commit... but in distributed systems we can’t be sure

votes will reach the coordinator! any “live” protocol risks making a mistake and

counting a live process that voted to commit as a failed process, leading to an abort

Hence, non-triviality condition is hard to capture

Page 95: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Typical protocol Coordinator asks all processes if

they can take the action Processes decide if they can and

send back “ok” or “abort” Coordinator collects all the

answers (or times out) Coordinator computes outcome

and sends it back

Page 96: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Commit protocol illustrated

ok to commit?

Page 97: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Commit protocol illustrated

ok to commit?

ok with us

Page 98: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Commit protocol illustrated

ok to commit?

ok with uscommit

Note: garbage collection protocol not shown here

Page 99: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Failure issues So far, have implicitly assumed that

processes fail by halting (and hence not voting)

In real systems a process could fail in arbitrary ways, even maliciously

This has lead to work on the “Byzantine generals” problem, which is a variation on commit set in a “synchronous” model with malicious failures

Page 100: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Failure model impacts costs! Byzantine model is very costly: 3t+1

processes needed to overcome t failures, protocol runs in t+1 rounds

This cost is unacceptable for most real systems, hence protocols are rarely used

Main area of application: hardware fault-tolerance, security systems

For these reasons, we won’t study such protocols

Page 101: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Commit with simpler failure model Assume processes fail by halting Coordinator detects failures (unreliably)

using timouts. It can make mistakes! Now the challenge is to terminate the

protocol if the coordinator fails instead of, or in addition to, a participant!

Page 102: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Commit protocol illustrated

ok to commit?

ok with us… times outabort!

Note: garbage collection protocol not shown here

crashed!

Page 103: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Example of a hard scenario Coordinator starts the protocol One participant votes to abort, all

others to commit Coordinator and one participant

now fail... we now lack the information to

correctly terminate the protocol!

Page 104: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Commit protocol illustrated

ok to commit?

okdecision unknown!

vote unknown!ok

Page 105: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Example of a hard scenario Problem is that if coordinator told the

failed participant to abort, all must abort

If it voted for commit and was told to commit, all must commit

Surviving participants can’t deduce the outcome without knowing how failed participant voted

Thus protocol “blocks” until recovery occurs

Page 106: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Skeen: Three-phase commit

Seeks to increase availability Makes an unrealistic assumption

that failures are accurately detectable

With this, can terminate the protocol even if a failure does occur

Page 107: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Skeen: Three-phase commit Coordinator starts protocol by sending request Participants vote to commit or to abort Coordinator collects votes, decides on

outcome Coordinator can abort immediately To commit, coordinator first sends a “prepare

to commit” message Participants acknowledge, commit occurs

during a final round of “commit” messages

Page 108: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Three phase commit protocol illustrated

ok to commit?

ok ....

commit

prepare to commit

prepared...

Note: garbage collection protocol not shown here

Page 109: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Observations about 3PC If any process is in “prepare to commit”

all voted for commit Protocol commits only when all

surviving processes have acknowledged prepare to commit

After coordinator fails, it is easy to run the protocol forward to commit state (or back to abort state)

Page 110: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Assumptions about failures If the coordinator suspects a failure, the

failure is “real” and the faulty process, if it later recovers, will know it was faulty

Failures are detectable with bounded delay

On recovery, process must go through a reconnection protocol to rejoin the system! (Find out status of pending protocols that terminated while it was not operational)

Page 111: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Problems with 3PC With realistic failure detectors (that can make

mistakes), protocol still blocks! Bad case arises during “network partitioning”

when the network splits the participating processes into two or more sets of operational processes

Can prove that this problem is not avoidable: there are no non-blocking commit protocols for asynchronous networks

Page 112: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Situation in practical systems? Most use protocols based on 2PC: 3PC is more

costly and ultimately, still subject to blocking! Need to extend with a form of garbage

collection mechanism to avoid accumulation of protocol state information (can solve in the background)

Some systems simply accept the risk of blocking when a failure occurs

Others reduce the consistency property to make progress at risk of inconsistency with failed proc.

Page 113: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Process groups To overcome cost of replication will

introduce dynamic process group model (processes that join, leave while system is running) Will also relax our consistency goal: seek

only consistency within a set of processes that all remain operational and members of the system

In this model, 3PC is non-blocking! Yields an extremely cheap replication

scheme!

Page 114: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Failure detection Basic question: how to detect a failure

Wait until the process recovers. If it was dead, it tells you

I died, but I feel much better now Could be a long wait

Use some form of probe But might make mistakes

Substitute agreement on membership Now, failure is a “soft” concept Rather than “up” or “down” we think about

whether a process is behaving acceptably in the eyes of peer processes

Page 115: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Architecture

Membership Agreement, “join/leave” and “P seems to be unresponsive”

3PC-like protocols use membership changes instead of failure notification

Applications use replicated data for Applications use replicated data for high availabilityhigh availability

Page 116: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Issues? How to “detect” failures

Can use timeout Or could use other system monitoring tools

and interfaces Sometimes can exploit hardware

Tracking membership Basically, need a new replicated service System membership “lists” are the data it

manages We’ll say it takes join/leave requests as input

and produces “views” as output

Page 117: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Architecture

GMS

A

B

C

D

join leave

join

A seems to have failed

{A}

{A,B,D}

{A,D}

{A,D,C}

{D,C}

X Y Z

Application processes

GMS processes

membership views

Page 118: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Issues Group membership service (GMS) has

just a small number of members This core set will tracks membership for a

large number of system processes Internally it runs a group membership

protocol (GMP) Full system membership list is just

replicated data managed by GMS members, updated using multicast

Page 119: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

GMP design What protocol should we use to

track the membership of GMS Must avoid split-brain problem Desire continuous availability

We’ll see that a version of 3PC can be used

But can’t “always” guarantee liveness

Page 120: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Class Project Analysis

Page 121: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Approach 1

primary

backup

Clients initially connected to primary, which keeps backup up to date. Backup follows log

log

Page 122: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Approach 1

primary

backup

Clients detect the failure and reconnect to backup. Butsome clients may have “gone away”. Backup state couldbe slightly stale. New transactions might suffer from this

log

Page 123: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Approach 2

primary

backup

Clients initially connected to primary, but through an interface.Backup follows log

log

Page 124: Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies

Approach 2

primary

backup

The interface module detects the failure and automaticallyreconnects to backup. But some clients may have “gone away”. Backup state could be slightly stale. New transactions might suffer from this

log