Lecture 12 - Distributed Systems

Lecture 12 Page 1 CS 188,Winter 2015

Agreement in Distributed Systems CS 188

Distributed Systems February 19, 2015


Introduction

We frequently want to get a set of nodes in a distributed system to agree

Commitment protocols and mutual exclusion are particular cases

The approaches we discussed for those work in limited situations

In general, when can we reach agreement in a distributed system?


Basics of Agreement Protocols

What is agreement? What are the necessary conditions for

agreement?


What Do We Mean By Agreement?

In simplest case, can n processors agree that a variable takes on value 0 or 1? Only non-faulty processors need

agree More complex agreements can be built

from this simple agreement


Conditions for Agreement Protocols

Consistency All participants agree on same value

and decisions are final Validity

Participants agree on a value at least one of them wanted

Termination/Progress All participants choose a value in a

finite number of steps


Challenges to Agreement Delays

In message delivery In nodes responding to messages

Failures And recovery from failures

Lies by participants Or innocent errors that have similar

effects


Failures and Agreement

Failures make agreement difficult Failed nodes dont participate Failed nodes sometimes recover at

inconvenient times At worst, failed nodes participate in

harmful ways Real failures are worse than fail-stop


Types of Failures

Fail-stop A nice, clean failure Processor stops executing anything

Realistic failures Partitionings Arbitrary delays

Adversarial failures Arbitrary bad things happen


Election Algorithms

If you get everyone to agree a particular node is in charge,

Future consensus is easy, since he makes the decisions

How do you determine whos in charge? Statically Dynamically


Static Leader Selection Methods

Predefine one process/node as the leader

Simple Everyone always knows whos the

leader Not very resilient

If the leader fails, then what?


Dynamic Leader Selection Methods

Choose a new leader dynamically whenever necessary

More complicated But failure of a leader is easy to handle

Just elect a new one Election doesnt imply voting

Not necessarily majority-based


Election Algorithms vs. Mutual Exclusion Algorithms

Most mutual exclusion algorithms dont care much about failures

Election algorithms are designed to handle failures

Also, mutual exclusion algorithms only need a winner

Election algorithms need everyone to know who won


A Typical Use of Election Algorithms

A group of processes wants to periodically take a distributed snapshot

They dont want multiple simultaneous snapshots

So they want one leader to order them to take the snapshot


Problems in Election Algorithms

Some of the nodes may have failed before the algorithm starts

Some of the nodes may fail during the algorithm

Some nodes may recover from failure Possible at inconvenient times

What about partitions?


Election Algorithms and the Real Work

The election algorithm is usually overhead Theres a real computation you want to

perform The election algorithm chooses someone to

lead it Having two leaders while real computation

is going on is bad


The Bully Algorithm

The biggest kid on the block gets to be the leader

But what if the biggest kid on the block is taking his piano lesson?

The next biggest kid gets to be leader Until the piano lesson is over . . .


Electing a Bully The kids come out to play

Hey, Spike!

Spikes Mom hasnt let him out yet

Hey, Butch!

Im here, who else is? Peewee! Cuthbert!

Im the leader, lets play tag!

The piano lesson ends

Cuthbert Peewee! Butch! Im the leader, and were playing

baseball!

Hey, Spike! Hey,

Spike! Im here, where are you sissies?


Assumptions of the Bully Algorithm

A static set of possible participants With an agreed-upon order

All messages are delivered with Tm seconds All responses are sent within Tp seconds of

delivery These last two imply synchronous behavior


The Basic Idea Behind the Bully Algorithm

Possible leaders try to take over If they detect a better leader, they agree

to its leadership Keep track of state information about

whether you are electing a leader Only do real work when you agree on a

leader


The Bully Algorithm and Timeouts

Call out the biggest kids name If he doesnt answer soon enough,

call out the next biggest kids name Until you hear an answer Or the caller is the biggest kid Then take over, by telling everyone

else youre the leader


The Bully Algorithm At Work

One node is currently the coordinator It expects a certain set of nodes to be up and

participating The coordinator asks all other nodes If an expected node doesnt answer, start an

election Also if it answers in the negative

If an unexpected node answers, start an election


The Practicality of the Bully Algorithm

The bully algorithm works reasonably well if the timeouts are effective A timeout occurring really means the

site in question is down And there are no partitions at all

If there are, what happens?


The Invitation Algorithm

More practical than bully algorithm Doesnt depend on timeouts

But its results are not as definitive An asynchronous algorithm


The Basic Idea Behind the Invitation Algorithm

A current coordinator tries to get all other nodes to agree to his leadership

If more than one coordinator around, get together and merge groups

Use timeouts only to allow progress, not to make definitive decisions

No set priorities for who will be coordinator


The Invitation Algorithm and Group Numbers

The invitation algorithm recruits a group of nodes to work together More than one group can exist

simultaneously Group numbers identify the group Why not identify with coordinator ID?

Because one node can serially coordinate many groups


The Basic Operation of the Invitation Algorithm

Coordinators in a normal state periodically check all other nodes

If any other node is a coordinator, try to merge the groups

If timeouts occur, dont worry about it Also dont worry if a response to

check comes from this or earlier request


Merging in the Invitation Algorithm

Merging always requires forming new group May have same coordinator, but

different group number Coordinator who initiates merge asks

all other known coordinators to merge They ask their group members Original group members also asked


A Simplified Example

1

1

1

2

3

3

3

4

Node 1 checks for other

coordinator

AreYouCoordinator?

AreYouCoordinator?

Yes

No

So node 1 finds another coordinator Node 1 asks the other coordinator and his old node to join his group

Invite

Invite

Invite on behalf of node

1

1 1

1 1

Accept

Accept

UP ={1,2,3,4}

Ready

Ready

If all members of UP{} respond, were fine Node 1 forms a new group


The Reorganization State Nodes enter the reorganization state

after getting their answer Whats the point of this state?

Why not just start up the group? After all, we all know whos going

to be a member Or do we?


Why We Need Another Round of Messages

1

2

3

4

1 1

1 1

Invitation

Invitation

Who does 1 think will join the group, at this point? 2 and 3

Invitation

Assuming no timeouts, 4 will also join And 2 needs to know that And what if someone crashes? Presumably not accepting the invitation?


Timeouts in the Merge

Dont worry too much about them Some nodes respond before the timeout

Some dont If you dont catch them this time, you

might the next


Straggler Messages

This algorithm is asynchronous So messages may come in late

What do we do when messages arrive late?

Mostly, reject them How do we tell?

Messages contain group number


Multiple Simultaneous Groups

The invitation algorithm allows multiple simultaneous groups to exist Each with a proper coordinator

Is this a good thing? No, but what are the alternatives?

No node ever belongs to more than one group, at least


Paxos A family of algorithms that allow a

distributed system to reach agreement In the face of delays and failures Cant perfectly guarantee progress

But makes progress in realistic conditions Does guarantee consistency Usually defined to reach consensus on some

value v


Paxos Assumptions Processors are of variable speed and may

fail Might recover after failure But they dont lie

Any processor can send a message to any other processor

Messages can be lost, arbitrarily delayed, reordered, or duplicated But never corrupted


Paxos Processor Roles Client

Issues a request, waits for a response Acceptor/voter

Remembers things for the protocol Proposer (simpler if theres only one)

Assists client in getting a response Learner

Actually executes a request Leader

One of the proposers that leads the process One processor can play several roles

Usually, all processes are acceptors, proposers, and learners


Paxos Quorums Collections of acceptors that make decisions

Several different quorums in system Messages are sent to quorums, not single

acceptors Messages only effective if all quorum members

receive it Similarly, all acceptors in a quorum must send

a message for to be effective If any member of the quorum survives, its

decisions survive


Quorum Membership All quorums must contain a majority of

all acceptors in the system Any two quorums must share at least

one acceptor E.g., if there are four acceptors

{1,2,3,4}, quorums might be: {1,2,3}, {1,2,4}, {2,3,4}, {1,3,4}


Paxos Rounds

Paxos proceeds in rounds In response to a client request If the round reaches agreement, the

client gets a response If not, you start another round Continue till a round reaches

agreement


A Simple Paxos Round

C P

A1

A2

A3

L1

L2

1. request

2. prepare(N)

3. promise(N,null) 3. promise(N,null) 3. promise(N, null)

4. accept(N,Vres)

5. accepted(N,Vmax)

Vres is a result chosen by P, if

no promise had a value

6. response

N is a bigger number than P has ever used or seen

before

If an acceptor ever promised on this

item before, it returns the

generation and value from that run of Paxos, not null


The Point of Different Paxos Roles

C P

A1

A2

A3

L1

L2

The client wants to get

something done

The proposer coordinates

protocol activities

The acceptors ensure proper concurrent

behavior and handle proposer failures

The learners ensure redundant memory of the

result of a decision Remember!

One machine can play multiple roles


Paxos Error Handling Some cases simple, some complex A simple case:

One of the acceptors fails If theres still a quorum, no problem Go ahead without him

Another simple case: One of the learners failed If any learners are left, theyll provide the

right response to the client


More Complex Error Cases

Things like failure of proposer in middle of a round

Paxos chooses a new leader and uses him from this point

What if old leader comes back? Even more complex, but it works out


Paxos and Overheads

Generally quite expensive In messages and thus delays

Many optimizations possible Some dont alter the protocol

characteristics Some trade off handling some error

conditions for better performance


Byzantine Agreement

Life can be a lot worse than merely being unable to rely on timeouts

What if one of the nodes were working with is lying?

How can we reach agreement if we cant trust all the participants?


The Purpose of Byzantine Agreement

Well, why would one of our distributed system components lie?

It probably wouldnt But it might contain a bug If it contains the worst possible bug,

what can it do? Essentially, inadvertently lie


The Realism of Byzantine Agreement

It isnt realistic It doesnt really happen No one really uses it But it demonstrates a limit on how

badly things can go while still allowing agreement


Why Is It Called Byzantine?

After the fall of Rome itself, the empire lived on in the east Called Byzantium

Byzantium survived for around 1000 years

The Byzantines were famous for their treachery and double-dealing


The Byzantine General Problem Several Byzantine generals each command

their own army They are far apart and communicate with

messengers The emperor wants to attack the Turks If all generals attack, theyll win

Even if a majority attack, theyll win Retreating is OK, if everyone does it

But the Turks may have bribed some generals


The Complete Problem Statement Messages are point-to-point Messages are reliably delivered, with a

predictable timeout Failure to receive message in time

means sender is a traitor Traitors can send any messages they

please But cannot forge their identities


How Many Traitors Is Too Many?

Can all the loyal generals reach agreement on whether to attack or retreat?

Or can the traitors prevent them from reaching any agreement?

How many generals must the Turks bribe before no agreement is possible?


The Answer

If the Turks bribe 1/3 of the generals, the remaining 2/3s cannot reach agreement

How can that be? Why not just a majority? Easiest to consider in the case of a

commander


The 3-General Byzantine Problem

Commander

What if theyre all loyal?

Attack Attack

Everyone attacks and the Turk is vanquished

But what if the commander is a traitor?

Attack Retreat

One general attacks, one retreats, the traitor pockets the bribe, and the Turks win


Cant the Loyal Generals Check Their Orders?

Commander

Attack Retreat

1

2 3

Generals 2 and 3 check their orders

Retreat

Attack

They figure out 1 is a traitor and come to their own agreement


But What if the Commander Wasnt the Traitor?

Commander

Attack Attack

1

2 3

3 is the traitor, this time Generals 2 and 3 check their orders

Retreat

Attack

They figure out 1 is a traitor and come to their own agreement But 1 isnt the traitor, 3 is the traitor He convinces 2 to retreat, 1 is slaughtered attacking, and 3 pockets the bribe


Can General 2 Tell Which Scenario Is Occurring?

When 1 was the traitor, 2 saw: When 3 was the traitor, 2 saw:

1

2 3 Retreat

Attack 1

2 3 Retreat

Attack

2 cant tell the difference, so he cant decide whether to attack or retreat


What If There Were 4 Generals?

1

2

Commander

3 4

What if the commander (1) is the traitor? If he doesnt send some messages, hell be seen as the traitor But what can he send?

Attack Attack Retreat


Can the Three Loyal Generals Reach Agreement?

1

2

Commander

3 4

Attack Attack Retreat

They can exchange all the messages and let the majority rule Since there are only two messages, the commander must have sent the same message to two nodes If the commander is loyal and someone else is lying, the majority represents the loyal commanders will


But What if There Were Five Generals?

1 Commander

2 3 4 Attack Attack Retreat

5 Retreat

Pre-arrange a tie-breaker E.g., always retreat on ties All the loyal generals then retreat And the traitor must explain his failure to the Turks


What If You Dont Want a Commander?

What if you want everyone to vote? And accept the majority?

With the guarantee that all loyal nodes abide by the majority?

Serially treat each node as the commander Reach agreement on his vote Then move on to the next node


The Trick Behind Byzantine Agreement

Everyone must know what everyone else thinks about everything else

Not just what I think the commander said, but what everyone else claims the commander said

Resulting algorithms are tricky and expensive But it could be (and will be) worse


Authenticated Byzantine Agreement

What if the messages are signed in an unforgeable way?

Then dishonest generals cant lie about what honest general told them

In this case, honest generals reach agreement regardless of how many are dishonest

Documents

Lecture 12 - Distributed Systems