73
Reaching reliable agreement in an unreliable world Heidi Howard [email protected] @heidiann Research Students Lecture Series 13th October 2015 1

Reaching reliable agreement in an unreliable world

Embed Size (px)

Citation preview

Page 1: Reaching reliable agreement in an unreliable world

Reaching reliable agreement in an unreliable

worldHeidi Howard

[email protected] @heidiann

Research Students Lecture Series 13th October 2015

1

Page 2: Reaching reliable agreement in an unreliable world

Introducing Alice

Alice is new graduate from Cambridge off to the world of work.

She joins a cool new start up, where she is responsible for a key value store.

2

Page 3: Reaching reliable agreement in an unreliable world

Single Server System

Server

Client 2

A 7B 2

Client 1 Client 3

3

Page 4: Reaching reliable agreement in an unreliable world

Single Server System

Server

Client 2

A 7B 2

Client 1 Client 3

A?

7

4

Page 5: Reaching reliable agreement in an unreliable world

Single Server System

Server

Client 2

A 7B 2

Client 1 Client 3

B=3 OK

3

5

Page 6: Reaching reliable agreement in an unreliable world

Single Server System

Server

Client 2

A 7B 2

Client 1 Client 3

B?3

3

6

Page 7: Reaching reliable agreement in an unreliable world

Single Server SystemPros

• linearizable semantics

• durability with write-ahead logging

• easy to deploy

• low latency (1 RTT in common case)

• partition tolerance with retransmission & command cache

Cons

• system unavailable if server fails

• throughput limited to one server

7

Page 8: Reaching reliable agreement in an unreliable world

Backups

aka Primary backup replication

Primary

Client 2

Client 1Backup 1

Backup 1

Backup 1

A 7B 2

A?7

A 7B 2

A 7B 2

A 7B 2

8

Page 9: Reaching reliable agreement in an unreliable world

Backups

aka Primary backup replication

Primary

Client 2

Client 1Backup 1

Backup 1

Backup 1

A 7B 1

B=1

A 7B 2

A 7B 2

A 7B 2

A 7B 1

9

Page 10: Reaching reliable agreement in an unreliable world

Backups

aka Primary backup replication

Primary

Client 2

Client 1Backup 1

Backup 1

Backup 1

A 7B 1

OK

A 7B 1

A 7B 1

A 7B 1

OK

OK

OK

10

Page 11: Reaching reliable agreement in an unreliable world

Big GotchaWe are assuming total ordered broadcast

11

Page 12: Reaching reliable agreement in an unreliable world

Totally Ordered Broadcast

(aka atomic broadcast) the guarantee that messages are received reliably and in the same order by all nodes.

12

Page 13: Reaching reliable agreement in an unreliable world

Requirements• Scalability - High throughout processing of

operations.

• Latency - Low latency commit of operation as perceived by the client.

• Fault-tolerance - Availability in the face of machine and network failures.

• Linearizable semantics - Operate as if a single server system.

13

Page 14: Reaching reliable agreement in an unreliable world

Doing the Impossible

14

Page 15: Reaching reliable agreement in an unreliable world

CAP TheoremPick 2 of 3:

• Consistency

• Availability

• Partition tolerance

Proposed by Brewer in 1998, still debated and regarded as misleading. [Brewer’12] [Kleppmann’15]

15

Eric Brewer

Page 16: Reaching reliable agreement in an unreliable world

FLP Impossibility

It is impossible to guarantee consensus when messages may be delay if even one node may fail. [JACM’85]

16

Page 17: Reaching reliable agreement in an unreliable world

Consensus is impossible[PODC’89]

Nancy Lynch17

Page 18: Reaching reliable agreement in an unreliable world

Aside from Simon PJ

Don’t drag your reader or listener through your blood strained path.

Simon Peyton Jones18

Page 19: Reaching reliable agreement in an unreliable world

PaxosPaxos is at the foundation of (almost) all distributed consensus protocols.

It is a general approach of using two phases and majority quorums.

It takes much more to construct a complete fault-tolerance distributed systems. Leslie Lamport

19

Page 20: Reaching reliable agreement in an unreliable world

Beyond Paxos

• Replication techniques - mapping a consensus algorithm to a fault tolerance system

• Classes of operations - handling of read vs writes, operating on stale data, soft vs hard state

• Leadership - fixed, variable or leaderless, multi-paxos, how to elect a leader, how to discover a leader

• Failure detectors - heartbeats & timers

• Dynamic membership, Log compaction & GC, sharding, batching etc

20

Page 21: Reaching reliable agreement in an unreliable world

Consensus is hard

21

Page 22: Reaching reliable agreement in an unreliable world

A raft in the sea of confusion

22

Page 23: Reaching reliable agreement in an unreliable world

Case Study: Raft

Raft is the understandable replication algorithm.

Provides linearisable client semantics, 1 RTT best case latency for clients.

A complete(ish) architecture for making our application fault-tolerance.

Uses SMR and Paxos

23

Page 24: Reaching reliable agreement in an unreliable world

State Machine Replication

Server

Client

ServerServer A 7B 2

A 7B 2

A 7B 2

B=3

24

Page 25: Reaching reliable agreement in an unreliable world

State Machine Replication

Server

Client

ServerServer A 7B 2

A 7B 2

A 7B 2

B=3

25

Page 26: Reaching reliable agreement in an unreliable world

State Machine Replication

Server

Client

ServerServer A 7B 2

A 7B 2

A 7B 2

B=3

3

3 3

26

Page 27: Reaching reliable agreement in an unreliable world

Leadership

Follower Candidate Leader

Startup/ Restart

Timeout Win

Timeout

Step down

27

Step down

Page 28: Reaching reliable agreement in an unreliable world

OrderingEach node stores is own perspective on a value known as the term.

Each message includes the sender’s term and this is checked by the recipient.

The term orders periods of leadership to aid in avoiding conflict.

Each has one vote per term, thus there is at most one leader per term.

28

Page 29: Reaching reliable agreement in an unreliable world

Ordering

29

ID: 1

ID: 2

ID: 3

1

1

1

2

2 2

2

Page 30: Reaching reliable agreement in an unreliable world

ID: 1 Term: 0 Vote: n

ID: 2 Term: 0 Vote: n

ID: 5 Term: 0 Vote: n

ID: 4 Term: 0 Vote: n

ID: 3 Term: 0 Vote: n

30

Page 31: Reaching reliable agreement in an unreliable world

Leadership

Follower Candidate Leader

Startup/ Restart

Timeout Win

Timeout

Step down

31

Step down

Page 32: Reaching reliable agreement in an unreliable world

ID: 1 Term: 0 Vote: n

ID: 2 Term: 0 Vote: n

ID: 5 Term: 0 Vote: n

ID: 4 Term: 1

Vote: me

ID: 3 Term: 0 Vote: n

Vote for me in term 1!32

Page 33: Reaching reliable agreement in an unreliable world

ID: 1 Term: 1 Vote: 4

ID: 2 Term: 1 Vote: 4

ID: 5 Term: 1 Vote: 4

ID: 4 Term: 1

Vote: me

ID: 3 Term: 1 Vote: 4

Ok!33

Page 34: Reaching reliable agreement in an unreliable world

ID: 1 Term: 1 Vote: 4

ID: 2 Term: 1 Vote: 4

ID: 5 Term: 1 Vote: 4

ID: 4 Term: 1

Vote: me

ID: 3 Term: 1 Vote: 4

Leader fails34

Page 35: Reaching reliable agreement in an unreliable world

ID: 1 Term: 2 Vote: 5

ID: 2 Term: 1 Vote: 4

ID: 5 Term: 2

Vote: me

ID: 4 Term: 1

Vote: me

ID: 3 Term: 1 Vote: 4

Vote for me in term 2!

35

Page 36: Reaching reliable agreement in an unreliable world

ID: 1 Term: 2 Vote: 5

ID: 2 Term: 2

Vote: me

ID: 5 Term: 2

Vote: me

ID: 4 Term: 1

Vote: me

ID: 3 Term: 2 Vote: 2

Vote for me in term 2!

36

Page 37: Reaching reliable agreement in an unreliable world

ID: 1 Term: 2 Vote: 5

ID: 2 Term: 2

Vote: me

ID: 5 Term: 2

Vote: me

ID: 4 Term: 1

Vote: me

ID: 3 Term: 2 Vote: 2

37

Page 38: Reaching reliable agreement in an unreliable world

Leadership

Follower Candidate Leader

Startup/ Restart

Timeout Win

Timeout

Step down

38

Step down

Page 39: Reaching reliable agreement in an unreliable world

ID: 1 Term: 3 Vote: 2

ID: 2 Term: 3

Vote: me

ID: 5 Term: 3 Vote: 2

ID: 4 Term: 3 Vote: 2

ID: 3 Term: 3 Vote: 2

Vote for me in term 3!

39

Page 40: Reaching reliable agreement in an unreliable world

ID: 1 Term: 3 Vote: 2

ID: 2 Term: 3

Vote: me

ID: 5 Term: 3 Vote: 2

ID: 4 Term: 3 Vote: 2

ID: 3 Term: 3 Vote: 2

40

Page 41: Reaching reliable agreement in an unreliable world

ReplicationEach node has a log of client commands and a index into this representing which commands have been committed

41

Page 42: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 2

A 4B 2

A 4B 2

Simple Replication

42

Page 43: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 2

A 4B 2

A 4B 2

B=7

Simple Replication

43

Page 44: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 2

A 4B 2

A 4B 2

B=7

B=7

B=7

Simple Replication

44

Page 45: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 2

A 4B 7

A 4B 2

B=7

B=7

B=7

Simple Replication

45

Page 46: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 7

A 4B 7

A 4B 7

B=7

B=7

B=7

Simple Replication

46

Page 47: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 7

A 4B 7

A 4B 7

B=7

B=7

B=7

B?

Simple Replication

47

Page 48: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 7

A 4B 7

A 4B 7

B=7

B=7

B=7

B?

B?

B?

Simple Replication

48

Page 49: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 7

A 4B 7

A 4B 7

B=7

B=7

B=7

B?

B?

B?

Simple Replication

49

Page 50: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 2

A 4B 7

A 4B 7

B=7

B=7

B?

B?

Catchup

A=6

50

Page 51: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 2

A 4B 7

A 4B 7

B=7

B=7

B?

B?

Catchup

A=6

A=6

:(

51

Page 52: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 2

A 4B 7

A 4B 7

B=7

B=7

B?

B?

Catchup

A=6

A=6

:(

52

Page 53: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 2

A 4B 7

A 4B 7

B=7

B=7

B?

B?

Catchup

A=6

A=6

:)

53

Page 54: Reaching reliable agreement in an unreliable world

ID: 1

ID: 2

ID: 3

A=4

A=4

A=4

A 4B 2

A 4B 7

A 4B 7

B=7

B=7

B?

B?

Catchup

A=6

A=6

B=7 B? A=6

54

Page 55: Reaching reliable agreement in an unreliable world

Evaluation

• The leader is a serious bottleneck -> limited scalability

• Can only handle the failure of a minority of nodes

• Some rare network partitions render protocol in livelock

55

Page 56: Reaching reliable agreement in an unreliable world

Beyond Raft

56

Page 57: Reaching reliable agreement in an unreliable world

Case Study: Tango

Tango is designed to be a scalable replication protocol.

It’s a variant of chain replication + Paxos.

It is leaderless and pushes more work onto clients

57

Page 58: Reaching reliable agreement in an unreliable world

Simple Replication

Client 1

A 7B 2

Client 2

A 4B 2 Sequencer

Server 1 Server 2 Server 3

0A=4

Next: 1

0A=4

0A=4

B=5

58

Page 59: Reaching reliable agreement in an unreliable world

Simple Replication

Client 1

A 7B 2

Client 2

A 4B 2 Sequencer

Server 1 Server 2 Server 3

0A=4

Next: 2

0A=4

0A=4

Next?

1

59

B=5

Page 60: Reaching reliable agreement in an unreliable world

Simple Replication

Client 1

A 7B 2

Client 2

A 4B 2 Sequencer

Server 1 Server 2 Server 3

0A=4

Next: 2

0A=4

0A=4

B=5 @ 1

OK

1B=5

60

B=5

Page 61: Reaching reliable agreement in an unreliable world

Simple Replication

Client 1

A 7B 2

Client 2

A 4B 2 Sequencer

Server 1 Server 2 Server 3

0A=4

Next: 2

0A=4

0A=4

1B=5

1B=5

61

B=5 @ 1 OK

Page 62: Reaching reliable agreement in an unreliable world

Simple Replication

Client 1

A 7B 2

Client 2

A 4B 2 Sequencer

Server 1 Server 2 Server 3

0A=4

Next: 2

0A=4

0A=4

1B=5

1B=5

1B=5

62

B=5 @ 1

OK

Page 63: Reaching reliable agreement in an unreliable world

Simple Replication

Client 1

A 7B 2

Client 2

A 4B 5 Sequencer

Server 1 Server 2 Server 3

0A=4

Next: 2

0A=4

0A=4

1B=5

1B=5

1B=5

63

Page 64: Reaching reliable agreement in an unreliable world

Fast Read

Client 1

A 7B 2

Client 2

A 4B 5 Sequencer

Server 1 Server 2 Server 3

0A=4

Next: 2

0A=4

0A=4

1B=5

1B=5

1B=5

Check?

1

B?

64

Page 65: Reaching reliable agreement in an unreliable world

Handling FailuresSequencer is soft-state which can be reconstructed by querying head server.

Clients failing between receiving token and first read leaves gaps in the log. Clients can mark these as empty and space (though not address) is reused.

Clients may fail before completing a write. The next client can fill-in and complete the operation

Server failures are detected by clients, who initiate a membership change, using term numbers.

65

Page 66: Reaching reliable agreement in an unreliable world

Evaluation

Tango is scalable, the leader is not longer the bottleneck.

Dynamic membership and sharding come for free with design.

High latency of chain replication

66

Page 67: Reaching reliable agreement in an unreliable world

Next Steps

67

Page 68: Reaching reliable agreement in an unreliable world

68

wait… we’re not finished yet!

Page 69: Reaching reliable agreement in an unreliable world

Requirements• Scalability - High throughout processing of

operations.

• Latency - Low latency commit of operation as perceived by the client.

• Fault-tolerance - Availability in the face of machine and network failures.

• Linearizable semantics - Operate as if a single server system.

69

Page 70: Reaching reliable agreement in an unreliable world

Many more examples• Raft [ATC’14] - Good starting point, understandable

algorithm from SMR + multi-paxos variant

• VRR [MIT-TR’12] - Raft with round-robin leadership & more distributed load

• Tango [SOSP’13] - Scalable algorithm for f+1 nodes, uses CR + multi-paxos variant

• Zookeeper [ATC'10] - Primary backup replication + atomic broadcast protocol (Zab [DSN’11])

• EPaxos [SOSP’13] - leaderless Paxos varient for WANs

70

Page 71: Reaching reliable agreement in an unreliable world

Can we do even better?• Self-scaling replication - adapting resources to

maintain resilience level.

• Geo replication - strong consistency between wide area links

• Auto configuration - adapting timeouts and configure as network changes

• Integrated with unikernels, virtualisation, containers and other such deployment tech

71

Page 72: Reaching reliable agreement in an unreliable world

Evaluation is hard

• few common evaluation metrics.

• often only one experiment setup is used.

• different workloads

• evaluation to demonstrate protocol strength

72

Page 73: Reaching reliable agreement in an unreliable world

Lessons Learned

• Reaching consensus in distributed systems is do able

• Exploit domain knowledge

• Raft is a good starting point but we can do much better!

Questions?73