74
COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 06/20/22 1 Distributed Systems - COMP 655

COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

Embed Size (px)

Citation preview

Page 1: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

COMP 655:Distributed/Operating

SystemsSummer 2011

Dr. Chunbo ChuWeek 7: Fault Tolerance

04/20/23 1Distributed Systems - COMP 655

Page 2: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 2

Fault Tolerance• Fault tolerance concepts• Implementation – distributed agreement• Distributed agreement meets transaction

processing: 2- and 3-phase commit

Bonus material• Implementation – reliable point-to-point

communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing

Page 3: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 3

Fault tolerance concepts• Availability – can I use it now?

– Usually quantified as a percentage• Reliability – can I use it for a

certain period of time?– Usually quantified as MTBF

• Safety – will anything really bad happen if it does fail?

• Maintainability – how hard is it to fix when it fails?– Usually quantified as MTTR

Page 4: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 4

Comparing nines• 1 year = 8760 hr• Availability levels

– 90% = 876 hr downtime/yr– 99% = 87.6 hr downtime/yr– 99.9% = 8.76 hr downtime/yr– 99.99% = 52.56 min downtime/yr– 99.999% = 5.256 min downtime/yr

Page 5: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 5

Exercise: how to get five nines

1. Brainstorm what you would have to deal with to build a single-machine system that could run for five years with 25 min downtime. Consider:

– Hardware failures, especially disks– Power failures– Network outages– Software installation– What else?

2. Come up with some ideas about how to solve the problems you identify

Page 6: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 6

Multiple machines at 99%

Assuming independent failures

Page 7: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 7

Multiple machines at 95%

Assuming independent failures

Page 8: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 8

Multiple machines at 80%

Assuming independent failures

Page 9: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 9

1,000 components

Page 10: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 10

Things to watch out for in availability requirements

• What constitutes an outage …– A client PC going down?– A client applet going into an infinite

loop?– A server crashing?– A network outage?– Reports unavailable?– If a transaction times out?– If 100 transactions time out in a 10

min period?– etc

Page 11: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 11

More to watch out for• What constitutes being back up

after an outage?• When does an outage start?• When does it end?• Are there outages that don’t

count?– Natural disasters?– Outages due to operator errors?

• What about MTBF?

Page 12: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 12

Ways to get 99% availability

1. MTBF = 99 hr, MTTR = 1 hr2. MTBF = 99 min, MTTR = 1 min3. MTBF = 99 sec, MTTR = 1 sec

Page 13: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 13

More definitions

failure

error

fault

causes

may causeFault tolerance is continuing to work correctly in the presence of faults.

Types of faults:• transient• intermittent• permanent

Page 14: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 14

Types of failures

Page 15: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 15

If you remember one thing• Components fail in distributed systems

on a regular basis.• Distributed systems have to be

designed to deal with the failure of individual components so that the system as a whole– Is available and/or– Is reliable and/or– Is safe and/or– Is maintainable

depending on the problem it is trying to solve and the resources available …

Page 16: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 16

Fault Tolerance• Fault tolerance concepts• Implementation – distributed

agreement• Distributed agreement meets

transaction processing: 2- and 3-phase commit

Page 17: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 17

Two-army problem• Red army has 5,000 troops• Blue army and White army have

3,000 troops each• Attack together and win• Attack separately and lose in serial• Communication is by messenger,

who might be captured• Blue and white generals have no

way to know when a messenger is captured

Page 18: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 18

Activity: outsmart the generals

• Take your best shot at designing a protocol that can solve the two-army problem

• Spend ten minutes• Did you think of anything

promising?

Page 19: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 19

Conclusion: go home• “agreement between even two

processes is not possible in the face of unreliable communication”

Page 20: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 20

Byzantine generals• Assume perfect communication• Assume n generals, m of whom

should not be trusted• The problem is to reach agreement

on troop strength among the non-faulty generals

Page 21: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 21

Byzantine generals - example

n = 4, m = 1(units are K-troops)

(a) Multicast troop-strength messages(b) Construct troop-strength vectors(c) Compare notes: majority rules in each componentResult: 1, 2, and 4 agree on (1,2,unknown,4)

Page 22: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 22

Doesn’t work with n=3, m=1

Page 23: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 23

Fault Tolerance• Fault tolerance concepts• Implementation – distributed

agreement• Distributed agreement meets

transaction processing: 2- and 3-phase commit

Page 24: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 24

Distributed commit protocols

• What is the problem they are trying to solve?– Ensure that a group of processes all

do something, or none of them do– Example: in a distributed transaction

that involves updates to data on three different servers, ensure that all three commit or none of them do

Page 25: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 25

2-phase commit

Coordinator Participant

What to do when P, in READY state, contacts Q

Page 26: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 26

If coordinator crashes• Participants could wait until the

coordinator recovers• Or, they could try to figure out

what to do among themselves– Example, if P contacts Q, and Q is in

the COMMIT state, P should COMMIT as well

Page 27: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 27

2-phase commitWhat to do when P, in READY state, contacts Q

If all surviving participants are in READY state,1. Wait for coordinator to recover2. Elect a new coordinator (?)

Page 28: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 28

3-phase commit• Problem addressed:

– Non-blocking distributed commit in the presence of failures

– Interesting theoretically, but rarely used in practice

Page 29: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 29

3-phase commit

Coordinator Participant

Page 30: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 30

Bonus material• Implementation – reliable point-to-

point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing

Page 31: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 31

RPC, RMI crash & omission failures

• Client can’t locate server• Request lost• Server crashes after receipt of

request• Response lost• Client crashes after sending

request

Page 32: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 32

Can’t locate server• Raise an exception, or• Send a signal, or• Log an error and return an error

code

Note: hard to mask distribution in this case

Page 33: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 33

Request lost• Timeout and retry• Back off to “cannot locate server”

if too many timeouts occur

Page 34: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 34

Server crashes after receipt of request

• Possible semantic commitments– Exactly once– At least once– At most once

Normal Work done Work not done

Page 35: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 35

Behavioral possibilities• Server events

– Process (P)– Send completion message (M)– Crash (C)

• Server order– P then M– M then P

• Client strategies– Retry every message– Retry no messages– Retry if unacknowledged– Retry if acknowledged

Page 36: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 36

Combining the options

Page 37: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 37

Lost replies• Make server operations

idempotent whenever possible• Structure requests so that server

can distinguish retries from the original

Page 38: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 38

Client crashes• The server-side activity is called an orphan computation

• Orphans can tie up resources, hold locks, etc

• Four strategies (at least)– Extermination, based on client-side logs

• Client writes a log record before and after each call• When client restarts after a crash, it checks the log

and kills outstanding orphan computations• Problems include:

– Lots of disk activity– Grand-orphans

Page 39: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 39

Client crashes, continued• More approaches for handling orphans

– Re-incarnation, based on client-defined epochs• When client restarts after a crash, it

broadcasts a start-of-epoch message• On receipt of a start-of-epoch message, each

server kills any computation for that client

– “Gentle” re-incarnation• Similar, but server tries to verify that a

computation is really an orphan before killing it

Page 40: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 40

Yet more client-crash strategies

• One more strategy– Expiration

• Each computation has a lease on life• If not complete when the lease expires, a

computation must obtain another lease from its owner

• Clients wait one lease period before restarting after a crash (so any orphans will be gone)

• Problem: what’s a reasonable lease period?

Page 41: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 41

Common problems with client-crash strategies

• Crashes that involve network partition(communication between partitions will

not work at all)

• Killed orphans may leave persistent traces behind, for example– Locks– Requests in message queues

Page 42: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 42

Bonus material• Implementation – reliable point-to-

point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing

Page 43: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 43

How to do it?• Redundancy applied

– In the appropriate places– In the appropriate ways

• Types of redundancy– Data (e.g. error correcting codes,

replicated data)– Time (e.g. retry)– Physical (e.g. replicated hardware,

backup systems)

Page 44: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 44

Triple Modular Redundancy

Page 45: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 45

Tandem Computers• TMR on

– CPUs– Memory

• Duplicated– Buses– Disks– Power supplies

• A big hit in operations systems for a while

Page 46: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 46

Replicated processing• Based on process groups• A process group consists of one or more

identical processes• Key events

– Message sent to one member of a group– Process joins group– Process leaves group– Process crashes

• Key requirements– Messages must be received by all members– All members must agree on group

membership

Page 47: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 47

Flat or non-flat?

Page 48: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 48

Effective process groups require

• Distributed agreement– On group membership– On coordinator elections– On whether or not to commit a

transaction

• Effective communication– Reliable enough– Scalable enough– Often, multicast– Typically looking for atomic multicast

Page 49: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 49

Process groups also require

• Ability to tolerate crash failures and omission failures– Need k+1 processes to deal with up to k

silent failures

• Ability to tolerate performance, response, and arbitrary failures– Need 3k+1 processes to reach agreement

with up to k Byzantine failures– Need 2k+1 processes to ensure that a

majority of the system produces the correct results with up to k Byzantine failures

Page 50: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 50

Bonus material• Implementation – reliable point-to-

point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing

Page 51: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 51

Reliable multicasting

Page 52: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 52

Scalability problem• Too many acknowledgements

– One from each receiver– Can be a huge number in some

systems– Also known as “feedback implosion”

Page 53: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 53

Basic feedback suppression in scalable

reliable multicast

If a receiver decides it has missed a message,• it waits a random time, then multicasts a retransmission request• while waiting, if it sees a sufficient request from another receiver,

it does not send its own request• server multicasts all retransmissions

Page 54: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 54

Hierarchical feedback suppression for scalable

reliable multicast

• messages flow from root toward leaves• acks and retransmit requests flow toward root from coordinators• each group can use any reliable small-group multicast scheme

Page 55: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 55

Atomic multicast• Often, in a distributed system,

reliable multicast is a step toward atomic multicast

• Atomic multicast is atomicity applied to communications:– Either all members of a process group

receive a message, OR– No members receive it

• Often requires some form of order agreement as well

Page 56: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 56

How atomic multicast helps

1. Assume we have atomic multicast, among a group of processes, each of which owns a replica of a database

2. One replica goes down3. Database activity continues4. The process comes back up5. Atomic multicast allows us to figure

out exactly which transactions have to be re-played (see pp 386-387)

Page 57: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 57

More concepts• Group view• View change• Virtually synchronous

– Each message is received by all non-faulty processes, or

– If sender crashes during multicast, message could be ignored by all processes

Page 58: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 58

Virtual synchrony picture

Basic idea:in virtual synchrony, a multicast cannot cross a view-change

Page 59: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 59

Receipt vs Delivery

Remember totally-ordered multicast …

Page 60: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 60

What about multicast message order?

• Two aspects:– Relationship between sending order and

delivery order– Agreement on delivery order

• Send/delivery ordering relationships– Unordered– FIFO-ordered– Causally-ordered

• If receivers agree on delivery order, it’s called totally-ordered multicast

Page 61: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 61

UnorderedProcess P1 Process P2 Process P3

sends m1sends m2

delivers m1delivers m2

delivers m2delivers m1

Page 62: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 62

FIFO-ordered

Agreement on: m1 before m2 m3 before m4

Process P1 Process P2 Process P3

sends m1sends m2

delivers m1delivers m3delivers m2delivers m4

delivers m3delivers m1delivers m2delivers m4

Process P4

sends m3sends m4

Page 63: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 63

Six types of virtually synchronous reliable

multicast

Relationship between sendingorder and delivery order

Agreement ondelivery order

Page 64: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 64

Implementing virtual synchrony

Don’t deliver a message until it’s been received everywhere -but “everywhere” can change

(a) 7’s crash is detected by 4, which sends a view-change message

(b) Processes forward unstable messages, followed by flush

(c) When have flush from all processes in new view, install new view

Page 65: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 65

Bonus material• Implementation – reliable point-to-

point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing

Page 66: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 66

Recovery from error • Two main types:

– Backward recovery to a checkpoint (assumed to be error-free)

– Forward recovery (infer a correct state from available data)

Page 67: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 67

More about checkpoints• They are expensive• Usually combined with a message log• Message logs are cleared at checkpoints• Recovering a crashed process:

– Restart it– Restore its state to the most recent

checkpoint– Replay the message log

Page 68: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 68

Recovery line == most recent distributed

snapshot

Page 69: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 69

Domino effect

Page 70: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 70

Bonus material• Implementation – reliable point-to-

point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing

Page 71: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 71

Sparing• Not really fault tolerance• But it can be cheaper, and provide

fast restoration time after a failure• Types of spares

– Cold– Hot– Warm

• The spare may or may not also have regular responsibilities in the system

Page 72: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 72

Switchover• Repair is accomplished by

switching processing away from a failed server to a spare

Page 73: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 73

Questions on switchover• Has the failed system really failed?• Is the spare operational?• Can the spare handle the load?

– May need a way to block medium to low priority work during switchovers

• How will the spare get access to the failed server’s data?

• What client session data will be preserved, and how?

Page 74: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 74

More switchover questions• What about configuration files?• What about network addressing?• What about switching back after the

failed server has been repaired?– Partial shutdown of the spare– Updating directories to redirect part of the

load– Making up for lost medium-to-low priority

work