Upload
domenic-wilkerson
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
COMP 655:Distributed/Operating
SystemsSummer 2011
Dr. Chunbo ChuWeek 7: Fault Tolerance
04/20/23 1Distributed Systems - COMP 655
04/20/23 Distributed Systems - COMP 655 2
Fault Tolerance• Fault tolerance concepts• Implementation – distributed agreement• Distributed agreement meets transaction
processing: 2- and 3-phase commit
Bonus material• Implementation – reliable point-to-point
communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing
04/20/23 Distributed Systems - COMP 655 3
Fault tolerance concepts• Availability – can I use it now?
– Usually quantified as a percentage• Reliability – can I use it for a
certain period of time?– Usually quantified as MTBF
• Safety – will anything really bad happen if it does fail?
• Maintainability – how hard is it to fix when it fails?– Usually quantified as MTTR
04/20/23 Distributed Systems - COMP 655 4
Comparing nines• 1 year = 8760 hr• Availability levels
– 90% = 876 hr downtime/yr– 99% = 87.6 hr downtime/yr– 99.9% = 8.76 hr downtime/yr– 99.99% = 52.56 min downtime/yr– 99.999% = 5.256 min downtime/yr
04/20/23 Distributed Systems - COMP 655 5
Exercise: how to get five nines
1. Brainstorm what you would have to deal with to build a single-machine system that could run for five years with 25 min downtime. Consider:
– Hardware failures, especially disks– Power failures– Network outages– Software installation– What else?
2. Come up with some ideas about how to solve the problems you identify
04/20/23 Distributed Systems - COMP 655 6
Multiple machines at 99%
Assuming independent failures
04/20/23 Distributed Systems - COMP 655 7
Multiple machines at 95%
Assuming independent failures
04/20/23 Distributed Systems - COMP 655 8
Multiple machines at 80%
Assuming independent failures
04/20/23 Distributed Systems - COMP 655 9
1,000 components
04/20/23 Distributed Systems - COMP 655 10
Things to watch out for in availability requirements
• What constitutes an outage …– A client PC going down?– A client applet going into an infinite
loop?– A server crashing?– A network outage?– Reports unavailable?– If a transaction times out?– If 100 transactions time out in a 10
min period?– etc
04/20/23 Distributed Systems - COMP 655 11
More to watch out for• What constitutes being back up
after an outage?• When does an outage start?• When does it end?• Are there outages that don’t
count?– Natural disasters?– Outages due to operator errors?
• What about MTBF?
04/20/23 Distributed Systems - COMP 655 12
Ways to get 99% availability
1. MTBF = 99 hr, MTTR = 1 hr2. MTBF = 99 min, MTTR = 1 min3. MTBF = 99 sec, MTTR = 1 sec
04/20/23 Distributed Systems - COMP 655 13
More definitions
failure
error
fault
causes
may causeFault tolerance is continuing to work correctly in the presence of faults.
Types of faults:• transient• intermittent• permanent
04/20/23 Distributed Systems - COMP 655 14
Types of failures
04/20/23 Distributed Systems - COMP 655 15
If you remember one thing• Components fail in distributed systems
on a regular basis.• Distributed systems have to be
designed to deal with the failure of individual components so that the system as a whole– Is available and/or– Is reliable and/or– Is safe and/or– Is maintainable
depending on the problem it is trying to solve and the resources available …
04/20/23 Distributed Systems - COMP 655 16
Fault Tolerance• Fault tolerance concepts• Implementation – distributed
agreement• Distributed agreement meets
transaction processing: 2- and 3-phase commit
04/20/23 Distributed Systems - COMP 655 17
Two-army problem• Red army has 5,000 troops• Blue army and White army have
3,000 troops each• Attack together and win• Attack separately and lose in serial• Communication is by messenger,
who might be captured• Blue and white generals have no
way to know when a messenger is captured
04/20/23 Distributed Systems - COMP 655 18
Activity: outsmart the generals
• Take your best shot at designing a protocol that can solve the two-army problem
• Spend ten minutes• Did you think of anything
promising?
04/20/23 Distributed Systems - COMP 655 19
Conclusion: go home• “agreement between even two
processes is not possible in the face of unreliable communication”
04/20/23 Distributed Systems - COMP 655 20
Byzantine generals• Assume perfect communication• Assume n generals, m of whom
should not be trusted• The problem is to reach agreement
on troop strength among the non-faulty generals
04/20/23 Distributed Systems - COMP 655 21
Byzantine generals - example
n = 4, m = 1(units are K-troops)
(a) Multicast troop-strength messages(b) Construct troop-strength vectors(c) Compare notes: majority rules in each componentResult: 1, 2, and 4 agree on (1,2,unknown,4)
04/20/23 Distributed Systems - COMP 655 22
Doesn’t work with n=3, m=1
04/20/23 Distributed Systems - COMP 655 23
Fault Tolerance• Fault tolerance concepts• Implementation – distributed
agreement• Distributed agreement meets
transaction processing: 2- and 3-phase commit
04/20/23 Distributed Systems - COMP 655 24
Distributed commit protocols
• What is the problem they are trying to solve?– Ensure that a group of processes all
do something, or none of them do– Example: in a distributed transaction
that involves updates to data on three different servers, ensure that all three commit or none of them do
04/20/23 Distributed Systems - COMP 655 25
2-phase commit
Coordinator Participant
What to do when P, in READY state, contacts Q
04/20/23 Distributed Systems - COMP 655 26
If coordinator crashes• Participants could wait until the
coordinator recovers• Or, they could try to figure out
what to do among themselves– Example, if P contacts Q, and Q is in
the COMMIT state, P should COMMIT as well
04/20/23 Distributed Systems - COMP 655 27
2-phase commitWhat to do when P, in READY state, contacts Q
If all surviving participants are in READY state,1. Wait for coordinator to recover2. Elect a new coordinator (?)
04/20/23 Distributed Systems - COMP 655 28
3-phase commit• Problem addressed:
– Non-blocking distributed commit in the presence of failures
– Interesting theoretically, but rarely used in practice
04/20/23 Distributed Systems - COMP 655 29
3-phase commit
Coordinator Participant
04/20/23 Distributed Systems - COMP 655 30
Bonus material• Implementation – reliable point-to-
point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing
04/20/23 Distributed Systems - COMP 655 31
RPC, RMI crash & omission failures
• Client can’t locate server• Request lost• Server crashes after receipt of
request• Response lost• Client crashes after sending
request
04/20/23 Distributed Systems - COMP 655 32
Can’t locate server• Raise an exception, or• Send a signal, or• Log an error and return an error
code
Note: hard to mask distribution in this case
04/20/23 Distributed Systems - COMP 655 33
Request lost• Timeout and retry• Back off to “cannot locate server”
if too many timeouts occur
04/20/23 Distributed Systems - COMP 655 34
Server crashes after receipt of request
• Possible semantic commitments– Exactly once– At least once– At most once
Normal Work done Work not done
04/20/23 Distributed Systems - COMP 655 35
Behavioral possibilities• Server events
– Process (P)– Send completion message (M)– Crash (C)
• Server order– P then M– M then P
• Client strategies– Retry every message– Retry no messages– Retry if unacknowledged– Retry if acknowledged
04/20/23 Distributed Systems - COMP 655 36
Combining the options
04/20/23 Distributed Systems - COMP 655 37
Lost replies• Make server operations
idempotent whenever possible• Structure requests so that server
can distinguish retries from the original
04/20/23 Distributed Systems - COMP 655 38
Client crashes• The server-side activity is called an orphan computation
• Orphans can tie up resources, hold locks, etc
• Four strategies (at least)– Extermination, based on client-side logs
• Client writes a log record before and after each call• When client restarts after a crash, it checks the log
and kills outstanding orphan computations• Problems include:
– Lots of disk activity– Grand-orphans
04/20/23 Distributed Systems - COMP 655 39
Client crashes, continued• More approaches for handling orphans
– Re-incarnation, based on client-defined epochs• When client restarts after a crash, it
broadcasts a start-of-epoch message• On receipt of a start-of-epoch message, each
server kills any computation for that client
– “Gentle” re-incarnation• Similar, but server tries to verify that a
computation is really an orphan before killing it
04/20/23 Distributed Systems - COMP 655 40
Yet more client-crash strategies
• One more strategy– Expiration
• Each computation has a lease on life• If not complete when the lease expires, a
computation must obtain another lease from its owner
• Clients wait one lease period before restarting after a crash (so any orphans will be gone)
• Problem: what’s a reasonable lease period?
04/20/23 Distributed Systems - COMP 655 41
Common problems with client-crash strategies
• Crashes that involve network partition(communication between partitions will
not work at all)
• Killed orphans may leave persistent traces behind, for example– Locks– Requests in message queues
04/20/23 Distributed Systems - COMP 655 42
Bonus material• Implementation – reliable point-to-
point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing
04/20/23 Distributed Systems - COMP 655 43
How to do it?• Redundancy applied
– In the appropriate places– In the appropriate ways
• Types of redundancy– Data (e.g. error correcting codes,
replicated data)– Time (e.g. retry)– Physical (e.g. replicated hardware,
backup systems)
04/20/23 Distributed Systems - COMP 655 44
Triple Modular Redundancy
04/20/23 Distributed Systems - COMP 655 45
Tandem Computers• TMR on
– CPUs– Memory
• Duplicated– Buses– Disks– Power supplies
• A big hit in operations systems for a while
04/20/23 Distributed Systems - COMP 655 46
Replicated processing• Based on process groups• A process group consists of one or more
identical processes• Key events
– Message sent to one member of a group– Process joins group– Process leaves group– Process crashes
• Key requirements– Messages must be received by all members– All members must agree on group
membership
04/20/23 Distributed Systems - COMP 655 47
Flat or non-flat?
04/20/23 Distributed Systems - COMP 655 48
Effective process groups require
• Distributed agreement– On group membership– On coordinator elections– On whether or not to commit a
transaction
• Effective communication– Reliable enough– Scalable enough– Often, multicast– Typically looking for atomic multicast
04/20/23 Distributed Systems - COMP 655 49
Process groups also require
• Ability to tolerate crash failures and omission failures– Need k+1 processes to deal with up to k
silent failures
• Ability to tolerate performance, response, and arbitrary failures– Need 3k+1 processes to reach agreement
with up to k Byzantine failures– Need 2k+1 processes to ensure that a
majority of the system produces the correct results with up to k Byzantine failures
04/20/23 Distributed Systems - COMP 655 50
Bonus material• Implementation – reliable point-to-
point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing
04/20/23 Distributed Systems - COMP 655 51
Reliable multicasting
04/20/23 Distributed Systems - COMP 655 52
Scalability problem• Too many acknowledgements
– One from each receiver– Can be a huge number in some
systems– Also known as “feedback implosion”
04/20/23 Distributed Systems - COMP 655 53
Basic feedback suppression in scalable
reliable multicast
If a receiver decides it has missed a message,• it waits a random time, then multicasts a retransmission request• while waiting, if it sees a sufficient request from another receiver,
it does not send its own request• server multicasts all retransmissions
04/20/23 Distributed Systems - COMP 655 54
Hierarchical feedback suppression for scalable
reliable multicast
• messages flow from root toward leaves• acks and retransmit requests flow toward root from coordinators• each group can use any reliable small-group multicast scheme
04/20/23 Distributed Systems - COMP 655 55
Atomic multicast• Often, in a distributed system,
reliable multicast is a step toward atomic multicast
• Atomic multicast is atomicity applied to communications:– Either all members of a process group
receive a message, OR– No members receive it
• Often requires some form of order agreement as well
04/20/23 Distributed Systems - COMP 655 56
How atomic multicast helps
1. Assume we have atomic multicast, among a group of processes, each of which owns a replica of a database
2. One replica goes down3. Database activity continues4. The process comes back up5. Atomic multicast allows us to figure
out exactly which transactions have to be re-played (see pp 386-387)
04/20/23 Distributed Systems - COMP 655 57
More concepts• Group view• View change• Virtually synchronous
– Each message is received by all non-faulty processes, or
– If sender crashes during multicast, message could be ignored by all processes
04/20/23 Distributed Systems - COMP 655 58
Virtual synchrony picture
Basic idea:in virtual synchrony, a multicast cannot cross a view-change
04/20/23 Distributed Systems - COMP 655 59
Receipt vs Delivery
Remember totally-ordered multicast …
04/20/23 Distributed Systems - COMP 655 60
What about multicast message order?
• Two aspects:– Relationship between sending order and
delivery order– Agreement on delivery order
• Send/delivery ordering relationships– Unordered– FIFO-ordered– Causally-ordered
• If receivers agree on delivery order, it’s called totally-ordered multicast
04/20/23 Distributed Systems - COMP 655 61
UnorderedProcess P1 Process P2 Process P3
sends m1sends m2
delivers m1delivers m2
delivers m2delivers m1
04/20/23 Distributed Systems - COMP 655 62
FIFO-ordered
Agreement on: m1 before m2 m3 before m4
Process P1 Process P2 Process P3
sends m1sends m2
delivers m1delivers m3delivers m2delivers m4
delivers m3delivers m1delivers m2delivers m4
Process P4
sends m3sends m4
04/20/23 Distributed Systems - COMP 655 63
Six types of virtually synchronous reliable
multicast
Relationship between sendingorder and delivery order
Agreement ondelivery order
04/20/23 Distributed Systems - COMP 655 64
Implementing virtual synchrony
Don’t deliver a message until it’s been received everywhere -but “everywhere” can change
(a) 7’s crash is detected by 4, which sends a view-change message
(b) Processes forward unstable messages, followed by flush
(c) When have flush from all processes in new view, install new view
04/20/23 Distributed Systems - COMP 655 65
Bonus material• Implementation – reliable point-to-
point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing
04/20/23 Distributed Systems - COMP 655 66
Recovery from error • Two main types:
– Backward recovery to a checkpoint (assumed to be error-free)
– Forward recovery (infer a correct state from available data)
04/20/23 Distributed Systems - COMP 655 67
More about checkpoints• They are expensive• Usually combined with a message log• Message logs are cleared at checkpoints• Recovering a crashed process:
– Restart it– Restore its state to the most recent
checkpoint– Replay the message log
04/20/23 Distributed Systems - COMP 655 68
Recovery line == most recent distributed
snapshot
04/20/23 Distributed Systems - COMP 655 69
Domino effect
04/20/23 Distributed Systems - COMP 655 70
Bonus material• Implementation – reliable point-to-
point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing
04/20/23 Distributed Systems - COMP 655 71
Sparing• Not really fault tolerance• But it can be cheaper, and provide
fast restoration time after a failure• Types of spares
– Cold– Hot– Warm
• The spare may or may not also have regular responsibilities in the system
04/20/23 Distributed Systems - COMP 655 72
Switchover• Repair is accomplished by
switching processing away from a failed server to a spare
04/20/23 Distributed Systems - COMP 655 73
Questions on switchover• Has the failed system really failed?• Is the spare operational?• Can the spare handle the load?
– May need a way to block medium to low priority work during switchovers
• How will the spare get access to the failed server’s data?
• What client session data will be preserved, and how?
04/20/23 Distributed Systems - COMP 655 74
More switchover questions• What about configuration files?• What about network addressing?• What about switching back after the
failed server has been repaired?– Partial shutdown of the spare– Updating directories to redirect part of the
load– Making up for lost medium-to-low priority
work