Upload
others
View
29
Download
1
Embed Size (px)
Citation preview
Ali Asghar Pourhaji Kazem, Spring 2015
DISTRIBUTED SYSTEMSPrinciples and Paradigms
Second EditionANDREW S. TANENBAUM
MAARTEN VAN STEEN
Chapter 8Fault Tolerance
1
Ali Asghar Pourhaji Kazem, Spring 2015
Fault Tolerance Basic Concepts
• Being fault tolerant is strongly related to what are called dependable systems
• Dependability implies the following:1. Availability: Readiness for usage2. Reliability: Continuity of service delivery3. Safety: Very low probability of catastrophes4. Maintainability: How easy can a failed
system be repaired2
Ali Asghar Pourhaji Kazem, Spring 2015
Type of Errors
• Transient: Transient faults occur once and thendisappear
• If the operation is repeated, the fault goes away. A bird flyingthrough the beam of a microwave transmitter may cause lostbits on some network
• Intermittent: An intermittent fault occurs, thenvanishes of its own accord, then reappears, andso on
• A loose contact on a connector will often cause an intermittentfault
• Permanent: A permanent fault is one thatcontinues to exist until the faulty component isreplaced 3
Ali Asghar Pourhaji Kazem, Spring 2015
Terminology
• Failure: When a component is not living up to itsspecifications, a failure occurs
• Error: That part of a component’s state that can lead to afailure
• Fault: The cause of an error
What to do about faults• Fault prevention: prevent the occurrence of a fault• Fault tolerance: build a component such that it can mask the
presence of faults• Fault removal: reduce presence, number, seriousness of
faults• Fault forecasting: estimate present number, future
incidence, and consequences of faults4
Ali Asghar Pourhaji Kazem, Spring 2015
Failure Models
Figure 8-1. Different types of failures.5
Ali Asghar Pourhaji Kazem, Spring 2015
Failure Masking by Redundancy
• If a system is to be fault tolerant, the best it cando is to try to hide the occurrence of failuresfrom other processes
• The key technique for masking faults is to useredundancy
• Three kinds are possible:• information redundancy• Time redundancy• physical redundancy
6
Ali Asghar Pourhaji Kazem, Spring 2015
Failure Masking by Redundancy
• Information redundancy: With information redundancy,extra bits are added to allow recovery from garbled bits
• Time redundancy: With time redundancy, an action isperformed, and then if need be, it is performed again
• Physical redundancy: With physical redundancy, extraequipment or processes are added to make it possiblefor the system as a whole to tolerate the loss ormalfunctioning of some components
7
Ali Asghar Pourhaji Kazem, Spring 2015
Failure Masking by Redundancy
Figure 8-2. Triple modular redundancy.8
Ali Asghar Pourhaji Kazem, Spring 2015
PROCESS RESILIENCE (1)
• The first topic we discuss is protection against processfailures, which is achieved by replicating processes intogroups
• The key approach to tolerating a faulty process is toorganize several identical processes into a group
• The key property that all groups have is that when amessage is sent to the group itself, all members of thegroup receive it
• In this way, if one process in a group fails, hopefullysome other process can take over for it
9
Ali Asghar Pourhaji Kazem, Spring 2015
PROCESS RESILIENCE (2)
• Basic issue• Protect yourself against faulty processes by replicating and
distributing computations in a group
• Flat groups• Good for fault tolerance as information exchange immediately
occurs with all group members; however, may impose moreoverhead as control is completely distributed (hard toimplement).
• Hierarchical groups• All communication through a single coordinator ⇒ not really fault
tolerant and scalable, but relatively easy to implement.
10
Ali Asghar Pourhaji Kazem, Spring 2015
Flat Groups versus Hierarchical Groups
Figure 8-3. (a) Communication in a flat group. (b) Communication in a simple hierarchical group.11
Ali Asghar Pourhaji Kazem, Spring 2015
Groups and failure masking (1)
• K-fault tolerant group• When a group can mask any k concurrent member failures (k is
called degree of fault tolerance).
• How large does a k-fault tolerant group need to be?• Assume crash/performance failure semantics ⇒ a total of k +1
members are needed to survive k member failures.• Assume arbitrary failure semantics, and group output defined by
voting ⇒ a total of 2k +1 members are needed to survive kmember failures.
• Assumption• All members are identical, and process all input in the same
order ⇒ only then we sure that they do exactly the same thing.
12
Ali Asghar Pourhaji Kazem, Spring 2015
Groups and failure masking (2)
• Scenario• Assuming arbitrary failure semantics, we need 3k +1 group
members to survive the attacks of k faulty members. This is alsoknown as Byzantine failures
• Essence• We are trying to reach a majority vote among the group of
loyalists, in the presence of k traitors ⇒ need 2k +1 loyalists.
13
Ali Asghar Pourhaji Kazem, Spring 2015
Agreement in Faulty Systems (1)
(a) what they send to each other
)b) what each one got from the other
)c) what each one got in second step
14
Ali Asghar Pourhaji Kazem, Spring 2015
Agreement in Faulty Systems (2)
Figure 8-6. The same as Fig. 8-5, except now with two correct process and one faulty process. 15
Ali Asghar Pourhaji Kazem, Spring 2015
Failure Detection
• We detect failures through timeoutmechanisms
• Setting timeouts properly is very difficult andapplication dependent
• You cannot distinguish process failures from networkfailures
• We need to consider failure notification throughoutthe system:
• Gossiping (i.e., proactively disseminate a failure detection)• On failure detection, pretend you failed as well
16
Ali Asghar Pourhaji Kazem, Spring 2015
Reliable Communication
• So far Concentrated on process resilience (bymeans of process groups). What about reliablecommunication channels?
• Error detection• Framing of packets to allow for bit error detection• Use of frame numbering to detect packet loss
• Error correction• Add so much redundancy that corrupted packets can
be automatically corrected• Request retransmission of lost, or last N packets
17
Ali Asghar Pourhaji Kazem, Spring 2015
RPC Semantics in the Presence of Failures (1)
Five different classes of failures that can occur inRPC systems:
1. The client is unable to locate the server.2. The request message from the client to the
server is lost.3. The server crashes after receiving a request.4. The reply message from the server to the client
is lost.5. The client crashes after sending a request.
18
Ali Asghar Pourhaji Kazem, Spring 2015
RPC Semantics in the Presence of Failures (2)
• Client cannot locate server: Raise anexception or send a signal to client leading toloss in transparency.
• Lost request messages: Start a timer whensending a request. If timer expires before a replyis received, send the request again. Serverwould need to detect duplicate requests.
19
Ali Asghar Pourhaji Kazem, Spring 2015
Server Crashes (1)
• Server crashes: Server crashes before or afterexecuting the request is indistinguishable from the clientside...
• We need to decide on what we expect from the server� At least once semantics
• The server guarantees it will carry out an operation at least once, no matterwhat.
� At most once semantics• The server guarantees it will carry out an operation at most once
� Guarantee nothing semantics!� Exactly once semantics.
20
Ali Asghar Pourhaji Kazem, Spring 2015
Server Crashes (2)
Figure 8-7. A server in client-server communication. (a) The normal case. (b) Crash after execution. (c) Crash before execution.
21
Ali Asghar Pourhaji Kazem, Spring 2015
Server Crashes (3)
Three events that can happen at the server: • Send the completion message (M), • Print the text (P), • Crash (C).
22
Ali Asghar Pourhaji Kazem, Spring 2015
Server Crashes (4)These events can occur in six different orderings:1. M →P →C: A crash occurs after sending the completion
message and printing the text.2. M →C (→P): A crash happens after sending the
completion message, but before the text could be printed.
3. P →M →C: A crash occurs after sending the completion message and printing the text.
4. P→C(→M): The text printed, after which a crash occurs before the completion message could be sent.
5. C (→P →M): A crash happens before the server could do anything.
6. C (→M →P): A crash happens before the server could do anything.
23
Ali Asghar Pourhaji Kazem, Spring 2015
Server Crashes (5)
Figure 8-8. Different combinations of client and server strategies in the presence of server crashes. 24
Ali Asghar Pourhaji Kazem, Spring 2015
RPC Semantics in the Presence of Failures (3)
• Lost Reply Messages• Detecting lost replies can be hard, because it can also be that the server
had crashed. You don’t know whether the server has carried out theoperation
• Solution• Set a timer on client. If it expires without a reply, then sendthe request
again. If requests areidempotent, then they can be repeated againwithout ill-effects.
25
Ali Asghar Pourhaji Kazem, Spring 2015
RPC Semantics in the Presence of Failures (4)
• Client Crashes:
• Createsorphans. An orphan is an active computation onthe server for which there is no client waiting. Dealingwith orphans.
• Extermination.• Client logs each request in a file before sending it. After a reboot the file is checked and
the orphan is explicitly killed off. Expensive, cannot locate grand-orphans etc.
• Reincarnation.• Divide time into sequentially numbered epochs. When a client reboots, it broadcasts a
message declaring a new epoch. This allows servers to terminate orphan computations.
• Gentle reincarnation.• A server tries to locate the owner of orphans before killing the computation.
• Expiration.• Each RPC is given a quantum of time to finish its job. If it cannot finish, then it asks for
another quantum. After a crash, a client need only wait for a quantum to make sure allorphans are gone.
26
Ali Asghar Pourhaji Kazem, Spring 2015
RPC Semantics in the Presence of Failures (5)
• An idempotent operation that can be repeated as oftenas necessary without any harm being done. E.g. readinga block from a file.
• In general, try to make RPC/RMI methods be idempotentif possible. If not, it can be dealt with in couple of ways.
• Use a sequence number with each request so server can detectduplicates. But now the server needs to keep state for each client.
• Have a bit in the message to distinguish between original andduplicatetransmission.
27
Ali Asghar Pourhaji Kazem, Spring 2015
Basic Reliable Multicasting Schemes(1)• Although most transport layers offer reliable point-to-point channels,
they rarely offer reliable communication to a collection of processes
• The best they can offer is to let each process set up a point-to-pointconnection to each other process it wants to communicate with
• Obviously, such an organization is not very efficient as it may wastenetwork bandwidth
• Nevertheless, if the number of processes is small, achievingreliability through multiple reliable point-to-point channels is a simpleand often straightforward solution
28
Ali Asghar Pourhaji Kazem, Spring 2015
Basic Reliable Multicasting Schemes(2)
29
• To go beyond this simple case, we need to define precisely whatreliable multicasting is
• Intuitively, it means that a message that is sent to a process groupshould be delivered to each member of that group
• To cover such situations, a distinction should be made betweenreliable communication in the presence of faulty processes, andreliable communication when processes are assumed to operatecorrectly
Ali Asghar Pourhaji Kazem, Spring 2015
Basic Reliable-Multicasting Schemes
Figure 8-9. A simple solution to reliable multicasting when all receivers are known and are assumed not to fail.
(a) Message transmission. (b) Reporting feedback. 30
Ali Asghar Pourhaji Kazem, Spring 2015
Basic Reliable Multicasting Schemes(2)
31
• The sending process assigns a sequence number to each messageit multicasts
• We assume that messages are received in the order they are sent• In this way, it is easy for a receiver to detect it is missing a message• Each multicast message is stored locally in a history buffer at the
sender• Assuming the receivers are known to the sender, the sender simply
keeps the message in its history buffer until each receiver hasreturned an acknowledgment
• If a receiver detects it is missing a message, it may return a negativeacknowledgment, requesting the sender for a retransmission
• Alternatively, the sender may automatically retransmit the messagewhen it has not received all acknowledgments within a certain time
Ali Asghar Pourhaji Kazem, Spring 2015
Scalability in Reliable Multicasting -Feedback suppression
Basic idea• Let a process P suppress its own feedback when it notices
another process Q is already asking for a retransmission
Assumptions• All receivers listen to a common feedback channel to which
feedback messages are submitted
• Process P schedules its own feedback message randomly,and suppresses it when observing another feedbackmessage
32
Ali Asghar Pourhaji Kazem, Spring 2015
Nonhierarchical Feedback Control
Figure 8-10. Several receivers have scheduled a request for retransmission, but the first retransmission request
leads to the suppression of others.33
Ali Asghar Pourhaji Kazem, Spring 2015
Hierarchical Feedback Control (1)
Basic solution• Construct a hierarchical feedback channel in which all
submitted messages are sent only to the root. Intermediatenodes aggregate feedback messages before passing themon.
Observation• Intermediate nodes can easily be used for retransmission
purposes
34
Ali Asghar Pourhaji Kazem, Spring 2015
Hierarchical Feedback Control (2)
Figure 8-11. The essence of hierarchical reliable multicasting.Each local coordinator forwards the message to its children and
later handles retransmission requests.35
Ali Asghar Pourhaji Kazem, Spring 2015
Atomic Multicast
36
• In particular, what is often needed in a distributedsystem is the guarantee that a message isdelivered to either all processes or to none at al
• In addition, it is generally also required that allmessages are delivered in the same order to allprocesses
• This is also known as the atomic multicastproblem
Ali Asghar Pourhaji Kazem, Spring 2015
Virtual Synchrony (1)
• we again adopt a model in which the distributedsystem consists of a communication layer
• Within this communication layer, messages aresent and received
• A received message is locally buffered in thecommunication layer until it can be delivered to theapplication that is logically placed at a higher layer
37
Ali Asghar Pourhaji Kazem, Spring 2015
Virtual Synchrony (2)
Figure 8-12. The logical organization of a distributed system todistinguish between message receipt and message delivery.38
Ali Asghar Pourhaji Kazem, Spring 2015
Virtual Synchrony (2)
Figure 8-13. The principle of virtual synchronous multicast.39
Ali Asghar Pourhaji Kazem, Spring 2015
Message Ordering (1)
Four different orderings are distinguished:• Unordered multicasts• FIFO-ordered multicasts• Causally-ordered multicasts• Totally-ordered multicasts
40
Ali Asghar Pourhaji Kazem, Spring 2015
Message Ordering (2)
• A reliable, unordered multicast is a virtuallysynchronous multicast in which no guarantees aregiven concerning the order in which received messagesare delivered by different processes.
• In the case of reliable FIFO-ordered multicasts, thecommunication layer is forced to deliver incomingmessages from the same process in the same order asthey have been sent
41
Ali Asghar Pourhaji Kazem, Spring 2015
Message Ordering (3)
42
• Reliable causally-ordered multicast delivers messagesso that potential causality between different messagesis preserved
• In other words. if a message m1 causally precedesanother message m2, regardless of whether they weremulticast by the same sender, then the communicationlayer at each receiver will always deliver m2 after it hasreceived and delivered m1
Ali Asghar Pourhaji Kazem, Spring 2015
Message Ordering (4)
43
• Besides the three discussed orderings, there may bethe additional constraint that message delivery is to betotally ordered as well
• Total-ordered delivery means that regardless of whethermessage delivery is unordered, FIFO ordered, orcausally ordered, it is required additionally that whenmessages are delivered, they are delivered in the sameorder to all group members
Ali Asghar Pourhaji Kazem, Spring 2015
Message Ordering (5)
Figure 8-14. Three communicating processes in the same group. The ordering of events
per process is shown along the vertical axis.44
Ali Asghar Pourhaji Kazem, Spring 2015
Message Ordering (6)
Figure 8-15. Four processes in the same group with two different senders, and a possible delivery order of messages under
FIFO-ordered multicasting 45
Ali Asghar Pourhaji Kazem, Spring 2015
Implementing Virtual Synchrony (1)
Figure 8-16. Six different versions of virtually synchronous reliable multicasting.
46
Ali Asghar Pourhaji Kazem, Spring 2015
Implementing Virtual Synchrony (2)
Figure 8-17. (a) Process 4 notices that process 7 has crashed and sends a view change. 47
Ali Asghar Pourhaji Kazem, Spring 2015
Implementing Virtual Synchrony (3)
Figure 8-17. (b) Process 6 sends out all itsunstable messages, followed by a flush message. 48
Ali Asghar Pourhaji Kazem, Spring 2015
Implementing Virtual Synchrony (4)
Figure 8-17. (c) Process 6 installs the new view when it has received a flush message from everyone else.
49
Ali Asghar Pourhaji Kazem, Spring 2015
Two-Phase Commit (1)
Figure 8-18. (a) The finite state machine for the coordinator in 2PC. (b) The finite state machine for a participant.
50
Ali Asghar Pourhaji Kazem, Spring 2015
Two-Phase Commit (2)
Figure 8-19. Actions taken by a participant P when residing in state READY and having contacted another participant Q.51
Ali Asghar Pourhaji Kazem, Spring 2015
Two-Phase Commit (3)
Figure 8-20. Outline of the steps taken by the coordinator in a two-phase commit protocol.
. . .
52
Ali Asghar Pourhaji Kazem, Spring 2015
Two-Phase Commit (4)
Figure 8-20. Outline of the steps taken by the coordinator in a two-phase commit protocol.
. . .
53
Ali Asghar Pourhaji Kazem, Spring 2015
Two-Phase Commit (5)
Figure 8-21. (a) The steps taken by a participant
process in 2PC.
54
Ali Asghar Pourhaji Kazem, Spring 2015
Two-Phase Commit (7)
Figure 8-21. (b) The steps for handling incoming decision requests.. 55
Ali Asghar Pourhaji Kazem, Spring 2015
Three-Phase Commit (1)
The states of the coordinator and each participantsatisfy the following two conditions:1. There is no single state from which it is possible
to make a transition directly to either a COMMIT or an ABORT state.
2. There is no state in which it is not possible to make a final decision, and from which a transition to a COMMIT state can be made.
56
Ali Asghar Pourhaji Kazem, Spring 2015
Three-Phase Commit (2)
Figure 8-22. (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant.
57
Ali Asghar Pourhaji Kazem, Spring 2015
Recovery: BackgroundEssence• When a failure occurs, we need to bring the system into an error-
free state:• Forward error recovery: Find a new state from which the system can continue
operation
• Backward error recovery: Bring the system back into a previous error-freestate
Practice• Use backward error recovery, requiring that we establish recovery
points
Observation• Recovery in distributed systems is complicated by the fact that
processes need to cooperate in identifying a consistent state fromwhere to recover 58
Ali Asghar Pourhaji Kazem, Spring 2015
Recovery – Stable Storage
Figure 8-23. (a) Stable storage. (b) Crash after drive 1 is updated. (c) Bad spot. 59
Ali Asghar Pourhaji Kazem, Spring 2015
Consistent Recovery State
60
Requirement• Every message that has been received is also shown to have been
sent in the state of the sender
Recovery line• Assuming processes regularly checkpoint their state, the most recent
consistent global checkpoint.
Ali Asghar Pourhaji Kazem, Spring 2015
Independent Checkpointing
Figure 8-25. The domino effect.61
Ali Asghar Pourhaji Kazem, Spring 2015
Characterizing Message-Logging Schemes
Figure 8-26. Incorrect replay of messages after recovery, leading to an orphan process. 62