Upload
melvin-harper
View
215
Download
0
Embed Size (px)
DESCRIPTION
Purpose 1. halt failures – process stops executing without performing any incorrect actions 2. enable a process to deduce the event orderings that will be observed by other processes in the system Why do this? - simplifies higher level code - permits distributed computations to be implemented with reduced risk of inconsistent actions being taken.
Citation preview
Reliable Communicationin the Presence of Failures
Kenneth P. Birman and Thomas A. Joseph
Presented by Gloria Chang
Failures.. Failures.. Failures..
• Problem:Failures happen.. How are we to recover data?
• Failures still happen– No 100% fault prevention techniques
• Recovery mechanisms are needed– Recovery by fault-tolerance techniques
Purpose
1. halt failures – process stops executing without performing any incorrect actions
2. enable a process to deduce the event orderings that will be observed by other processes in the system Why do this?
- simplifies higher level code - permits distributed computations to be implemented
with reduced risk of inconsistent actions being taken.
Goal
- To construct a broadcast protocol that orders messages relative to failure and recovery events such that inconsistencies
- Ensure that every process experiences the same sequence of events
- Thus… 1. updates can be performed immediately2. recovery actions can be performed immediately after detecting failure
Environment / System Characteristics
• Processes process local states
• Communication through messages
• Communication network is structured hierarchically into clusters of local sites
• Failure = halting failure“A process ceases execution w/o taking any (visible) incorrect or malicious actions”Types: Process Failure
Communication Failure
Physical vs Logical Failure Handling
Key fact: • Perceived order of failures vary from process to process
Types of Failure Handling:• Physical = process acts directly after a failure is detected
– Bad! Why? Inconsistent actions may occur
• Logical = uses the beauty of protocols!!!
What does a protocol do for failure handling?• A protocol is run to reach agreement with other processes
that a failure event has occurred and to order it with respect to other events
Definitions
• What is a Process Group?– A collection of processes that:
1. cooperate to perform a distributed computation2. interact using communication protocols
• What is a Process Group View (a.k.a. view)?– Snapshot of the membership and global properties of a process
group at some (logical) instant in time
• What is a Broadcast?– Transmission of a message from a process to the members of
a process group (and possibly some additional processes)
Fault-Tolerant Process Groups
• Purpose: Allow members of group of process to be able to
monitor one another
• Why bother monitoring?If there is a change in status of a member, all processes need to agree on whether a request should be handled before or after the change in status, so they can consistently decide on which process should respond to the request
How?Provide a process group abstraction, through Broadcast Primitives, such that changes in the properties of the group are ordered with respect to ongoing broadcasts
Broadcast Primitives
Types:• Group Broadcast Primitive (GBCast)
• Atomic Broadcast Primitive (ABCast)
• Casual Broadcast Primitive (CBCast)
All Broadcast Primitives are atomic – all destinations receive a message or none
Set of destinations is assumed known at the time a broadcast is issued
Definitions
• What is a Group Communication?
P0
P3P2
P1m0
m3m2
m1
Properties of Group Communication
• Reliability: a message has to be received by all nodes- Reliable broadcast
• Consistent ordering: different messages sent by different nodes are delivered to all nodes in the same order
- Atomic broadcast
• causality preservation: the order in which messages are delivered at the nodes is consistent with causality between the send events of these messages
- Causal broadcast
Group Broadcast Primitive
• Purpose:– Manages group addressing; informs operational group
members when another member fails, recovers, joins or w/draws voluntarily, or when some other change to a global property of the group occurs
• Goal:– Maintain a local copy of the view– Update and act on it when receiving of GBAST message
• Notation:– GBCAST(action,G), where G denotes a view
Example: GBCAST(“p has failed”,G)
Group Broadcast Primitive• Ensures all messages from a failed process are ordered before the GBCAST
for failure GBCASTs:
(1.1) The process p running the protocol acquires a read-lock on its copy of the site view. It then sends a message to all processes in the system, informing them of the start of the failure GBCAST for f.
(1.2) A process q receiving this message schedules for transmission any message B in BUFq sent by f that includes a member f G in REM_DESTS(B). It then waits until the status of these messages turns to sent.
(1.3) If q belongs to G, q waits until all ABCASTs from f have become deliverable. This will happen eventually because some process (perhaps q itself) will take over to complete the ABCAST protocol.
(1.4) The process q then sends an acknowledgment to p. When acknowledgments have been received from all operational processes, p releases its read-lock. The lock is implicitly released if p fails prior to doing so.
Group Broadcast Primitive
• Orders GBCASTs to the same group relative to one another(2.1) The process p distributes the message action to the members of the process group G.
(2.2) A recipient q places copies of the message on all ABCAST priority queues, tagging them undeliverable. We assume that there is always a (possibly empty) queue for every possible ABCAST label. It assigns it a priority greater than that of any message that has been placed on any of the ABCAST queues, and sends this priority value back to p (all copies receive the same priority).
(2.3) After collecting the responses, p sends the maximum of all values it has received to the members of G, which change the priority accordingly and re-sort their queues. Unlike what happens in the ABCAST protocol, the messages are not tagged deliverable a this time. Thus, when a GBCAST message reaches the head of an ABCAST priority queue, further delivery of messages from the queue will be suspended
(2.4) When the GBCAST message reaches the head of all ABCAST queues, the next part is begun.
Group Broadcast Primitive• Orders GBCASTs relative to CBCASTs
(3.1) The process p initiating the protocol contacts all members of G.
(3.2) A participant q establishes a FIFO wait queue (unless one already exists).Until the GBCAST protocol completes, messages that would have beenplaced on the delivery queue at q by the CBCAST protocols are placed onthis queue instead.
(3.3) If any message B in IDlist, is in PBUF, and the remaining destinations ofB include sites in G, q must assume that those sites have not yet receiveda copy of B. Any such message is scheduled for transmission to thedestinations in REM-DESTS(B) G, and q waits until the messages havebeen sent. It then sends IDlist, to p.
(3.4) After collecting these messages,p merges all the lists it has received, callingthis the before list. It sends the before list to all participants. When aparticipant q receives this list, any message that was transmitted during
Group Broadcast Primitive• Orders GBCASTs relative to CBCASTs
(3.1) The process p initiating the protocol contacts all members of G.
(3.2) A participant q establishes a FIFO wait queue (unless one already exists).Until the GBCAST protocol completes, messages that would have beenplaced on the delivery queue at q by the CBCAST protocols are placed onthis queue instead.
(3.3) If any message B in IDlist, is in PBUFq, and the remaining destinations ofB include sites in G, q must assume that those sites have not yet receiveda copy of B. Any such message is scheduled for transmission to thedestinations in REM-DESTS(B) G, and q waits until the messages havebeen sent. It then sends IDlist, to p.
(3.4) After collecting these messages,p merges all the lists it has received, callingthis the before list. It sends the before list to all participants. When aparticipant q receives this list, any message that was transmitted duringstep 3.3 must have arrived and is on the wait queue unless its has alreadybeen delivered. Similarly, during step 1.2 all CBCASTs messages from a failedwere either placed on wait queue or delivered
Group Broadcast Primitive
• Orders GBCASTs relative to CBCASTs
(4.1) Each participant q does the following: For each CBCAST B in its waitqueue, if B is in the before list, or if there is some B’ in the before list andB s B’, or if the GBCAST is for a failure of process f and SENDER(B) =f, then B is added to the list.
(4.2) Any messages in the wait queue that are also in the before list are nowtransferred to the delivery queue, preserving their relative order. TheGBCAST message is then placed on the delivery queue.
(4.3) If there are no other GBCAST protocols in progress, p appends the contentsof the wait queue to the delivery queue and deletes the wait queue.
(4.4) The GBCAST messages are removed from the heads of the ABCASTqueues, allowing ABCAST messages to be delivered.If a failure occurs, any participant can restart the protocol from the
Atomic Broadcast Primitive
• Purpose:– delivers messages atomically and in the same order
everywhere. ex: processes maintain copies of a replicated queue items inserted and removed from queues must be the same at all locations
• Notation:– ABCAST(msg,label,dests)
msg = message to be broadcast label = string of characters dests = set of processes to which message must be delivered
Atomic Broadcast Primitive• A three-phase algorithm:
– Message (m,p) where m is content and p is the priority;
– Phase 1: The sender transmit its message (m,p) to all the nodes;
– Phase 2: Each receiver adds the message to its queue and tags it as “undeliverable”. It then assigns a new priority q, which is higher than the priority of any message in the queue and informs the sender about the new priority.
– Phase 3: • The sender collects all replies and computes the maximum value of new priorities
it receives and sends the value back to all receivers.• Each receiver changes the priority to the new priority received from the sender
and tags the message as “deliverable”. It sorts the messages in the queue based on the priority level and delivers all messages in the beginning of the queue which marks “deliverable” until it hits “undeliverable”.
Atomic Broadcast Primitive• Assume two processes P0 and P1:
P0 sends (m0, 3) to itself and to P1 P1 sends (m1, 5) to itself and p0
• Draw a time-line and the queues in each process
p0
p1
(m0,3) (m1,5)
Atomic Broadcast Primitive
p0
p1
(m0,3)(m1,5)
[(m0,3,u)]
[(m1,5,u)]
0->6
[(m0,6,d)]
[(m0,3,u) (m1,5,u)]
[(m1,5,u) (m0,6,d)]
u = undeliverabled = deliverable
Atomic Broadcast Primitiveu = undeliverabled = deliverable
p0
p1
(m1,5)
[(m0,6,d)]
[(m1,5,u) (m0,6,d)]
1->7
[(m1,5,u)]
[(m0,6,d) (m1,7,d)]
[(m1,7,d)]
Causal Broadcast Primitive
• Purpose – order in which messages are delivered at the nodes is consistent with
causality between the send events of these messages– “Happened before” order.– Messages from given process in order.
• Notation:– CBCAST(msg,label,dests)
Causal Broadcast Primitive
• clabels – used to indicate the order in which broadcasts should be delivered
• “Happens-Before”– clabel1 → clabel2 clabel1 < clabel2 and both are comparable – CLABEL(B) clabel of broadcast B
– B → B’ CLABEL(B) → CLABEL(B’)
c
Causal Broadcast Primitive• A message B is transmitted from BUFp at site s to BUFq at site t as follows
(1) A transfer packet (B1, B2) is first created and includes all messages B’ in BUFp such that B’B and REM_DESTS(B’) is nonempty. The messages are sorted so that, if BiBj, then i<j.
(2) The transfer packet is then transmitted from site s to site t.
(3) When the packet has been sent, for each Bi that it contained, q is deleted from REM_DESTS(Bi), if was listed there
Causal Broadcast Primitive• A message q receives a packet <B1,B2,…>, the following is done for each i
in increasing order of i:
(4) If ID(Bi) is already associated with a message in BUFq then Bi is a duplicate and is discarded.
(5) If q REM_DESTS(Bi), Bi is placed on the delivery queue for q, q is removed from REM_DESTS(Bi), and a copy of Bi is placed in BUFq
(6) Otherwise, Bi is a message in transit to another process, and it is simply placed in BUFq.
Conclusion
• With ABCAST, CBCAST and GBCAST protocols, failure handling can be implemented in any local or wide area network.
• With these protocols, failure-handling mechanisms and event orderings are integrated without compromising efficiency.