Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05

Reliable Communication Reliable Communication in the Presence of in the Presence of

FailuresFailuresBased on the paper by: Kenneth Birman and Thomas A. Based on the paper by: Kenneth Birman and Thomas A.

JosephJoseph

Cesar TalledoCesar Talledo

COEN 317COEN 317

Fall 05Fall 05

AgendaAgenda IntroductionIntroduction Challenges in Fault-Tolerant Distributed SystemsChallenges in Fault-Tolerant Distributed Systems Consistent Event Ordering in a Distributed Consistent Event Ordering in a Distributed

SystemSystem Key Aspects of Proposed ApproachKey Aspects of Proposed Approach Logical vs. Physical FailuresLogical vs. Physical Failures Proposed Broadcast PrimitivesProposed Broadcast Primitives

GBCASTGBCAST ABCASTABCAST CBCASTCBCAST

Advantages of the Proposed ApproachAdvantages of the Proposed Approach Sample Application: Updating Replicated DataSample Application: Updating Replicated Data Final ThoughtsFinal Thoughts

IntroductionIntroduction White paper written in 1987, funded by DoDWhite paper written in 1987, funded by DoD Purpose: Purpose:

Present a set of communication primitives that Present a set of communication primitives that facilitate distributed processing in the presence of facilitate distributed processing in the presence of failuresfailures

System Assumptions:System Assumptions: One computation, executed by multiple processes in

a distributed system (DS) Each process has a local state Processes communicate via broadcasts with other

processes Processes may “halt” at any time

The paper does not address byzantine failures

Challenges in Fault-Challenges in Fault-Tolerant DSTolerant DS

Challenges in fault-tolerant distributed systemsChallenges in fault-tolerant distributed systems How does the system handle exit/re-entry of processesHow does the system handle exit/re-entry of processes

A process may leave the computation (due to failure)A process may leave the computation (due to failure) A process may re-enter the computation (recovery)A process may re-enter the computation (recovery)

How do processes communicate with each other whenHow do processes communicate with each other when Messages may be lost by communication subsystemMessages may be lost by communication subsystem Messages may be re-ordered while in transitMessages may be re-ordered while in transit Some receiver processes may haltSome receiver processes may halt Sender process may haltSender process may halt

How to handle failures when system is asynchronousHow to handle failures when system is asynchronous Goal: continue the computation in the presence of Goal: continue the computation in the presence of

failuresfailures

Consistent Event Consistent Event Ordering in DSOrdering in DS

Key aspects of distributed processingKey aspects of distributed processing Processes must have a Processes must have a consistent viewconsistent view of the ordering of the ordering

of events (i.e., messages) during the computationof events (i.e., messages) during the computation A system must provide ordering, but also allow A system must provide ordering, but also allow

concurrencyconcurrency Failures can affect consistent view of event Failures can affect consistent view of event

orderingordering Example:Example:

Process ‘A’ sends a broadcast message, then fails Process ‘A’ sends a broadcast message, then fails Process ‘B’ receives the message, then notices failureProcess ‘B’ receives the message, then notices failure Process ‘C’ notices failure, then receives messageProcess ‘C’ notices failure, then receives message Process ‘D’ never receives message, but notices failureProcess ‘D’ never receives message, but notices failure Process ‘F’ receives message, but never notices failureProcess ‘F’ receives message, but never notices failure Process ‘G’ never receives message nor notices failureProcess ‘G’ never receives message nor notices failure

KeyKey: All processes must agree on the events that : All processes must agree on the events that occurred and on the order of those eventsoccurred and on the order of those events

... Consistent Event ... Consistent Event Ordering in DSOrdering in DS

Approaches to keep consistent event Approaches to keep consistent event ordering in the presence of failuresordering in the presence of failures 1) Run agreement protocol after a failure is 1) Run agreement protocol after a failure is

detecteddetected Problems: slow and requires synchronous communicationProblems: slow and requires synchronous communication

2) Use this rule: A process should discard 2) Use this rule: A process should discard messages received from a process that is known messages received from a process that is known to have failedto have failed

Problem: Processes learn of failures at different times, Problem: Processes learn of failures at different times, so system may still be inconsistentso system may still be inconsistent

Proposed Idea: Proposed Idea: ““Construct a broadcast protocol that orders Construct a broadcast protocol that orders

messages relative to failure and recovery events”messages relative to failure and recovery events”

Key Aspects of Proposed Key Aspects of Proposed Approach Approach

Failure and recovery are treated as Failure and recovery are treated as system eventssystem events, , just like local processing and messagesjust like local processing and messages

Thus, failure and recovery have an Thus, failure and recovery have an orderingordering with with respect to messages & local processingrespect to messages & local processing

The paper proposes communication primitives that The paper proposes communication primitives that maintain consistent ordering among processesmaintain consistent ordering among processes All processes experience the same sequence of events, All processes experience the same sequence of events,

including failuresincluding failures Advantages: Advantages:

When a process notices a failure, it can assume that the When a process notices a failure, it can assume that the rest of the system has noticed the order of the failure rest of the system has noticed the order of the failure consistentlyconsistently

Therefore, the process can immediately react to the failure Therefore, the process can immediately react to the failure (no agreement protocol required)(no agreement protocol required)

Logical vs. Physical Logical vs. Physical FailuresFailures

Failures (i.e., lost messages, process halts) are Failures (i.e., lost messages, process halts) are physical events, occurring in real-timephysical events, occurring in real-time Processes cannot control when a failure occursProcesses cannot control when a failure occurs

Recall that processes use Recall that processes use logical clockslogical clocks to track to track order of events in a distributed computationorder of events in a distributed computation

In order to treat failures as ordered events, In order to treat failures as ordered events, physical failures must be physical failures must be mappedmapped to logical to logical failuresfailures

How?How? Introduce “Process-Group View”: Logical snapshot of Introduce “Process-Group View”: Logical snapshot of

processes involved in the distributed computationprocesses involved in the distributed computation Changes in the properties of the group (i.e., failures, Changes in the properties of the group (i.e., failures,

recovery) are recovery) are orderedordered with respect to other events with respect to other events These changes are communicated among processes by These changes are communicated among processes by

using the proposed broadcast primitivesusing the proposed broadcast primitives

Proposed Broadcast PrimitivesProposed Broadcast Primitives

3 Broadcast Communication Primitives3 Broadcast Communication Primitives Group-Broadcast (GBCAST)Group-Broadcast (GBCAST) Atomic-Broadcast (ABCAST)Atomic-Broadcast (ABCAST) Causal-Broadcast (CBCAST)Causal-Broadcast (CBCAST)

All 3 are atomic: All processes receive All 3 are atomic: All processes receive the message or non-receive the messagethe message or non-receive the message

Emphasis on lightweight primitives: Emphasis on lightweight primitives: quick processing is desired to improve quick processing is desired to improve performance performance

GBCASTGBCAST

GBCAST GBCAST Group Broadcast Group Broadcast Used to keep consistent “process group view”Used to keep consistent “process group view” Call: GBCAST(action, G)Call: GBCAST(action, G)

action action type of event that has occurred type of event that has occurred G G process group view process group view

GBCAST satisfies the following ordering GBCAST satisfies the following ordering constraintsconstraints Delivered in the same order with respect to all Delivered in the same order with respect to all

other broadcasts at each destinationother broadcasts at each destination Delivered after any messages sent by the failed Delivered after any messages sent by the failed

processprocess

… … GBCASTGBCAST GBCAST is used to inform group member GBCAST is used to inform group member

processes that the process group view has processes that the process group view has changedchanged Each process keeps a local copy of the “process group Each process keeps a local copy of the “process group

view”view” Reception of a GBCAST updates the local copyReception of a GBCAST updates the local copy A process can assume that its local copy is consistent A process can assume that its local copy is consistent

with the rest of the groupwith the rest of the group Upon failure or recovery, a GBCAST is sent by Upon failure or recovery, a GBCAST is sent by

thethe Supervisory process executing in same machine where Supervisory process executing in same machine where

process failure or recovery occurred (if machine alive)process failure or recovery occurred (if machine alive) Failure detection software executing on other machineFailure detection software executing on other machine

The usage of GBCAST avoids execution of an The usage of GBCAST avoids execution of an agreement protocolagreement protocol

ABCASTABCAST ABCAST ABCAST Atomic Broadcast Atomic Broadcast Provides Provides sequential consistencysequential consistency on replicated on replicated

datadata Applications use ABCAST to enforce order in the way Applications use ABCAST to enforce order in the way

data is updated in the distributed system (i.e., shared data is updated in the distributed system (i.e., shared data structure)data structure)

Call: ABCAST(msg, label, dests)Call: ABCAST(msg, label, dests) msg msg message to be broadcasted message to be broadcasted label label identifies ABCASTs that are related to each identifies ABCASTs that are related to each

otherother dests dests set of processes to which broadcast is sent set of processes to which broadcast is sent

ABCASTs with the same label that have ABCASTs with the same label that have destinations in common are delivered in the destinations in common are delivered in the same order (some order) to all such destinationssame order (some order) to all such destinations

CBCASTCBCAST CBCAST CBCAST Causal Broadcast Causal Broadcast Provides causal consistency on replicated dataProvides causal consistency on replicated data

Applications use CBCAST to enforce causal order in Applications use CBCAST to enforce causal order in the way data is updated in the distributed systemthe way data is updated in the distributed system

Call: CBCAST(msg, clabel, dests)Call: CBCAST(msg, clabel, dests) msg msg message to be broadcasted message to be broadcasted clabel clabel identifies related CBCASTs and type of identifies related CBCASTs and type of

orderingordering dests dests set of processes to which broadcast is sent set of processes to which broadcast is sent

CBCASTs with the same ‘clabel’ that have CBCASTs with the same ‘clabel’ that have destinations in common are delivered in a destinations in common are delivered in a predetermined order to all such destinationspredetermined order to all such destinations

… … CBCASTCBCAST Broadcast ‘A’ causally precedes broadcast ‘B’ ifBroadcast ‘A’ causally precedes broadcast ‘B’ if

A and B are sent by the same process, and A is sent A and B are sent by the same process, and A is sent before Bbefore B

A and B are sent by different processes, and A was A and B are sent by different processes, and A was received by the process that sent B before B was sentreceived by the process that sent B before B was sent

Causal ordering is determined by the value of Causal ordering is determined by the value of ‘clabels’‘clabels’

If broadcast A causally precedes broadcast B, If broadcast A causally precedes broadcast B, then clabel(A) < clabel(B)then clabel(A) < clabel(B)

Usage of ‘clabels’ gives applications the power to Usage of ‘clabels’ gives applications the power to decide events that are causally relateddecide events that are causally related Not all CBCASTs are causally related; ordering them Not all CBCASTs are causally related; ordering them

would limit system concurrencywould limit system concurrency

Advantages of the Proposed Advantages of the Proposed ApproachApproach

Simplify applicationsSimplify applications Eliminate the need for ‘ordering protocols’ at the Eliminate the need for ‘ordering protocols’ at the

application level needed to prevent inconsistencies due application level needed to prevent inconsistencies due to potential failures to potential failures

These protocols are needed if communication were These protocols are needed if communication were done via simple atomic broadcastsdone via simple atomic broadcasts

Improve system performanceImprove system performance Application ordering protocols restrict concurrency by Application ordering protocols restrict concurrency by

imposing synchronization rulesimposing synchronization rules

NoteNote: Assumption is that GBCAST, ABCAST, and : Assumption is that GBCAST, ABCAST, and CBCAST are implemented at a level below the CBCAST are implemented at a level below the application (i.e., Kernel)application (i.e., Kernel)

Sample Application: Sample Application: Updating Replicated DataUpdating Replicated Data

All copies of the replicated data must be updated in the All copies of the replicated data must be updated in the same ordersame order

Without the proposed broadcast primitives, process Without the proposed broadcast primitives, process would need to do explicit synchronizationwould need to do explicit synchronization Send a basic atomic broadcast to the remote copiesSend a basic atomic broadcast to the remote copies Wait for the remote copies to reply with confirmation of updateWait for the remote copies to reply with confirmation of update Update local copy, and perform next updateUpdate local copy, and perform next update NoteNote: similar to 2-Phase-Commit: similar to 2-Phase-Commit

Using CBCAST, process can assume that all copies have Using CBCAST, process can assume that all copies have been updated once CBCAST returns and local copy is been updated once CBCAST returns and local copy is updatedupdated CBCAST guarantees that all copies receive update in required CBCAST guarantees that all copies receive update in required

order with respect to previous CBCASTs that update the same order with respect to previous CBCASTs that update the same datadata

CBCASTs are ordered with respect to failures (notified via CBCASTs are ordered with respect to failures (notified via GBCASTs)GBCASTs)

NoteNote: Usage of CBCASTs improves performance: Usage of CBCASTs improves performance

Final ThoughtsFinal Thoughts

The proposed broadcast primitives provideThe proposed broadcast primitives provide Implicit ordering of messagesImplicit ordering of messages

Applications need not do explicit synchronization to Applications need not do explicit synchronization to prevent ordering problems when failures are possible prevent ordering problems when failures are possible

Message ordering with respect to faults/recoveries Message ordering with respect to faults/recoveries Faults and Recoveries are treated as logical events, Faults and Recoveries are treated as logical events,

subject to ordering with respect to messagessubject to ordering with respect to messages This provides consistency among the processes in the This provides consistency among the processes in the

distributed system (all processes experience same set of distributed system (all processes experience same set of events)events)

Improved performanceImproved performance Elimination of explicit application ordering protocols Elimination of explicit application ordering protocols

allows higher concurrency in computationallows higher concurrency in computation

Documents

Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05