3rd Lecture

DISTRIBUTED SYSTEMS II FAULT-TOLERANT BROADCAST (CNT.)

Prof Philippas TsigasDistributed Computing and Systems Research Group

*Total, FIFO and causal ordering of multicast messagesNotice the consistent ordering of totally ordered messages T1 and T2.They are opposite to real time.The order can be arbitraryit need not be FIFO or causaland the causally related messages C1 and C3

Atomic BroadcastRequires that all correct processes deliver all messages in the same order.

Implies that all correct processes see the same view of the world.*

Atomic BroadcastTheorem: Atomic broadcast is impossible in asynchronous systems.

Proof:Equivalent to consensus problem.*

Review of Consensus*381888

What is Consensus?N processesEach process p has input variable xp (v) : initially either 0 or 1output variable yp (d) : initially b (b=undecided)v single value for process p; d decision valueA process is non-faulty in a run provided that it takes infinitely many steps, and it is faulty otherwiseConsensus problem: design a protocol so that eitherall non-faulty processes set their output variables to 0 all non-faulty processes set their output variables to 1There is at least one initial state that leads to each outcomes 1 and 2 above

Consensus (II)All correct processes propose a value, and must agree on a value related to the proposed values!Definition: The Consensus problem is specified as follows: Termination: Every correct process eventually decides some value. Validity: If all processes that propose a value, propose v, then all correct processes eventually decide v. Agreement: If a correct process decides v, then all correct processes eventually decide v. Integrity: Every process decides at most once, and if it decides on v (not NU,) then some some process must have proposed it. (NU is a special value which stands for no unanimity).*

FLPTheorem: Consensus is impossible in any asynchronous system if one process can halt. [Fisher, Lynch, Peterson 1985]

Impossibility of distributed consensus with one faulty process (the original paper)http://dl.acm.org/citation.cfm?id=214121

A Brief Tour of FLP Impossibilityhttp://the-paper-trail.org/blog/a-brief-tour-of-flp-impossibility/

Possible Homework Assignment Area *

Atomic BroadcastTheorem 1: Any atomic broadcast algorithm solves consensus.

Everybody does an Atomic BroadcastDecides first value delivered

Theorem 2: Atomic broadcast is impossible in any asynchronous system if one process can halt.

Proof: By contradiction using FLP and Theorem 1*

*

Total ordering using a sequencer

Atomic BroadcastConsensus is solvable in:Synchronous systems (we will discuss such an algorithm that works in f+1 rounds) [We will come back to that!!!]Certain semi-synchronous systems

Consensus is also solvable inAsynchronous systems with randomizationAsynchronous systems with failure-detectors [We will come back to that!!!]

*

SLIDES FROM THE BOOK TO HAVE A LOOK ATPlease check aslo the slides from your book.I appened them here. *

Copyright George Coulouris, Jean Dollimore, Tim Kindberg 2001 email:[email protected] material is made available for private study and for direct use by individual teachers.It may not be included in any product or employed in any service without the written permission of the authors.Viewing: These slides must be viewed in slide show mode.Teaching material based on Distributed Systems: Concepts and Design, Edition 3, Addison-Wesley 2001.

Distributed Systems Course Coordination and Agreement11.4 Multicast communicationthis chapter covers other types of coordination and agreement such as mutual exclusion, elections and consensus. We will study only multicast.But we will study the two-phase commit protocol for transactions in Chapter 12, which is an example of consensusWe also omit the discussion of failure detectors which is relevant to replication

*Revision of IP multicast (section 4.5.1 page154)IP multicast an implementation of group communicationbuilt on top of IP (note IP packets are addressed to computers) allows the sender to transmit a single IP packet to a set of computers that form a multicast group (a class D internet address with first 4 bits 1110)Dynamic membership of groups. Can send to a group with or without joining itTo multicast, send a UDP datagram with a multicast addressTo join, make a socket join a group (s.joinGroup(group) - Fig 4.17) enabling it to receive messages to the groupMulticast routers Local messages use local multicast capability. Routers make it efficient by choosing other routers on the way. Failure model Omission failures some but not all members may receive a message.e.g. a recipient may drop message, or a multicast router may failIP packets may not arrive in sender order, group members can receive messages in different orders

How can you restrict a multicast to the local area network?Give two reasons for restricting the scope of a multicast message

*Introduction to multicastMulticast communication requires coordination and agreement. The aim is for members of a group to receive copies of messages sent to the groupMany different delivery guarantees are possible e.g. agree on the set of messages received or on delivery orderingA process can multicast by the use of a single operation instead of a send to each memberFor example in IP multicast aSocket.send(aMessage)The single operation allows for:efficiency I.e. send once on each link, using hardware multicast when available, e.g. multicast from a computer in London to two in Beijingdelivery guarantees e.g. cant make a guarantee if multicast is implemented as multiple sends and the sender fails. Can also do orderingMany projects - Amoeba, Isis, Transis, Horus (refs p436)What is meant by[the term broadcast ?

*System modelThe system consists of a collection of processes which can communicate reliably over 1-1 channelsProcesses fail only by crashing (no arbitrary failures)Processes are members of groups - which are the destinations of multicast messagesIn general process p can belong to more than one groupOperations multicast(g, m) sends message m to all members of process group gdeliver (m) is called to get a multicast message delivered. It is different from receive as it may be delayed to allow for ordering or reliability.Multicast message m carries the id of the sending process sender(m) and the id of the destination group group(m)We assume there is no falsification of the origin and destination of messages

*

Open and closed groupsClosed groups only members can send to group, a member delivers to itself they are useful for coordination of groups of cooperating serversOpen they are useful for notification of events to groups of interested processesDoes IP multicast support open and closed groups?

*Reliability of one-to-one communication(Ch.2 page 57)The term reliable 1-1 communication is defined in terms of validity and integrity as follows:validity: any message in the outgoing message buffer is eventually delivered to the incoming message buffer;integrity:the message received is identical to one sent, and no messages are delivered twice.How do we achieve validity?integrity by use checksums, reject duplicates (e.g. due to retries). If allowing for malicious users, use security techniquesHow do we achieve integrity?validity - by use of acknowledgements and retries

*11.4.1 Basic multicastA correct process will eventually deliver the message provided the multicaster does not crashnote that IP multicast does not give this guaranteeThe primitives are called B-multicast and B-deliverA straightforward but ineffective method of implementation:use a reliable 1-1 send (i.e. with integrity and validity as above)To B-multicast(g,m): for each process p e g, send(p, m);On receive (m) at p: B-deliver (m) at pProblem if the number of processes is large, the protocol will suffer from ack-implosion

What are ack-implosions?A practical implementation of Basic Multicast may be achieved over IP multicast (on next slide, but not shown)

*Implementation of basic multicast over IPEach process p maintains: a sequence number, Spg for each group it belongs to and Rqg, the sequence number of the latest message it has delivered from process q For process p to B-multicast a message m to group git piggybacks Sgp on the message m, using IP multicast to send itthe piggybacked sequence numbers allow recipients to learn about messages they have not receivedOn receive (g,m, S) at p:if S = Rqg +1 B-deliver (m) and increment Rqg by 1if S < Rqg +1 reject the message because it has been delivered beforeif S > Rqg +1 note that a message is missing, request missing message from sender. (will use a hold-back queue to be discussed later on)If the sender crashes, then a message may be delivered to some members of the group but not others.

*11.4.2 Reliable multicastThe protocol is correct even if the multicaster crashesit satisfies criteria for validity, integrity and agreementit provides operations R-multicast and R-deliverIntegrity - a correct process, p delivers m at most once. Also p e group(m) and m was supplied to a multicast operation by sender(m)Validity - if a correct process multicasts m, it will eventually deliver mAgreement - if a correct process delivers m then all correct processes in group(m) will eventually deliver mintegrity as for 1-1 communicationvalidity - simplify by choosing sender as the one processagreement - all or nothing - atomicity, even if multicaster crashes

*

Reliable multicast algorithm over basic multicastprocesses can belong to several closed groupsto R-multicast a message, a process B-multicasts it to processes in the group including itselfValidity - a correct process will B-deliver to itselfIntegrity - because the reliable 1-1 channels used for B-multicast guarantee integrity Agreement - every correct process B-multicasts the message to the others. If p does not R-deliver then this is because it didnt B-deliver - because no others did either.What can you say about the performance of this algorithm?Is this algorithm correct in an asynchronous system?Reliable multicast can be implemented efficiently over IP multicast by holding back messages until every member can receive them. We skip that.

*Reliable multicast over IP multicast (page 440)This protocol assumes groups are closed. It uses:piggybacked acknowledgement messagesnegative acknowledgements when messages are missedProcess p maintains: Spg a message sequence number for each group it belongs to and Rqg sequence number of latest message received from process q to gFor process p to R-multicast message m to group gpiggyback Spg and +ve acks for messages received in the form IP multicasts the message to g, increments Spg by 1A process on receipt by of a message to g with S from pIff S=Rpg+1 R-deliver the message and increment Rpg by 1If S Rpg discard the messageIf S> Rpg + 1 or if R

*The hold-back queue for arriving multicast messagesThe hold back queue is not necessary for reliability as in the implementation using IP muilticast, but it simplifies the protocol, allowing sequence numbers to represent sets of messages. Hold-back queues are also used for ordering protocols.

*Reliability properties of reliable multicast over IPIntegrity - duplicate messages detected and rejected. IP multicast uses checksums to reject corrupt messagesValidity - due to IP multicast in which sender delivers to itselfAgreement - processes can detect missing messages. They must keep copies of messages they have delivered so that they can re-transmit them to others.discarding of copies of messages that are no longer needed : when piggybacked acknowledgements arrive, note which processes have received messages. When all processes in g have the message, discard it.problem of a process that stops sending - use heartbeat messages.This protocol has been implemented in a practical way in Psynch and Trans (refs. on p442)

*11.4.3 Ordered multicastThe basic multicast algorithm delivers messages to processes in an arbitrary order. A variety of orderings may be implemented:FIFO orderingIf a correct process issues multicast(g, m) and then multicast(g,m ), then every correct process that delivers m will deliver m before m .Causal orderingIf multicast(g, m) multicast(g,m ), where is the happened-before relation between messages in group g, then any correct process that delivers m will deliver m before m . Total orderingIf a correct process delivers message m before it delivers m, then any other correct process that delivers m will deliver m before m.Ordering is expensive in delivery latency and bandwidth consumption

*Total, FIFO and causal ordering of multicast messagesthese definitions do not imply reliability, but we can define atomic multicast - reliable and totally ordered. Notice the consistent ordering of totally ordered messages T1 and T2.They are opposite to real time.The order can be arbitraryit need not be FIFO or causalNote the FIFO-related messages F1 and F2and the causally related messages C1 and C3 Ordered multicast delivery is expensive in bandwidth and latency. Therefore the less expensive orderings (e.g. FIFO or causal) are chosen for applications for which they are suitable

*Display from a bulletin board programUsers run bulletin board applications which multicast messagesOne multicast group per topic (e.g. os.interesting)Require reliable multicast - so that all members receive messagesOrdering:

*Implementation of FIFO ordering over basic multicastWe discuss FIFO ordered multicast with operations FO-multicast and FO-deliver for non-overlapping groups. It can be implemented on top of any basic multicastEach process p holds: Spg a count of messages sent by p to g and Rqg the sequence number of the latest message to g that p delivered from qFor p to FO-multicast a message to g, it piggybacks Spg on the message, B-multicasts it and increments Spg by 1On receipt of a message from q with sequence number S, p checks whether S = Rqg + 1. If so, it FO-delivers it. if S > Rqg + 1 then p places message in hold-back queue until intervening messages have been delivered. (note that B-multicast does eventually deliver messages unless the sender crashes)

*Implementation of totally ordered multicastThe general approach is to attach totally ordered identifiers to multicast messageseach receiving process makes ordering decisions based on the identifiers similar to the FIFO algorithm, but processes keep group specific sequence numbersoperations TO-multicast and TO-deliverwe present two approaches to implementing total ordered multicast over basic multicastusing a sequencer (only for non-overlapping groups)the processes in a group collectively agree on a sequence number for each message

*

Total ordering using a sequencer

*Discussion of sequencer protocolSince sequence numbers are defined by a sequencer, we have total ordering. Like B-multicast, if the sender does not crash, all members receive the message

Kaashoeks protocol uses hardware-based multicast The sender transmits one message to sequencer, thenthe sequencer multicasts the sequence number and the messagebut IP multicast is not as reliable as B-multicast so the sequencer stores messages in its history buffer for retransmission on request members notice messages are missing by inspecting sequence numbersWhat are the potential problems with using a single sequencer? What can the sequencer do about its history buffer becoming full?Members piggyback on their messages the latest sequence number they have seenWhat happens when some member stops multicasting?Members that do not multicast send heartbeat messages (with a sequence number)

*

The ISIS algorithm for total orderingthis protocol is for open or closed groups1. the process P1 B-multicasts a message to members of the group3. the sender uses the proposed numbers to generate an agreed number2. the receiving processes propose numbers and return them to the sender

*ISIS total ordering - agreement of sequence numbersEach process, q keeps: Aqg the largest agreed sequence number it has seen and Pqg its own largest proposed sequence number1. Process p B-multicasts to g, where i is a unique identifier for m. 2.Each process q replies to the sender p with a proposal for the messages agreed sequence number of Pqg := Max(Aqg, Pqg ) + 1. assigns the proposed sequence number to the message and places it in its hold-back queue3. p collects all the proposed sequence numbers and selects the largest as the next agreed sequence number, a. It B-multicasts to g. Recipients set Aqg := Max(Aqg, a ) , attach a to the message and re-order hold-back queue.

*Discussion of ordering in ISIS protocolHold-back queueordered with the message with the smallest sequence number at the front of the queuewhen the agreed number is added to a message, the queue is re-orderedwhen the message at the front has an agreed id, it is transferred to the delivery queueeven if agreed, those not at the front of the queue are not transferredevery process agrees on the same order and delivers messages in that order, therefore we have total ordering. Latency3 messages are sent in sequence, therefore it has a higher latency than sequencer methodthis ordering may not be causal or FIFOproof of total ordering on page 448

*Causally ordered multicastWe present an algorithm of Birman 1991 for causally ordered multicast in non-overlapping, closed groups. It uses the happened before relation (on multicast messages only)that is, ordering imposed by one-to-one messages is not taken into accountIt uses vector timestamps - that count the number of multicast messages from each process that happened before the next message to be multicast

*

Causal ordering using vector timestampsNote: a process can immediately CO-deliver to itself its own messages (not shown)

*Commentsafter delivering a message from pj, process pi updates its vector timestamp by adding 1 to the jth element of its timestampcompare the vector clock rule where Vi[j] := max(Vi[j], t[j]) for j=1, 2, ...Nin this algorithm we know that only the jth element will increasefor an outline of the proof see page 449if we use R-multicast instead of B-multicast then the protocol is reliable as well as causally ordered. If we combine it with the sequencer algorithm we get total and causal ordering

*Comments on multicast protocolswe need to have protocols for overlapping groups because applications do need to subscribe to several groupsdefinitions of global FIFO ordering etc on page 450 and some references to papers on themmulticast in synchronous and asynchronous systemsall of our algorithms do work in bothreliable and totally ordered multicast can be implemented in a synchronous systembut is impossible in an asynchronous system (reasons discussed in consensus section - paper by Fischer et al.)

*SummaryMulticast communication can specify requirements for reliability and ordering, in terms of integrity, validity and agreementB-multicast a correct process will eventually deliver a message provided the multicaster does not crashreliable multicast in which the correct processes agree on the set of messages to be delivered; we showed two implementations: over B-multicast and IP multicastdelivery orderingFIFO, total and causal delivery ordering. FIFO ordering by means of senders sequence numberstotal ordering by means of a sequencer or by agreement of sequence numbers between processes in a groupcausal ordering by means of vector timestampsthe hold-back queue is a useful component in implementing multicast protocols

DISTRIBUTED SYSTEMS II FAULT-TOLERANT AGREEMENT

Prof Philippas TsigasDistributed Computing and Systems Research Group

Copyright George Coulouris, Jean Dollimore, Tim Kindberg 2001 email:[email protected] material is made available for private study and for direct use by individual teachers.It may not be included in any product or employed in any service without the written permission of the authors.Viewing: These slides must be viewed in slide show mode.Teaching material based on Distributed Systems: Concepts and Design, Edition 3, Addison-Wesley 2001.

Distributed Systems Course Coordination and Agreement11.5 Consensus and Related problems

Consensus - AgreementAll correct processes propose a value, and must agree on a value related to the proposed values!Definition: The Consensus problem is specified as follows: Termination: Every correct process eventually decides some value. Validity: If all processes that propose a value, propose v, then all correct processes eventually decide v. Agreement: If a correct process decides v, then all correct processes eventually decide v. Integrity: Every process decides at most once, and if it decides on v (not NU,) then some some process must have proposed it. (NU is a special value which stands for no unanimity).*

*The one general problem (Trivial!)

BattlefieldGTroops

*The two general problem:

messengersBlue armyRed armyBlueGRedGEnemy

*Blue and red army must attack at same timeBlue and red generals synchronize through messengersMessengers (messages) can be lostRules:

*How Many Messages Do We Need?BGRGattack at 9amassume blue starts...Is this enough??

*How Many Messages Do We Need?BGRGattack at 9amassume blue starts...Is this enough??ack (red goes at 9am)

*How Many Messages Do We Need?BGRGattack at 9amassume blue starts...Is this enough??ack (red goes at 9am)got ack

*Stated problem is Impossible!Theorem: There is no protocol that uses a finite number of messages that solves the two-generals problem (as stated here)

Proof: Consier the shortest such protocol(execution)Consider last messageProtocol must work if last message never arrivesSo dont send itBut, now you have a shorter protocol(execution)

*Stated problem is Impossible!Theorem: There is no protocol that uses a finite number of messages that solves the two-generals problem (as stated here)Alternatives??

*Probabilistic Approach?Send as many messages as possible, hope one gets through...BGRGassume blue starts...

*Eventual CommitEventually both sides attack...BGRGassume blue starts...retransmitsretransmits

*2-Phase Eventual CommitEventually both sides attack...BGRGassume blue starts...retransmitsretransmitsphase 1phase 2

*

Chalmers surrounded by army units Armies have to attack simultaneously in order to conquer ChalmersCommunication between generals by means of messengersSome generals of the armies are traitors

The Byzantine agreement problem

One process(the source or commander) starts with a binary value Each of the remaining processes (the lieutenants) has to decide on a binary value such that:Agreement: all non-faulty processes agree on the same valueValidity: if the source is non-faulty, then all non-faulty processes agree on the initial value of the sourceTermination: all processes decide within finite timeSo if the source is faulty, the non-faulty processes can agree on any valueIt is irrelevant on what value a faulty process decides

*

Byzantine Empire

Conditions for a solution for Byzantine faults

Number of processes: nMaximum number of possibly failing processes: fNecessary and sufficient condition for a solution to Byzantine agreement:f

Senario 1*

Senario 2*

The Byzantine agreement problem (general formulation)

All processes starts with a binary value They have to decide on a binary value such that: Agreement: all non-faulty processes agree on the same value Validity: Input value Termination: All correct processes decide within finite time It is irrelevant on what value a faulty process decides

*

Impossibility of 1-resilient 3-processor Agreement*A:VA=0B:VB=0C:VC=0A:VA=1B:VB=1C:VC=1E1

ProofIn E0 A and B decide 0In E1 B and C decide 1In E2 C has to decide 1 and A has to decide 0, contradiction!

*

**

Note error in book Figure in 1st impression

*For a 1hr 15 mins lecture, the following slides were omitted: IP multicast implementation of basic multicastIP multicast implementation of reliable multicastmplementation of FIFO multicast over basic multicastISIS algorithm for totally ordered multicast(one of these topics could have been included in the time)

*Restrict range of a multicast by setting the time to liveThe reasons for doing this are: 1) for efficiency 2) to prevent sending to another group elsewhere with the same address*the term broadcast means delivery to all local processes*Will discuss reliability of 1-1 channels in a minute arbitrary failures - e.g. sending false messages such as untrue acks process p can belong to more than one group - but to simplify often restrict it to 1*Yes, IP multicast does support both open and closed groups**Ack implosions - multicaster is sent acks by members and so many arrive that it may drop somne of them, so it retransmits and gets yet more acks.*Correct in an asynchronous nsystem - no timing assumptions

inefficient - each message is sent |g| times*the piggybacked values in a message allow recipients to learn about messages they have not yet received*

Note error in book Figure in 1st impression

*the ordering is also FIFO because the senders messages arrive in order at sequencerit is not causala sequencer may become bottlenecka sequencer is a critical point of failurecan improve this by storing message at f+ 1 nodes before delivering, so as to tolerate up to f failuresMembers inform it as to the latest sequence number they have seen (piggybacked)If a member stops multicasting, the sequencer will not know what they have received. To solve this problem, members send 'heartbeat messages' containing sequnce numbers*process B-multicasts a message to members of the groupreceiving processes propose numbers and return them to the sender sender uses proposed numbers to generate agreed numbers *pi has delivered from pj Vi[j]pj has delivered Vj[k] need to wait until Vi[k]=>Vj[k]can CO-deliver to self immediately after B-multicast*Times taken to present these slidesSections 14.1-14.2 about 1 hour 15 minutes (but omitted the details of view synchronous group communication on page 563)Then sections 14.4 and 14.5 took about 1 hour 15 minutes, but omitted 14.4.2 Bayou and 14.4.3 Coda and 14.5.6 virtual partition algorithm.

Documents

3rd Lecture