Overview - Southern Illinois University Carbondalerahimi/cs420/slides/cs420-part4.pdf · Overview ♦Introduction ♦Fundamental Concepts of Distributed Systems 8 System models 8

Overview♦ Introduction♦ Fundamental Concepts of Distributed Systems

8 System models8 Review of network architectures8 Interprocess communication

♦ Time and Global States8 Clocks and concepts of time8 Synchronization8 Global states

♦ Coordination8 Distributed mutual exclusion8 Multicast8 Byzantine problems

♦ Distribution and Operating Systems8 Protection mechanisms8 Processes and threads

♦ Distributed File Systems8 Network file system (NFS)

Overview♦ Middleware

8 Distributed object models– Remote invocation– CORBA

8 Name services♦ Security

8 Cryptographic algorithms8 Digital signatures

♦ Distribution and Database Systems8 Distribution of databses8 Transactions and concurrency control8 Concurrency control in distributed transactions

♦ Distributed Shared Memory8 Sequential consistency

♦ Telecommunications Systems8 Distributed multimedia systems8 Intelligent networks8 Network management

Coordination♦ Coordination Problems in Distributed Systems

8 asynchronous distributed systems: no one process has a view of the currentglobal system state

8 need to coordinate the actions of the independent processes to achievecommon goals

– failure detection: how do I know in an asynchronous network whether mypeer is dead or alive

– mutual exclusion: no two process will ever get access to a sharedresource in a critical section at the same time

– election: in master-slave systems, how will the system elect a master(either at boot up time or when the master fails)

– muticast: sending to a group of receipientsireliability of multicast iorder preservation

– consensus in the presence of faults (byzantine problems):ihow to know whether acknowledgement was received over an

unreliable communication mediumihow to know whether peer process knows about one’s own

intentions in the presence of a non-confidential communicationchannel

Failure Detection♦ Failure Detector

8 service that posesses the capability to decide whether a particular processhas crashed or not

8 local failure detector in each object, collaborating with peers in otherprocesses to detect failure

– unreliable failure detector: distinguishes suspected and unsuspectedpeer processesiunsuspected: failure is unlikely (e.g., f.d. has recently received

communication from unsuspected peer)* may be inaccurate

isuspected: indication that peer process failed (e.g., no messagereceived in quite some time)

* may be inaccurate (e.g., peer process hasn’t failed, but thecommunication link is down, or peer process is much slowerthan expected)

– reliable failure detectoriunsuspected: potentially inaccurate as aboveifailed

* accurate determination that peer process has failed

Failure Detection♦ Failure Detector

8 implementation of unreliable failure detector– periodically, every T seconds each p send’s “I’m alive” message to every

other process– if local failure detector at q does not receive “I’m alive” from p within T+D

(D = est. max. transmission delay), then p is suspected– will revise verdict if message is subsequently received

8 problem: how to calibrate D– for small D, intermittent network performance downgrades will

lead to suspected non-crashed processes many times, or– for large D, crashes will remain unobserved (crashed nodes will be fixed

before timeout expires)8 solution approaches

– variable D, based on observed network latencies8 implementation of reliable failure detectors only possible in synchronous

networks

Mutual Exclusion♦ Mutual exclusion problems

8 prominent problem in multitasking operating systems– access to shared memory– access to shared resources– access to shared data– various algorithms to ensure mutual exclusion, e.g.

iDijkstra’s SemaphoresiMonitors

8 mutual exclusion in distributed systems– no shared memory– usually, no centralized instance like operating system kernel that would

coordinate access– based on synchronous or asynchronous design approatch

8 examples– consistent access to shared files (e.g., Network File Systems)– coordination of access to an access point in an IEEE 802.11 WaveLAN

Mutual Exclusion

♦ Requirements for Mutual Exclusion Algorithms8 ME1: at most one process may execute in the critical secion at any given

point in time (safety)8 ME2: requests to enter or exit the critical section will eventually succeed

(liveness)– impossible for one process to enter critical section more than once while

other processes are awaiting entry8 ME3: if one request to enter the critical section is issued before another

request (as per the → relation), then the requests will be served in the sameorder

enter()

access()

exit()

process 1

enter()

access()

exit()

process n

...

Mutual Exclusion♦ Performance criteria to be used in the assessment of mutual

exclusion algorithms8 bandwidth consumed (corresponds to number of messages sent)8 client delay at each entry and exit8 throughput: number of critical region accesses that the system allows

– here measured in terms of the synchronization delay between oneprocess exiting the critical section and the next process entering

rahimi

Text Box

Mutual Exclusion

♦ Central Server-based Algorithm8 central server receives access requests

– if no process in critical section, request will be granted– if process in critical section, request will be queued

8 process leaving critical section– grant access to next process in queue, or wait for new requests if queue

is empty♦ Properties

8 satisfies ME1 and ME2, but not ME3 (network delays may reorder requests)8 two messages per request, one per exit, exit does not delay the exiting process8 performance and availability of server are the bottlenecks

Server

1. Requesttoken

Queue ofrequests

2. Releasetoken

3. Granttoken

4

2

p4

p3p

2

p1

© Addison-Wesley Publishers 2000

Mutual Exclusion

♦ Ring-based Algorithm8 logical, not necessarily physical link: every process pi has connection to

process p(i+1) mod N8 token passes in one direction through the ring8 token arrival

– only process in posession of token may access critical region– if no request upon arrival of token, or when exiting critical region, pass

token on to neighbour8 satisfies ME1 and ME2, but not ME38 performance

– constant bandwidth consumption– entry delay between 0 and N message transmission times– synchronization delay between 1 and N message transmission times

pn

p2

p3

p4

Token

p1


Mutual Exclusion

♦ Algorithm by Ricart and Agrawala8 based on multicast

– process requesting access multicasts request to all other processes– process may only enter critical section if all other processes return

positive acknowledgement messages8 assumptions

– all processes have communication channels to all other processes– all processes have distinct numeric ID and maintain logical clocks

On initializationstate := RELEASED;

To enter the sectionstate := WANTED;Multicast request to all processes; processing deferred hereT := request’s timestamp;Wait until (number of replies received = (N – 1));state := HELD;

On receipt of a request <Ti, pi> at pj (i ≤ j)if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi)))then

queue request from pi without replying; else

reply immediately to pi;end if

To exit the critical sectionstate := RELEASED;reply to any queued requests; © Addison-Wesley Publishers 2000

Mutual Exclusion

♦ Algorithm by Ricart and Agrawala8 if request is broadcast and state of all other processes is RELEASED, then

all processes will reply immediately and requester will obtain entry8 if at least one process is in state HELD, that process will not reply until it has

left critical section, hence mutual exclusion8 if two or more processes request at the same time, whichever processes

request bears lower timestamp will be the first to get N-1 replies8 in case of equal timestamps, process with lower ID wins


To enter the sectionstate := WANTED;Multicast request to all processes; processing of incoming requestsT := request’s timestamp; deferred hereWait until (number of replies received = (N – 1));state := HELD;





Mutual Exclusion

♦ Algorithm by Ricart and Agrawala8 p3 not attempting to enter, p1 and p2 request entry simultaneously8 p3 replies immediately8 p2 receives request from p1, timestamp(p2) < timestamp(p1), therefore p2

does not reply8 p1 sees its timestamp to be larger than that of the request from p2, hence it

replies immediately and p2 is granted access8 p2 will reply to p1’s request after exiting the critical section

p3

34

Reply

3441

41 41

34

p1

p2

ReplyReply


Mutual Exclusion

♦ Algorithm by Ricart and Agrawala8 algorithms satisfies ME1

– two processes pi and pj can only access critical section at the same timein case they would have replied to each other

– since pairs <Ti, pi> are totally ordered, this cannot happen8 algorithms also satisfies ME2 and ME3







Mutual Exclusion

♦ Algorithm by Ricart and Agrawala8 performance

– getting access requires 2(N-1) messages per request– synchronization delay: only one message transmission time,

client delay: just one round-trip time, previous algorithms up to N)8 protocol improvements

– repeated entry of same process without executing protocol– optimization possible to N messages per request (with hardware support for multicast)







Mutual Exclusion♦ Maekawa’s Voting Algorithm

8 observation– to get access, not all processes have to agree– suffices to split set of processes up into subsets (“voting sets”) that

overlap– suffices that there is consensus within every subset

8 model– processes p1, .., pN– voting sets V1, .., VN chosen such that ∀ i,k and for some integer M:

pi ∈ ViVi ∩ Vk ≠ ∅ (some overlap in every voting set)| Vi | = K (fairness: all voting sets have equal size)each process pk, is contained in M voting sets

rahimi

Text Box


8 protocol– to obtain entry to critical section, pi sends request messages to all K-1

members of voting set Vi– cannot enter until K-1 replies received– when leaving critical section, send release to all members of Vi– when receiving request

iif state = HELD or already replied (voted) since last request* then queue request

ielse immediately send reply– when receiving release

iremove request at head of queue and send reply


On initializationstate := RELEASED; voted := FALSE;

For pi to enter the critical sectionstate := WANTED;Multicast request to all processes in Vi – {pi};Wait until (number of replies received = (K – 1));state := HELD;

On receipt of a request from pi at pj (i ≠ j)if (state = HELD or voted = TRUE)then


send reply to pi;voted := TRUE;

end ifFor pi to exit the critical section

state := RELEASED;Multicast release to all processes in Vi – {pi};

On receipt of a release from pi at pj (i ≠ j)if (queue of requests is non-empty)then

remove head of queue – from pk, say;send reply to pk;voted := TRUE;

elsevoted := FALSE;

end if © Addison-Wesley Publishers 2000


8 optimization goal: minimize K while achieving mutual exclusion– can be shown to be reached when K~√ N and M=K

8 optimal voting sets: nontrivial to calculate– approximation: derive Vi so that | Vi | ~ 2√N

iplace processes in a √N by √N matrixilet Vi the union of the row and column containing pi

8 satisfies ME1– if possible for two processes to enter critical section, then processes in

the non-empty intersection of their voting sets would have both grantedaccess

– impossible, since all processes make at most one vote after receivingrequest

8 deadlocks are possible– consider three processes with

iV1 = {p1, p2}, V2 = {p2, p3}, V3 = {p3, p1}– possible to construct cyclic wait graph

ip1 replies to p2, but queues request from p3ip2 replies to p3, but queues request from p1ip3 replies to p1, but queues request from p2


8 algorithm can be adapted to become deadlock-free– use of logical clocks– processes queue requests in happened-before order– means that ME3 is also satisfied

8 performance– bandwidth utilization

i2√N per entry, √N per exit, total 3√N is better than Ricart andAgrawala for N>4

– synchronization delayiround-trip time instead of single-message transmission time in Ricart

and Agrawala

Mutual Exclusion♦ Notes on Fault Tolerance

8 none of these algorithms tolerates message loss8 ring-algorithms canot tolerate single crash failure8 Maekawa’s algorithm can tolerate some crash failure

– if process is in a voting set not required, rest of the system not affected8 Central-Server: tolerates crash failure of node that has neither requested

access nor is currently in the critical section8 Ricart and Agrawala algorithm can be modified to tolerate crash failures by

the assumption that a failed process grants all requests immediately– requires reliable failure detector

Election Algorithms♦ Election

8 algorithm designed to designate one unique process out of a set ofprocesses with similar capabilities to take over certain functions in adistributes system

– central server for mutual exclusion– ring master in token ring networks– bus master

8 necessary when– system is booted– server fails– server retires

8 properties, to be valid during any particular run of the system– E1: a process pi has electedi = ⊥ (undefined) or electedi = P for some

non-crashed process P that will be chosen at the end of the run with thelargest identifier (safety)

– E2: all processes pi will eventually set electedi ≠ ⊥ (liveness)8 performance

– network bandwidth utilization (proportional to total number of messagessent)

– turnaround time: the number of serialized message transmission timesbetween initiation and termination of a single run

Election Algorithms♦ Ring-based Algorithm

8 assumptions– all nodes communicate on uni-directional ring structure– all processes have unique integer id– asynchronous, reliable system

8 initially, all processs marked “non-participant”8 to begin election, a process place election message with its identifier on ring

and marks itself “participant”8 upon receipt of election message, compare received identifier with its own

– if received id greater than your id, forward message to neighbour– if received id smaller than your id

iif your status is “non-participant”, then substitute your id in electionmessage and foward on ring

iotherwise, does not forward message (already “participant”)– if received id is identical to your id

ithis process’s id must be greatest and it becomes electedimark your status as “non-participant”isend out “elected” message

8 upon any forwarding, mark your state as “participant”8 when receiving “elected” message

– mark your status as “non-participant”– set electedi appropriately and forward elected message

Election Algorithms♦ Ring-based Algorithm

8 properties– E1 satisfied, since all identifiers are compared– E2 follows from reliable communication property

8 performance– at worst 2N-1 messages for electing the right-hand neighbour– another N elected messages

8 failures– tolerates no failures

Election Algorithms♦ The Bully-Algorithm

8 works for synchronous networks– nodes can crash, and crashes will be detected reliably

8 assumptions– each node knows identifiers of all other nodes– every node can communicate with every other node

8 message types– election: announce an election– answer: reply to an election message– coordinator: announce identity of elected process

Election Algorithms♦ The Bully-Algorithm

8 initiation of algorithm: reliable failure detection– a peer process failed if no answer to request within

iT = 2Ttrans + Tprocess8 process can decide whether to become coordinator by comparing own id with

all other ids (highest wins)– announce by sending coordinator message to all other nodes with lower

id8 process with lower id can bid to become coordinator by sending election

message to all processes with higher ID– if no response within T, considers itself elected coordinator, sends

coordinator message to all processes with lower id– otherwise, wait for another T’ time units for a coordinator message to

arrive from new coordinatoriif no response, then begin another election process

8 process receiving coordinator message sets variable electioni to the id of thecoordinator received in the message

8 if process receives election message, sends back an answer message andbegins another election - unless one was already initiated

8 new process replacing crashed process– if highest id, will immediately send coordinator message and “bully”

current coordinator to resign

Election Algorithms♦ The Bully Algorithm

8 example

p1 p2

p3

p4

p1

p2

p3

p4

Ccoordinator

Stage 4

C

election

electionStage 2

p1

p2

p3

p4

C

election

answer

answer

electionStage 1

timeout

Stage 3

Eventually.....

p1

p2

p3

p4

election

answer

The election of coordinator p2,after the failure of p4 and then p3


Election Algorithms♦ The Bully Algorithm

8 properties– E1 satisfied (if no process replaced and timeout T estimate accurate)– E2 satisfied (synchronous network, reliable transmission)– E1 not satisfied if crashed process replaced at the same time while

another process has announced that it is the new coordinator8 performance

– best case: process with the second highest identifier detects coordinatorsfailureielects itself coordinator and sends N-2 coordinator messages

– requires O(N2) messages in worst case when least id detects failure firstiN-1 processes with higher IDs start election

Multicast♦ Multicast

8 group communication– sending and delivery of messages to more than one receipient– membership in receipient group transparent to sender

ione send operation to one address without having to send individualmessages to all receipients

8 issues– addressing– coordination

iguarantees that messages are received by a group of receipientsidelivery ordering among group members

8 uses of multicast– Computer Supported Collaborative Work (CSCW)

ishared white boradsivideo-conferencing

– communication with replicated servers (to achieve fault-tolerance)– event notification in networks

Multicast♦ IP-based Multicast

8 only implemented by some IP routers8 available for UDP transport service8 addressing: multicast address and port number8 IP multicast group

– class D IP address for which first 4 bits are 1110 in IPv4– membership is dynamic– computer belongs to multicast group if one or more processes have

sockets that belong to a multicast group8 implementation of multicast IP routers

– on local area networks, use LAN's multicast capabilities (e.g., Ethernet)iuse locally valid multicast address, set Time To Live (TTL) counter in

IP header to 1 so that packet will never get routed outside LAN– in the Internet, router forwards messages to all other routers that have

members in the multicast group, which in turn forward the datagrams togroup membersisession directory (sd)

* allowing users to advertise multicast sessions as well as their valid multicast addresses

8 no guarantees whatsoever– message loss, reordering, duplication, etc.

Multicast♦ Properties of multicast

8 achieves not only transparency, but also enables stronger guarantees than"delivery by hand"

– efficient use of network hardwareirouter sends individual messagesiuses tree-like distribution structure if avialabeiuse of LAN-based multicast capabilities, if available

– delivery guarantees♦ System model

8 messge m: contains ID of sender and of destination group– multicast(g, m): multicast message m to group g– deliver(m): delivery of a message at receipient

8 multicast group is– closed, if multicast only within– open, if processes not member of the group may send to it

Multicast♦ Basic multicast

8 guaranteed delivery, unless multicaster crashes8 primitives and implementation

– B-multicast(g, m): for each process p ∈ g, send(p, m)– B-deliver(m) at p: when receive(m) at p, for all p

8 problem in using concurrent send(p, m) operations– ack-implosion:

iall receipients acknowledge receipt at about same timeibuffer overflow leads to dropping of ack messagesiretransmits, even more ack messages

Multicast♦ Reliable multicast

8 primitives– R-multicast(m, g)– R-deliver(m)

8 desired properties– integrity: a correct process p delivers a message at most once, and the

delivered message is identical to the message sent in the multicast sendoperation (safety)

– validity: if a correct process multicasts message m, then it will eventuallydeliver m (liveness)

– agreement: if a correct process delivers a message m, then all othercorrect processes in the target group of message m will also delivermessage m

– (additionally) uniform agreement: if a process, no matter whether it iscorrect or fails, delivers a message m, then all correct processe in thegroup will deliver m as well

8 notes:– validity is expressed in terms of self-delivery, for simplicity reasons

ivalidity and agreement amount to overall liveness requirement: if oneprocess (the sender) delivers a message m, then m will eventuallybe delivered to all the group’s correct members

– agreement is similar to “atomicity”: all-or-nothing semantics

Multicast♦ Reliable multicast

8 Implementation– B-multicast to processes in group– R-deliver

8 properties– validity: a correct process will eventually B-deliver to itself– integrity: based on underlying communication medium– agreement: B-multicast to all other processes after B-deliver

8 inefficient, since each message is sent |g| times to each process


rahimi

Pencil

rahimi

Pencil

rahimi

Pencil

rahimi

Pencil

rahimi

Pencil

rahimi

Pencil

rahimi

Pencil

rahimi

Pencil

rahimi

Pencil

rahimi

Pencil

rahimi

Pencil

rahimi

Pencil

rahimi

Pencil

Multicast♦ Reliable Multicast over IP Multicast

8 R-IP-multicast is based on observation that multicast is successful in most cases– use negative acknowledgement to indicate non-delivery

8 Basic idea– closed multicast groups– Sg

p: sequence number for group g that process p belongs to– Rg

p: sequence number of latest message that a process has deliveredfrom process p and that was sent to group g

– p R-multicasts message to group gipiggy back onto message

* Sgp

* acknowledgements <q, Rgq> for all q

iIP-multicast message and piggy back informationiincrement Sg

p by one


8 Basic idea– R-deliver message from p

ionly if received sequence number S = Rgp+1

ithen increment Rgp by 1

iretain any message that cannot yet be delivered in hold-back-queue

Messageprocessing

Delivery queueHold-back

queue

deliver

Incomingmessages

When delivery guarantees aremet



8 Basic idea– R-deliver message from p

iif S ≤ Rgp, then message is already delivered, discard

iif S > Rgp or R > Rg

q for any enclosed acknowledgement <q, R>, thenreceiver has missed one or more messages, requests retransmitthrough negative acknowledgement

8 properties– integrity

ifollows from detection of duplicates and properties of IP multicast(e.g., checksum to detect message corruption)

– validity & agreement (validity holds because IP multicast has this property)imessage loss can only be detected when a successor message is

eventually transmittedirequires processes to multicast messages indefinitelyirequires unbounded history for broadcast messages so that

retransmit is always possible

Multicast♦ Ordered Multicast

8 assume: every process belongs to at most one group8 properties

– FIFO ordering: if a correct process issues a multicast(g, m) andthen multicast(g, m’), then every correct process that delivers m’will deliver m before m’

– causal ordering: if multicast(g, m) → multicast(g, m’), where→ is induced by message passing only, then every correct process thatdelivers m’ will deliver m before m’

– total ordering: if a correct process delivers m before it delivers m’, thenany other correct process that delivers m’ will deliver m before m’

8 notes– causal ordering implies FIFO ordering– FIFO ordering and causal ordering are partial orders– total order allows arbitrary ordering of deliver events relative to multicast

events, as long as this order is identical in all correct processes


8 implementing FIFO ordering– Sg

p: sequence number for group g that process p belongs to– Rg

p: sequence number of latest message that a process has deliveredfrom process p and that was sent to group g

– assumption: non-overlapping groups– FO-multicast(m, g)

iB-multicast(m, g, < Sgp >)

iincrement Sgp by 1

– upon receipt of a message from q with sequence number Siif S = Rg

q+1, then this is the next message,* therefore FO-deliver(m)* Rg

q := Siif S > Rg

q+1, then* place message on hold-back queue until intervening messages

have been delivered and S = Rgq+1


8 implementing total ordering– idea: assign totally ordered identifiers to multicast messages so that

every process makes the same delivery decision based on theseidentifiers

– delivery similar to FIFO delivery, only that group-specific sequencenumbers rather than process-specific sequence numbers are used

– assumption: non-overlapping groups– two main methods for the assignment of identifiers

isequencericollective agreement on the assignment of message identifiers


8 implementing total ordering– sequencer

iprocess wishing to TO-broadcast attaches a unique identifier id(m)to the message

imessage is sent to sequencer as well as all members of gisequencer maintains group-specific sequence number sg which it

uses to assign increasing and consecutive sequence numbers to themessages it B-delivers

iannounces the order in which members of g have to deliver thesemessages using a B-multicasted order message



8 implementing total ordering– sequencer



8 implementing total ordering– sequencer is bottleneck (performance and/or reliability)– collective agreement on the assignment of message identifiers

iimplemented in the ISIS toolkitigroups may be open or closedireceiving processes bounce proposed sequence numbers to senderisender returns agreed sequence numbersieach process q in group g maintains

* Agq: the largest agreed sequence number it has observed so far

for group g* Pg

q: its own largest proposed sequence number


8 implementing total ordering– algorithm for collective agreement on the assignment of message

identifiersip B-multicasts <m, i> to g, where i is unique identifier for mieach receipient q replies to g with proposal for agreed sequence

number* Pg

q:= max(Agq, Pg

q) + 1* each process q provisionally assigns its own proposed sequence

number to message and queues message in hold back queue,ordered according to proposed sequence number

ip chooses largest proposed number as sequence number, aip B-multicasts <i, a> to gieach process q in group

* sets Agq:= max(Ag

q, a)* reorders received message in hold-back queue if received

sequence number differs from proposed number* only when message at head of hold-back queue is assigned an

agreed sequence number, it will be queued in delivery queue


8 implementing causal ordering (after Birman et al.)– algorithm shown here ensures compliance with HB-relation only when

it is established by multicast messages, not by individual one-to-onecommunication

– each process maintains a vector clock counting the multicast events thathave happened before a local multicast event

– CO-multicast(m, g)iadd one to its own timestampiB-multicast message

– when pi B-delivers message from pkiplace it in hold-back queue until it is assured that all causally

preceding messages have been delivered:iconsider vector timestamp of received message

* wait until it has delivered any earlier message sent by pk, and* it has delivered any message that pk had delivered at the time it

multicast the current messageiupdate own vector timestamp in the k-th position


8 implementing causal ordering (after Birman et al.)



8 note: combinations are possible– CO-multicast + TO-multicast (sequencer) yields total and causal

message deliveryiidea: all processes in the same order, i.e., in the sequencers order,

and this order is causal, we get total and causal order8 extensions to overlapping groups

– “naive” extension: implement orderings on all processes at hand, thosethat are not in a particular group will discard messages not adressed tothem

– inefficient solution, suggestions to more efficient solutions exist

Group Communication♦ Multicast communication to groups with dynamic membership

Join

Groupaddress

expansion

Multicastcommunication

Groupsend

Fail Group membershipmanagement

Leave

Process group


Group Communication♦ Group membership service

8 interface for group membership changes8 failure detector8 notification of membership changes to group members8 group address expansion

♦ Group views8 lists of current group members8 process “suspected”

– exclusion from group view– if process not failed, or recovered, it needs to re-join group– false suspicion reduces effectiveness of group

Group Communication♦ View delivery

8 necessary to relieve programmer to query state of all other group membersbefore making a send decision

8 group management service delivers sequence of views to members, e.g.– v0(g) = {p}, v1(g) = {p, p’}, v2(g) = {p}, ...

8 system imposes an ordering on the possibly concurrent view changes8 receiving/delivering a view

– queue in hold-back queue as for multicast until all members agree todeliver the view

Consensus♦ Consensus problems

8 all correct computers controlling a spaceship should decide to proceed withlanding, or all of them should decide to abort (after each has proposed oneaction or the other)

8 in an electronic money transfer transaction, all involved processes mustconsistently agee on whether to perform the transaction (debit and credit), ornot

8 in mutual exclusion, processes need to agree on which process enters criticalsection

8 in election, processes need to agree on elected process8 in totally ordered multicast, processes need to agree on a consistent

message delivery order

Consensus♦ Recall process failure models

8 crash failures: processes stop (fail), but remain silent8 byzantine failures: processes fail, but may still respond to environment with

arbitrary, erratic behavior (e.g., send false acknowledgements, etc.)


Consensus♦ Factors threatening consensus

8 failures– communication link or process failures– crash failures (fail-silent) or byzantine failures (arbitrary)

i(after Byzantine Empire 330-1453, in which unfaithfulness anduntruthfulness have allegedly been very common)

8 network characteristics– synchronous or asynchronous

8 failure detectors– reliable or unreliable

8 are messages authenticated (digitally signed) or not– can a process lie about the content of message that it received from a

correct process?– can adversary claim to send message under a false expedient’s id?

♦ Model8 processes communicating by message passing8 desireable: reaching consensus even in the presence of faults

– assumption: communication is reliable, but processes may fail

Consensus♦ The Consensus Problem (C)

8 agreement in the value of a decision variable among all correct processes– pi is in state undecided and proposes a single value vi– next, processes communicate with each other to exchange values– in doing so, pi sets decision variable di and enters the decided state after

which the value of di remains unchanged

1

P2

P3 (crashes)

P1

Consensus algorithm

v1=proceed

v3=abort

v2=proceed

d1:=proceed d2:=proceed



8 properties of a consensus algorithm– termination: eventually, each correct process sets its decision variable– agreement:

ifor all correct correct pi and pk such that state(pi) = state(pk)=decided

di = dk– integrity: if the correct processes all proposed the same value, then any

correct process has chosen that value in the decided stateivariation: ... then some correct process has chosen that value in the

decided state


8 algorithm to solve consensus in a failure-free environment– each process reliably multicasts proposed values– after receiving response, solves consensus function

majority(v1,.., vN), [remark: other problem-specific functions possible]which returns most often proposed value, or undefined if no majorityexists

– propertiesitermination guaranteed by reliability of multicastiagreement, integrity: definition of majority, and integrity of reliable

multicast (all processes solve same function on same data)8 when crashes occur

– how to detect failure?– will algorithm terminate?

8 when byzantine failures occur– processes communicate random values– evaluation of consensus function may be inconsistent– malevolent processes may deliberately propose false or inconsistent

values

Consensus♦ The Byzantine Generals Problem (BG)

8 three or more generals are to agree on an attack or retreat8 commander issues order

– others (lieutenant to the commander) have to decide to attack or retreat8 one of the generals may be treacherous

– if commander is treacherous, it proposes attacking to one general andretreating to the other

– if lieutenants are treacherous, they tell one of their peers that commanderordered to attack, and others that commander ordered to retreat

8 difference to consensus problem: one process supplies a value that othershave to agree on

8 properties– termination: eventually each correct process sets it decision variable– agreement: the decision value of all correct processes is the same– integrity: if the commander is correct, then all processes decide on the

value that the commander proposesinote: implies agreement only if the commander is correct, but

commander need not be correct (see above)

Consensus♦ Interactive Consistency (IC)

8 each process suggests one value8 goal: all correct processes agree on a vector of values, each component

corresponding to one processes’ agreed value– example: agreement about each processes' local state

8 requirements– termination: eventually each correct process sets it decision variable– agreement: the decision vector of all correct processes is the same– integrity: if pi is correct, then all correct processes decide on vi as the i-th

component of their vector

Consensus♦ Relationship of Consensus to Other Problems

8 assume that the previous problems could be solved, yielding the followingdecision variables

– C(v1,.., vN) returns the decision value of pi– BGi(k, v) returns the decision value of pi where pk is the commander

which proposes value v– ICi(v1,.., vN)[k] returns the k-th value in the decision vector of pi where

v1,.., vN are the values that the processes propose8 possibilities to derive solutions from these problem solutions

– IC from BGirun BG N times, once with each pi acting as commander

ICi(v1,.., vN)[k] = BGi(k, vk)– C from IC

irun IC to produce a vector of values at each processiapply an appropriate function on the vector’s values to derive a

single valueCi(v1,.., vN) = majority(ICi(v1,.., vN)[1],.., ICi(v1,.., vN)[N])

– BG from Cicommander pk sends its proposed value v to itself and each of the

remaining processesiall processes run C with the values v1,.., vN that they receivei BGi(k, v) = Ci(v1,.., vN)

– termination, agreement and integrity preserved in each case

Consensus♦ Relationship of Consensus to Other Problems

8 solving consensus equivalent to solving reliable, totally ordered multicast– implementing consensus with RTO-multicast

icollect all processes in one groupieach pi performs RTO-multicast(g, vi)ieach pi chooses di = mi, where mi is the first value that the RTO-

multicast deliversiproperties

* termination follows from reliability of multicast* agreement and integrity follow from reliability and total ordering

– implementing RTO-multicast from consensus can be shown as well

Consensus♦ Consensus in Synchronous Networks

8 assumption: no more than f of the N processes crash8 algorithm proceeds in in f+1 rounds

– processes B-multicast values between them– at the end of f+1 rounds, all surviving processe are in a position to agree



8 Dolev-Strong algorithm– Valuesi

r: set of proposed values known to process i before round r– every process multicasts the set of values it has not sent in previous

rounds– then takes delivery of values from other processes– round is potentially terminated by timeout– at the end of f+1 rounds, each process choses minimum value



8 Dolev-Strong algorithm– termination: guaranteed through synchronicity property of system– correctness: will every process arrive at the same set of values at the end

of the final round?iif proven, integrety and agreement will follow, since processes

consistently apply the minimum function to this set



8 Dolev-Strong algorithm– correctness: will every process arrive at the same set of values at the end

of the final round?iif proven, integrity and agreement will follow, since processes

consistently apply the minimum function to this set– proof sketch

iassume two processes differ in their final set of valuesihence, some correct process i possesses a value v that another

correct process k (i ≠ k) does not possessithe only way to explain this is that some other process m, which sent

v to i, crashed before v could be delivered to kiin turn, any process sending v in the previous round must have

crashediwe have to assume at least one crash per roundihave f+1 rounds, at most f crashes, hence contradiction

8 It can be shown that in synchronous systems, any algorithm to reachconsensus, tolerating up to f crash or byzantine failures, requires at least f+1rounds

Consensus♦ Byzantine Generals Problem in Synchronous Network

8 allow arbitrary (byzantine) failures8 up to f faulty processes8 correct processes can detect the absence of a message through timeout, but

cannot conclude that sender has crashed, since it may be silent for sometime and then start sending messages again

8 assume private communication channels– fourth process cannot detect if one process sends messages with

different content to two peers– no faulty process can inject messages into channels connecting correct

processes8 assume that messages are not digitally signed (authenticated and verifyable)8 general result (Lamport, Shostak and Pease)

– no solution if N ≤ 3f– give an algorithm for N ≥ 3f+1

Consensus

♦ Byzantine Generals Problem in Synchronous Network8 impossibility for N = 3 processes

– read “3:1:u” as “three says one says u”– both scenarios show two rounds of messages– left: all p2 knows is that it has received two different values– right: same situation, even though now commander is faulty– assume a solution existed

ip2 would have to decide on value v, by integrity condition of BG– assume that no algorithm can decide locally for p2 between the two

scenariosithen p2 would need to decide on w (value sent by commander) in

right hand scenario– same reasoning for p3 w

iwill have to decide for commander’s value, which is a violation ofagreement in right hand scenario, hence contradiction

p1 (Commander)

p2 p3

1:v1:v

2:1:v

3:1:u

p1 (Commander)

p2 p3

1:x1:w

2:1:w

3:1:x

Faulty processes are shown shaded© Addison-Wesley Publishers 2000


8 sketch of impossibility for N < 3f (Pease, Shostak and Lamport)– assume a solution existed for N ≤ 3– let each of three processes p1, p2 and p3 simulate n1, n2 and n3 generals,

where p1+ p2 + p3 = N and n1, n2, n3 ≤ N/3– assume that one of the processes is faulty– correct processes simulate correct generals

iinternal interaction of “own” generalsisend messages from “own” generals to those generals simulated by

other processes– faulty general’s processes are faulty and may emit spurious messages– since p1+ p2 + p3 = N and n1, n2, n3 ≤ N/3, at most f generals are faulty– since algorithms that is run on the generals is correct, simulation will

terminate– however, now there is a way for two processes out of three to reach

consensus: each process decides on the value chosen by all of theirsimulated generals

– contradicts impossibility for N = 3


8 solution for N ≥ 3f+1– solution by Pease, Shostak and Lamport too complex to present here– therefore: presentation of solution for N = 4, f = 1– correct generals reach agreement in two rounds:

ifirst, commander sends value to each lieutenantisecond, each lieutenant sends value it received to all peers

– lieutenant receivesivalue from commanderiN-2 values from peers

– if commander faulty, then all lieutenants correct, each will have gatheredexactly the set of values that the commander sent out

– if one lieutenant faulty, each of its peers receives N-2 copies of the valuethe commander sent out, plus the faulty lieutenant value

– to reach agreement, simple majority function sufficesisince N ≥ 4, N-2 ≥ 2, majority function will ignore value of faulty

lieutenant, and produce value of commander if commander is correct(will produce ⊥ if commander incorrect)

– note: BG requires agreement only if commander correct



p1 (Commander)

p2 p3

1:v1:v

2:1:v3:1:u

p4

1:v

4:1:v2:1:v 3:1:w

4:1:v

p1 (Commander)

p2 p3

1:w1:u

2:1:u3:1:w

p4

1:v

4:1:v2:1:u 3:1:w

4:1:v

p1 (Commander)

p2 p3

1:w1:v

2:1:v3:1:w

p4

1:v

4:1:v2:1:v 3:1:w

4:1:v

{v,u,v}

{v,v,w}

{v,w,v}

{v,v,w}

{w,v,v} {u,v,w}

{u,v,w}

{u,v,w}

p2: majority({v,u,v}) = vp3: majority({v,v,w}) = v

p2: majority({v,w,v}) = vp3: majority({v,v,w}) = vp4: majority({w,v,v}) = v

p2, p3, p4:majority({v,u,w}) = ⊥

Consensus♦ Impossibility of Agreement in Asynchronous Systems

8 previous algorithms: synchrony assumption– message exchanges in rounds– timeouts

8 in asynchronous systems, no algorithms can guarantee reaching consensus,even with just one process crash failure (Fischer, Lynch and Paterson, 1985)

– proof ideai show that there is always some continuation of the process’s

execution that avoids consensus being reached8 consequences

– in asynchronous systems, no solution to BG, IC, TOR-multicast8 of course, in practice consensus can often be reached, but a residual

probability that consensus cannot be reached remains8 possible approaches to reaching consensus by weakening system

assumptions– partial synchrony– masking faults– modified failure detectors– randomized algorithms


8 partial synchrony– message delays are bounded, but bound unknown– known bound, but longer transmission delays for some, finite, initial

period of time8 masking faults

– design system so that failures appear like intermittent slowdown inprocessing of messagesistore system state on persistent storage before crashirestart system in that state after recovery

8 modified failure detectors– in ISIS system

ideem process that has not responded as faileditreat this process as fail-safe, i.e., discard any subsequent

messages from this processiproblems:

* long timeouts necessary* false negatives possible that reduce effectiveness of system


8 modified failure detectors– in ISIS system (Birman, 1993)

ideem process that has not responded as faileditreat this process as fail-safe, i.e., discard any subsequent

messages from this processiproblems:

* long timeouts necessary* false negatives possible that reduce effectiveness of system

– eventually weak failure detector (Chandra and Toueg, 1996)iconsensus can be solved, even with a weak failure detector, if fewer

than N/2 processes crash and communication is reliableieventually weak failure detector

* eventually weakly complete: each faulty process is eventuallysuspected permanently

* eventually weakly accurate: after some time, at least one correctprocess is never suspected by any correct process

ieventually weak failure detector cannot be implemented inasynchronous system based on message passing, however, failuredetectors adapting timeout values can come close to “ewfd”s

Documents

Overview - Southern Illinois University Carbondalerahimi/cs420/slides/cs420-part4.pdf · Overview ♦Introduction ♦Fundamental Concepts of Distributed Systems 8 System models 8