95
Ch12 (continued) Replicated Data Management

Ch12 (continued) Replicated Data Management

  • Upload
    zeke

  • View
    34

  • Download
    2

Embed Size (px)

DESCRIPTION

Ch12 (continued) Replicated Data Management. Outline. Serializability Quorum-based protocol Replicated Servers Dynamic Quorum Changes One-copy serializability View-based quorums Atomic Multicast - PowerPoint PPT Presentation

Citation preview

Page 1: Ch12 (continued) Replicated Data Management

Ch12 (continued)Replicated Data Management

Page 2: Ch12 (continued) Replicated Data Management

Outline Serializability Quorum-based protocol Replicated Servers Dynamic Quorum Changes One-copy serializability View-based quorums

Atomic Multicast Virtual synchrony Ordered Multicast Reliable and Causal Multicast Trans Algorithm Group Membership Transi Algorithm

Update Propagation

Page 3: Ch12 (continued) Replicated Data Management

Data replication

Data replication: Why? Make data available in spite of failure of some processors Enable Transactions (user-defined actions) to complete successfully even if some failures occur in the system

i.e. Actions are resilient to failures of some nodes

Problems to be solved: Consistency Management of replicas

Page 4: Ch12 (continued) Replicated Data Management

Data replication Intuitive representation of a replicated data system

Transactions

d: a logical data

replica of dreplica of d replica of d

Logical operation

Mapping of logical operation

Transactions see logical data

The underlying system maps each operation on the logical data to

operations on multiple copies

Page 5: Ch12 (continued) Replicated Data Management

Data replication Correctness criteria of the underlying system

To be correct, the mapping performed by the underlying system must ensure the one-copy serializability: One-copy serializability property: The concurrent execution of transactions on replicated data should be equivalent to some serial execution of the transaction on non-replicated data

Page 6: Ch12 (continued) Replicated Data Management

Data replication Quorum-based protocol

Ensure that any pair of conflicting accesses to the same data item access overlapping sites

Here we discuss the Read/Write quorums A data item d is stored at every processor p in P(d)

Every processor p in P(d) has a vote weight vp(d) R(d) : read threshold W(d): write threshold Read quorum of d: a subset P’ of P(d) such that ( vp(d), pP’ ) R(d)

Page 7: Ch12 (continued) Replicated Data Management

Data replication Quorum-based protocol

Write quorum of d: a subset P’ of P(d) such that ( vp(d), pP’ ) W(d)

The total number of votes for d: V(d) = ( vp(d), pP(d) )

Quorums must satisfy the following two conditions: condition 1: R(d) + W(d) > V(d) Intuitively, every read quorum of d intersect with every write quorum of d. Hence,

read and write cannot be performed concurrently on d

every read quorum can access the copy that reflects the latest update

Page 8: Ch12 (continued) Replicated Data Management

Data replication Quorum-based protocol

condition 2: 2*W(d) > V(d) Intuitively, two write quorums of d intersect Hence,

a write can be performed in at most one group

How read and write operations work ?

Page 9: Ch12 (continued) Replicated Data Management

Data replication Quorum-based protocol

Read operation: Each replica of d has a version number np(d) : version number of the replica at processor p Initially, np(d) is zero,When a transaction T wants to read d, the following steps are performed 1. Broadcast request for votes to P(d) (a remote processor q replies by sending a message with nq(d), vq(d)) 2. Collect replies and construct P’ until ( vp(d), pP’ ) R(d) 3. Lock all copies in P’ 4. Read the replica dp, p in P’ with a highest version number 5. Unlock copies of d

Page 10: Ch12 (continued) Replicated Data Management

Data replication Quorum-based protocol

Write operation: Each replica of d has a version number np(d) : version number of the replica at processor p Initially, np(d) is zero,When a transaction T wants to write d, the following steps are performed 1. Broadcast request for votes to P(d) (a remote processor q replies by sending a message with nq(d), vq(d)) 2. Collect replies and construct P’ until ( vp(d), pP’ ) W(d) 3. Lock all copies in P’ 4. Compute the new value d’ of d 5. Let max_n(d) be the highest version number read in step 2, For all p in P’ , write d’ to d with np(d’)=max_n(d)+1

6. Unlock copies of d

Page 11: Ch12 (continued) Replicated Data Management

Data replication Replicated Servers

Clients

S:a logical server

replica of Sreplica of S replica of S

Logical requests

Mapping of logical requests

Clients see logical servers

The underlying system maps each operation on the logical servers to

operations on multiple copies Reply

Page 12: Ch12 (continued) Replicated Data Management

Data replication Replicated Servers

In the context of replicated data, one might consider thatthe system consists of servers and clients

Servers : are processors having copy of the data item Clients : are processors requesting operation on the data item

Some approaches for replicating servers: Active replication Primary site approach

Page 13: Ch12 (continued) Replicated Data Management

Data replication Replicated Servers

A copy of S is at every processor p in P(S)

Active replication: All the replicas are simultaneously active All replicas are equivalent

When a client C requests a service from S, C contacts any one of the replicas Sp, for p in P(S) Sp acts as the coordinator for the transaction

To be fault tolerant, the client must contact all the replicas (plus other restrictions e.g. same set of requests; same order of requests at all replicas )

In general, suitable for processor failures

Page 14: Ch12 (continued) Replicated Data Management

Data replication Replicated Servers

Primary site approach: One replica is the primary copy: coordinator for all transactions

All other replicas are backups (passive in general)

When a client C requests a service from S, C contacts the primary copy If the primary fails, a new primary is elected

If a network partitioning occurs, only the partition having the primary can be serviced

Page 15: Ch12 (continued) Replicated Data Management

Data replication Replicated Servers

Primary site approach: Read operation: If the requested operation is a read, the primary performs the operation and sends the result to the requester Write operation: If the requested operation is a write, The primary server make sure that all the backup maintains the most recent up to date value of the data item The primary processor might periodically checkpoints the state of the data item on the backups to reduce the computation overhead at the backups

Page 16: Ch12 (continued) Replicated Data Management

Data replication Dynamic Quorum changes

The quorum-based protocol we have seen is a static method A single processor failure can make a data item unavailable

If the system breaks into small groups, it might be the case that no group will perform the write operation

The dynamic quorum change algorithm avoids this (at certain limits) Idea: For a data item d, Quorums are defined on the set of alive replicas of d: Introduction of the notion of view, Each transaction executes in a single view Views are changed sequentially

Page 17: Ch12 (continued) Replicated Data Management

Data replication Dynamic Quorum changes

d: data itemP(d) : processors at which a copy of d is storedsome processors in P(d) can fail

View: we can regard a view of d as consisting of : alive processors of P(d): AR(d) a read quorum defined on AR(d) a write quorum defined on AR(d) a unique name v(d) (view names are assumed to be totally ordered)

Page 18: Ch12 (continued) Replicated Data Management

Data replication Dynamic Quorum changes

For a transaction Ti, v(Ti) denotes the view in which Ti executes

The idea behind view-based quorum is to ensure that If v(Ti) < v(Tj) then, Ti comes before Tj in an equivalent serial execution

Problem: ensure serializability within view serializability between views New conditions are necessary for quorums to satisfy the aboverequirements

Page 19: Ch12 (continued) Replicated Data Management

Data replication Dynamic Quorum changes

New conditions for quorums• d: data item• v: a view of d• P(d,v) : alive processor that store d in view v,• |P(d,v)| = n(d,v)• R(d,v) : read threshold for d in view v,• W(d,v) : write threshold for d in view v,• Ar(d) : read accessibility threshold for d in all views (availability, d can be read in a view v as long as there are Ar(d) alive processors in view v )• Aw(d): write accessibility threshold for d in all views (availability)

Page 20: Ch12 (continued) Replicated Data Management

Data replication Dynamic Quorum changes

New conditions for quorums (cont.) The threshold must satisfy the following conditions: DQC1. R(d,v) + W(d,v) > n(d,v) in a view, read write quorums intersect

DQC2. 2*W(d,v) > n(d,v) /* in a view, write quorum intersect; nodes participating in an update form a majority of the view */

DQC3. Ar(d) + Aw(d) > |P(d)| /* read accessibility and write accessibility intersect in all views */

DQC4. Aw(d) W(d,v) n(d,v) /* ensure consistency of views (we’ll see later)

*/

DQC5. 1 R(d,v) n(d,v) /* The minimum size of a read quorum is 1 */

Page 21: Ch12 (continued) Replicated Data Management

Data replication Dynamic Quorum changes

Restrictions on read and write operations: A data item d can be

read in view v only if n(d,v) Ar(d) i.e. the number of alive replicas of d must be greater than or equal to read availability

written in view v only if n(d,v) Aw(d) i.e. the number of alive replicas of d must be greater than or equal to the write availability These restrictions are imposed to ensure consistent changes of quorums

Page 22: Ch12 (continued) Replicated Data Management

Data replication Dynamic Quorum changes

How read and write operations work: Similar to the static quorum-based protocol except that:

Only processors in P(d,v) are contacted for votes (hence, for constructing the quorum i.e. P’)

The version number of each replica becomes : (view_number, in_view_sequence_number)

If a processor p receives a request from a transaction Ti and v(Ti) is not the view p has for d, then p rejects the request

Page 23: Ch12 (continued) Replicated Data Management

Data replication Dynamic Quorum changes

Installation of new view: We have claimed that views are changed sequentially

How this is achieved?

A processor p in P(d) can initiate an update of the view for d due to recovering , failure of a member of the view or because its version number for d is not current.

Page 24: Ch12 (continued) Replicated Data Management

Data replication Dynamic Quorum changes

Installation of new view (cont.):The idea: Assume that processor p is the one that wants to change the view 1. p determines if the view (the set of nodes with which

p can communicate) it belongs to satisfies the new conditions for quorums (n(d,v) Ar(d) and n(d,v) Aw(d), …) If this is not the case, p cannot change the view 2. p reads all copies of d in P(d,v) 3. p gets the new copy from a replica with the highest version number; 4. p increments the view number 5. p broadcasts the latest version to all members of P(d,v)

Page 25: Ch12 (continued) Replicated Data Management

Data replication Dynamic Quorum changes

Installation of new view (cont.):Let v be the old view an v’ the view after the changeWe have that W(d,v) Aw(d) n(d,v’) Ar(d) Ar(d) + Aw(d) > |P(d)|

which implies that W(d,v) + n(d,v’) > |P(d)| That is, read and write quorum in view v overlap when changing to v’ : “consistent” change of view.

Page 26: Ch12 (continued) Replicated Data Management

Data replication Dynamic Quorum changes

View change handle network partitions: Assume a data item d is replicated at five processors A,B,C,D,E Ar(d)=2, Aw(d)=3 Initial view 0: P(d,0) ={A,B,C,D,E}, W(d,0)=5, R(d,0)=1

Assume that the system partitions A B C D || E /* node E cannot communicate with others */ If an update request d arrives at any processor, if the view is not updated, the operation cannot be performed

Page 27: Ch12 (continued) Replicated Data Management

Data replication View changes handle network partitions: Assume a data item d is replicated at five processors A,B,C,D,E Ar(d)=2, Aw(d)=3 Let view 1 be: P(d,1) ={A,B,C,D}, W(d,1)=4, R(d,1)=1 In this view, partition {E} can still read d but cannot update d

partition {A,B,C,D} can read and write d Assume that D fails, partitions: {E}, {A,B,C} d can be read by both partitions, to enable write operation, the view must be updated e.g. P(d,2) ={A,B,C}, W(d,2)=3, R(d,2)=1

Page 28: Ch12 (continued) Replicated Data Management

Data replication View change illustrated:

A B C D E

r1(d)w1(d) w1(d) w1(d) w1(d) w1(d)

r3(d) w3(d) w3(d) w3(d) w3(d) r2(d)

r4(d)w4(d) w4(d) w4(d) w4(d) w4(d)

View 0W(d,0)= 5R(d,0)= 1

View 1W(d,1)= 4R(d,1)= 1

View 1W(d,2)= 5R(d,2)= 1

The quorum based algorithm serialized T2 before T3

Is there any notion of majority behind view changes?

Page 29: Ch12 (continued) Replicated Data Management

Outline Serializability Quorum-based protocol Replicated Servers Dynamic Quorum Changes One-copy serializability View-based quorums

Atomic Multicast Virtual synchrony Ordered Multicast Reliable and Causal Multicast Trans Algorithm Group Membership Transi Algorithm

Update Propagation

Page 30: Ch12 (continued) Replicated Data Management

Atomic Multicast

In many situations, one-to-many form of communication is useful e.g. maintaining replicated servers,etc.

Two forms of one-to-many communication are possible: Broadcast and Multicast

Broadcast: the sender sends to all the nodes in the system

Multicast: the sender sends to a subset L of the nodes in the system

we are interested in multicast and we assume the sender sends a message m

Page 31: Ch12 (continued) Replicated Data Management

Atomic Multicast A naïve algorithm for Multicast For each processor p in L send m to p

Problem: The sender fails after sending it has only sent to some processors

Some members of the list L receive m while others do not receive m This is not acceptable in fault tolerant systems Multicast must be Reliable: If one processor in L receives m, every alive processor in L must receive m

Page 32: Ch12 (continued) Replicated Data Management

Atomic Multicast The naïve algorithm for Multicast + 2PC technique The 2PC technique can improve reliability of the naïve algorithm

Idea: Regard a multicast as a transaction (“all-or-nothing” property) distinguish delivery of a message from the reception of that message

APP APP

m

Delivery of message mReception of message m

Page 33: Ch12 (continued) Replicated Data Management

Atomic Multicast The naïve algorithm for Multicast + 2PC technique The 2PC technique can improve reliability of the naïve algorithm

Idea(cont.): Rule for delivery: deliver a message only when you know that the message will be delivered everywhere

Algorithm for the sender: 1. send m to every processor in L 2. When you have received all acknowledgements, deliver m locally; tell every processor in L that it can deliver

Page 34: Ch12 (continued) Replicated Data Management

Atomic Multicast The naïve algorithm for Multicast + 2PC technique This technique might require an important amount of work due to recovering of a failed processor In addition, it is vulnerable vulnerability inherited from the vulnerability of the 2PC

The main difficulty comes from the correctness criteria: How can a processor determines which nodes in L are up ? Virtual synchrony takes this into account

Page 35: Ch12 (continued) Replicated Data Management

Atomic Multicast Virtual synchrony Accounts for the fact that it is difficult to determine exactly which are the non-failed processors

Processors are organized into groups that cooperate to perform a reliable multicast

Each group corresponds to a multicast list: multicast in a group

Group view : The current list of processors to receive a multicast message (+ some global properties)

Consistency of group view: common view on the members

Page 36: Ch12 (continued) Replicated Data Management

Atomic Multicast Virtual synchrony An algorithm for performing reliable multicast is virtual synchronyIf : 1. In any consistent state, there is a unique group view on which all members of the group agree

2. If a message m is multicast in group view v before view change c, then either: 2.1. No processor in v participating in c can ever receive m, or 2.2. All processors in v participating in c receive m before performing c

Page 37: Ch12 (continued) Replicated Data Management

Atomic Multicast Virtual synchrony illustrated

AA B DC CB

c :{A,B,C} participate

m{} {} {}

AA B DC CB

C: {A,B,C} participate

m{m} {m} {m}

c

2.1

2.2

Page 38: Ch12 (continued) Replicated Data Management

Atomic Multicast Virtual synchrony View changes can be considered as checkpoints

Delivery list in virtual synchrony Between two consecutive “checkpoints” v and v’, A set G of messages is multicast A sender of a message in G must be in v Hence, if p is removed from the view, the remaining processors can consider that p has failed

There is a guarantee that from v’ , no message from p will be delivered in the future

Page 39: Ch12 (continued) Replicated Data Management

Atomic Multicast Ordered multicasts One might want a multicast that satisfies a specific order: e.g. Causal order, total order

Causal order (for causal multicast): If processor p receives m1 and then multicast m2 then every processor that receives {m1,m2} should receive m1 before m2

Total order (for atomic multicast) If p receives m1 before m2 then every processor that receives {m1,m2} should receive m1 before m2

(i.e. the same order of reception everywhere)

Page 40: Ch12 (continued) Replicated Data Management

Atomic Multicast Why causal multicast ? Assume that the data item x is replicated and consider the followingscenario:

m2m2

m1

m1

m1: “set x to zero”m2: “increment x by 1”

p rq

m1 must be delivered before m2. Otherwise,inconsistency !

Page 41: Ch12 (continued) Replicated Data Management

Atomic Multicast Why total order for multicast ? Assume that p sends m and after that, p crashesthen by some mechanism, q and r are informed about the crash of pbut q receives m before crash(p); r receives crash(p) before m

p rq

m

m

crash(p)

Total order is necessaryotherwise, q and r mighttake different decisions

Page 42: Ch12 (continued) Replicated Data Management

Atomic Multicast Why total order for multicast (cont.)? Assume that the data item x is a replicated queue and consider the following scenario:

m2m2

m1

m1

m1: “insert a to x”m2: “delete a from x”

p rq

m1 and m2 must be delivered in the same order. Otherwise,inconsistency !

Page 43: Ch12 (continued) Replicated Data Management

Atomic Multicast The Trans algorithm

Executes between two views changes exploitation of the guarantee provided by virtual synchrony

Hence, the algorithm works on one view

Mechanisms: Combination of positive and negative acknowledgements for reliability Piggybacking acknowledgements on messages being multicast simplifies detection of missed messages minimizes the need for explicit acknowledgements

Page 44: Ch12 (continued) Replicated Data Management

Atomic Multicast

By piggybacking positive acknowledgements and negative acknowledgements, when a processor p receives a multicast message, p learns : which message it doe not need to acknowledge which message it has missed and must request a retransmission

Page 45: Ch12 (continued) Replicated Data Management

Atomic Multicast The idea behind Trans is illustrated by the following scenario: Let L =[ P,Q,R] be a delivery list for multicast

step 1. P multicasts m1 step 2. Q receives m1 and piggybacks a positive acknowledgement on the next message m2 that it multicasts ( we write m2:ack(m1) to mean m2 contains ack for m1) step3 . R receives m2 (i.e. m2:ack(m1))

Two cases are in order for R upon receipt of m2: case 1 If R had received m1, it realizes that it does not need to send an acknowledgement for it, as Q had acknowledged it

case 2 If R had not received m1, then R learns (because of the ack(m1) attached to m2) that m1 is missing then R requests a retransmission of m1 by attaching a negative acknowledgement for m1 in the next message it multicast

Page 46: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: an invariant The protocol maintains the following invariant

A processor p multicasts an acknowledgement of message m only if processor p has received m and all messages that causally precede m.

Causal order m

If you acknowledge m, you do not need to acknowledgeall unacknowledged messages that precede m

Page 47: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: stable messagesA message is said to be stable if it has reached all the processorsin the group view

This is detectable because each receiver of a message multicasts acknowledges

Some assumptions:

All messages are assumed to be uniquely identified (processor_id , message_seq_number)

Each sender sequentially number its messages

A virtual synchrony layer is assumed

Page 48: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: Variables used Each processor maintains:

ack_list : the list of identifiers of messages for which that node has to send a positive acknowledgement

nack_list : the list of identifiers of messages for which that node has to send a negative acknowledgement

G : the causal DAG contains all messages that the processor has received but that are not yet stable (m, m’) is in G if message m acknowledges message m’

Page 49: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: retransmission Using information given by the local DAG, a processor can determine which messages it should have received

For such a message, a negative acknowledgement is multicast to request a retransmission

Page 50: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: variables (cont.)

m : message container (serves as id of message here) m.message : application-level message (to be delivered at the app.) m.nacks : list of negative acknowledgments m.acks : list of positive acknowledgments

L : destinations list (maintained by an underlying algorithm)

Page 51: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: Causal DAG functions used

add_to_DAG(m,G) : insert m into G

not_duplicate(m,G) : True if m has never been received before

causal(m,G) : True if all messages that m causally follows have been received

stable(m,G) : True if all (alive) processors have acknowledged m

Page 52: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: Sending a message Trans_send(message) create m a container; m.message := message; m.ack := ack_list; /* attach positive acknowledgements to m */ m .nack :=nack_list ;/* attach negative acknowledgements to m */ put m in ack_list; add_to_DAG(m,G); send m to every processor in L

Page 53: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: Receiving a message Trans_receive(m) for every nack(x) in m.nacks /* note: for a retransmission, m.nacks is empty */ if x in G then multicast x; if not_duplicate(m,G) then /* m has never been received before */ for every ack(m’) in m.acks do /* update of nack_list */ if not_duplicate(m’,G) then add m’ to nack_list if m is in nack_list then /* m is a retransmission */ remove m from nack_list; add m to undelivered_list; remove m.nacks from m; add_to_DAG(m,G); while there is m’ in undelivered_list such that causal(m’,G) do begin remove m’ from undelivered_list; deliver m’.message to the application; end

compute ack_list to be all m in G such that causal(m,G) and there is no m’ such that causal(m’, G) and m acknowledges m’ for every m’ in G do if stable(m’,G) then remove m’ from G and reclaim the buffer

Page 54: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: the Causal DAG G : the causal DAG contains all messages that the processor has received but that are not yet stable (m, m’) is in G if message m acknowledges message m’

c1

d1b1Message (c1, [ack(b1), nack_(d1)])represents a piece of the DAG

Processor C that sends c1, acknowledges message b1, but it also requests retransmissionof message d1

Page 55: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: retransmission

A retransmitted message can it be different from the original message ? The retransmitted message must contain all the positive acknowledgements that the original message had

the list of negative acknowledgements is not useful in the retransmission

Page 56: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: illustrated

A B DC

a aa

(a,[]) A multicasts D does not get

a aa

(b,[ack(a)])

b bb

a

b

B multicast ;

D receives b:it learns about unreceived message a

All nodes know B has got a

Page 57: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: illustrated

A B DC

a aa

(c,[ack(b)])

b bb

a

b

C multicast and C onlyacknowledges b(implicit acknowledgement for a) All nodes know C has got b and a (no nack(a) in themessage c broadcasts)

c cc c

Page 58: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: illustrated

A B DC

a aa

(d,[nack(a)])

b bb

a

b

D multicasts , in the D’s message, there is a nack(a) : request retransmission

A does not get message from D;

D must piggyback ack(c) onits next multicast

c cc c

d dd ?a ?a?a

Page 59: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: illustrated

A B DC

a aa

(d,[nack(a)])

b bb

a

b

D multicasts , in the D’s message, there is a nack for a: request retransmission

A does not get message from D;

D must piggyback only ack(c) on its next multicast i.e. implicit ack for a and b

c cc c

dd d ?a?a ?a

Page 60: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: illustrated

A B DC

a aa

(a,[])

b bb

a

b

C “remulticasts” ;no attached ack Note: representation of aat D changed

In their next message, C and B acknowledge d

c cc c

dd d

Page 61: Ch12 (continued) Replicated Data Management

Atomic Multicast Trans: Properties If a message is received to a processor that does not fail, eventually every non-failed processor receives that message

Messages are delivered in causal order

G forms a tight description of the causal ordering among messages

If a processor fails, the storage requirement of the algorithm grows without bound

The trans algorithm needs to be composed with a distributed algorithm that maintains consistency of views to avoid unbounded storage requirement

Page 62: Ch12 (continued) Replicated Data Management

Atomic Multicast Group membership Goal: To maintain membership (delivery) list -- a replicated data

Properties of the membership list The value of Li is the same at all processors in Li

Processors install new versions of the membership list in exactly the same order

Page 63: Ch12 (continued) Replicated Data Management

Atomic Multicast Group membership Determination of group view:

A group view is determined by the set of alive processors that are involved in the computation of that view

Notion of agreed view

Page 64: Ch12 (continued) Replicated Data Management

Atomic Multicast Correctness criteria for a reliable group membership distributed algorithms: 1. There is an initial agreed group view in any execution 2. Processors change their local view based on information about failures and new processors

3. The agreed view is unique in any consistent state 4. If p and q are members of the agreed view that goes through a series of changes, p and q see the same sequence of changes

5. The algorithm responds to notifications that processors are faulty or operating

Page 65: Ch12 (continued) Replicated Data Management

Atomic Multicast The Transis Algorithm

“Asynchronous agreement distributed algorithm” Properties: (make things simple !!)

Paranoid (one sacrifices accurate identification of failed processor) If any a non suspected processor p suspects a processor q of failing, processor q is declared faulty and all processors in the group will remove q from their local view, even if they can still communicate with q

Unidirectional Once a processor is removed from the membership list, that processor is never readmitted in the group, except as “new”

Page 66: Ch12 (continued) Replicated Data Management

Atomic Multicast The paranoid and the unidirectional properties lead to monotone agreement The set of suspected processors monotonically increases at every non-suspected processor processor

eventually, the same set of suspected processors at every non-suspected processor (agreement !)

If everyone suspects everyone, the view collapses

Virtual synchrony Views are changed in a manner that ensures virtual synchrony every processor must must be able to agree on what the last message from every other processor was

Page 67: Ch12 (continued) Replicated Data Management

Atomic Multicast Interactions with Trans Transis interacts with trans through C-DAG

e.g. Transis might : query C-DAG; modify C-DAG; block some messages require immediate sending of some messages

Fault detection Through consistent line in the C-DAG i.e. a global state in which all the members agree on the new group view

Page 68: Ch12 (continued) Replicated Data Management

Atomic Multicast Achievement of a consistent line Message F(q) means “q is suspected faulty”

Algorithm for new group view computation

When processor p suspects that processor q is faulty, p multicasts F(q) using Trans

When message F(q) becomes causal at a r.DAG processor r processor r removes q from its local view; processor r multicasts F(q)

When all non-suspected processor have received F(q) from, the agreement on the new view is reached

Page 69: Ch12 (continued) Replicated Data Management

Atomic Multicast Achievement of a consistent line Message F(q) means “q is suspected faulty”

A B C D E

F(E)

F(E)

F(E)

F(E)

F(q) are multicast using Trans;All the processors receive the same DAG, Consistent line can be computed from DAG

Lx

Lx+1

time

Page 70: Ch12 (continued) Replicated Data Management

Atomic Multicast Messages partitioning by a consistent line regular message that precede the consistent line are in view Lx

regular messages that follow the consistent line are in Lx+1

A B C D E

F(E)

F(E)

F(E)

F(E)

Lx

Lx+1

timem1

m2

Regular message = message different from F(q)

Page 71: Ch12 (continued) Replicated Data Management

Atomic Multicast Messages delivery with respect to a consistent line (virtual synchrony)

regular messages that follow the consistent line are in view Lx+1 and should be delivered in view Lx+1

A B C D E

F(E)

F(E)

F(E)

F(E)

Lx

Lx+1

timem1

m2

m1 should be delivered after installing Lx+1 at C

Page 72: Ch12 (continued) Replicated Data Management

Atomic Multicast Messages delivery with respect to a consistent line (virtual synchrony)

regular message that precede the consistent line are in view Lx should be delivered before installing the new view Lx+1

A B C D E

F(E)

F(E)

F(E)

F(E)

Lx

Lx+1

timem1

m2

Regular message = message different from F(q)

Page 73: Ch12 (continued) Replicated Data Management

Atomic Multicast Concurrent failures handling the first processor that learns F(q) and F(r) proposes F(q,r)

A B C D E

Lx

Lx+1

time

G

F(C)

m’

F(C)

F(C,G)

F(C,G)

F(C,G)

F(G)

F(G)

F(C,G)

New view : A, B, D, E

Page 74: Ch12 (continued) Replicated Data Management

Atomic Multicast Handling messages from suspected processors

Assume q is suspected of being faulty At some processors, messages from q might causally precede or follow F(q)

To ensure virtual synchrony, All message from q that precede the first F(q) must be delivered before installing the new view

No message from q that causally follow any F(q) can be delivered (they are discarded)

Messages from q that are concurrent to all F(q) are discarded

Page 75: Ch12 (continued) Replicated Data Management

Atomic Multicast Handling messages from a suspected processor illustrated

A B C D E

F(E)

F(E)

F(E)

F(E)

Follows:discarded

Concurrent:discarded

Precede: deliver it

Page 76: Ch12 (continued) Replicated Data Management

Atomic Multicast Preventing regular messages from straddling view change

When a processor p receives F(q), processor p multicasts F(q) before it multicasts any regular message i.e. F(q) receives “high priority”

The causal layer guarantees that any message that p sends after F(q) causally follows F(q)

Hence the following situation cannot occur

F(q)m

F(q)

Page 77: Ch12 (continued) Replicated Data Management

Atomic Multicast Preventing empty views

Suspected processors are removed forever this might lead to empty views

When a non-faulty processor learns that it has been removed from the view, it fails then it rejoins as a new processor (incarnation number) are used for that

algorithm for adding new processor is needed

Page 78: Ch12 (continued) Replicated Data Management

Atomic Multicast Adding new processor Algorithm similar to the algorithm for fault detection

Progressive (to account for concurrent join propositions) construction of the join set

When all processors in the current view multicast the same join set, a consistent line is achieved

When a consistent line is achieved, joining processors must beproperly initialized to start participating in the multicast (Trans)

Page 79: Ch12 (continued) Replicated Data Management

Atomic Multicast Update propagation Deals with relaxed consistency constraints on replicated data

The main requirement: updates on data items must be propagated among all the replicas in a timely manner.

Example: Routing tables

Useful for large networks

Algorithms often based on Gossip processor contact each other to make themselves up to date by exchanging “news”

Page 80: Ch12 (continued) Replicated Data Management

Atomic Multicast Gossip-type Algorithms

Based on the meaning of “update”

1) update = “overwrite the old value of an object”

2) update = “modify the old value of an object” (e.g. increment a counter, change an entry of an array)

Page 81: Ch12 (continued) Replicated Data Management

Atomic Multicast Epidemic Algorithms

Method for gossiping updates to replicated data ?

Assume: update = “overwrite the old value”;

d : data item replicated on M servers;

the computation of an update of d (data item) assigns d a version number (timestamp)

When “gossiping”: if a newer version is proposed, take it

Difficulty : how to spread new updates without too many messages

Page 82: Ch12 (continued) Replicated Data Management

Atomic Multicast Epidemic Algorithms

The simplest method for distributing the news of a new update Direct mail : when a server performs an update, it informs all other servers directly

M-1 messages are sent Properties: simple; all reached in fault-free system

Problems: sender might fail some destinations will not hear of the update large communication burden on the sender not suitable for dynamic topologies

Page 83: Ch12 (continued) Replicated Data Management

Atomic Multicast Epidemics: Idea: when a server performs an update, its direct neighbors;

A neighbor informs its neighbors and the propagation continues

- important number of messages;

no guarantee that the update will reach all sites

Page 84: Ch12 (continued) Replicated Data Management

Atomic Multicast Randomized Epidemics: Idea: when a server performs an update, it informs some direct neighbors

Details: Definitions: Let u(d) be be an update of data item d;

1. A susceptible server for u(d) is one which has never heard of u(d);

2. An infectious server for u(d) is one which has heard of u(u) and is actively propagating u(d) ;

3. A removed server for u(d) is that knows of u(d) but is no longer actively propagating u(d)

Page 85: Ch12 (continued) Replicated Data Management

Atomic Multicast Randomized Epidemics: The algorithm: Let k be a parameter;

1. When a susceptible server learns u(d), it becomes infectious for u(d)

2. An infectious server repeatedly contact a random server and informs this random server about u(d)

3. If an infectious server p for u(d) contacts an infectious server or a removed server p’ for u(d) then with probability 1/k, p becomes removed

Page 86: Ch12 (continued) Replicated Data Management

Atomic Multicast Analysis of the randomized Epidemics: In : fraction of infectious servers s : fraction of susceptible servers

Assume: on every time unit (dt), every infectious server contacts another server

Note ds : variation of s during dt; dIn : variation of In during dt;then on average, s.In servers become infectious during dt (1-s).In /k become removed during dt;

Page 87: Ch12 (continued) Replicated Data Management

Atomic Multicast Analysis of the randomized Epidemics:

Thus ds = -sIn (1) dIn = sIn - (1-s)In /k (2)

(2) /(1) gives: dIn / ds = 1/ks -(k+1) /k : differential equation

solution: In(s) = [(k+1)/k].(1-s) + log(s)/k

solve In(s)=0 to determine the fraction of sites that do not hear about the update when the epidemic is terminated (In = 0)

Page 88: Ch12 (continued) Replicated Data Management

Atomic Multicast Analysis of the randomized Epidemics:Trivial solution s=1;

Solution : s0(k) = exp(-(k+1)(1-s)) exp(-(k+1)) when s << 1

An infectious site becomes removed at its ith contact with probability (1-1/k)(i-1).(1/k)

Thus the expected number of messages is mtot = M (1-s) [1 + sum{ (i-1/k)(i-1).1/k, 1<= i <= }] M(1-s)

Page 89: Ch12 (continued) Replicated Data Management

Atomic Multicast Analysis of the randomized Epidemics:Let consider the variation of the parameter from k to k+1:

then [mtot(k) -mtot(k+1)]/(k+1 -k) M[(1-s0(k)) - (1-s0(k+1))] M.s0(k+1) - M.s0(k) M.(1- 1/exp)(exp(-(k+1))) i.e. the number of message required to infect new sites decreases exponentially with k

Conclusion: epidemic algorithm are good for initial distribution (high probability to contact susceptible servers), but some other mechanism is needed to infect the last few sites

Page 90: Ch12 (continued) Replicated Data Management

Atomic Multicast Anti-entropy:Idea: one site contact another to exchange recent updatesa processor p initiates a contact by executing Gossip(): pick a random processor, s; exchange(s)

The contacted processor sends the list of its timestamps for the data item d

Methods for accomplishing exchange(s) Pull: p pulls the most recent updates from s (take); Push: p pushes the most recent updates to s (give) Pull-push: p takes most recent updates from s and p gives its most recent updates to s

Page 91: Ch12 (continued) Replicated Data Management

Atomic Multicast Antientropy:

If most processors are infectious, Pull is better than Push

Let pi be the probability that after i contacts, a random processor is still uninfected

Pull: pi+1 = pi2

(remains uninfected if contacting an uninfected processor)

Push: pi+1 pi /e (from s0(k) = exp(-(k+1)))

If pi 1 then Pull is slowIf pi << 1 then Pull is much better than Push

Page 92: Ch12 (continued) Replicated Data Management

Atomic Multicast Update logs Here we assume that update =“update the old value”

Value of an object is considered to consist of: an initial value the history of updates applied to the object For consistency : the same history of updates must be applied to all copies

A new update is then function of the whole history of updates

Page 93: Ch12 (continued) Replicated Data Management

Atomic Multicast Update logs Each processor keeps a log: an ordered listing containing all the updates that the processor has processed

Processors distribute their logs to each other

Causal log propagation: events are added to the log in a order that is consistent with causality

Notations: L: log; L[i] : the first i element of L e an element of L; index(e) position of e in L L[e] shorthand for L[index(e)]

Page 94: Ch12 (continued) Replicated Data Management

Atomic Multicast Update logs Consistency of logs: Let e be an event that is first executed at processor p. Then for every processor j =1,..,M and every event f, f is in p.L[index(e)] if and only if f is j.L[index(e)]

When a processor p propagates its log, processor p propagates all events in its log : event propagation is transitive

Log propagation transmit context of events: context of an event is described by a vector timestamp

Contexts are merged upon reception (vector timestamp technique)

Page 95: Ch12 (continued) Replicated Data Management

Atomic Multicast Causal log propagation

12

6

When event 6occurs at p1,the first 5 events from p1 2 events from p2, 3 event from p3 3 events from p4are in p1’s log

p1 p2 p3 p4