Ch12 (continued) Replicated Data Management

Ch12 (continued)Replicated Data Management

Outline Serializability Quorum-based protocol Replicated Servers Dynamic Quorum Changes One-copy serializability View-based quorums

Atomic Multicast Virtual synchrony Ordered Multicast Reliable and Causal Multicast Trans Algorithm Group Membership Transi Algorithm

Update Propagation

Data replication

Data replication: Why? Make data available in spite of failure of some processors Enable Transactions (user-defined actions) to complete successfully even if some failures occur in the system

i.e. Actions are resilient to failures of some nodes

Problems to be solved: Consistency Management of replicas

Data replication Intuitive representation of a replicated data system

Transactions

d: a logical data

replica of dreplica of d replica of d

Logical operation

Mapping of logical operation

Transactions see logical data

The underlying system maps each operation on the logical data to

operations on multiple copies

Data replication Correctness criteria of the underlying system

To be correct, the mapping performed by the underlying system must ensure the one-copy serializability: One-copy serializability property: The concurrent execution of transactions on replicated data should be equivalent to some serial execution of the transaction on non-replicated data

Data replication Quorum-based protocol

Ensure that any pair of conflicting accesses to the same data item access overlapping sites

Here we discuss the Read/Write quorums A data item d is stored at every processor p in P(d)

Every processor p in P(d) has a vote weight vp(d) R(d) : read threshold W(d): write threshold Read quorum of d: a subset P’ of P(d) such that ( vp(d), pP’ ) R(d)


Write quorum of d: a subset P’ of P(d) such that ( vp(d), pP’ ) W(d)

The total number of votes for d: V(d) = ( vp(d), pP(d) )

Quorums must satisfy the following two conditions: condition 1: R(d) + W(d) > V(d) Intuitively, every read quorum of d intersect with every write quorum of d. Hence,

read and write cannot be performed concurrently on d

every read quorum can access the copy that reflects the latest update


condition 2: 2*W(d) > V(d) Intuitively, two write quorums of d intersect Hence,

a write can be performed in at most one group

How read and write operations work ?


Read operation: Each replica of d has a version number np(d) : version number of the replica at processor p Initially, np(d) is zero,When a transaction T wants to read d, the following steps are performed 1. Broadcast request for votes to P(d) (a remote processor q replies by sending a message with nq(d), vq(d)) 2. Collect replies and construct P’ until ( vp(d), pP’ ) R(d) 3. Lock all copies in P’ 4. Read the replica dp, p in P’ with a highest version number 5. Unlock copies of d


Write operation: Each replica of d has a version number np(d) : version number of the replica at processor p Initially, np(d) is zero,When a transaction T wants to write d, the following steps are performed 1. Broadcast request for votes to P(d) (a remote processor q replies by sending a message with nq(d), vq(d)) 2. Collect replies and construct P’ until ( vp(d), pP’ ) W(d) 3. Lock all copies in P’ 4. Compute the new value d’ of d 5. Let max_n(d) be the highest version number read in step 2, For all p in P’ , write d’ to d with np(d’)=max_n(d)+1

6. Unlock copies of d

Data replication Replicated Servers

Clients

S:a logical server

replica of Sreplica of S replica of S

Logical requests

Mapping of logical requests

Clients see logical servers

The underlying system maps each operation on the logical servers to

operations on multiple copies Reply


In the context of replicated data, one might consider thatthe system consists of servers and clients

Servers : are processors having copy of the data item Clients : are processors requesting operation on the data item

Some approaches for replicating servers: Active replication Primary site approach


A copy of S is at every processor p in P(S)

Active replication: All the replicas are simultaneously active All replicas are equivalent

When a client C requests a service from S, C contacts any one of the replicas Sp, for p in P(S) Sp acts as the coordinator for the transaction

To be fault tolerant, the client must contact all the replicas (plus other restrictions e.g. same set of requests; same order of requests at all replicas )

In general, suitable for processor failures


Primary site approach: One replica is the primary copy: coordinator for all transactions

All other replicas are backups (passive in general)

When a client C requests a service from S, C contacts the primary copy If the primary fails, a new primary is elected

If a network partitioning occurs, only the partition having the primary can be serviced


Primary site approach: Read operation: If the requested operation is a read, the primary performs the operation and sends the result to the requester Write operation: If the requested operation is a write, The primary server make sure that all the backup maintains the most recent up to date value of the data item The primary processor might periodically checkpoints the state of the data item on the backups to reduce the computation overhead at the backups

Data replication Dynamic Quorum changes

The quorum-based protocol we have seen is a static method A single processor failure can make a data item unavailable

If the system breaks into small groups, it might be the case that no group will perform the write operation

The dynamic quorum change algorithm avoids this (at certain limits) Idea: For a data item d, Quorums are defined on the set of alive replicas of d: Introduction of the notion of view, Each transaction executes in a single view Views are changed sequentially


d: data itemP(d) : processors at which a copy of d is storedsome processors in P(d) can fail

View: we can regard a view of d as consisting of : alive processors of P(d): AR(d) a read quorum defined on AR(d) a write quorum defined on AR(d) a unique name v(d) (view names are assumed to be totally ordered)


For a transaction Ti, v(Ti) denotes the view in which Ti executes

The idea behind view-based quorum is to ensure that If v(Ti) < v(Tj) then, Ti comes before Tj in an equivalent serial execution

Problem: ensure serializability within view serializability between views New conditions are necessary for quorums to satisfy the aboverequirements


New conditions for quorums• d: data item• v: a view of d• P(d,v) : alive processor that store d in view v,• |P(d,v)| = n(d,v)• R(d,v) : read threshold for d in view v,• W(d,v) : write threshold for d in view v,• Ar(d) : read accessibility threshold for d in all views (availability, d can be read in a view v as long as there are Ar(d) alive processors in view v )• Aw(d): write accessibility threshold for d in all views (availability)


New conditions for quorums (cont.) The threshold must satisfy the following conditions: DQC1. R(d,v) + W(d,v) > n(d,v) in a view, read write quorums intersect

DQC2. 2*W(d,v) > n(d,v) /* in a view, write quorum intersect; nodes participating in an update form a majority of the view */

DQC3. Ar(d) + Aw(d) > |P(d)| /* read accessibility and write accessibility intersect in all views */

DQC4. Aw(d) W(d,v) n(d,v) /* ensure consistency of views (we’ll see later)

*/

DQC5. 1 R(d,v) n(d,v) /* The minimum size of a read quorum is 1 */


Restrictions on read and write operations: A data item d can be

read in view v only if n(d,v) Ar(d) i.e. the number of alive replicas of d must be greater than or equal to read availability

written in view v only if n(d,v) Aw(d) i.e. the number of alive replicas of d must be greater than or equal to the write availability These restrictions are imposed to ensure consistent changes of quorums


How read and write operations work: Similar to the static quorum-based protocol except that:

Only processors in P(d,v) are contacted for votes (hence, for constructing the quorum i.e. P’)

The version number of each replica becomes : (view_number, in_view_sequence_number)

If a processor p receives a request from a transaction Ti and v(Ti) is not the view p has for d, then p rejects the request


Installation of new view: We have claimed that views are changed sequentially

How this is achieved?

A processor p in P(d) can initiate an update of the view for d due to recovering , failure of a member of the view or because its version number for d is not current.


Installation of new view (cont.):The idea: Assume that processor p is the one that wants to change the view 1. p determines if the view (the set of nodes with which

p can communicate) it belongs to satisfies the new conditions for quorums (n(d,v) Ar(d) and n(d,v) Aw(d), …) If this is not the case, p cannot change the view 2. p reads all copies of d in P(d,v) 3. p gets the new copy from a replica with the highest version number; 4. p increments the view number 5. p broadcasts the latest version to all members of P(d,v)


Installation of new view (cont.):Let v be the old view an v’ the view after the changeWe have that W(d,v) Aw(d) n(d,v’) Ar(d) Ar(d) + Aw(d) > |P(d)|

which implies that W(d,v) + n(d,v’) > |P(d)| That is, read and write quorum in view v overlap when changing to v’ : “consistent” change of view.


View change handle network partitions: Assume a data item d is replicated at five processors A,B,C,D,E Ar(d)=2, Aw(d)=3 Initial view 0: P(d,0) ={A,B,C,D,E}, W(d,0)=5, R(d,0)=1

Assume that the system partitions A B C D || E /* node E cannot communicate with others */ If an update request d arrives at any processor, if the view is not updated, the operation cannot be performed

Data replication View changes handle network partitions: Assume a data item d is replicated at five processors A,B,C,D,E Ar(d)=2, Aw(d)=3 Let view 1 be: P(d,1) ={A,B,C,D}, W(d,1)=4, R(d,1)=1 In this view, partition {E} can still read d but cannot update d

partition {A,B,C,D} can read and write d Assume that D fails, partitions: {E}, {A,B,C} d can be read by both partitions, to enable write operation, the view must be updated e.g. P(d,2) ={A,B,C}, W(d,2)=3, R(d,2)=1

Data replication View change illustrated:

A B C D E

r1(d)w1(d) w1(d) w1(d) w1(d) w1(d)

r3(d) w3(d) w3(d) w3(d) w3(d) r2(d)

r4(d)w4(d) w4(d) w4(d) w4(d) w4(d)

View 0W(d,0)= 5R(d,0)= 1

View 1W(d,1)= 4R(d,1)= 1

View 1W(d,2)= 5R(d,2)= 1

The quorum based algorithm serialized T2 before T3

Is there any notion of majority behind view changes?

Outline Serializability Quorum-based protocol Replicated Servers Dynamic Quorum Changes One-copy serializability View-based quorums

Atomic Multicast Virtual synchrony Ordered Multicast Reliable and Causal Multicast Trans Algorithm Group Membership Transi Algorithm

Update Propagation

Atomic Multicast

In many situations, one-to-many form of communication is useful e.g. maintaining replicated servers,etc.

Two forms of one-to-many communication are possible: Broadcast and Multicast

Broadcast: the sender sends to all the nodes in the system

Multicast: the sender sends to a subset L of the nodes in the system

we are interested in multicast and we assume the sender sends a message m

Atomic Multicast A naïve algorithm for Multicast For each processor p in L send m to p

Problem: The sender fails after sending it has only sent to some processors

Some members of the list L receive m while others do not receive m This is not acceptable in fault tolerant systems Multicast must be Reliable: If one processor in L receives m, every alive processor in L must receive m

Atomic Multicast The naïve algorithm for Multicast + 2PC technique The 2PC technique can improve reliability of the naïve algorithm

Idea: Regard a multicast as a transaction (“all-or-nothing” property) distinguish delivery of a message from the reception of that message

APP APP

m

Delivery of message mReception of message m

Atomic Multicast The naïve algorithm for Multicast + 2PC technique The 2PC technique can improve reliability of the naïve algorithm

Idea(cont.): Rule for delivery: deliver a message only when you know that the message will be delivered everywhere

Algorithm for the sender: 1. send m to every processor in L 2. When you have received all acknowledgements, deliver m locally; tell every processor in L that it can deliver

Atomic Multicast The naïve algorithm for Multicast + 2PC technique This technique might require an important amount of work due to recovering of a failed processor In addition, it is vulnerable vulnerability inherited from the vulnerability of the 2PC

The main difficulty comes from the correctness criteria: How can a processor determines which nodes in L are up ? Virtual synchrony takes this into account

Atomic Multicast Virtual synchrony Accounts for the fact that it is difficult to determine exactly which are the non-failed processors

Processors are organized into groups that cooperate to perform a reliable multicast

Each group corresponds to a multicast list: multicast in a group

Group view : The current list of processors to receive a multicast message (+ some global properties)

Consistency of group view: common view on the members

Atomic Multicast Virtual synchrony An algorithm for performing reliable multicast is virtual synchronyIf : 1. In any consistent state, there is a unique group view on which all members of the group agree

2. If a message m is multicast in group view v before view change c, then either: 2.1. No processor in v participating in c can ever receive m, or 2.2. All processors in v participating in c receive m before performing c

Atomic Multicast Virtual synchrony illustrated

AA B DC CB

c :{A,B,C} participate

m{} {} {}

AA B DC CB

C: {A,B,C} participate

m{m} {m} {m}

c

2.1

2.2

Atomic Multicast Virtual synchrony View changes can be considered as checkpoints

Delivery list in virtual synchrony Between two consecutive “checkpoints” v and v’, A set G of messages is multicast A sender of a message in G must be in v Hence, if p is removed from the view, the remaining processors can consider that p has failed

There is a guarantee that from v’ , no message from p will be delivered in the future

Atomic Multicast Ordered multicasts One might want a multicast that satisfies a specific order: e.g. Causal order, total order

Causal order (for causal multicast): If processor p receives m1 and then multicast m2 then every processor that receives {m1,m2} should receive m1 before m2

Total order (for atomic multicast) If p receives m1 before m2 then every processor that receives {m1,m2} should receive m1 before m2

(i.e. the same order of reception everywhere)

Atomic Multicast Why causal multicast ? Assume that the data item x is replicated and consider the followingscenario:

m2m2

m1

m1

m1: “set x to zero”m2: “increment x by 1”

p rq

m1 must be delivered before m2. Otherwise,inconsistency !

Atomic Multicast Why total order for multicast ? Assume that p sends m and after that, p crashesthen by some mechanism, q and r are informed about the crash of pbut q receives m before crash(p); r receives crash(p) before m

p rq

m

m

crash(p)

Total order is necessaryotherwise, q and r mighttake different decisions

Atomic Multicast Why total order for multicast (cont.)? Assume that the data item x is a replicated queue and consider the following scenario:

m2m2

m1

m1

m1: “insert a to x”m2: “delete a from x”

p rq

m1 and m2 must be delivered in the same order. Otherwise,inconsistency !

Atomic Multicast The Trans algorithm

Executes between two views changes exploitation of the guarantee provided by virtual synchrony

Hence, the algorithm works on one view

Mechanisms: Combination of positive and negative acknowledgements for reliability Piggybacking acknowledgements on messages being multicast simplifies detection of missed messages minimizes the need for explicit acknowledgements

Atomic Multicast

By piggybacking positive acknowledgements and negative acknowledgements, when a processor p receives a multicast message, p learns : which message it doe not need to acknowledge which message it has missed and must request a retransmission

Atomic Multicast The idea behind Trans is illustrated by the following scenario: Let L =[ P,Q,R] be a delivery list for multicast

step 1. P multicasts m1 step 2. Q receives m1 and piggybacks a positive acknowledgement on the next message m2 that it multicasts ( we write m2:ack(m1) to mean m2 contains ack for m1) step3 . R receives m2 (i.e. m2:ack(m1))

Two cases are in order for R upon receipt of m2: case 1 If R had received m1, it realizes that it does not need to send an acknowledgement for it, as Q had acknowledged it

case 2 If R had not received m1, then R learns (because of the ack(m1) attached to m2) that m1 is missing then R requests a retransmission of m1 by attaching a negative acknowledgement for m1 in the next message it multicast

Atomic Multicast Trans: an invariant The protocol maintains the following invariant

A processor p multicasts an acknowledgement of message m only if processor p has received m and all messages that causally precede m.

Causal order m

If you acknowledge m, you do not need to acknowledgeall unacknowledged messages that precede m

Atomic Multicast Trans: stable messagesA message is said to be stable if it has reached all the processorsin the group view

This is detectable because each receiver of a message multicasts acknowledges

Some assumptions:

All messages are assumed to be uniquely identified (processor_id , message_seq_number)

Each sender sequentially number its messages

A virtual synchrony layer is assumed

Atomic Multicast Trans: Variables used Each processor maintains:

ack_list : the list of identifiers of messages for which that node has to send a positive acknowledgement

nack_list : the list of identifiers of messages for which that node has to send a negative acknowledgement

G : the causal DAG contains all messages that the processor has received but that are not yet stable (m, m’) is in G if message m acknowledges message m’

Atomic Multicast Trans: retransmission Using information given by the local DAG, a processor can determine which messages it should have received

For such a message, a negative acknowledgement is multicast to request a retransmission

Atomic Multicast Trans: variables (cont.)

m : message container (serves as id of message here) m.message : application-level message (to be delivered at the app.) m.nacks : list of negative acknowledgments m.acks : list of positive acknowledgments

L : destinations list (maintained by an underlying algorithm)

Atomic Multicast Trans: Causal DAG functions used

add_to_DAG(m,G) : insert m into G

not_duplicate(m,G) : True if m has never been received before

causal(m,G) : True if all messages that m causally follows have been received

stable(m,G) : True if all (alive) processors have acknowledged m

Atomic Multicast Trans: Sending a message Trans_send(message) create m a container; m.message := message; m.ack := ack_list; /* attach positive acknowledgements to m */ m .nack :=nack_list ;/* attach negative acknowledgements to m */ put m in ack_list; add_to_DAG(m,G); send m to every processor in L

Atomic Multicast Trans: Receiving a message Trans_receive(m) for every nack(x) in m.nacks /* note: for a retransmission, m.nacks is empty */ if x in G then multicast x; if not_duplicate(m,G) then /* m has never been received before */ for every ack(m’) in m.acks do /* update of nack_list */ if not_duplicate(m’,G) then add m’ to nack_list if m is in nack_list then /* m is a retransmission */ remove m from nack_list; add m to undelivered_list; remove m.nacks from m; add_to_DAG(m,G); while there is m’ in undelivered_list such that causal(m’,G) do begin remove m’ from undelivered_list; deliver m’.message to the application; end

compute ack_list to be all m in G such that causal(m,G) and there is no m’ such that causal(m’, G) and m acknowledges m’ for every m’ in G do if stable(m’,G) then remove m’ from G and reclaim the buffer

Atomic Multicast Trans: the Causal DAG G : the causal DAG contains all messages that the processor has received but that are not yet stable (m, m’) is in G if message m acknowledges message m’

c1

d1b1Message (c1, [ack(b1), nack_(d1)])represents a piece of the DAG

Processor C that sends c1, acknowledges message b1, but it also requests retransmissionof message d1

Atomic Multicast Trans: retransmission

A retransmitted message can it be different from the original message ? The retransmitted message must contain all the positive acknowledgements that the original message had

the list of negative acknowledgements is not useful in the retransmission

Atomic Multicast Trans: illustrated

A B DC

a aa

(a,[]) A multicasts D does not get

a aa

(b,[ack(a)])

b bb

a

b

B multicast ;

D receives b:it learns about unreceived message a

All nodes know B has got a


A B DC

a aa

(c,[ack(b)])

b bb

a

b

C multicast and C onlyacknowledges b(implicit acknowledgement for a) All nodes know C has got b and a (no nack(a) in themessage c broadcasts)

c cc c


A B DC

a aa

(d,[nack(a)])

b bb

a

b

D multicasts , in the D’s message, there is a nack(a) : request retransmission

A does not get message from D;

D must piggyback ack(c) onits next multicast

c cc c

d dd ?a ?a?a


A B DC

a aa

(d,[nack(a)])

b bb

a

b

D multicasts , in the D’s message, there is a nack for a: request retransmission

A does not get message from D;

D must piggyback only ack(c) on its next multicast i.e. implicit ack for a and b

c cc c

dd d ?a?a ?a


A B DC

a aa

(a,[])

b bb

a

b

C “remulticasts” ;no attached ack Note: representation of aat D changed

In their next message, C and B acknowledge d

c cc c

dd d

Atomic Multicast Trans: Properties If a message is received to a processor that does not fail, eventually every non-failed processor receives that message

Messages are delivered in causal order

G forms a tight description of the causal ordering among messages

If a processor fails, the storage requirement of the algorithm grows without bound

The trans algorithm needs to be composed with a distributed algorithm that maintains consistency of views to avoid unbounded storage requirement

Atomic Multicast Group membership Goal: To maintain membership (delivery) list -- a replicated data

Properties of the membership list The value of Li is the same at all processors in Li

Processors install new versions of the membership list in exactly the same order

Atomic Multicast Group membership Determination of group view:

A group view is determined by the set of alive processors that are involved in the computation of that view

Notion of agreed view

Atomic Multicast Correctness criteria for a reliable group membership distributed algorithms: 1. There is an initial agreed group view in any execution 2. Processors change their local view based on information about failures and new processors

3. The agreed view is unique in any consistent state 4. If p and q are members of the agreed view that goes through a series of changes, p and q see the same sequence of changes

5. The algorithm responds to notifications that processors are faulty or operating

Atomic Multicast The Transis Algorithm

“Asynchronous agreement distributed algorithm” Properties: (make things simple !!)

Paranoid (one sacrifices accurate identification of failed processor) If any a non suspected processor p suspects a processor q of failing, processor q is declared faulty and all processors in the group will remove q from their local view, even if they can still communicate with q

Unidirectional Once a processor is removed from the membership list, that processor is never readmitted in the group, except as “new”

Atomic Multicast The paranoid and the unidirectional properties lead to monotone agreement The set of suspected processors monotonically increases at every non-suspected processor processor

eventually, the same set of suspected processors at every non-suspected processor (agreement !)

If everyone suspects everyone, the view collapses

Virtual synchrony Views are changed in a manner that ensures virtual synchrony every processor must must be able to agree on what the last message from every other processor was

Atomic Multicast Interactions with Trans Transis interacts with trans through C-DAG

e.g. Transis might : query C-DAG; modify C-DAG; block some messages require immediate sending of some messages

Fault detection Through consistent line in the C-DAG i.e. a global state in which all the members agree on the new group view

Atomic Multicast Achievement of a consistent line Message F(q) means “q is suspected faulty”

Algorithm for new group view computation

When processor p suspects that processor q is faulty, p multicasts F(q) using Trans

When message F(q) becomes causal at a r.DAG processor r processor r removes q from its local view; processor r multicasts F(q)

When all non-suspected processor have received F(q) from, the agreement on the new view is reached

Atomic Multicast Achievement of a consistent line Message F(q) means “q is suspected faulty”

A B C D E

F(E)

F(E)

F(E)

F(E)

F(q) are multicast using Trans;All the processors receive the same DAG, Consistent line can be computed from DAG

Lx

Lx+1

time

Atomic Multicast Messages partitioning by a consistent line regular message that precede the consistent line are in view Lx

regular messages that follow the consistent line are in Lx+1

A B C D E

F(E)

F(E)

F(E)

F(E)

Lx

Lx+1

timem1

m2

Regular message = message different from F(q)

Atomic Multicast Messages delivery with respect to a consistent line (virtual synchrony)

regular messages that follow the consistent line are in view Lx+1 and should be delivered in view Lx+1

A B C D E

F(E)

F(E)

F(E)

F(E)

Lx

Lx+1

timem1

m2

m1 should be delivered after installing Lx+1 at C

Atomic Multicast Messages delivery with respect to a consistent line (virtual synchrony)

regular message that precede the consistent line are in view Lx should be delivered before installing the new view Lx+1

A B C D E

F(E)

F(E)

F(E)

F(E)

Lx

Lx+1

timem1

m2

Regular message = message different from F(q)

Atomic Multicast Concurrent failures handling the first processor that learns F(q) and F(r) proposes F(q,r)

A B C D E

Lx

Lx+1

time

G

F(C)

m’

F(C)

F(C,G)

F(C,G)

F(C,G)

F(G)

F(G)

F(C,G)

New view : A, B, D, E

Atomic Multicast Handling messages from suspected processors

Assume q is suspected of being faulty At some processors, messages from q might causally precede or follow F(q)

To ensure virtual synchrony, All message from q that precede the first F(q) must be delivered before installing the new view

No message from q that causally follow any F(q) can be delivered (they are discarded)

Messages from q that are concurrent to all F(q) are discarded

Atomic Multicast Handling messages from a suspected processor illustrated

A B C D E

F(E)

F(E)

F(E)

F(E)

Follows:discarded

Concurrent:discarded

Precede: deliver it

Atomic Multicast Preventing regular messages from straddling view change

When a processor p receives F(q), processor p multicasts F(q) before it multicasts any regular message i.e. F(q) receives “high priority”

The causal layer guarantees that any message that p sends after F(q) causally follows F(q)

Hence the following situation cannot occur

F(q)m

F(q)

Atomic Multicast Preventing empty views

Suspected processors are removed forever this might lead to empty views

When a non-faulty processor learns that it has been removed from the view, it fails then it rejoins as a new processor (incarnation number) are used for that

algorithm for adding new processor is needed

Atomic Multicast Adding new processor Algorithm similar to the algorithm for fault detection

Progressive (to account for concurrent join propositions) construction of the join set

When all processors in the current view multicast the same join set, a consistent line is achieved

When a consistent line is achieved, joining processors must beproperly initialized to start participating in the multicast (Trans)

Atomic Multicast Update propagation Deals with relaxed consistency constraints on replicated data

The main requirement: updates on data items must be propagated among all the replicas in a timely manner.

Example: Routing tables

Useful for large networks

Algorithms often based on Gossip processor contact each other to make themselves up to date by exchanging “news”

Atomic Multicast Gossip-type Algorithms

Based on the meaning of “update”

1) update = “overwrite the old value of an object”

2) update = “modify the old value of an object” (e.g. increment a counter, change an entry of an array)

Atomic Multicast Epidemic Algorithms

Method for gossiping updates to replicated data ?

Assume: update = “overwrite the old value”;

d : data item replicated on M servers;

the computation of an update of d (data item) assigns d a version number (timestamp)

When “gossiping”: if a newer version is proposed, take it

Difficulty : how to spread new updates without too many messages

Atomic Multicast Epidemic Algorithms

The simplest method for distributing the news of a new update Direct mail : when a server performs an update, it informs all other servers directly

M-1 messages are sent Properties: simple; all reached in fault-free system

Problems: sender might fail some destinations will not hear of the update large communication burden on the sender not suitable for dynamic topologies

Atomic Multicast Epidemics: Idea: when a server performs an update, its direct neighbors;

A neighbor informs its neighbors and the propagation continues

- important number of messages;

no guarantee that the update will reach all sites

Atomic Multicast Randomized Epidemics: Idea: when a server performs an update, it informs some direct neighbors

Details: Definitions: Let u(d) be be an update of data item d;

1. A susceptible server for u(d) is one which has never heard of u(d);

2. An infectious server for u(d) is one which has heard of u(u) and is actively propagating u(d) ;

3. A removed server for u(d) is that knows of u(d) but is no longer actively propagating u(d)

Atomic Multicast Randomized Epidemics: The algorithm: Let k be a parameter;

1. When a susceptible server learns u(d), it becomes infectious for u(d)

2. An infectious server repeatedly contact a random server and informs this random server about u(d)

3. If an infectious server p for u(d) contacts an infectious server or a removed server p’ for u(d) then with probability 1/k, p becomes removed

Atomic Multicast Analysis of the randomized Epidemics: In : fraction of infectious servers s : fraction of susceptible servers

Assume: on every time unit (dt), every infectious server contacts another server

Note ds : variation of s during dt; dIn : variation of In during dt;then on average, s.In servers become infectious during dt (1-s).In /k become removed during dt;

Atomic Multicast Analysis of the randomized Epidemics:

Thus ds = -sIn (1) dIn = sIn - (1-s)In /k (2)

(2) /(1) gives: dIn / ds = 1/ks -(k+1) /k : differential equation

solution: In(s) = [(k+1)/k].(1-s) + log(s)/k

solve In(s)=0 to determine the fraction of sites that do not hear about the update when the epidemic is terminated (In = 0)

Atomic Multicast Analysis of the randomized Epidemics:Trivial solution s=1;

Solution : s0(k) = exp(-(k+1)(1-s)) exp(-(k+1)) when s << 1

An infectious site becomes removed at its ith contact with probability (1-1/k)(i-1).(1/k)

Thus the expected number of messages is mtot = M (1-s) [1 + sum{ (i-1/k)(i-1).1/k, 1<= i <= }] M(1-s)

Atomic Multicast Analysis of the randomized Epidemics:Let consider the variation of the parameter from k to k+1:

then [mtot(k) -mtot(k+1)]/(k+1 -k) M[(1-s0(k)) - (1-s0(k+1))] M.s0(k+1) - M.s0(k) M.(1- 1/exp)(exp(-(k+1))) i.e. the number of message required to infect new sites decreases exponentially with k

Conclusion: epidemic algorithm are good for initial distribution (high probability to contact susceptible servers), but some other mechanism is needed to infect the last few sites

Atomic Multicast Anti-entropy:Idea: one site contact another to exchange recent updatesa processor p initiates a contact by executing Gossip(): pick a random processor, s; exchange(s)

The contacted processor sends the list of its timestamps for the data item d

Methods for accomplishing exchange(s) Pull: p pulls the most recent updates from s (take); Push: p pushes the most recent updates to s (give) Pull-push: p takes most recent updates from s and p gives its most recent updates to s

Atomic Multicast Antientropy:

If most processors are infectious, Pull is better than Push

Let pi be the probability that after i contacts, a random processor is still uninfected

Pull: pi+1 = pi2

(remains uninfected if contacting an uninfected processor)

Push: pi+1 pi /e (from s0(k) = exp(-(k+1)))

If pi 1 then Pull is slowIf pi << 1 then Pull is much better than Push

Atomic Multicast Update logs Here we assume that update =“update the old value”

Value of an object is considered to consist of: an initial value the history of updates applied to the object For consistency : the same history of updates must be applied to all copies

A new update is then function of the whole history of updates

Atomic Multicast Update logs Each processor keeps a log: an ordered listing containing all the updates that the processor has processed

Processors distribute their logs to each other

Causal log propagation: events are added to the log in a order that is consistent with causality

Notations: L: log; L[i] : the first i element of L e an element of L; index(e) position of e in L L[e] shorthand for L[index(e)]

Atomic Multicast Update logs Consistency of logs: Let e be an event that is first executed at processor p. Then for every processor j =1,..,M and every event f, f is in p.L[index(e)] if and only if f is j.L[index(e)]

When a processor p propagates its log, processor p propagates all events in its log : event propagation is transitive

Log propagation transmit context of events: context of an event is described by a vector timestamp

Contexts are merged upon reception (vector timestamp technique)

Atomic Multicast Causal log propagation

12

6

When event 6occurs at p1,the first 5 events from p1 2 events from p2, 3 event from p3 3 events from p4are in p1’s log

p1 p2 p3 p4

Documents

Ch12 (continued) Replicated Data Management