Slides for Chapter 15: Replication From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 4, © Addison-Wesley 2005

Slides for Chapter 15: Replication

From Coulouris, Dollimore and KindbergDistributed Systems:

Concepts and DesignEdition 4, © Addison-Wesley 2005

Outline

IntroductionSystem model and group communicationFault-tolerant servicesCase studies of highly available services: the gossip

architecture, Bayou and CodaTransactions with replicated dataSummary

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005

Introduction

Replication: the maintenance of copies of data at multiple computers A key to the effectiveness of distributed systems:

Enhance performance Increased availability Fault tolerance


System model and group communication

Object Replica

System model Five phases are involved in the performance of a single

request upon the replicated objects: The front end issues the request to one or more replica managers Coordination

• FIFO ordering• Causal ordering• Total ordering

Execution Agreement Response


Group communication Multicast communication Group membership management

These two are highly interrelated



Figure 15.1A basic architectural model for the management of replicated data

FE

Requests andreplies

C

ReplicaC

ServiceClients Front ends

managers

RM

RMFE

RM

Communicate with one or more of the replica managers, make replication transparent


Figure 15.2Services provided for process groups

Join

Groupaddress

expansion

Multicastcommunication

Group

send

FailGroup membership

management

Leave

Process group

A process outside the group sends message to the group without knowing the group’s membership

The group communication service has to manage changes in the group’s membership while multicasts take place concurrently

The role of group membership service Providing an interface for group membership changes

Create or destroy groups, add or withdraw a process from a group

Implementing a failure detector Exclude a process if it is suspected to have failed or become unreachable

Notifying members of group membership changes Performing group address expansion

Expands the identifier into the current group membership for delivery

IP multicast is a weak case of a group membership service. A full group membership service maintains group views (Ordered lists of the current group members with their unique process identifiers)


View delivery For each group g the group management service delivers

to any member process a series of views For example, A member delivering a view when a membership change occurs An event as occurring in a view v(g) at process p if, at the time of the

event’s occurrence, p has delivered v(g) but has not yet delivered the next view

Some basic requirements for view delivery Order: If a process p delivers view v(g) and then view v’(g), then no other

process q!= p delivers v’(g) before v(g) Integrity: if process p delivers view v(g) then p v(g)∈ Non-triviality: If process q joins a group and is or becomes indefinitely

reachable from process p!=q, then eventually q is always in the views that p delivers Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4

© Pearson Education 2005

gp )(),(),( 210 gvgvgv)()(),',()(),()( 210 pgvppgvpgv

View synchronous group communication Guarantees provided by view synchronous group

communication: Agreement: correct processes deliver the same sequence of views

(starting from the view in which they join the group), and the same set of messages in any given view

Integrity: if a correct process p delivers message m, then it will not deliver m again. Furthermore, p group(m) and the process that sent m is in the ∈view in which p delivers m

Validity(closed group): correct processes always deliver the messages that they send.



Figure 15.3View-synchronous group communication

p

q

r

p crashes

view (q, r)view (p, q, r)

p

q

r

p crashes


a (allowed). b (allowed).

p

q

r

view (p, q, r)

p

q

r

p crashes


c (disallowed). d (disallowed).p crashes

view (q, r)

p sends a message m while in view (p,q,r) but that p crashes soon after sending m, while q and r are correct

Fault-tolerance services

Correctness criteria for replicated objects Linearizability

A replicated shared object service is linearizable if for any execution there is some interleaving of the series of operations issued by all the clients that satisfies the following two criteria:

• The interleaved sequence of operations meets the specification of a (single) correct copy of the objects

• The order of operations in the interleaving is consistent with the real times at which the operations occurred in the actual execution


Client 1: Client 2:

setBalanceB(x, 1)

setBalanceA(y, 2)

getBalanceA(y) -> 2

getBalanceA(x) -> 0

Sequentially consistent If for any execution there is some interleaving of the series of operations

issued by all the clients which satisfies the following two criteria• The interleaved sequence of operations meets the specification of a

(single) correct copy of the objects• The order of operations in the interleaving is consistent with the

program order in which each individual client executed them


Client 1: Client 2:

setBalanceB(x, 1)

getBalanceA(y) -> 0

getBalanceA(x) -> 0

setBalanceA(y, 2)

Passive (primary-backup) replication At any time, there is a single primary replica manager and

one or more secondary replica managers (backups) Front ends communicate only with the primary replica manager The primary replica manager executes the operations and sends copies of

the updated data to the backups If the primary fails, one of the backups is promoted to act as the primary

This system obviously implements linearizability if the primary is correct. Why?



Figure 15.4The passive (primary-backup) model for fault tolerance

FEC

FEC

RM

Primary

Backup

Backup

RM

RM

Active replication The replica managers are state machines that play

equivalent roles and are organized as a group Front ends multicast their requests to the group of replica managers All the replica managers process the request independently but identically

and reply If any replica manager crashes, no impact upon the performance of the

service

Achieves sequential consistency



Figure 15.5Active replication

FE CFEC RM

RM

RM

Case studies of highly available services

Apply replication techniques to make service highly available.

Three systems that provide highly available services: the gossip architecture, Bayou and Coda


The gossip architecture Replicating data close to the points where groups of

clients need it Replica managers exchange messages periodically in

order to convey the updates they have each received from clients

The system makes two guarantees Each client obtains a consistent services over time

• Replica managers only ever provide a client with data that reflects at least the updates that the clients have observed so far

Relaxed consistency between replicas• All replica managers eventually receive all updates and they apply

updates with ordering guarantees that make the replicas sufficiently similar to suit the needs of the application



Figure 15.6Query and update operations in a gossip service

Service

Query Val

FE

RM RM

RM

Query, prev Val, new

Update

FE

Update, prev Update id

Clients

gossip


Figure 15.7Front ends propagate their timestamps whenever clients communicate directly

FE

Clients

FE

Service

Vectortimestamps

RM RM

RM

gossip

In order to control the ordering of operating processes, each front end keeps a vector timestamp that reflects the version of the latest data values accessed by the front end

Clients exchange data by accessing the same gossip service and by communicating directly with one another via the client’s front ends.


Figure 15.8A gossip replica manager, showing its main state components

Other replica managers

Replica timestamp

Update log

Value timestamp

Value

Executed operation table

Stable

updates

Updates

Gossipmessages

FE

Replicatimestamp

Replica log

OperationID Update PrevFE

Replica manager

Timestamp table


Figure 15.9Committed and tentative updates in Bayou

c0 c1 c2 cN t0 t1 ti

Committed Tentative

t2

Tentative update ti becomes the next committed update and is inserted after the last committed update cN.

ti+1

Transactions with replicated data

One-copy serialization property The effect of transactions performed by clients on

replicated objects should be the same as if they had been performed one at a time on a single set of objects

Three replication schemes Available copies with validation Quorum consensus Virtual partition


Architecture for replicated transactions



Figure 15.10Transactions on replicated data

B

A

Client + front end

BB BA A

getBalance(A)

Client + front end

Replica managersReplica managers

deposit(B,3);

UT

Read-one/write-all scheme, a read request can be performed by a single replica manager, whereas a write request must be performed by all the replica managers in the group

ACID: Atomicity, Consistency, Isolation, Durability

The two-phase commit protocol One of the RMs is the coordinator

First phase – coordinator sends canCommit? To other RMs Second phase – coordinator sends doCommit or doAbort to other RMs

Primary copy replication Primary copy is used for transactions

Concurrency control at the primary Primary communicates with other RMs (backup to address failure)

Read-one / write-all Read – one single RM, which sets a read lock Write – at all RMs, each sets a write lock One copy serializability is achieved

Read and write on same RM require conflicting locks


Available copies replication Designed to allow for some replica managers being

temporarily unavailable. A client’s request on a logical object may be performed by

any available replica manager But a client’s update request must be performed by all

available replica managers in the group with copies of the object



Figure 15.11Available copies

A

X

Client + front end

P

B

Client + front end

Replica managers

deposit(A,3);

UT

deposit(B,3);

getBalance(B)

getBalance(A)

Replica managers

Y

M

B

N

A

B

At X, transaction T has read A and therefore transaction U is not allowed to update A with the deposit operation until transaction T has completed


Figure 15.12Network partition

Client + front end

B

withdraw(B, 4)

Client + front end

Replica managers

deposit(B,3);

UTNetworkpartition

B

B B

Replication schemes are designed with the assumption that partitions will eventually be repaired. Therefore, the replica managers within a single partition must ensure that any requests that they execute during a partition will not make the set of replicas inconsistent when the partition is repaired.

Optimistic approaches allow updates in all partitions. After repairing partition, all updates that break the one-copy serializability criterion will be abortedPessimistic approaches limits availability even when there are no partitions, it prevents any inconsistencies occurring during partitions.

Available copies with validation


Quorum consensus methods In quorum consensus replication schemes an update

operation on a logical object may be completed successfully by a subgroup of its group of replica managers

Version number for timestamps may be used to determine whether copies are up to date



Page 650Gifford’s quorum concensus examples

Example 1 Example 2 Example 3

Latency Replica 1 75 75 75

(milliseconds) Replica 2 65 100 750

Replica 3 65 750 750

Voting Replica 1 1 2 1

configuration Replica 2 0 1 1

Replica 3 0 1 1

Quorum R 1 2 1

sizes W 1 3 3

Derived performance of file suite:

Read Latency 65 75 75

Blocking probability 0.01 0.0002 0.000001Write Latency 75 100 750

Blocking probability 0.01 0.0101 0.03

Virtual partition algorithm Combines the quorum consensus approach with the

available copies algorithm

A virtual partition is an abstraction of a real partition and contains a set of replica managers



Figure 15.13Two network partitions

Replica managers

Network partition

VX Y Z

TTransaction


Figure 15.14Virtual partition

X V Y Z

Replica managers

Virtual partition Network partition


Figure 15.15Two overlapping virtual partitions

Virtual partition V1 Virtual partition V2

Y X V Z


Figure 15.16Creating a virtual partition

Phase 1: • The initiator sends a Join request to each potential member. The argument of Join is a proposed logical timestamp for the new virtual partition.• When a replica manager receives a Join request, it compares the proposed logical timestamp with that of its current virtual partition.

– If the proposed logical timestamp is greater it agrees to join and replies Yes;

– If it is less, it refuses to join and replies No.Phase 2:

• If the initiator has received sufficient Yes replies to have read and write quora, it may complete the creation of the new virtual partition by sending a Confirmation message to the sites that agreed to join. The creation timestamp and list of actual members are sent as arguments.

• Replica managers receiving the Confirmation message join the new virtual partition and record its creation timestamp and list of actual members.

Documents

Slides for Chapter 15: Replication From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 4, © Addison-Wesley 2005