Upload
elvin-hicks
View
226
Download
8
Tags:
Embed Size (px)
Citation preview
Slides for Chapter 15: Replication
From Coulouris, Dollimore and KindbergDistributed Systems:
Concepts and DesignEdition 4, © Addison-Wesley 2005
Outline
IntroductionSystem model and group communicationFault-tolerant servicesCase studies of highly available services: the gossip
architecture, Bayou and CodaTransactions with replicated dataSummary
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Introduction
Replication: the maintenance of copies of data at multiple computers A key to the effectiveness of distributed systems:
Enhance performance Increased availability Fault tolerance
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
System model and group communication
Object Replica
System model Five phases are involved in the performance of a single
request upon the replicated objects: The front end issues the request to one or more replica managers Coordination
• FIFO ordering• Causal ordering• Total ordering
Execution Agreement Response
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Group communication Multicast communication Group membership management
These two are highly interrelated
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.1A basic architectural model for the management of replicated data
FE
Requests andreplies
C
ReplicaC
ServiceClients Front ends
managers
RM
RMFE
RM
Communicate with one or more of the replica managers, make replication transparent
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.2Services provided for process groups
Join
Groupaddress
expansion
Multicastcommunication
Group
send
FailGroup membership
management
Leave
Process group
A process outside the group sends message to the group without knowing the group’s membership
The group communication service has to manage changes in the group’s membership while multicasts take place concurrently
The role of group membership service Providing an interface for group membership changes
Create or destroy groups, add or withdraw a process from a group
Implementing a failure detector Exclude a process if it is suspected to have failed or become unreachable
Notifying members of group membership changes Performing group address expansion
Expands the identifier into the current group membership for delivery
IP multicast is a weak case of a group membership service. A full group membership service maintains group views (Ordered lists of the current group members with their unique process identifiers)
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
View delivery For each group g the group management service delivers
to any member process a series of views For example, A member delivering a view when a membership change occurs An event as occurring in a view v(g) at process p if, at the time of the
event’s occurrence, p has delivered v(g) but has not yet delivered the next view
Some basic requirements for view delivery Order: If a process p delivers view v(g) and then view v’(g), then no other
process q!= p delivers v’(g) before v(g) Integrity: if process p delivers view v(g) then p v(g)∈ Non-triviality: If process q joins a group and is or becomes indefinitely
reachable from process p!=q, then eventually q is always in the views that p delivers Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4
© Pearson Education 2005
gp )(),(),( 210 gvgvgv)()(),',()(),()( 210 pgvppgvpgv
View synchronous group communication Guarantees provided by view synchronous group
communication: Agreement: correct processes deliver the same sequence of views
(starting from the view in which they join the group), and the same set of messages in any given view
Integrity: if a correct process p delivers message m, then it will not deliver m again. Furthermore, p group(m) and the process that sent m is in the ∈view in which p delivers m
Validity(closed group): correct processes always deliver the messages that they send.
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.3View-synchronous group communication
p
q
r
p crashes
view (q, r)view (p, q, r)
p
q
r
p crashes
view (q, r)view (p, q, r)
a (allowed). b (allowed).
p
q
r
view (p, q, r)
p
q
r
p crashes
view (q, r)view (p, q, r)
c (disallowed). d (disallowed).p crashes
view (q, r)
p sends a message m while in view (p,q,r) but that p crashes soon after sending m, while q and r are correct
Fault-tolerance services
Correctness criteria for replicated objects Linearizability
A replicated shared object service is linearizable if for any execution there is some interleaving of the series of operations issued by all the clients that satisfies the following two criteria:
• The interleaved sequence of operations meets the specification of a (single) correct copy of the objects
• The order of operations in the interleaving is consistent with the real times at which the operations occurred in the actual execution
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Client 1: Client 2:
setBalanceB(x, 1)
setBalanceA(y, 2)
getBalanceA(y) -> 2
getBalanceA(x) -> 0
Sequentially consistent If for any execution there is some interleaving of the series of operations
issued by all the clients which satisfies the following two criteria• The interleaved sequence of operations meets the specification of a
(single) correct copy of the objects• The order of operations in the interleaving is consistent with the
program order in which each individual client executed them
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Client 1: Client 2:
setBalanceB(x, 1)
getBalanceA(y) -> 0
getBalanceA(x) -> 0
setBalanceA(y, 2)
Passive (primary-backup) replication At any time, there is a single primary replica manager and
one or more secondary replica managers (backups) Front ends communicate only with the primary replica manager The primary replica manager executes the operations and sends copies of
the updated data to the backups If the primary fails, one of the backups is promoted to act as the primary
This system obviously implements linearizability if the primary is correct. Why?
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.4The passive (primary-backup) model for fault tolerance
FEC
FEC
RM
Primary
Backup
Backup
RM
RM
Active replication The replica managers are state machines that play
equivalent roles and are organized as a group Front ends multicast their requests to the group of replica managers All the replica managers process the request independently but identically
and reply If any replica manager crashes, no impact upon the performance of the
service
Achieves sequential consistency
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.5Active replication
FE CFEC RM
RM
RM
Case studies of highly available services
Apply replication techniques to make service highly available.
Three systems that provide highly available services: the gossip architecture, Bayou and Coda
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
The gossip architecture Replicating data close to the points where groups of
clients need it Replica managers exchange messages periodically in
order to convey the updates they have each received from clients
The system makes two guarantees Each client obtains a consistent services over time
• Replica managers only ever provide a client with data that reflects at least the updates that the clients have observed so far
Relaxed consistency between replicas• All replica managers eventually receive all updates and they apply
updates with ordering guarantees that make the replicas sufficiently similar to suit the needs of the application
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.6Query and update operations in a gossip service
Service
Query Val
FE
RM RM
RM
Query, prev Val, new
Update
FE
Update, prev Update id
Clients
gossip
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.7Front ends propagate their timestamps whenever clients communicate directly
FE
Clients
FE
Service
Vectortimestamps
RM RM
RM
gossip
In order to control the ordering of operating processes, each front end keeps a vector timestamp that reflects the version of the latest data values accessed by the front end
Clients exchange data by accessing the same gossip service and by communicating directly with one another via the client’s front ends.
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.8A gossip replica manager, showing its main state components
Other replica managers
Replica timestamp
Update log
Value timestamp
Value
Executed operation table
Stable
updates
Updates
Gossipmessages
FE
Replicatimestamp
Replica log
OperationID Update PrevFE
Replica manager
Timestamp table
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.9Committed and tentative updates in Bayou
c0 c1 c2 cN t0 t1 ti
Committed Tentative
t2
Tentative update ti becomes the next committed update and is inserted after the last committed update cN.
ti+1
Transactions with replicated data
One-copy serialization property The effect of transactions performed by clients on
replicated objects should be the same as if they had been performed one at a time on a single set of objects
Three replication schemes Available copies with validation Quorum consensus Virtual partition
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Architecture for replicated transactions
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.10Transactions on replicated data
B
A
Client + front end
BB BA A
getBalance(A)
Client + front end
Replica managersReplica managers
deposit(B,3);
UT
Read-one/write-all scheme, a read request can be performed by a single replica manager, whereas a write request must be performed by all the replica managers in the group
ACID: Atomicity, Consistency, Isolation, Durability
The two-phase commit protocol One of the RMs is the coordinator
First phase – coordinator sends canCommit? To other RMs Second phase – coordinator sends doCommit or doAbort to other RMs
Primary copy replication Primary copy is used for transactions
Concurrency control at the primary Primary communicates with other RMs (backup to address failure)
Read-one / write-all Read – one single RM, which sets a read lock Write – at all RMs, each sets a write lock One copy serializability is achieved
Read and write on same RM require conflicting locks
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Available copies replication Designed to allow for some replica managers being
temporarily unavailable. A client’s request on a logical object may be performed by
any available replica manager But a client’s update request must be performed by all
available replica managers in the group with copies of the object
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.11Available copies
A
X
Client + front end
P
B
Client + front end
Replica managers
deposit(A,3);
UT
deposit(B,3);
getBalance(B)
getBalance(A)
Replica managers
Y
M
B
N
A
B
At X, transaction T has read A and therefore transaction U is not allowed to update A with the deposit operation until transaction T has completed
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.12Network partition
Client + front end
B
withdraw(B, 4)
Client + front end
Replica managers
deposit(B,3);
UTNetworkpartition
B
B B
Replication schemes are designed with the assumption that partitions will eventually be repaired. Therefore, the replica managers within a single partition must ensure that any requests that they execute during a partition will not make the set of replicas inconsistent when the partition is repaired.
Optimistic approaches allow updates in all partitions. After repairing partition, all updates that break the one-copy serializability criterion will be abortedPessimistic approaches limits availability even when there are no partitions, it prevents any inconsistencies occurring during partitions.
Available copies with validation
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Quorum consensus methods In quorum consensus replication schemes an update
operation on a logical object may be completed successfully by a subgroup of its group of replica managers
Version number for timestamps may be used to determine whether copies are up to date
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Page 650Gifford’s quorum concensus examples
Example 1 Example 2 Example 3
Latency Replica 1 75 75 75
(milliseconds) Replica 2 65 100 750
Replica 3 65 750 750
Voting Replica 1 1 2 1
configuration Replica 2 0 1 1
Replica 3 0 1 1
Quorum R 1 2 1
sizes W 1 3 3
Derived performance of file suite:
Read Latency 65 75 75
Blocking probability 0.01 0.0002 0.000001Write Latency 75 100 750
Blocking probability 0.01 0.0101 0.03
Virtual partition algorithm Combines the quorum consensus approach with the
available copies algorithm
A virtual partition is an abstraction of a real partition and contains a set of replica managers
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.13Two network partitions
Replica managers
Network partition
VX Y Z
TTransaction
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.14Virtual partition
X V Y Z
Replica managers
Virtual partition Network partition
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.15Two overlapping virtual partitions
Virtual partition V1 Virtual partition V2
Y X V Z
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005
Figure 15.16Creating a virtual partition
Phase 1: • The initiator sends a Join request to each potential member. The argument of Join is a proposed logical timestamp for the new virtual partition.• When a replica manager receives a Join request, it compares the proposed logical timestamp with that of its current virtual partition.
– If the proposed logical timestamp is greater it agrees to join and replies Yes;
– If it is less, it refuses to join and replies No.Phase 2:
• If the initiator has received sufficient Yes replies to have read and write quora, it may complete the creation of the new virtual partition by sending a Confirmation message to the sites that agreed to join. The creation timestamp and list of actual members are sent as arguments.
• Replica managers receiving the Confirmation message join the new virtual partition and record its creation timestamp and list of actual members.