November 2005Distributed systems: shared data1 Distributed Systems: Shared Data

November 2005 Distributed systems: shared data 1

Distributed Systems:

Shared Data


Overview of chapters

• Introduction• Co-ordination models and languages• General services• Distributed algorithms• Shared data

– Ch 13 Transactions and concurrency control, 13.1-13.4

– Ch 14 Distributed transactions

– Ch 15 Replication


Overview• Transactions and locks

• Distributed transactions

• Replication


Overview• Transactions

• Nested transactions

• Locks


• Replication

Known material!


Transactions: Introduction

• Environment– data partitioned over different servers on

different systems– sequence of operations as individual unit– long-lived data at servers (cfr. Databases)

• transactions = approach to achieve consistency of data in a distributed environment


Transactions: Introduction• Example

Person 1:

Withdraw ( A, 100);

Deposit (B, 100);

Person 2:

Withdraw ( C, 200);

Deposit (B, 200);

A

C

B


Transactions: Introduction• Critical section

– group of instructions indivisible block wrt other cs

– short duration

• atomic operation (within a server)– operation is free of interference from operations being

performed on behalf of other (concurrent) clients

– concurrency in server multiple threads

– atomic operation <> critical section

• transaction


Transactions: Introduction• Critical section• atomic operation• transaction

– group of different operations + properties

– single transaction may contain operations on different servers

– possibly long duration

ACID properties


Transactions: ACID• Properties concerning the sequence of operations

that read or modify shared data:

A tomicity

C onsistency

I solation

D urability


Transactions: ACID• Atomicity or the “all-or-nothing” property

– a transaction

• commits = completes successfully or

• aborts = has no effect at all

– the effect of a committed transaction

• is guaranteed to persist

• can be made visible to other transactions

– transaction aborts can be initiated by

• the system (e.g. when a node fails) or

• a user issuing an abort command


Transactions: ACID• Consistency

– a transaction moves data from one consistent state to another

• Isolation– no interference from other transactions

– intermediate effects invisible to other transactions

The isolation property has 2 parts:

– serializability: running concurrent transactions has the same effect as some serial ordering of the transactions

– Failure isolation: a transaction cannot see the uncommitted effects of another transaction


Transactions: ACID

• Durability– once a transaction commits, the effects of the

transaction are preserved despite subsequent failures


Transactions: Life histories• Transactional service operations

– OpenTransaction() Trans• starts new transaction

• returns unique identifier for transaction

– CloseTransaction(Trans) (Commit, Abort)• ends transaction

• returns commit if transaction committed else abort

– AbortTransaction(Trans)• aborts transaction


Transactions: Life histories• History 1: success

T := OpenTransaction();

operation;

operation;

….

operation;

CloseTransaction(T);

Operations have

read

or

write

semantics


Transactions: Life histories• History 2: abort by client


operation;

operation;

….

operation;

AbortTransaction(T);


Transactions: Life histories• History 3: abort by server


operation;

operation;

….

operation;

Server aborts!

Error reported


Transactions: Concurrency• Illustration of well known problems:

– the lost update problem

– inconsistent retrievals

• operations used + implementations

– Withdraw(A, n)

– Deposit(A, n)

b := A.read();

A.write( b - n);

b := A.read();

A.write( b + n);


Transactions: Concurrency• The lost update problem:

Transaction T

Withdraw(A,4);

Deposit(B,4);

Transaction U

Withdraw(C,3);

Deposit(B,3);

Interleaved execution of operations on B ?



Transaction T A B: 4 Transaction U C B: 3

bt := A.read();

A.write(bt-4);A: 100

B: 200

C: 300




bt := A.read();

A.write(bt-4); bu := C.read();

C.write(bu-3);

A: 96

B: 200

C: 300




bt := A.read();


C.write(bu-3);

A: 96

B: 200

C: 297

bt := B.read();

bt=200 bu := B.read();

B.write(bu+3);




bt := A.read();


C.write(bu-3);

A: 96

B: 203

C: 297

bt := B.read();


B.write(bu+3);B.write(bt+4);




bt := A.read();


C.write(bu-3);

A: 96

B: 204

C: 297

bt := B.read();


B.write(bu+3);B.write(bt+4);

Correct B = 207!!


Transactions: Concurrency• The inconsistent retrieval problem:

Transaction T

Withdraw(A,50);

Deposit(B,50);

Transaction U

BranchTotal();


Transactions: Concurrency• The inconsistent retrieval problem :

Transaction T A B: 50 Transaction U BranchTotal

bt := A.read();


B: 200

C: 300


Transactions: Concurrency• The inconsistent retrieval problem :


bt := A.read();


B: 200

C: 300

bu := A.read();

bu := bu + B. read();

bu := bu + C.read();

550bt := B.read();

B.write(bt+50);


Transactions: Concurrency• The inconsistent retrieval problem:


bt := A.read();


B: 250

C: 300

bu := A.read();

bu := bu + B. read();

bu := bu + C.read();

550bt := B.read();

B.write(bt+50);Correct total: 600


Transactions: Concurrency• Illustration of well known problems:

– the lost update problem

– inconsistent retrievals

• elements of solution– execute all transactions serially?

• No concurrency unacceptable

– execute transactions in such a way that overall execution is equivalent with some serial execution

• sufficient? Yes

• how? Concurrency control


Transactions: Concurrency• The lost update problem: serially equivalent interleaving


bt := A.read();


B: 200

C: 300




bt := A.read();


C.write(bu-3);

A: 96

B: 200

C: 300




bt := A.read();


C.write(bu-3);

A: 96

B: 200

C: 297

bt := B.read();

B.write(bt+4);




bt := A.read();


C.write(bu-3);

A: 96

B: 204

C: 297

bt := B.read();

B.write(bt+4); bu := B.read();

B.write(bu+3);




bt := A.read();


C.write(bu-3);

A: 96

B: 207

C: 297

bt := B.read();

B.write(bt+4); bu := B.read();

B.write(bu+3);


Transactions: Recovery• Illustration of well known problems:

– a dirty read

– premature write

• operations used + implementations

– Withdraw(A, n)

– Deposit(A, n)

b := A.read();

A.write( b - n);

b := A.read();

A.write( b + n);


Transactions: Recovery• A dirty read problem:

Transaction T

Deposit(A,4);

Transaction U

Deposit(A,3);

Interleaved execution and abort ?



Transaction T 4 A Transaction U 3 A

bt := A.read();

A.write(bt+4);A: 100




bt := A.read();

A.write(bt+4); bu := A.read();

A.write(bu+3);

A: 104




bt := A.read();


A.write(bu+3);

A: 107

Commit

Abort

Correct result: A = 103


Transactions: Recovery• Premature write or

Over-writing uncommitted values :

Transaction T

Deposit(A,4);

Transaction U

Deposit(A,3);

Interleaved execution and Abort ?


Transactions: Recovery• Over-writing uncommitted values :


bt := A.read();

A.write(bt+4);A: 100




bt := A.read();


A.write(bu+3);

A: 104




bt := A.read();


A.write(bu+3);

A: 107

Abort

Correct result: A = 104


Transactions: Recovery• Illustration of well known problems:

– a dirty read

– premature write

• elements of solution:– Cascading Aborts: a transaction reading uncommitted

data must be aborted if the transaction that modified the data aborts

– to avoid cascading aborts, transactions can only read data written by committed transactions

– undo of write operations must be possible


Transactions: Recovery

• how to preserve data despite subsequent failures?– usually by using stable storage

• two copies of data stored – in separate parts of disks

– not decay related (probability of both parts corrupted is small)


Nested Transactions• Transactions composed of several

sub-transactions

• Why nesting?– Modular approach to structuring transactions in

applications

– means of controlling concurrency within a transaction• concurrent sub-transactions accessing shared data are

serialized

– a finer grained recovery from failures• sub-transactions fail independent


Nested Transactions

• Sub-transactions commit or abort independently– without effect on outcome of other sub-transactions or

enclosing transactions

• effect of sub-transaction becomes durable only when top-level transaction commits

T = Transfer

T1 = Deposit T2 = Withdraw


Concurrency control: locking• Environment

– shared data in a single server (this section)– many competing clients

• problem:– realize transactions– maximize concurrency

• solution: serial equivalence

• difference with mutual exclusion?


Concurrency control: locking

• Protocols:– Locks– Optimistic Concurrency Control– Timestamp Ordering


Concurrency control: locking• Example:

– access to shared data within a transaction lock (= data reserved for …)

– exclusive locks• exclude access by other transactions


Concurrency control: locking• Same example (lost update) with locking

Transaction T

Withdraw(A,4);

Deposit(B,4);

Transaction U

Withdraw(C,3);

Deposit(B,3);

Colour of data show owner of lock




bt := A.read();

B: 200

C: 300

• Exclusive locks

A: 100




bt := A.read();

B: 200

C: 300

• Exclusive locks

A: 100A.write(bt-4);




bt := A.read();

B: 200

C: 300

• Exclusive locks

A: 96A.write(bt-4);

bu := C.read();




bt := A.read();

B: 200

C: 300

• Exclusive locks

A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);




bt := A.read();

B: 200

C: 297

• Exclusive locks

A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);

bu := B.read();




bt := A.read();

B: 200

C: 297

• Exclusive locks

A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);

bt := B.read();bu := B.read();

Wait for T

B.write(bt+4);




bt := A.read();

B: 204

C: 297

• Exclusive locks

A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);


Wait for T

B.write(bt+4);





bt := A.read();

B: 204

C: 297

• Exclusive locks

A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);


Wait for T

B.write(bt+4);





bt := A.read();

B: 204

C: 297

• Exclusive locks

A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);


B.write(bt+4);

CloseTransaction(T);B.write(bu+3);




bt := A.read();

B: 207

C: 297

• Exclusive locks

A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);


B.write(bt+4);

CloseTransaction(T);B.write(bu+3); CloseTransaction(U);




bt := A.read();

B: 207

C: 297

• Exclusive locks

A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);


B.write(bt+4);

CloseTransaction(T);B.write(bu+3); CloseTransaction(U);


Concurrency control: locking• Basic elements of protocol

1 serial equivalence• requirements

– all of a transaction’s accesses to a particular data item should be serialized with respect to accesses by other transactions

– all pairs of conflicting operations of 2 transactions should be executed in the same order

• how?– A transaction is not allowed any new locks after it has

released a lock

Two-phase locking


Concurrency control: locking• Two-phase locking

– Growing Phase• new locks can be acquired

– Shrinking Phase• no new locks

• locks are released


Concurrency control: locking• Basic elements of protocol

1 serial equivalence two-phase locking 2 hide intermediate results

• conflict between– release of lock access by other transactions possible

– access should be delayed till commit/abort transaction

• how?– New mechanism?

– (better) release of locks only at commit/abort

strict two-phase locking– locks held till end of transaction


Concurrency control: locking• How increase concurrency and preserve

serial equivalence?– Granularity of locks

– Appropriate locking rules


Concurrency control: locking• Granularity of locks

– observations• large number of data items on server• typical transaction needs only a few items• conflicts unlikely

– large granularitylimits concurrent access• example: all accounts in a branch of bank are

locked together

– small granularity overhead


Concurrency control: locking• Appropriate locking rules

– when conflicts?

Read & Write locks

operation by T operation by U conflict

read read No

read write Yes

write write Yes


Concurrency control: locking• Lock compatibility

Lock requested

Read Write

Lock None OK OK

already Read OK Wait

set Write Wait Wait

For one data item


Concurrency control: locking• Strict two-phase locking

– locking• done by server (containing data item)

– unlocking• done by commit/abort of the transactional service


Concurrency control: locking• Use of locks on strict two-phase locking

– when an operation accesses a data item• not locked yet lock set & operation proceeds

• conflicting lock set by another transactiontransaction must wait till ...

• non-conflicting lock set by another transactionlock shared & operation proceeds

• locked by same transactionlock promoted if necessary & operation proceeds


Concurrency control: locking• Use of locks on strict two-phase locking

– when an operation accesses a data item

– when a transaction is committed/aborted server unlocks all data items locked for the

transaction


Concurrency control: locking• Lock implementation

– lock manager– managing table of locks:

• transaction identifiers

• identifier of (locked) data item

• lock type

• condition variable– for waiting transactions


Concurrency control: locking• Deadlocks

– a state in which each member of a group of transactions is waiting for some other member to release a lock

no progress possible!

– Example: with read/write locks


Concurrency control: locking• Same example (lost update) with locking

Transaction T

Withdraw(A,4);

Deposit(B,4);

Transaction U

Withdraw(C,3);

Deposit(B,3);

Colour of data show owner of lock




bt := A.read();

B: 200

C: 300

• Read/write locks

A: 100




bt := A.read();

B: 200

C: 300


A: 100A.write(bt-4);




bt := A.read();

B: 200

C: 300


A: 96A.write(bt-4);

bu := C.read();




bt := A.read();

B: 200

C: 300


A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);




bt := A.read();

B: 200

C: 297


A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);

bu := B.read();




bt := A.read();

B: 200

C: 297


A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);





bt := A.read();

B: 204

C: 297


A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);


B.write(bt+4);

Wait for release by U

B.write(bu+3);

Wait for release by T

Deadlock!!


Concurrency control: locking• Solutions to the Deadlock problem

– Prevention• by locking all data items used by a transaction

when it starts

• by requesting locks on data items in a predefined order

Evaluation• impossible for interactive transactions

• reduction of concurrency



– Detection• the server keeps track of a wait-for graph

– lock: edge is added

– unlock: edge is removed

• the presence of cycles may be checked– when an edge is added

– periodically

– example




bt := A.read();

B: 200

C: 297


A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);



Concurrency control: locking• Wait-for graph

A

B

C

T U

Held by




bt := A.read();

B: 204

C: 297


A: 96A.write(bt-4);

bu := C.read();

C.write(bu-3);


B.write(bt+4);

Wait for release by U

B.write(bu+3);

Wait for release by T



A

B

C

T U

Held by



A

B

C

T U

Held by



B

T U

Cycle deadlock



– Detection• the server keeps track of a wait-for graph

• the presence of cycles must be checked

• once a deadlock detected, the server must select a transaction and abort it (to break the cycle)

• choice of transaction? Important factors– age of transaction

– number of cycles the transaction is involved in



– Timeouts• locks granted for a limited period of time

– within period: lock invulnerable

– after period: lock vulnerable



• Distributed transactions– Flat and nested distributed transactions– Atomic commit protocols– Concurrency in distributed transactions– Distributed deadlocks– Transaction recovery

• Replication


Distributed transactions

• Definition

Any transaction whose activities involve

multiple servers

• Examples

– simple: client accesses several servers

– nested: server accesses several other servers



• Examples: simple

client

Y

X

Z

Serial execution of requests on different server



• Examples: nesting

Serial or parallel execution of requests on different servers

client

Y

X

Z

P

N

M

TT1

T2

T11

T12

T22

T21



• Examples:

Client

X

Y

Z

X

Y

M

NT1

T2

T11

Client

P

TT

12

T21

T22

T

T


Distributed transactions• Commit: agreement between all servers involved

• to commit• to abort

• take one server as coordinatorsimple (?) protocol

– single point of failure?

• tasks of the coordinator– keep track of other servers, called workers

– responsible for final decision


Distributed transactions• New service operations:

– AddServer( TransID, CoordinatorID)• called by clients

• first operation on server that has not joined the transaction yet

– NewServer( TransID, WorkerID)• called by new server on the coordinator

• coordinator records ServerID of the worker in its workers list




client

Y

X

Z


X$Withdraw(A,4);

Z$Deposit(C,4);

Y$Withdraw(B,3);

Z$Deposit(D,3);


1. T := X$OpenTransaction();

2. X$Withdraw(A,4);

coordinator

A

B

C,D




client

Y

X

Z


X$Withdraw(A,4);

Z$Deposit(C,4);

Y$Withdraw(B,3);

Z$Deposit(D,3);


coordinator

A

B

C,D

3. Z$AddServer(T, X)

4. X$NewServer(T, Z);

5. Z$Deposit(C,4);

worker




client

Y

X

Z


X$Withdraw(A,4);

Z$Deposit(C,4);

Y$Withdraw(B,3);

Z$Deposit(D,3);


coordinator

A

B

C,D

6. Y$AddServer(T, X)

7. X$NewServer(T, Y);

8. Y$Withdraw(B,3);

worker

worker




client

Y

X

Z


X$Withdraw(A,4);

Z$Deposit(C,4);

Y$Withdraw(B,3);

Z$Deposit(D,3);


coordinator

A

B

C,D

9. Z$Deposit(D, 3);

worker

worker




client

Y

X

Z


X$Withdraw(A,4);

Z$Deposit(C,4);

Y$Withdraw(B,3);

Z$Deposit(D,3);


coordinator

A

B

C,D

10. X$CloseTransaction(T);

worker

worker



• Examples: data at servers

client

Y

X

Z

coordinator

A

B

C,D

worker

worker

Server Trans Role Coord. Workers

X T Coord (here) Y, Z

Y T Worker X

Z T Worker X




• Replication


Atomic Commit protocol

• Elements of the protocol– each server is allowed to abort its part of a transaction

– if a server votes to commit it must ensure that it will

eventually be able to carry out this commitment

• the transaction must be in the prepared state

• all altered data items must be on permanent storage

– if any server votes to abort, then the decision must be

to abort the transaction


Atomic Commit protocol• Elements of the protocol (cont.)

– the protocol must work correctly, even when

• some servers fail

• messages are lost

• servers are temporarily unable to communicate


Atomic Commit protocol• Protocol:

– Phase 1: voting phase

– Phase 2: completion according to outcome of vote


Atomic Commit protocol• Protocol

Coordinator

Step Status1

prepared to commit

Worker

Step Status

2

prepared to commit

3 (counting votes)

committed

CanCommit?

Yes

4 committed

done

DoCommit

HaveCommitted


Atomic Commit protocol• Protocol: Phase 1 voting phase

– Coordinator: for operation CloseTransaction• sends CanCommit to each worker• behaves as worker in phase 1• waits for replies from workers

– Worker: when receiving CanCommit• if for worker transaction can commit

– saves data items– sends Yes to coordinator

• if for worker transaction cannot commit– sends No to coordinator– clears data structures, removes locks


Atomic Commit protocol• Protocol: Phase 2

– Coordinator: collecting votes• all votes Yes:

commit transaction; send DoCommit to workers

• one vote No:abort transaction

– Worker: voted yes, waits for decision of coordinator• receives DoCommit

– makes committed data available; removes locks

• receives AbortTransaction– clears data structures; removes locks

Point of decision!!


Atomic Commit protocol• Timeouts:

– worker did all/some operations and waits for CanCommit

• unilateral abort possible– coordinator waits for votes of workers

• unilateral abort possible– worker voted Yes and waits for final decision

of coordinator• wait unavoidable• extensive delay possible• additional operation GetDecision can be used to get

decision from coordinator or other workers


Atomic Commit protocol• Performance:

– C W: CanCommit N-1 messages– W C: Yes/No N-1 messages– C W: DoCommit N-1 messages– W C: HaveCommitted N-1 messages

+ (unavoidable) delays possible


Atomic Commit protocol• Nested Transactions

– top level transaction & subtransactions

transaction tree


Atomic Commit protocol

client

Y

X

Z

P

N

M

TT1

T2

T11

T12

T22

T21


Atomic Commit protocol• Nested Transactions

– top level transaction & subtransactions

transaction tree

– coordinator = top level transaction

– subtransaction identifiers• globally unique

• allow derivation of ancestor transactions(why necessary?)


Atomic Commit protocol• Nested Transactions: Transaction IDs

TID in example actual TID

T Z, nz

T1 Z, nz ; X, nx

T11 Z, nz ; X, nx ; M, nm

T2 Z, nz ; Y, ny


Atomic Commit protocol• Upon completion of a subtransaction

– independent decision to commit or abort– commit of subtransaction

• only provisionally• status (including status of descendants) reported to

parent• final outcome dependant on its ancestors

– abort of subtransaction• implies abort of all its descendants• abort reported to its parent (always possible?)


Atomic Commit protocol• Data structures

– commit list: list of all committed (sub)transactions

– aborts list: list of all aborted (sub)transactions

– example


Atomic Commit protocol• Data structures: example

1

2

T11

T12

T22

T21

abort (at M)

provisional commit (at N)

provisional commit (at X)

aborted (at Y)

provisional commit (at N)

provisional commit (at P)

T

T

T



T22

P

T1

T2

T12

T21

X

Y

T11

M

N

T

Z



Server Trans ChildTrans

CommitList

AbortList

Z T

X

Y

M

N

P



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ




CommitList

AbortList

Z T T1

X T1

Y

M

N

P



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ




CommitList

AbortList

Z T T1

X T1 T11

Y

M T11

N

P



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ

abort




CommitList

AbortList

Z T T1

X T1 T11 T11

Y

M T11 T11

N

P



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ

abort




CommitList

AbortList

Z T T1

X T1 T11 ,T12 T11

Y

M T11 T11

N T12

P



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ

abort

commit




CommitList

AbortList

Z T T1

X T1 T11 ,T12 T12 T11

Y

M T11 T11

N T12 T12

P



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ

abort

commitcommit




CommitList

AbortList

Z T T1

X T1 T11 ,T12 T12 , T1 T11

Y

M T11 T11

N T12 T12

P




CommitList

AbortList

Z T T1 T12 , T1 T11

X T1 T11 ,T12 T12 , T1 T11

Y

M T11 T11

N T12 T12

P



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ

abort

commitcommit




CommitList

AbortList

Z T T1 ,T2 T12 , T1 T11

X T1 T11 ,T12 T12 , T1 T11

Y T2

M T11 T11

N T12 T12

P



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ

abort

commitcommit




CommitList

AbortList

Z T T1 ,T2 T12 , T1 T11

X T1 T11 ,T12 T12 , T1 T11

Y T2 T21

M T11 T11

N T12 ,T21 T12

P



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ

abort

commitcommit

commit




CommitList

AbortList

Z T T1 ,T2 T12 , T1 T11

X T1 T11 ,T12 T12 , T1 T11

Y T2 T21 T21

M T11 T11

N T12 ,T21 T12 ,T21

P



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ

abort

commitcommit

commit




CommitList

AbortList

Z T T1 ,T2 T12 , T1 T11

X T1 T11 ,T12 T12 , T1 T11

Y T2 T21 ,T22 T21

M T11 T11

N T12 ,T21 T12 ,T21

P T22



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ

abort

commitcommit

commit

commit




CommitList

AbortList

Z T T1 ,T2 T12 , T1 T11

X T1 T11 ,T12 T12 , T1 T11

Y T2 T21 ,T22 T21 ,T22

M T11 T11

N T12 ,T21 T12 ,T21

P T22 T22



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ

abort

commitcommit

commit

commitabort




CommitList

AbortList

Z T T1 ,T2 T12 , T1 T11

X T1 T11 ,T12 T12 , T1 T11

Y T2 T21 ,T22 T21 ,T22 T2

M T11 T11

N T12 ,T21 T12 ,T21

P T22 T22




CommitList

AbortList

Z T T1 ,T2 T12 , T1 T11 ,T2

X T1 T11 ,T12 T12 , T1 T11

Y T2 T21 ,T22 T21 ,T22 T2

M T11 T11

N T12 ,T21 T12 ,T21

P T22 T22


Atomic Commit protocol• Data structures: final data

Server Trans Child Trans

Commit List

Abort List

Z T T12 , T1

N, X T11 ,T2

X T1 T12 , T1 T11

Y

M

N T12 ,T21 T12 ,T21

P T22 T22


Atomic Commit protocol• Algorithm of coordinator (flat protocol)

– Phase 1• send CanCommit to each worker in commit list

– TransactionId: T– abort list

• coordinator behaves as worker– Phase 2 (as for non-nested transactions)

• all votes Yes:commit transaction; send DoCommit to workers

• one vote No:abort transaction


Atomic Commit protocol• Algorithm of worker (flat protocol)

– Phase 1 (after receipt of CanCommit)

• at least one (provisionally) committed descendant of top level transaction:

– transactions with ancestors in abort list are aborted

– prepare for commit of other transactions

– send Yes to coordinator

• no (provisionally) committed descendant– send No to coordinator

– Phase 2 (as for non-nested transactions)


Atomic Commit protocol• Algorithm of worker (flat protocol)

– Phase 1 (after receipt of CanCommit)

– Phase 2 voted yes, waits for decision of coordinator• receives DoCommit

– makes committed data available; removes locks

• receives AbortTransaction– clears data structures; removes locks


Atomic Commit protocol• Timeouts:

– same 3 as above:• worker did all/some operations and waits for

CanCommit• coordinator waits for votes of workers • worker voted Yes and waits for final decision of

coordinator– provisionally committed child with an aborted

ancestor:• does not participate in algorithm• has to make an enquiry itself• when?


Atomic Commit protocol• Data structures: final data


CommitList

AbortList

Z T T12 , T1

N, XT11 ,T2

X T1 T12 , T1 T11

Y

M

N T12 ,T21 T12 ,T21

P T22 T22



T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ

abort

commitcommit

commit

commitabort




• Replication


Distributed transactions Locking

• Locks are maintained locally (at each server)

– it decides whether• to grant a lock

• to make the requesting transaction wait

– it cannot release the lock until it knows whether the transaction has been

• committed

• aborted

at all servers– deadlocks can occur



• Locking rules for nested transactions– child transaction inherits locks from parents

– when a nested transaction commits, its locks are inherited by its parents

– when a nested transaction aborts, its locks are removed

– a nested transaction can get a read lock when all the holders of write locks (on that data item) are ancestors

– a nested transaction can get a write lock when all the holders of read and write locks (on that data item) are ancestors



• Who can access A?

T

T1

T2

T12

T21

X

Y

T11

M

T22

P

NZ

A




• Replication


Distributed deadlocks

• Single server approaches– prevention: difficult to apply

– timeouts: value with variable delays?

Detection

• global wait-for-graph can be constructed from local

ones

• cycle in global graph possible without cycle in local

graph


Distributed transactions Deadlocks

C D

Z

A

X

B

Y

W

V

U



• Algorithms:

– centralised deadlock detection: not a good idea

• depends on a single server

• cost of transmission of local wait-for graphs

– distributed algorithm:

• complex

• phantom deadlocks

edge chasing approach



• Phantom deadlocks

– deadlock detected that is not really a deadlock

– during deadlock detection

• while constructing global wait-for graph

• waiting transaction is aborted



• Edge Chasing

– distributed approach to deadlock detection:

• no global wait-for graph is constructed

– servers attempt to find cycles

• by forwarding probes (= messages) that follow

edges of the wait-for graph throughout the

distributed system



• Edge Chasing

– three steps:

• initiation: transaction starts waiting

– new probe constructed

• detection: probe received

– extend probe

– check for loop

– forward new probe

• resolution


• Edge Chasing: initiation

– send out probe

when transaction T starts waiting for U (and U

is already waiting for …)

– in case of lock sharing, different probes are

forwarded


T U



C

Z

A

X

B

Y

W

V

U

W U

Initiation


• Edge Chasing: detection

– when receiving probe

• Check if U is waiting

• if U is waiting for V (and V is waiting)

add V to probe

• check for loop in probe?

– yes deadlock

– no forward new probe


T U

T U V



C

Z

A

X

B

Y

W

V

U

W U

Initiation

W U V

W U V W


• Edge Chasing: resolution

– abort one transaction

– problem?

• Every waiting transaction can initiate deadlock

detection

• detection may happen at different servers

• several transactions may be aborted

– solution: transactions priorities



• Edge Chasing: transaction priorities

– assign priority to each transaction, e.g. using

timestamps

– solution of problem above:

• abort transaction with lowest priority

• if different servers detect same cycle, the same

transaction will be aborted



• Edge Chasing: transaction priorities

– other improvements

• number of initiated probe messages – detection only initiated when higher priority transaction

waits for a lower priority one

• number of forwarded probe messages – probes travel downhill -from transaction with high

priority to transactions with lower priorities

– probe queues required; more complex algorithm





• Replication


Transactions and failures• Introduction

– Approaches to fault-tolerant systems• replication

– instantaneous recovery from a single fault

– expensive in computing resources

• restart and restore consistent state– less expensive

– requires stable storage

– slow(er) recovery process


Transactions and failures• Overview

– Stable storage

– Transaction recovery

– Recovery of the two-phase commit protocol


Transactions and failures Stable storage

• Ensures that any essential permanent data will be recoverable after any single system failure

• allow system failures: – during a disk write

– damage to any single disk block

• hardware solution RAID technology• software solution:

– based on pairs of blocks for same data item

– checksum to determine whether block is good or bad



• Based on the following invariant:– not more than one block of any pair is bad

– if both are good• same data

• except during execution of write operation

• write operation:– maintains invariant

– writes on both blocks are done strictly sequential

• restart of stable storage server after crash recovery procedure to restore invariant



• Recovery for a pair:– both good and the same ok

– one good, one bad copy good block to bad block

– both good and different copy one block to the other



– Stable storage




Transactions and failures Transaction recovery

• atomic property of transaction implies:– durability

• data items stored in permanent storage

• data will remain available indefinitely

– failure atomicity• effects of transactions are atomic even when servers

fail

• recovery should ensure durability and failure atomicity



• Assumptions about servers– servers keep data in volatile storage

– committed data recorded in a recovery file

• single mechanism: recovery manager– save data items in permanent storage for committed

transactions

– restore the server’s data items after a crash

– reorganize the recovery file to improve performance of recovery

– reclaim storage space in the recovery file



• Elements of algorithm:– each server maintains an intention list for all of its

active transactions: pairs of

• name• new value

– decision of server: prepared to commit a transaction intention list saved in the recovery file (stable storage)

– server receives DoCommit commit recorded in recovery file

– after a crash: based on recovery file

• effects of committed transactions restored (in correct order)

• effects of other transactions neglected



• Alternative implementations for recovery file:

– logging technique

– shadow versions

• (see book for details)



– Stable storage




Transactions and failures two-phase commit protocol

• Server can fail during commit protocol

• each server keeps its own recovery file

• 2 new status values:– done– uncertain



• meaning of status values:– committed:

• coordinator: outcome of votes is yes

• worker: protocol is complete

– done• coordinator: protocol is complete

– uncertain:• worker: voted yes; outcome unknown



• Recovery actions: (status@…) in recovery file

– prepared@coordinator• no decision before failure of server

• send AbortTransaction to all workers

– aborted@coordinator• send AbortTransaction to all workers

– committed@coordinator• decision to commit taken before crash

• send DoCommit to all workers

• resume protocol



• Recovery actions: (status@…) in recovery file

– committed@worker• send HaveCommitted to coordinator

– uncertain@worker• send GetDecision to coordinator to get status

– prepared@worker• not yet voted yes• unilateral abort possible

– done@coordinator• no action required




• Replication– System model and group communication– Fault-tolerant services– Highly available services– Transactions with replicated data


Replication• A technique for enhancing services

– Performance enhancement– Increased availability– Fault tolerance

• Requirements– Replication transparency– Consistency






System model and group communication

• Architectural model

FE

Requests andreplies

C

ReplicaC

ServiceClients Front ends

managers

RM

RMFE

RM



• 5 phases in the execution of a request:

– FE issues requests to one or more RMs

– Coordination: needed to execute requests consistently

• FIFO

• Causal

• Total

– Execution: by all managers, perhaps tentatively

– Agreement

– Response



• Need for dynamic groups!

• Role of group membership service– Interface for group membership changes: create/destroy groups, add process

– Implementing a failure detector: monitor group members

– Notifying members of group membership changes

– Performing group address expansion

• Handling network partitions: group is

– Reduced: primary-partition

– Split: partitionable



Join

Groupaddress

expansion

Multicastcommunication

Group

send

FailGroup membership

management

Leave

Process group



• View delivery

– To all members when a change in membership occurs

– <> receive view

• Event occurring in a view v(g) at process p

• Basic requirements for view delivery

– Order: if process p delivers v(g) and then v(g’)

then no process delivers v(g’) before v(g)

– Integrity: if p delivers v(g) then p v(g)

– Non-triviality: if q joins group and remains reachable

then eventually q v(g) at p



• View-synchronous group communication

– Reliable multicast + handle changing group views

– Guarantees

• Agreement: correct processes deliver the same set of

messages in any given view

• Integrity: if a process delivers m, it will not deliver it again

• Validity: if the system fails to deliver m to q

then other processes will deliver v’(g) (=v(g) –

{q})

before delivering m



p

q

r

p crashes

view (q, r)view (p, q, r)

p

q

r

p crashes


a (allowed). b (allowed).

p

q

r

view (p, q, r)

p

q

r

p crashes


c (disallowed). d (disallowed).

p crashes

view (q, r)






Fault-tolerant services• Goal: provide a service that is correct

despite up to f process failures

• Assumptions:– Communication reliable– No network partitions

• Meaning of correct in case of replication– Service keeps responding– Clients cannot discover difference with ...


Fault-tolerant services• Naive replication system:

– Clients read and update accounts at local replica manager

– Clients try another replica manager in case of failure

– Replica managers propagate updates in the background

• Example: 2 replica managers: A and B

2 bank accounts: x and y

2 clients: client 1 will use B by preference

client 2 will use A by preference

Client 1 Client 2

setBalanceB(x,1)

setBalanceA(y,2)

getBalanceA(y) 2

getBalanceA(x) 0

Strange behavior:

Client 2 sees 0 on account x and NOT 1

2 on account y

and update of x has been done earlier!!


Fault-tolerant services• Correct behaviour?

– Linearizability• Strong requirement

– Sequential consistency• Weaker requirement


Fault-tolerant services• Linearizability

– Terminology:• Oij: client i performs operation j• Sequence of operations by one client: O20, O21, O22,...• Virtual interleaving of operations performed by all clients

– Correctness requirements: interleaved sequence ...• Interleaved sequence of operations meets specification of a

(single) copy of the objects• Order of operations in the interleaving is consistent with the

real times at which the operations occurred

– Real time?• Yes, we prefer up-to-date information• Requires clock synchronization: difficult


Fault-tolerant services• Sequential consistency

– Correctness requirements: interleaved sequence ... (red = difference!)

• Interleaved sequence of operations meets specification of a (single) copy of the objects

• Order of operations in the interleaving is consistent with the program order in which each individual client executed them

– Example: sequential consistent not linearizable

Client 1 Client 2

setBalanceB(x,1)

getBalanceA(y) 0

getBalanceA(x) 0

setBalanceA(y,2)


Fault-tolerant services• Passive (primary-backup) replication

FEC

FEC

RM

Primary

Backup

Backup

RM

RM



– Sequence of events for handling a client request:• Request: FE issues request with unique id to primary• Coordination: request handled atomically in order; if

request already handled, re-send response

• Execution: execute request and store response• Agreement: primary sends updated state to backups and

waits for acks• Response: primary responds to FE; FE hands response

back to client

– Correctness: linearizability– Failures?



– Failures?• Primary uses view-synchronous group communication• Linearizability preserved, if

– Primary replaced by a unique backup– Surviving replica managers agree on which operations had been

performed at the replacement point

– Evaluation:• Non-deterministic behaviour of primary supported• Large overhead: view-synchronous communication required• Variation of the model:

– Read requests handled by backups: linearizability sequential consistent


Fault-tolerant services• Active replication

FE CFEC RM

RM

RM



– Sequence of events for handling a client request:• Request: FE does reliable TO-multicast(g, <m, i>) and

waits for reply• Coordination: every correct RM gets requests in same order• Execution: every correct RM executes the request;

all RMs execute all requests in the same order

• Agreement: not needed• Response: every RM returns result to FE;

when return result to client?– Crash failures: after first response from RM– Byzantine failures: after f+1 identical responses from RMs

– Correctness: sequential consistency, not linearizability



– Evaluation• Reliable + totally ordered multicast solving consensus

Synchronous systemAsynchronous + failure detectors

– Overhead!

• More performance– Relax total order in case operations commute:

result of o1;o2 = result o2;o1

– Forward read-only request to a single RM






Highly available services• Goal

– Provide acceptable level of service– Use minimal number of RMs– Minimize delay for returning resultWeaker consistency

• Overview– Coda

– Gossip Architecture

– Bayou


Highly available services Coda

• Aims: constant data availability– better performance, e.g. for bulletin boards,

databases,…– more fault tolerance with increasing scale– support mobile and portable computers

(disconnected operation)

Approach: AFS + replication



• Design AFS+– file volumes replicated on different servers– volume storage group (VSG) per file volume– Available Volume Storage group (AVSG) per

file volume at a particular instance of time– volume disconnected when AVSG is empty;

due to• network failure, partitioning• server failures• deliberate disconnection of portable workstation



• Replication and consistency– file version

• integer number associated with file copy

• incremented when file is changed

– Coda version vector (CVV)• array of numbers stored with file copy on a

particular server (holding a volume)

• one value per volume in VSG



• Replication and consistency: example 1

– File F stored at 3 servers: S1, S2, S3

– Initial values for all CVVs: CVVi = [1,1,1]

– update by C1 at S1 and S2; S3 inaccessibleCVV1 = [2,2,1], CVV2 = [2,2,1], CVV3 = [1,1,1]

– network repaired conflict detectedfile copy at S3 updated

CVV1 = [2,2,2], CVV2 = [2,2,2], CVV3 = [2,2,2]



• Replication and consistency: example 2

– File F stored at 3 servers: S1, S2, S3

– Initial values for all CVVs: CVVi = [1,1,1]

– update by C1 at S1 and S2; S3 inaccessibleCVV1 = [2,2,1], CVV2 = [2,2,1], CVV3 = [1,1,1]

– update by C2 at S3 ; S1 and S2 inaccessibleCVV1 = [2,2,1], CVV2 = [2,2,1], CVV3 = [1,1,2]

– network repaired conflict detectedmanual intervention or ….



• Implementation– On open

• Select one server from AVSG

• check CCV with all servers in AVSG

• files in replicated volume remain accessible to a client that can access at least one of the replica

• load sharing over replicated volumes

– On close• multicast file to AVSG

• update of CCV

– manual resolution of conflicts might be necessary



• Caching: update semantics– successful open:

AVSG not empty and latest(F, AVSG, 0) or

AVSG not empty and latest(F, AVSG, T) and

lostcallback(AVSG, T) and incache (F) or

AVSG empty and incache (F)



• Caching: update semantics– failed open:

AVSG not empty and conflict(F, AVSG) or

AVSG empty and not incache(F)

– successful close: AVSG not empty and updated(F, AVSG)

or

AVSG empty

– failed close: AVSG not empty and conflict(F, AVSG)



• Caching: cache coherence– relevant events to detect by Venus within T seconds of

their occurrence:• enlargement of AVSG

• shrinking of AVSG

• lost callback event

– method: probe message to all servers in VSG of any cached file every T seconds



• Caching: disconnected operation– Cache replacement policy: e.g. least-recently used

– how support long disconnection of portables:• Venus can monitor file referencing

• users can specify a prioritised list of files to retain on local disk

– reintegration after disconnection• priority for files on server

• client files in conflict are stored on covolumes; client is informed



• Performance: Coda <> AFS– No replication: no significant difference– 3-fold replication & load for 5 users:

load +5%– 3-fold replication & load for 50 users

load + 70% for Coda <> +16% for AFS• Difference: replication + tuning?

• Discussion– Optimistic approach to achieve high availability– Use of semantics free conflict detection

(except file directories)


Highly available servicesGossip

• Goal of Gossip architecture– Framework for implementing highly ...

– Replicate data close to points where groups of clients need it

• Operations:– 2 types:

• Queries: read-only operations

• Updates: change state (do not read state)

– FE send operations to any RM selection criterium: available + reasonable response time

– Guarantees



• Goal of Gossip architecture

• Operations:– 2 types: queries & updates– FE send operations to any RM – Guarantees:

• Each client obtains consistent service over time – even when communicating with different RMs

• Relaxed consistency between replicas– Weaker than sequential consistency



• Update ordering:– Causal least costly

– Forced (= total + causal)

– Immediate• Applied in a consistent order relative to any other update at all

RMs, independent of order requested for other updates

• Choice– Left to application designer

– Reflects trade-off between consistency and operation cost

– Implications for users



• Update ordering:– Causal least costly

– Forced (= total + causal)

– Immediate• Applied in a consistent order relative to any other update at all

RMs, independent of order requested for other updates

• Example electronic bulletin board:– Causal: for posting items

– Forced: for adding new subscriber

– Immediate: for removing a user



• Architecture– Clients + FE/client– Timestamps added to operations: in next figure

• Prev: reflects version of latest data values seen by client

• New: reflects state of responding RM

– Gossip messages: • exchange of operations between RMs



Query Val

FE

RM RM

RM

Query, prev Val, new

Update

FE

Update, prev Update id

Clients

gossipService



• Sequence of events for handling a client request:– Request:

• FE sends request to a single RM

• For query operation: Client blocked

• For update operation: FE returns to client asap; then forwards operation to one RM or f+1 RMs for increased reliability

– Update response: • if update operation, RM replies to FE after receiving the request

– Coordination:• Request stored in log queue till it can be executed

• Gossip messages can be exchanged to update state in RM

– Execution:



• Sequence of events for handling a client request:– Request

– Update response

– Coordination

– Execution: • RM executes request

– Query response• If operation is query then RM replies at this point

– Agreement:• Lazy update by RMs



• Gossip internals– Timestamps at FEs– State of RMs– Handling of Query operations– Processing of update operations in causal order– Forced and immediate update operations– Gossip messages– Update propagation



• Timestamps at FEs– Vector timestamps with entry for every RM

– Local component updated at every operation

– Returned timestamp merged with local one

– Client-to client operations• Via FEs• Include timestamps (to preserve causal order)



Query Val

FE

RM RM

RM

Query, prev Val, new

Update

FE

Update, prev Update id

Clients

gossipService

ts



• State of RMs– Value: state as maintained by RM

– Value timestamp: associated with value

– Update log: why log?• Operation cannot be executed yet• Operation has to be forwarded to other RMs

– Replica timestamp: reflects updates received by RM

– Executed operation table: to prevent re-execution

– Timestamp table: • timestamps from other RMs• Received with Gossip messages• Used to check for consistency between RMs



Replica timestamp

Update log

Value timestamp

Value

Executed operation table

Stable

updates

Updates

Gossipmessages

FE

Replicatimestamp

Replica log

OperationID Update PrevFE

Replica manager

Other replica managers

Timestamp table



• Handling of Query operations– Query request contains:

• q= operation

• q.prev = timestamp at FE

– If q.prev <= valueTSthen

operation can be executedelse operation put in hold-back queue

– Result contains new timestampmerged with timestamp at FE



• Processing of update operations in causal order– Update request u

• u.op specification of operation (type & parameters)

• u.prev: timestamp generated at FE

• u.id: unique identifier

– Handling of u at RMi



• Processing of update operations in causal order– Update request u = <u.op, u.prev, u.id>

– Handling of u at RMi

• u already processed?

• Increment i-the element in replicaTS

• Assign ts to uts[i] = replicaTS[i]ts[k] = u.prev[k], k<>i

• Log record r = <i, ts, u.op, u.prev, u.id> added to log

• ts returned to FE

• If stability condition u.prev <= valueTS is satisfiedthen value := apply(value, r.u.op)

valueTS := merge( valueTS, r.ts)executed := executed {r.u.id}



• Forced and immediate update operations– Special treatment

– Forced = total + causal order• Unique global sequence numbers

by primary RM (reelection if necessary)

– Immediate• Placed in sequence by primary RM (forced order)

• Additional communication between RMs



• Gossip messages– Gossip message m

• m.log = log

• m.ts = replicaTS

– Tasks done by RM when receiving m• Merge m with its own log

– Drop r when r.ts <= replicaTS

– replicaTS := merge(replicaTS, m.ts)

• Apply updates that have become stable

• Eliminate records from log and executed operations table



• Update propagation– Gossip exchange frequency

• To be tuned by application

– Partner selection policy• Random• Deterministic• Topological

– How much time will it take for all RMs to receive an update

• Frequency and duration of network partitions• Frequency for exchanging gossip-messages• Policy for choosing partners



• Discussion of architecture+ Clients can continue to obtain a service even with

network partition

- Relaxed consistency guarantees

- Inappropriate for updating replicas in near-real time

? Scalability? Depends upon• Number of updates in a gossip message

• Use read-only replicas


Highly available servicesBayou

• Goal– Data replication for high availability– Weaker guarantees than sequential consistency– Cope with variable connectivity

• Guarantees: Every RM eventually – receives the same set of updates– applies those updates



• Approach– Use a domain specific policy for detecting and

resolving conflicts– Every Bayou update contains

• Dependency check procedure– Check for conflict if update would be applied

• Merge procedure– Adapt update operation

» Achieves something similar

» Passes dependency check



• Committed and tentative updates– New updates are applied and marked as tentative

– Tentative updates can be undone and reapplied later

– final order decided by primary replica manager committed order

– Tentative update ti becomes next committed update• Undo of all tentative updates after last committed update

• Apply ti

• Other tentative updates are reapplied



c0 c1 c2 cN t0 t1 ti

Committed Tentative

t2

Tentative update ti becomes the next committed update and is inserted after the last committed update cN.

ti+1



• Discussion– Makes replication non-transparant to the application

• Exploits application’s semantics to increase availability• Maintains replicated state as eventually sequentially consistent

– Disadvantages:• Increased complexity for application programmer• Increased complexity for user: returned results can be changed

– Suitable for applications ....• conflicts rare• Underlying data semantics simple• Users can cope with tentative information

e.g. diary






Transactions with replicated data• Introduction

– Replicated transactional service• each data item replicated at a group of servers

replica managers

• transparency one-copy serializability

– Why?• increase availability

• increase performance

– Advantages/disadvantages• higher performance for read-only requests

• degraded performance op update requests


Transactions with replicated data• Architectures for replicated transactions

– questions:

• Can a client send requests to any replica manager?

• How many replica managers are needed for a

successful completion of an operation?

• If one replica manager is addressed, can this one

delay forwarding till commit of transaction?

• How to carry out two-phase commit?



– Replication Schemes:• Read-one/Write-all

– a read request can be performed by a single replica manager

– a write request must be performed by all replica managers

• Quorum Consensus

• Primary Copy



– Replication Schemes:• Read-one/Write-all

• Quorum Consensus– nr replica managers required to read data item

– nw replica managers requited to update data item

– nr + nw > number of replica managers

– advantage: less managers needed for update

• Primary Copy– all requests directed to a single server

– slaves can take over when primary fails



– Forwarding update requests?• Read-one/Write-all

– as soon as operation is received (only write requests)

• Quorum Consensus– as soon as operation is received (read + write requests)

• Primary Copy– when transaction commits


Transactions with replicated data

• Architectures for replicated transactions

– Two-phase commit

• locking (read one/write all)

– read operation: lock on single replica

– write operation: locks on all replica

– one copy serializability: assured as read and write have

conflicting lock on at least one replica

• protocol:



– Two-phase commit• locking (read one/write all)

• protocol:– becomes two-level nested two-phase protocol

– first phase:

» worker receives CanCommit request

request passed to all replica managers

replies collected; one reply to coordinator

– second phase:

» same approach for DoCommit and Abort request



– Failure implications?• Read-one/Write-all

– updates impossible available copies replication

• Quorum Consensus– quorum satisfaction still possible

• Primary Copy– failure of primary election new primary?

– Failure of slave update delayed


Transactions with replicated data• Available copies replication

– updates only performed by all available replica managers (cfr. Coda)

– no failures:• local concurrency control one copy

serializability

– failures:• additional concurrency control is needed• example (next slides)



A

X

A

Y

Replica managers

B

MB

P

B

N

C1

C2 GetBalance(A);

Deposit(B,3);

UT



A

X

A

Y

Replica managers

B

MB

P

B

N

C1

C2 GetBalance(B);

Deposit(A,3);

TU



A

X

A

Y

Replica managers

B

MB

P

B

N

C1

C2

TU U waiting for A @ X

T waiting for B @ N

Deadlock!

GetBalance(A);Deposit(B,3);

GetBalance(B);

Deposit(A,3);



A

X

A

Y

Replica managers

B

MB

P

B

N

C1

C2

TU

U can proceed; conflict not detected

T can proceed; conflict not detected

GetBalance(A);

Deposit(B,3);

GetBalance(B);

Deposit(A,3);


Transactions with replicated data• Available copies replication

– failures & additional concurrency control• at commit time it is checked

– servers unavailable during transaction: still unavailable?– Servers available during transaction: still available?

– If no abort transaction

• implications for two-phase commit protocol?



• Replication and network partitions– optimistic approach: available copies

• operations can go on in each partition

• conflicting transactions should be detectedand compensated for

• conflict detection:– for file systems: version vector (see Coda)

– for read/write conflicts: ….



• Replication and network partitions (cont.)

– pessimistic approach: quorum consensus

• operations can go on in a single partition only

• R read quorum & W write quorum– W > half of the votes

– R + W > total number of votes

• out of date copies should be detected– version vectors

– timestamps


Transactions with replicated data• Replication and network partitions (cont.)

– quorum consensus + available copies• virtual partitions advantages of both approaches

• virtual partition = abstraction of real partition

• transaction can operate in a virtual partition– sufficient replica managers to have read & write quorum

– available copies is used in transaction

• virtual partition changes during transactionabort of transaction

• member of virtual partition cannot access other membercreate new partition






Distributed Systems:

Shared Data

Documents

November 2005Distributed systems: shared data1 Distributed Systems: Shared Data