Slides for Chapter 14: Distributed transactions From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 4, © Addison-Wesley

Slides for Chapter 14: Distributed transactions

From Coulouris, Dollimore and Kindberg

Distributed Systems: Concepts and Design

Edition 4, © Addison-Wesley 2005

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3

© Addison-Wesley Publishers 2000

Commitment of dist. trans. - intro

A distributed transaction refers to a flat or nested transaction that accesses objects managed by multiple servers

Atomicity must still be preserved A process on one of the servers is coordinator, it

must ensure the same outcome at all of the servers.

The ‘two-phase commit protocol’ is the most commonly used protocol for achieving this

•



Fig. 14.3 A dist. banking transaction

..

BranchZ

BranchX

participant

participant

C

D

Client

BranchY

B

A

participant join

join

join

T

a.withdraw(4);

c.deposit(4);

b.withdraw(3);

d.deposit(3);

openTransaction

b.withdraw(T, 3);

closeTransaction

T = openTransaction a.withdraw(4); c.deposit(4); b.withdraw(3); d.deposit(3);

closeTransaction

Note: the coordinator is in one of the servers, e.g. BranchX



Fig. 14.4 Ops for 2PC

canCommit?(trans)-> Yes / NoCall from coordinator to participant to ask whether it can commit a transaction. Participant replies with its vote.

doCommit(trans) Call from coordinator to participant to tell participant to commit its part of a transaction.

doAbort(trans) Call from coordinator to participant to tell participant to abort its part of a transaction.

haveCommitted(trans, participant) Call from participant to coordinator to confirm that it has committed the transaction.

getDecision(trans) -> Yes / NoCall from participant to coordinator to ask for the decision on a transaction after it has voted Yes but has still had no reply after some delay. Used to recover from server crash or delayed messages.



Fig. 14.5 Two-phase commit protocol

Phase 1 (voting phase): 1. The coordinator sends a canCommit? request to

each of the participants in the transaction.2. When a participant receives a canCommit? request

it replies with its vote (Yes or No) to the coordinator. Before voting Yes, it prepares to commit by saving objects in permanent storage. If the vote is No the participant aborts immediately.



Fig. 14.5 Two-phase commit protocol

Phase 2 (completion according to outcome of vote):3. The coordinator collects the votes (including its own).

(a) If there are no failures and all the votes are Yes the coordinator decides to commit the transaction and sends a doCommit request to each of the participants.

(b) Otherwise the coordinator decides to abort the transaction and sends doAbort requests to all participants that voted Yes.

4. Participants that voted Yes are waiting for a doCommit or doAbort request from the coordinator. When a participant receives one of these messages it acts accordingly and in the case of commit, makes a haveCommitted call as confirmation to the coordinator.



Summary of 2PC

a distributed transaction involves several different servers. A nested transaction structure allows

additional concurrency and independent committing by the servers in

a distributed transaction.atomicity requires that the servers participating

in a distributed transaction either all commit it or all abort it.

continued ...

•



Summary of 2PC

atomic commit protocols are designed to achieve this effect, even if servers crash during their execution.

the 2PC protocol allows a server to abort unilaterally. it includes timeout actions to deal with

delays due to servers crashing. 2PC protocol can take an unbounded

amount of time to complete but is guaranteed to complete eventually.

•



14.5 Distributed deadlocks

Single server transactions can experience deadlocks prevent or detect and resolve use of timeouts is clumsy, detection is preferable.

it uses wait-for graphs.

Distributed transactions lead to distributed deadlocks in theory can construct global wait-for graph from

local ones a cycle in a global wait-for graph that is not in local

ones is a distributed deadlock

•



14.6 Transaction recovery

Atomicity property of transactions durability and failure atomicity durability requires that objects are saved in

permanent storage and will be available indefinitely

failure atomicity requires that effects of transactions are atomic even when the server crashes

•



14.6 Transaction recovery

Recovery is concerned with ensuring that a server’s objects are durable and that the service provides failure atomicity. for simplicity we assume that when a server is

running, all of its objects are in volatile memory and all of its committed objects are in a recovery file

in permanent storage recovery consists of restoring the server with the

latest committed versions of all of its objects from its recovery file

•



Recovery manager

The task of the Recovery Manager (RM) is: to save objects in permanent storage (in a recovery file)

for committed transactions; to restore the server’s objects after a crash; to reorganize the recovery file to improve performance; to reclaim storage space (in the recovery file).

media failures i.e. disk failures affecting the recovery file need another copy of the recovery file on an independent

disk.

•



Fig 14.18 Types of entry in a recovery file

Type of entry Description of contents of entry

Object A value of an object.

Transaction statusTransaction identifier, transaction status (prepared, committed

aborted) and other status values used for the two-phase

commit protocol.

Intentions list Transaction identifier and a sequence of intentions, each of

which consists of <identifier of object>, <position in recovery

file of value of object>.



Logging - reorganizing the recovery file

RM is responsible for reorganizing its recovery fileCheckpointing

the process of writing to a new recovery file the current committed values of a server’s objects, transaction status entries and intentions lists of

transactions not yet fully resolved including information related to 2PC (see later)

checkpointing makes recovery faster and saves disk space done after recovery and from time to time

•



Role Status Action of RM

Coordinator

prepared

or aborted

No decision reached before server failed. Sends abortTransaction to all participants on list and sends aborted to recovery file.

Coord committed A decision to commit had been reached before the server failed. Sends a doCommit to all part.’s in list and resumes 2PC at step 4.

Particip committed Part sends haveCommitted msg to Coord. Allows coord. to discard info about trans at next ckpoint.

Participant

uncertain The part. failed before knew outcome of trans. Cannot determine status of transaction until coordinator informs of decision. Sends a getDecision to coord to determine status of trans. When reply received, commits or aborts

Particip prepared Participant has not voted; can abort

Coord done No action is required

Recovery of the 2PC

Time and Global States

ECEN5053 Software Engineering of Distributed SystemsUniversity of Colorado, Boulder



Topics

Clock synchronizationLogical clocksGlobal State

http://www.7art-screensavers.com/screens/daliclock/dali-clock-small-03.jpg




How processes can synchronize

Multiple processes must be able to cooperate in granting each other temporary exclusive access to a resource

Also, multiple processes may need to agree on the ordering of events, such as whether message m1 from process P was sent before or after message m2 from process Q.



Centralized system

Time is unambiguous If a process wants to know the time, it makes a

system call and finds out If process A asks for the time and gets it and then

process B asks for the time and gets it, the time that B was told will be later than the time that A was told.

Simple, no?



Physical Clocks

Physical computer clocks are not clocks; they are timers Quartz crystal that oscillates at a well-defined

frequency that depends on physical properties Two registers: counter and a holding register Each oscillation decrements the counter by one When counter reaches zero, generates an interrupt

and the counter is reloaded from the holding register Each interrupt is called a clock tick

Interrupt service procedure adds 1 to time stored in memory so the software clock is kept up to date



The one and the many

What if the clock is “off” by a little? All processes on single machine use the same

clock so they will still be internally consistent What matters is relative time

Impossible to guarantee that crystals in different computers run at exactly the same frequency Gradually software clocks get out of synch --

skew

A program that expects time to be independent of the machine on which it is run ... fails



Hey buddy, can you spare me a second?

To provide UTC (translates as Universal Coordinated Time) to those who need precise time, NIST operates a short wave radio station WWV from Fort Collins, CO

WWV broadcasts a short pulse at the start of each second

There are stations in other countries plus satellitesUsing either short wave or satellite services requires an

accurate knowledge of the relative position of the sender and receiver. Why?



To WWV or not to WWV

If one computer has a WWV receiver, the goal is keeping all the others synchronized to it.

If no machines have WWV receivers, each machine keeps track of its own time Goal -- keep all machines together as well as

possible There are many algorithms



Underlying model for synchronization models

Each machine has a timer that interrupts H times a second Interrupt handler adds 1 to a software clock that

keeps track of the number of ticks since some agreed-upon time in the past

Call the value of the clock CNotationally, when UTC time is t, the value of the

clock on machine p is Cp(t) In a perfect world, Cp (t) = t for all p and all t



Back to reality

Theoretically, a timer with H=60 should generate 216,000 ticks per hour

Relative error is about 10^-5 meaning a particular machine gets a value in the range 215,998 to 216,002

There is a constant called the maximum drift rate and a timer will work with “perfect” + maximum drift rate.

If two clocks are drifting in the opposite direction at a time delta-t after they were synchronized may be as much as twice the max drift rate apart To differ by no more than delta, clocks must be

resynchronized every (delta/2*max-drift-rate) seconds



Cristian’s algorithm

Well suited to one machine with a WWV receiver and a goal to have all other machines stay synchronized with it.

Call the one with the WWV receiver the time serverPeriodically, each machine sends a message to the

time server asking for the current timeMachine responds with CUTC as fast as it can

1st approximation, requester sets its clock to CUTC

What’s wrong with that?



Big Trouble

Major problem Time really should never run backward -- why? If sender’s clock was fast, CUTC will be smaller

than the sender’s current value of C Change must be introduced gradually

If timer generates 100 interrupts/second, each interrupt adds 10 ms to the time

To slow down, ISR adds only 9 ms until correct

To speed up, add 11 ms at each interrupt



Little Trouble

Minor problem Takes a nonzero amount of time for the time

server’s reply to get back to the sender Delay may be large and vary with network load Cristian attempts to measure send and receive

times, subtract, divide by 2; add this to received CUTC

Better: length of time server’s ISR, I, and incoming message processing time: (T1 - T0 - I)/2

To improve accuracy, measure several and average



If no WWV Receiver

Berkeley UNIX algorithmThe time server (actually time daemon) is active, not

passive It polls every machine and asks what time it isBased on answers, it computes an average time and

tells all machines to adjust their clocks to the new time

The time daemon’s time is set manually by the operator periodically

Centralized algorithm though the time daemon does not have a WWV receiver



Decentralized synchronization

Cristian and Berkeley UNIX are centralized algorithms with the usual downside. What?

There are several decentralized algorithms, for example: Divide time into fixed length resynchronization

intervals At the beginning of each interval, every machine

broadcasts its current time Each starts a local timer to collect all broadcasts

arriving during a certain interval Algorithm to compute a new time based on

some/all



Internet Synchronization

New hardware and software technology in the past few years make it possible to keep millions of clocks synchronized to within a few ms of UTC

New algorithms using these synchronized clocks are beginning to appear

Synchronized clocks can be used to achieve cache consistency to use time-out tickets in distributed system

authentication to handle commitment in atomic transactions



Logical Clocks

See also notes from 3 weeks agoFor many purposes, it is sufficient that machines

agree on the same time even if it is not the “right” time

Internal consistency of the clocks mattersClock synchronization is possible but does not have

to be absolute If 2 processes do not interact, their clocks need

not be synchronized; the lack of synch would not be seen

What is important is that all processes agree on the order in which events occur



Lamport timestamps

a happens-before b means that all processes agree that first event a occurs, then afterward, event b occurs

We write a happens-before b as a --> b If a occurs before b in the same process, we say a --> b is

true If the event a sends a message and event b receives that

message in another process, a --> b is also true because a message cannot be received until after it is sent.

happens-before is transitive



Ya cain’t say

If x and y happen in different processes that do not exchange messages, then we cannot say x --> y we cannot say y --> x nothing can be said about when the events

happened or which event happened first we call these events concurrent



Invent time

Need a way of measuring time so that for every event a we can assign a time C(a) on which all processes agree. Such that, if a --> b, then C(a) < C(b) If a and b are two events in the same process and a

happens before b, then C(a) < C(b) If a is the sending of a msg by one process and b is

the receiving of that msg by another, then C(a) and C(b) must be assigned so that everyone agrees on the values of C(a) and C(b) with C(a) < C(b)

Corrections to C can only be made by addition, never subtraction so that the clock time always goes forward



If msg leaves at time N, it arrives at >= N+1

Each message carries the time according to its sender’s clock

When it arrives, if the receiver’s clock shows a value prior to the time the message was sent, the receiver fast forwards its clock to be 1 more than the sending time

Between every two events the clock must tick at least once If a process sends or receives 2 messages in quick

succession, it must advance its clock by (at least) 1 tick in between

No 2 events ever occur at exactly the same time



Totally-ordered Multicast

Consider a bank with replicated data in San Francisco and New York City.

Customer in SF wants to add $100 to the account of $1000

Meanwhile, a bank employee in NY initiates an update by which the customer’s account will be increased with 1% interest.

Due to communication delays, the instructions could arrive at the replicated sites in different orders with differing final answers

Should have been performed at both sites in same order



Using Lamport timestamps to get totally ordered multicast

Consider group of processes multicasting messages to each other

Each message is timestamped with the current (logical) time of its sender

Conceptually, if multicast, the msg is also sent to its sender

We assume msgs from the same sender are received in the order they were sent and that no messages were lost



totally ordered multicast (cont.)

When a process receives a message, it goes into a local queue ordered according to its timestamp

The receiver multicasts an acknowledgementUsing Lamport’s algorithm for adjusting local clocks,

the timestamp of the received msg is lower than the timestamp of the acknowledgement

All processes will eventually have the same copy of the local queue because each msg is multicast, plus acks

We assumed msgs are delivered in the order sent by sender



Each process inserts a received msg in its local queue according to the timestamp in that msg.

Lamport’s clocks ensure no two messages have the same timestamp

Also, the timestamps reflect a consistent global ordering of events

A process delivers a queued msg to the application it is running when that message is at the head of the queue and has been acknowledged by each other process

The msg removed from queue; associated acks removed.

totally ordered multicast (cont. more)




Vector Timestamps

With Lamport timestamps, nothing can be said about the relationship between a and b simply by comparing their timestamps C(a) and C(b). Just because C(a) < C(b), doesn’t mean a

happened before b (remember concurrent events)

Consider network news where processes post articles and react to posted articles

Postings are multicast to all membersWant reactions delivered after associated postings







Will totally-ordered multicasting work?

That scheme does not mean that if msg B is delivered after msg A, B is a reaction to msg A. They may be completely independent.

What’s missing? If causal relationships are maintained within a group

of processes, then receipt of a reaction to an article should always follow the receipt of the article. If two items are independent, their order of

delivery should not matter at all



Vector Timestamps capture causality

VT(a) < VT(b) means event a causally precedes event b.Let each process Pi maintain vector Vi such that

Vi[i] is the number of events that have occurred so far at Pi

If Vi[j] = k then Pi knows that k events have occurred at Pj

We increment Vi[i] at the occurrence of each new event that happens at process Pi

Piggyback vectors with msgs that are sent. When P i sends msg m, it sends its current vector along as a timestamp vt.



Receiver thus knows the number of events that have occurred at Pi

Receiver is also told how many events at other processes have taken place before Pi sent message m. timestamp vt of m tells the receiver how many events in

other processes have preceded m and on which m may causally depend

When Pj receives m, it adjusts its own vector by setting each entry Vj[k] to max{Vj[k], vt[k]}

The vector now reflects the # of msgs that Pj must receive to have at least seen the same msgs that preceded the sending of m.

Vj[i] is incremented by 1 representing the event of receiving msg m as the next message from Pi



When are messages delivered?

Vector timestamps are used to deliver msgs when no causality constraints are violated.

When process Pi posts an article, it multicasts that article as a msg a with timestamp vt(a) set equal to Vi.

When another process Pj receives a, it will have adjusted its own vector such that Vj[i] > vt(a)[i]

Now suppose Pj posts a reaction by multicasting msg r with timestamp vt(r) equal to Vj. vt(r)[i] > vt(a)[i].

Both msg a and msg r will arrive at Pk in some order



When receiving r, Pk inspects timestamp vt(r) and decides to postpone delivery until all msgs that causally precede r have been received as well.

In particular, r is delivered only if the following conditions are met

1. vt(r)[j] = Vk[j] + 1

2. vt(r)[i] <= Vk [i] for all i not equal to j

1. says r is the next msg Pk was expecting from Pj

2. says Pk has seen no msg not seen by Pj when it sent r. In particular, Pk has already seen message a.



Controversy

There has been some debate about whether support for totally-ordered and causally-

ordered multicasting should be provided as part of the message-communication layer or

whether applications should handle orderingComm layer doesn’t know what it contains, only potential

causality 2 msgs from same sender will always be marked as

causally related even if they are notApplication developer may not want to think about it

Documents

Slides for Chapter 14: Distributed transactions From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 4, © Addison-Wesley