An Extended Home-Based Coherence Protocol for Causally Consistent Replicated Read-Write Objects Jerzy BrzezińskiMichał Szychowiak

An Extended Home-Based Coherence Protocol for

Causally Consistent Replicated Read-Write Objects

Jerzy Brzeziński Michał Szychowiak

Checkpoint replication = a new technique

of recovery for Distributed Shared

Memory (DSM) systems:

low overhead – checkpointing integrated

with the coherence protocol managing

replicated DSM objects

high availability of shared objects in spite of

multiple node failures ...

and network partitioning (majority partition)

fast recovery

PLAN of the talk:

2. DSM recovery problem

3. Checkpoint replication

4. Example for causal consistency

5. Examples for other consistency models

1. DSM model consistency models coherence protocols

Message Passing

easy to implement

hard to program

LOCAL MEMORY

PROCESSOR

LOCAL MEMORY

PROCESSORno global clock

no common memory

PROCESSOR

PROCESSOR

SHAREDDATA

Distributed Shared Memory

PROCESSOR

PROCESSOR

SHAREDDATA

Distributed Shared Memory

easy to program

hard to implement

DSM data unit

physical memory page single variable object

object members ("object value") object methods (encapsulation) persistent objects read/write objects

read access: r(x)v

write access: w(x)v

PROCESSOR

PROCESSOR

DATAX’

Replication

DATAX’’

DATAX

PROCESSOR

PROCESSOR

DATAX’

Replication

DATAX’’

DATAX

local access = efficiency improvement

local access = data replication

PROCESSOR

PROCESSOR

DATAX’

Consistency

DATAX’’

DATAX

coherence protocol

Consistency models

How to efficiently control concurrent access to the

same object on different nodes to maintain the

consistency of DSM?

”A consistency model is essentially a contract between the software and the memory. It says that if the software agrees to obey certain rules, the memory promises to work correctly”

[A.S. Tanenbaum]

Consistency models

Atomic consistencySequential consistencyCausal consistencyPRAM consistencyCache consistencyProcessor consistency

Relaxed consistency models (release, entry, ...)

SymbolsĤ = set of all operations issued by the system

Hi = set of access operations to shared objects issued by Pi

HW = set of all write operations to shared objects

H|x = set of all access operations issued to object x

local order relation:

i = total order of operations issued by Pi

real time order:

o1RT o2 = operation o1 finishes in real time before o2 starts

Atomic consistency

Execution of access operations is atomically consistent

if there exists a total order of the operations in Ĥ:

o1RT o2 o1 o2,

satisfying two conditions:

:: u v w(x)v o(x)u o(x)u r(x)v

(exclusive writing) :: w1(x) w2(x) w2(x) w1(x)

( )o x u H

( ) , ( )w x v r x v H

1 2( ), ( )w x w x HW

(monotonic reads)

Atomic consistency

1)(2 xwP2

2)(2 xw

2)(1 xrP1

1)(1 xr

w2(x)1 r1(x)1

r1(x)1 w2(x)2

w2(x)2 r1(x)2

Sequential consistency

Execution of access operations is sequentially consistent if for

each process PiPcor there exists a legal serialization i of the

operations in HiHW satisfying two conditions:

(local order) :: ( o1j o2 o1i o

2 )

(exclusive writing) :: w1(x)i w2(x) w2(x)i w

1(x) 1 2( ), ( )w x w x HW

1 2, io o H HW

jP

corP

iP

corP iP

corP

Sequential consistency

1)(2 xwP2

2)(2 xw

1)(1 xrP1

Causal order

The causal-order relation in Ĥ is the transitive closure of the local

order relation i and a write-before relation that holds between a

write operation and a read operation returning the written value:

(i) ( ( o1i o2 ) o1 o

2 )

(ii) w(x)v r(x)v

(iii) ( (o1 o o o2) o1 o

2 )

Pi

P1 2, ˆo o H

x O

1 2, , ˆo o o H

Causal consistency

Execution of access operations is causally consistent if for each

process PiPcor there exists a legal serialization i satisfying the

following condition:

(o1 o2 o1i o2 )

1 2, io o H HW

Causal consistency

1)(2 xwP2

2)(2 xr

1)(1 xr2)(1 xw 1)(1 yw

1)(2 yr

P1

Coherence protocols

OWNER NODEmaster replica


2

NODEordinary replica


2



2owner(x)

manager(x)

CS(x) – copyset

Coherence protocols

write update



2



2 W(X)5



2UPD(x,5)

5

requires broadcast or similar tools

5

5

Coherence protocols

write invalidate



2



W(X)55



2INV(x)

Coherence protocols

write invalidate





W(X)55



low read/write ratio for objects

Failures

Pj

Pi

x=1

failure

wi(x)2

rj(x)???

tf

INV(x)

time x=1

x=2

ACK

Recovery problem

y=4

x=1 (restored)

Pj

Pi

x=1

failure

wi(x)2

rj(x)2

rj(x) recovery

tf

x

time wj(y)4 y=1

x=2

inconsistency

y=square(x)

Recovery problem

The goal the DSM recovery protocol is to restore

such global state of the shared memory that is

consistent, i.e. it allows to restart the distributed

computation from this state and guarantees that

any access operation issued in the restarted

computation will be consistent according to the

assumed consistency model.

Checkpoints

restored object value should reflect the history of operations performed on the object (this is the job of the DSM recovery protocol)

Solutions

1. write-update with full replication

2. logging of coherence operations/messages

3. checkpointing algorithms developed formerly for

msg-pass adapted for the DSM

Three main approaches:

Solutions

most solutions offer only single-failure resilience

write-update write-invalidate

logging checkpointing

coordinated independent

Checkpoints

restored object value should reflect the history of operations performed on the object (this is the job of the DSM recovery protocol)

consistent state of the memory may be stored in the form of backup copies of all existing objects (checkpoints)

on recovery, consistent values of all objects can be quickly fetched from their checkpoints

checkpoints need to be saved in a stable (non‑volatile) storage able to survive failures

Drawbacks

1. Requirements: external stable storage

2. Cost of checkpointing: coordination between processes access time to the stable storage

3. Cost of recovery: restoration of the consistent state re-execution of the processing

New solution

checkpoint storage in the DSM memory checkpoint replication – special obj replicas –

CCS(x) = checkpoint copyset full integration with the coherence protocol to

manage all kind of replicas fast recovery – current checkpoint of each

obj is consistent with the current state of the memory – no re-execution

Home-based coherence protocol

Every object x has been assigned a home-node which

collects all the modifications of x.

The process running on that node is called owner of x.

Different objects can have different home nodes

(owners).

We assume that there is a home-node for each object

x, and the identity of the owner holding the master

replica of x is known.

Home-node

Owner holds the master replica of x.

Home-node = static owner.

The master replica is distinguished by WR state.

Each write access issued to a WR replica is

performed instantaneously.

Similarly each read access.

Multiple readers

Besides the master replica there can be several

ordinary replicas in RO state.

Each read access issued to a RO replica is

performed instantaneously.

Each write access issued to a RO requires

communication with the home-node (to update the

master replica).

Multiple writers

Several processes can issue write accesses to the

same object x at the same time.

The global order of modifications of x is determined

by the order of reception of the update messages

(UPD) at the home-node.

Invalidation

The INV state denotes a invalidated ordinary

replica.

Any access issued to such a replica requires to

fetch he value from the master replica.

The UPD message is sent from the home-node.

On reception, the invalidated replica is updated and

become RO.

Coherence protocol automaton

ri(x)

RO owner i

ri(x)

wi(x)

rj(x) ______________________________________

Send x to Pj wj(x)

______________________________________

local_invalidatei(...) Send ack to j

local_invalidatei({...x...})

ri(x) ______________________________________

Recv x from owner k local_invalidatei(...)

INV owner i

WR owner = i

wi(x) _______________________________________

Recv ack from owner k

a) b)

Example

Pj

Pi

wi(x)4

wj(y)9 wj(x)2

wi(x)1

x{RO,0,[0,0]}

UPD(x,-,[1,1])

wj(x)3

UPD(x,3,[1,3])

y{RO,0,[0,0]}

x{WR,1,[1,0]}

x{RO,2,[1,1]}

x{WR,3,[1,3]}

y{WR,4,[1,2]}

UPD(x,2,[0,1]) UPD(x,-,[1,3])

x{WR,4,[2,3]}

x{RO,3,[1,3]}

wj(y)4

y{WR,9,[1,3]}

Checkpointing

WR replica which has been modified and not

accessed by any other process than the owner is

dirty.

The first external access to a WR/dirty replica

requires to checkpoint the accessed value (in order

to protect the causal dependency).

Checkpoint operation consists in replicating the

accessed value, i.e updating the checkpoint replicas.

Checkpoint replicas

C replica is updated on checkpoint operations and

never become invalidated.

A single checkpoint operation ckpt(x)v is performed

by the owner of x, Pk, and consists in atomically

updating all C replicas of x held by processes

included in CCS(x) with the value v, carried in

ckpt(x,v,VTkx) message.

Checkpoint replicas

C replica of x held by Pj will be duplicated to a new

ROC replica on the first local access to x.

In fact, a checkpoint replica is a prefetched current

value of x and can therefore can serve further

accesses.

During the next single checkpoint operation ckpt(x)v',

all existing ROC replica of x are destroyed.

Checkpointing

RO/dirty and ROC/dirty denote additional

information about the write accesses invoked

locally on RO or ROC replica.

A replica is in RO/dirty or ROC/dirty state when it

was modified by the holding process but possibly

not yet checkpointed by its owner.

Burst checkpointing

The coordinated burst checkpoint operation consists in

atomically performing two operations:

(1) taking a single checkpoint of all object x in WR state

and

(2) sending checkpoint requests CKPT_REQ(y) to owners

of all other objects y in RO/dirty and ROC/dirty states.

After that, the RO/dirty and ROC/dirty objects are

switched to RO and ROC states, respectively.

Vector clock

The causal relationship of the memory accesses is reflected

in the vector timestamps associated with each shared

object. Each process Pi manages a vector clock VTi.

The value of i-th component of the VTi counts writes

performed by Pi. More precisely, only intervals of write

operations not interlaced with communication with other

processes are counted, as it is sufficient to track the causal

dependency between operations issued by distinct

processes.

Vector clock

There are three operations performed on VTi:

inc(VTi) – increments a i-th component of the VTi

update(VTi,VTj) – returns the component wise

maximum of the two vectors

compare(VTi<VTj) – returns true iff

k: VTi[k]VTj[k] and k: VTi[k]VTj[k]

Vector timestamp

Each replica of shared object x stored at Pi has been

assigned a VTix..

On any local modification wi(x) VTix becomes equal to

VTi, and on the update UPD(x,v,VTx) from the master

replica, VTix becomes equal to VTx.

Causal order

A local_invalidatei(VT) operation ensures the

correctness of the basic protocol, by setting to INV the

status of all locally held replicas x not owned by Pi, for

which compare(VTix<VT) is true.

The reason for this operation is to invalidate all replicas

that could have potentially been overwritten since VTix

till VT.

Causal order

The local_invalidatej(VTjx) must be executed

when copying a checkpoint replica from C to

ROC.

Example

CKPT(x,5,[2,4])

Pj

Pi

wi(x)5

wj(y)9

rl(y)9

rl(y)

Pl

wj(x)2

CKPT(x,2,[1,1])

wi(x)1

x{RO,0,[0,0]}

UPD(x,-,[1,1])

wj(x)3

UPD(x,3,[1,3])

x{RO,2,[1,1]}

y{RO,0,[0,0]}

x{WR,1,[1,0]}

x{RO,2,[1,1]}

x{RO/dirty,3,[1,3]}

wj(x)4

UPD(x,4,[1,4])

y{WR,4,[1,2]}

UPD(x,2,[0,1])

UPD(x,-,[1,3]) UPD(x,-,[1,4])

x{WR/dirty,5,[2,4]}

x{RO/dirty,4,[1,4]}

x{RO/dirty,3,[1,3]} UPD(y,9,[1,3])

CKPT_REQ(x)

wj(y)4

y{WR,9,[1,3]}

Example

CKPT(x,5,[3,3])

Pj

Pi

wi(x)4

wj(y)9

rl(y)9

rl(y)

Pl

wj(x)2

CKPT(x,2,[1,1])

wi(x)1

x{RO,0,[0,0]}

UPD(x,-,[1,1])

wj(x)3

UPD(x,3,[1,3])

x{RO,2,[1,1]}

y{RO,0,[0,0]}

x{WR,1,[1,0]}

x{RO,2,[1,1]}

x{RO/dirty,3,[1,3]}

y{WR,4,[1,2]}

UPD(x,2,[0,1])

UPD(x,-,[1,3])

x{WR/dirty,4,[2,3]}

x{RO/dirty,3,[1,3]} UPD(y,9,[1,3])

CKPT(y,9,[1,3])

wj(y)4

y{WR,9,[1,3]}

CKPT(y,4,[1,3])

CKPT_REQ(x)

wi(x)5

Recovery

At any time before any failure occurs, there are at least

nrmin+1 replicas of x (the master replica plus nr nrmin C

replicas), thus in case of a failure of f nrmin processes

failure (at most f processes crash or become separated

from the majority partition) there will be at least one

non-faulty replica of x in the primary partition, and can

serve further access requests to x.

Recovery

As long as the current owner is non-faulty in the majority

partition, the extended coherence protocol assures

processing all requests to x issued in the majority partition.

However, if the current owner becomes unavailable, the

recovery procedure must elect a new owner from among

all available processes in the partition – the one holding

the most recent replica of x (timestamp comparison is

necessary for RO/ROC replicas).

Recovery

If there are no RO/ROC replicas of x any available C

replica may be used to restore the x value.

This is sufficient to continue the distributed processing

in the majority partition.

Alternatively, all shared object may be synchronously

rolled back to their checkpoint values, and all RO/dirty

and ROC/dirty replicas discarded.

CS(x) vs. CCS(x)

if |CS(x)| nr then CCS(x) CS(x)

2

n

boundary restriction on nr = |CCS(x)|

nrmin nr nrmax

f-resilience: nr = nrmin= nrmax=

prefetching:

if |CS(x)| nr then CCS(x) CS(x)(reduction of update cost)

Other protocols for ckpt replication

1. Another protocol for causal consistency

2. Atomic consistency

3. Processor consistency

Coherence protocol

Basic protocol: John and Ahamad 1993

Coherence protocol

WR owner = i

ri(x)

RO owner i

ri(x)

wj(x) __________________________________

Send x and ownership to Pj

wi(x) __________________________________

Recv x and ownership from owner k

local_invalidatei(VTk)

local invalidation when performing

local_invalidatei(VT)

ri(x) ___________________________________

Recv x from owner k


INV owner i

wi(x) ___________________________________


local_invalidatei(VTk) wi(x)

rj(x) ___________________________________

Send x to Pj

Extended coherence protocol

C (checkpoint)

ROC (read-only checkpoint).

Checkpoint replica C is updated on checkpoint operations.

After that moment, it can be switched to ROC on the next

local read access to x – triggering coherence operations

(local_invalidate)

Until the next checkpoint, the ROC replica serves read

accesses as RO replicas do.


WR owner = i

wj(x) from jCCS __________________________________


rj(x) ___________________________________

Send x to Pj

RO owner = i

ri(x)

ROC owner i

wi(x)

ri(x)

RO owner i

INV owner i

ri(x)

rj(x) from jCCS __________________________________

Checkpoint(CCS)

Send x to Pj

CHECKPOINT ___________________________________

atomic update C owner i

ri(x) ___________________________________

Recv x from owner k


invalidation when performing

local_invalidatei(VT)

wi(x) ___________________________________



wj(x) from jCCS ___________________________________

Checkpoint(CCS)


wi(x) __________________________________



ri(x) ___________________________________

local_invalidatei(VTx)

wi(x)

ri(x)

wj(x) from jCCS ___________________________________

Checkpoint(CCS)


wi(x) __________________________________________________________

Recv x and ownership from owner k Change CCS


CHECKPOINT ___________________________________

atomic update


Pj

Pi

wi(x)1

faili

wi(x)2 wi(x)4

rj(x)2

rj(x)

Pa

Pb

wi(x)3

tf

x

x{C,2,VTx}

time

ACK

wi(y)2

CKPT(x,2,VTx)


Pj

Pi

wi(x)2 wi(y)6

rj(x)2

rj(x)

Pa

Pb

tf

x

y

rj(y)

rj(y)5 ! rj(y)5

rj(y) y

CKPT(y,5,VTy)

wi(y)5

CKPT(x,2,VTx)

y{INV,–,–}

Atomic consistency

Execution of access operations is atomically consistent

if there exists a total order of the operations in Ĥ:

o1RT o2 o1 o2,


:: u v w(x)v o(x)u o(x)u r(x)v


( )o x u H

( ) , ( )w x v r x v H

1 2( ), ( )w x w x HW

(monotonic reads)

Coherence protocol

Basic protocol: Li and Hudak 1989

Coherence protocol

WR owner = i

wj(x) _____________________________________


ri(x) _____________________________________

Recv x from owner

wi(x) ______________________________________

Recv x and ownership from owner

Invalidate(CS)

rj(x) _____________________________________

Send x to Pj

RO owner = i

ri(x)

wi(x) ________________________________________

Invalidate(CS)

ri(x)

RO owner i

INV owner i

INVALIDATE _____________________________________

Send ACK to owner

wi(x) _____________________________________

Recv ownership Invalidate(CS)

ri(x)

wi(x)

wj(x) ______________________________________


rj(x) ______________________________________

Send x to Pj

Delayed checkpointing

Pj

Pi

wi(x)1

wi(x)2

ckpt(x)2 wi(x)4

rj(x)2

rj(x)

Pa

Pb

wi(x)3

INV(x)

tf

x x

rj(x)

rj(x)2

time

wj(y)4

wj(y)1

Extensions for checkpointing

ROC (read-only checkpoint) – checkpoint replica

available for read access to object x. All checkpoint

replicas are in state ROC directly after the

checkpointing operation, until some process issues a

write request to x.

C (checkpoint) – checkpoint replica available only on

recovery to restore a consistent state of DSM

(invalidated ROC)

CCS(x) – checkpoint copyset

Extensions for checkpointing

WR owner = i

RO owner = i

C owner i

wi(x) ________________________________________

Invalidate(CSCCS)

INV owner i wj(x)

______________________________________

Checkpoint(CCS)


rj(x) ______________________________________

Checkpoint(CCS)

Send x to Pj INVALIDATE ______________________________________

Send ACK to owner

ri(x)

CHECKPOINT ______________________________________

Send ACK to owner

CHECKPOINT ______________________________________

atomic update

ri(x) ______________________________________

Recv x from owner

ROC owner i

wi(x) ______________________________________

Recv ownership Change CCS

Invalidate(CSCCS)

. . .

Checkpoint inconsistency problem

Pj

Pi

wi(x)2 wi(y)6

rj(x)2

rj(x)

Pa

Pb

tf

x

y

rj(y)

rj(y)5 ! rj(y)5

rj(y) y

ckpt(y)5

wi(y)5

ckpt(x)2

INV(y)

Burst checkpointing

Pj

Pi

wi(x)1

UPD(y,5)

wi(y)5 wi(x)2

rj(x)2

rj(x) x

UPD(x,2) COMMIT(x,y)

Checkpoint atomicity

C

owner i

UPD ___________________________________

Send ACK1 to owner COMMIT

___________________________________

Send ACK2 to owner

ROC owner i

CX owner i

Recovery

if owner not available (crashed or separated from the

majority partition of system nodes), then ...

if any process from CS(x) available – chose it as the new

owner, else ...

if any process from CCS(x) available – chose it as the new

owner

concurrent owners (in distinct partitions) – only the ”majority”

owner can serve write requests and make new checkpoints

required failure detection and

majority partition detection

Correctness of the extended protocol

1. safety – the protocol correctly maintains the

coherence of shared data, accordingly to the

consistency model, besides failures of processes

and communication links

2. liveness – each access operation issued to any

shared data will eventually be performed (in a finite

time), even in the presence of failures.


Operation dependency: For any two operations

o1,o2Ĥ(t), we say that o2 is dependent on o1,

denoted o1o2 iff:

1) o1i o2

2) (write-before) ( o1=wi(x)v

o2=rj(x)v )

3) (transitive closure) o1o oo2

( )iP tcorP

x

O ,i j

P

( )o H t


Legal history: A set H(t)Ĥ(t) is legal history of

access operations iff:

1) Hi(t) H(t), and

2) ( o1 o2 o1 H(t) )

( )iP t

corP

1 ( )o H t 2 ( )o H t

Execution of access operations is atomically consistentif there exists a total order of the operations in H :

o1RT o2 o1 o2,


(legality) :: u v w(x)v o(x)u o(x)u r(x)v


H = legal history of operations issued by the system

Hi = set of all access operations to shared objects issued by Pi

o1RT o2 = operation o1 finishes in real time before o2 starts

( ) , ( )w x v r x v H

( )o x u H

1 2( ), ( )w x w x H



Theorem 1 (safety): Every access to an object x

performed by a correct process is an atomically

consistent reliable operation.

Theorem 2 (liveness): The protocol eventually brings

a value of x to any correct process in the majority

partition requesting the access.

Brief summary of the protocol

independent checkpointing of DSM objects

replicated checkpoints

resilience to multiple node failures and network

partitioning

independent recovery (only lost objects must be

recovered)

Processor consistency

Processor consistency: Execution of access operations is

processor consistent if for each Pi there exists a serialization

i of the set Hi HW that satisfies:

(PRAM) ( o1j o2 o1 i o

2 )

(cache) ( w1 i w2 w1 i w

2 )

1..j n1 2, io o H HW

x O

1 2, |w w HW H x

1..i n

1..i n

Coherence protocol

Brand new coherence protocol:

Brzeziński and Szychowiak 2003

home-based (home-node=static owner, no object

manager necessary) – ensures cache consistency

local invalidation – ensures PRAM consistency

proven correct

Coherence protocol

P2 owner(x,y)

P1

VT2=

w2(x)0

modified_since2(0)={x.id,y.id}

0 1 0

w2(y)0

0 2 0

P3

r1(y) 0

w3(y)1

w2(x)2

0 0 1

0 2 1

0 3 1

0 2 0

r1(x) 2 r1(y)0 x.state=INV

0 3 0

REQ(x.id,VT1[2]=2)

R_UPD(x,3,)

W_UPD(y,1,)

R_UPD(y,2,{x.id}) REQ(y.id,VT1[2]=0)

x.value=0 x.owner=2

x.ts=VT2[2]=1

y.value=0 y.owner=2 y.ts=VT2[2]=2

x.value=3 x.owner=2 x.ts=3

y.value=1 y.owner=2 y.ts=2

Coherence protocol

ri(x)

RO owner i

ri(x)

wi(x)

rj(x) __________________________________

Send x to Pj wj(x)

__________________________________

local_invalidatei(...) Send ACK to j

local_invalidatei({...x...})

ri(x) ___________________________________

Recv x from owner k

local_invalidatei(...)

INV owner i

WR owner = i

wi(x) ___________________________________

Recv ACK from owner k

a) b)

Conclusions

1. introduction of a new concept: replication of ckpts

strict integration with coherence protocol (may lead to

overall cost reduction)

fast recovery (coherence protocol ensures consistency

of global ckpt)

network partitioning tolerated

Conclusions

2. design of coherence protocols extended with ckpt

replication for high reliability atomic causal processor

3. correctness proofs consistency models redefinition for unreliable

environment safety (consistency of ckpts) liveness (availability of ckpts)

Further work

benchmark evaluation SPLASH-2

reliability of the internal objects of the DSM

coherency protocol: object managers

failure detection

Documents

An Extended Home-Based Coherence Protocol for Causally Consistent Replicated Read-Write Objects Jerzy BrzezińskiMichał Szychowiak