Upload
augustine-morgan
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
An Extended Home-Based Coherence Protocol for
Causally Consistent Replicated Read-Write Objects
Jerzy Brzeziński Michał Szychowiak
Checkpoint replication = a new technique
of recovery for Distributed Shared
Memory (DSM) systems:
low overhead – checkpointing integrated
with the coherence protocol managing
replicated DSM objects
high availability of shared objects in spite of
multiple node failures ...
and network partitioning (majority partition)
fast recovery
PLAN of the talk:
2. DSM recovery problem
3. Checkpoint replication
4. Example for causal consistency
5. Examples for other consistency models
1. DSM model consistency models coherence protocols
Message Passing
easy to implement
hard to program
LOCAL MEMORY
PROCESSOR
LOCAL MEMORY
PROCESSORno global clock
no common memory
DSM data unit
physical memory page single variable object
object members ("object value") object methods (encapsulation) persistent objects read/write objects
read access: r(x)v
write access: w(x)v
PROCESSOR
PROCESSOR
DATAX’
Replication
DATAX’’
DATAX
local access = efficiency improvement
local access = data replication
Consistency models
How to efficiently control concurrent access to the
same object on different nodes to maintain the
consistency of DSM?
”A consistency model is essentially a contract between the software and the memory. It says that if the software agrees to obey certain rules, the memory promises to work correctly”
[A.S. Tanenbaum]
Consistency models
Atomic consistencySequential consistencyCausal consistencyPRAM consistencyCache consistencyProcessor consistency
Relaxed consistency models (release, entry, ...)
SymbolsĤ = set of all operations issued by the system
Hi = set of access operations to shared objects issued by Pi
HW = set of all write operations to shared objects
H|x = set of all access operations issued to object x
local order relation:
i = total order of operations issued by Pi
real time order:
o1RT o2 = operation o1 finishes in real time before o2 starts
Atomic consistency
Execution of access operations is atomically consistent
if there exists a total order of the operations in Ĥ:
o1RT o2 o1 o2,
satisfying two conditions:
:: u v w(x)v o(x)u o(x)u r(x)v
(exclusive writing) :: w1(x) w2(x) w2(x) w1(x)
( )o x u H
( ) , ( )w x v r x v H
1 2( ), ( )w x w x HW
(monotonic reads)
Sequential consistency
Execution of access operations is sequentially consistent if for
each process PiPcor there exists a legal serialization i of the
operations in HiHW satisfying two conditions:
(local order) :: ( o1j o2 o1i o
2 )
(exclusive writing) :: w1(x)i w2(x) w2(x)i w
1(x) 1 2( ), ( )w x w x HW
1 2, io o H HW
jP
corP
iP
corP iP
corP
Causal order
The causal-order relation in Ĥ is the transitive closure of the local
order relation i and a write-before relation that holds between a
write operation and a read operation returning the written value:
(i) ( ( o1i o2 ) o1 o
2 )
(ii) w(x)v r(x)v
(iii) ( (o1 o o o2) o1 o
2 )
Pi
P1 2, ˆo o H
x O
1 2, , ˆo o o H
Causal consistency
Execution of access operations is causally consistent if for each
process PiPcor there exists a legal serialization i satisfying the
following condition:
(o1 o2 o1i o2 )
1 2, io o H HW
Coherence protocols
OWNER NODEmaster replica
OWNER NODEmaster replica
2
NODEordinary replica
NODEordinary replica
2
NODEordinary replica
NODEordinary replica
2owner(x)
manager(x)
CS(x) – copyset
Coherence protocols
write update
NODEordinary replica
NODEordinary replica
2
OWNER NODEmaster replica
OWNER NODEmaster replica
2 W(X)5
NODEordinary replica
NODEordinary replica
2UPD(x,5)
5
requires broadcast or similar tools
5
5
Coherence protocols
write invalidate
NODEordinary replica
NODEordinary replica
2
OWNER NODEmaster replica
OWNER NODEmaster replica
W(X)55
NODEordinary replica
NODEordinary replica
2INV(x)
Coherence protocols
write invalidate
NODEordinary replica
NODEordinary replica
OWNER NODEmaster replica
OWNER NODEmaster replica
W(X)55
NODEordinary replica
NODEordinary replica
low read/write ratio for objects
Recovery problem
y=4
x=1 (restored)
Pj
Pi
x=1
failure
wi(x)2
rj(x)2
rj(x) recovery
tf
x
time wj(y)4 y=1
x=2
inconsistency
y=square(x)
Recovery problem
The goal the DSM recovery protocol is to restore
such global state of the shared memory that is
consistent, i.e. it allows to restart the distributed
computation from this state and guarantees that
any access operation issued in the restarted
computation will be consistent according to the
assumed consistency model.
Checkpoints
restored object value should reflect the history of operations performed on the object (this is the job of the DSM recovery protocol)
Solutions
1. write-update with full replication
2. logging of coherence operations/messages
3. checkpointing algorithms developed formerly for
msg-pass adapted for the DSM
Three main approaches:
Solutions
most solutions offer only single-failure resilience
write-update write-invalidate
logging checkpointing
coordinated independent
Checkpoints
restored object value should reflect the history of operations performed on the object (this is the job of the DSM recovery protocol)
consistent state of the memory may be stored in the form of backup copies of all existing objects (checkpoints)
on recovery, consistent values of all objects can be quickly fetched from their checkpoints
checkpoints need to be saved in a stable (non‑volatile) storage able to survive failures
Drawbacks
1. Requirements: external stable storage
2. Cost of checkpointing: coordination between processes access time to the stable storage
3. Cost of recovery: restoration of the consistent state re-execution of the processing
New solution
checkpoint storage in the DSM memory checkpoint replication – special obj replicas –
CCS(x) = checkpoint copyset full integration with the coherence protocol to
manage all kind of replicas fast recovery – current checkpoint of each
obj is consistent with the current state of the memory – no re-execution
Home-based coherence protocol
Every object x has been assigned a home-node which
collects all the modifications of x.
The process running on that node is called owner of x.
Different objects can have different home nodes
(owners).
We assume that there is a home-node for each object
x, and the identity of the owner holding the master
replica of x is known.
Home-node
Owner holds the master replica of x.
Home-node = static owner.
The master replica is distinguished by WR state.
Each write access issued to a WR replica is
performed instantaneously.
Similarly each read access.
Multiple readers
Besides the master replica there can be several
ordinary replicas in RO state.
Each read access issued to a RO replica is
performed instantaneously.
Each write access issued to a RO requires
communication with the home-node (to update the
master replica).
Multiple writers
Several processes can issue write accesses to the
same object x at the same time.
The global order of modifications of x is determined
by the order of reception of the update messages
(UPD) at the home-node.
Invalidation
The INV state denotes a invalidated ordinary
replica.
Any access issued to such a replica requires to
fetch he value from the master replica.
The UPD message is sent from the home-node.
On reception, the invalidated replica is updated and
become RO.
Coherence protocol automaton
ri(x)
RO owner i
ri(x)
wi(x)
rj(x) ______________________________________
Send x to Pj wj(x)
______________________________________
local_invalidatei(...) Send ack to j
local_invalidatei({...x...})
ri(x) ______________________________________
Recv x from owner k local_invalidatei(...)
INV owner i
WR owner = i
wi(x) _______________________________________
Recv ack from owner k
a) b)
Example
Pj
Pi
wi(x)4
wj(y)9 wj(x)2
wi(x)1
x{RO,0,[0,0]}
UPD(x,-,[1,1])
wj(x)3
UPD(x,3,[1,3])
y{RO,0,[0,0]}
x{WR,1,[1,0]}
x{RO,2,[1,1]}
x{WR,3,[1,3]}
y{WR,4,[1,2]}
UPD(x,2,[0,1]) UPD(x,-,[1,3])
x{WR,4,[2,3]}
x{RO,3,[1,3]}
wj(y)4
y{WR,9,[1,3]}
Checkpointing
WR replica which has been modified and not
accessed by any other process than the owner is
dirty.
The first external access to a WR/dirty replica
requires to checkpoint the accessed value (in order
to protect the causal dependency).
Checkpoint operation consists in replicating the
accessed value, i.e updating the checkpoint replicas.
Checkpoint replicas
C replica is updated on checkpoint operations and
never become invalidated.
A single checkpoint operation ckpt(x)v is performed
by the owner of x, Pk, and consists in atomically
updating all C replicas of x held by processes
included in CCS(x) with the value v, carried in
ckpt(x,v,VTkx) message.
Checkpoint replicas
C replica of x held by Pj will be duplicated to a new
ROC replica on the first local access to x.
In fact, a checkpoint replica is a prefetched current
value of x and can therefore can serve further
accesses.
During the next single checkpoint operation ckpt(x)v',
all existing ROC replica of x are destroyed.
Checkpointing
RO/dirty and ROC/dirty denote additional
information about the write accesses invoked
locally on RO or ROC replica.
A replica is in RO/dirty or ROC/dirty state when it
was modified by the holding process but possibly
not yet checkpointed by its owner.
Burst checkpointing
The coordinated burst checkpoint operation consists in
atomically performing two operations:
(1) taking a single checkpoint of all object x in WR state
and
(2) sending checkpoint requests CKPT_REQ(y) to owners
of all other objects y in RO/dirty and ROC/dirty states.
After that, the RO/dirty and ROC/dirty objects are
switched to RO and ROC states, respectively.
Vector clock
The causal relationship of the memory accesses is reflected
in the vector timestamps associated with each shared
object. Each process Pi manages a vector clock VTi.
The value of i-th component of the VTi counts writes
performed by Pi. More precisely, only intervals of write
operations not interlaced with communication with other
processes are counted, as it is sufficient to track the causal
dependency between operations issued by distinct
processes.
Vector clock
There are three operations performed on VTi:
inc(VTi) – increments a i-th component of the VTi
update(VTi,VTj) – returns the component wise
maximum of the two vectors
compare(VTi<VTj) – returns true iff
k: VTi[k]VTj[k] and k: VTi[k]VTj[k]
Vector timestamp
Each replica of shared object x stored at Pi has been
assigned a VTix..
On any local modification wi(x) VTix becomes equal to
VTi, and on the update UPD(x,v,VTx) from the master
replica, VTix becomes equal to VTx.
Causal order
A local_invalidatei(VT) operation ensures the
correctness of the basic protocol, by setting to INV the
status of all locally held replicas x not owned by Pi, for
which compare(VTix<VT) is true.
The reason for this operation is to invalidate all replicas
that could have potentially been overwritten since VTix
till VT.
Causal order
The local_invalidatej(VTjx) must be executed
when copying a checkpoint replica from C to
ROC.
Example
CKPT(x,5,[2,4])
Pj
Pi
wi(x)5
wj(y)9
rl(y)9
rl(y)
Pl
wj(x)2
CKPT(x,2,[1,1])
wi(x)1
x{RO,0,[0,0]}
UPD(x,-,[1,1])
wj(x)3
UPD(x,3,[1,3])
x{RO,2,[1,1]}
y{RO,0,[0,0]}
x{WR,1,[1,0]}
x{RO,2,[1,1]}
x{RO/dirty,3,[1,3]}
wj(x)4
UPD(x,4,[1,4])
y{WR,4,[1,2]}
UPD(x,2,[0,1])
UPD(x,-,[1,3]) UPD(x,-,[1,4])
x{WR/dirty,5,[2,4]}
x{RO/dirty,4,[1,4]}
x{RO/dirty,3,[1,3]} UPD(y,9,[1,3])
CKPT_REQ(x)
wj(y)4
y{WR,9,[1,3]}
Example
CKPT(x,5,[3,3])
Pj
Pi
wi(x)4
wj(y)9
rl(y)9
rl(y)
Pl
wj(x)2
CKPT(x,2,[1,1])
wi(x)1
x{RO,0,[0,0]}
UPD(x,-,[1,1])
wj(x)3
UPD(x,3,[1,3])
x{RO,2,[1,1]}
y{RO,0,[0,0]}
x{WR,1,[1,0]}
x{RO,2,[1,1]}
x{RO/dirty,3,[1,3]}
y{WR,4,[1,2]}
UPD(x,2,[0,1])
UPD(x,-,[1,3])
x{WR/dirty,4,[2,3]}
x{RO/dirty,3,[1,3]} UPD(y,9,[1,3])
CKPT(y,9,[1,3])
wj(y)4
y{WR,9,[1,3]}
CKPT(y,4,[1,3])
CKPT_REQ(x)
wi(x)5
Recovery
At any time before any failure occurs, there are at least
nrmin+1 replicas of x (the master replica plus nr nrmin C
replicas), thus in case of a failure of f nrmin processes
failure (at most f processes crash or become separated
from the majority partition) there will be at least one
non-faulty replica of x in the primary partition, and can
serve further access requests to x.
Recovery
As long as the current owner is non-faulty in the majority
partition, the extended coherence protocol assures
processing all requests to x issued in the majority partition.
However, if the current owner becomes unavailable, the
recovery procedure must elect a new owner from among
all available processes in the partition – the one holding
the most recent replica of x (timestamp comparison is
necessary for RO/ROC replicas).
Recovery
If there are no RO/ROC replicas of x any available C
replica may be used to restore the x value.
This is sufficient to continue the distributed processing
in the majority partition.
Alternatively, all shared object may be synchronously
rolled back to their checkpoint values, and all RO/dirty
and ROC/dirty replicas discarded.
CS(x) vs. CCS(x)
if |CS(x)| nr then CCS(x) CS(x)
2
n
boundary restriction on nr = |CCS(x)|
nrmin nr nrmax
f-resilience: nr = nrmin= nrmax=
prefetching:
if |CS(x)| nr then CCS(x) CS(x)(reduction of update cost)
Other protocols for ckpt replication
1. Another protocol for causal consistency
2. Atomic consistency
3. Processor consistency
Coherence protocol
WR owner = i
ri(x)
RO owner i
ri(x)
wj(x) __________________________________
Send x and ownership to Pj
wi(x) __________________________________
Recv x and ownership from owner k
local_invalidatei(VTk)
local invalidation when performing
local_invalidatei(VT)
ri(x) ___________________________________
Recv x from owner k
local_invalidatei(VTk)
INV owner i
wi(x) ___________________________________
Recv x and ownership from owner k
local_invalidatei(VTk) wi(x)
rj(x) ___________________________________
Send x to Pj
Extended coherence protocol
C (checkpoint)
ROC (read-only checkpoint).
Checkpoint replica C is updated on checkpoint operations.
After that moment, it can be switched to ROC on the next
local read access to x – triggering coherence operations
(local_invalidate)
Until the next checkpoint, the ROC replica serves read
accesses as RO replicas do.
Extended coherence protocol
WR owner = i
wj(x) from jCCS __________________________________
Send x and ownership to Pj
rj(x) ___________________________________
Send x to Pj
RO owner = i
ri(x)
ROC owner i
wi(x)
ri(x)
RO owner i
INV owner i
ri(x)
rj(x) from jCCS __________________________________
Checkpoint(CCS)
Send x to Pj
CHECKPOINT ___________________________________
atomic update C owner i
ri(x) ___________________________________
Recv x from owner k
local_invalidatei(VTk)
invalidation when performing
local_invalidatei(VT)
wi(x) ___________________________________
Recv x and ownership from owner k
local_invalidatei(VTk)
wj(x) from jCCS ___________________________________
Checkpoint(CCS)
Send x and ownership to Pj
wi(x) __________________________________
Recv x and ownership from owner k
local_invalidatei(VTk)
ri(x) ___________________________________
local_invalidatei(VTx)
wi(x)
ri(x)
wj(x) from jCCS ___________________________________
Checkpoint(CCS)
Send x and ownership to Pj
wi(x) __________________________________________________________
Recv x and ownership from owner k Change CCS
local_invalidatei(VTk)
CHECKPOINT ___________________________________
atomic update
Extended coherence protocol
Pj
Pi
wi(x)1
faili
wi(x)2 wi(x)4
rj(x)2
rj(x)
Pa
Pb
wi(x)3
tf
x
x{C,2,VTx}
time
ACK
wi(y)2
CKPT(x,2,VTx)
Extended coherence protocol
Pj
Pi
wi(x)2 wi(y)6
rj(x)2
rj(x)
Pa
Pb
tf
x
y
rj(y)
rj(y)5 ! rj(y)5
rj(y) y
CKPT(y,5,VTy)
wi(y)5
CKPT(x,2,VTx)
y{INV,–,–}
Atomic consistency
Execution of access operations is atomically consistent
if there exists a total order of the operations in Ĥ:
o1RT o2 o1 o2,
satisfying two conditions:
:: u v w(x)v o(x)u o(x)u r(x)v
(exclusive writing) :: w1(x) w2(x) w2(x) w1(x)
( )o x u H
( ) , ( )w x v r x v H
1 2( ), ( )w x w x HW
(monotonic reads)
Coherence protocol
WR owner = i
wj(x) _____________________________________
Send x and ownership to Pj
ri(x) _____________________________________
Recv x from owner
wi(x) ______________________________________
Recv x and ownership from owner
Invalidate(CS)
rj(x) _____________________________________
Send x to Pj
RO owner = i
ri(x)
wi(x) ________________________________________
Invalidate(CS)
ri(x)
RO owner i
INV owner i
INVALIDATE _____________________________________
Send ACK to owner
wi(x) _____________________________________
Recv ownership Invalidate(CS)
ri(x)
wi(x)
wj(x) ______________________________________
Send x and ownership to Pj
rj(x) ______________________________________
Send x to Pj
Delayed checkpointing
Pj
Pi
wi(x)1
wi(x)2
ckpt(x)2 wi(x)4
rj(x)2
rj(x)
Pa
Pb
wi(x)3
INV(x)
tf
x x
rj(x)
rj(x)2
time
wj(y)4
wj(y)1
Extensions for checkpointing
ROC (read-only checkpoint) – checkpoint replica
available for read access to object x. All checkpoint
replicas are in state ROC directly after the
checkpointing operation, until some process issues a
write request to x.
C (checkpoint) – checkpoint replica available only on
recovery to restore a consistent state of DSM
(invalidated ROC)
CCS(x) – checkpoint copyset
Extensions for checkpointing
WR owner = i
RO owner = i
C owner i
wi(x) ________________________________________
Invalidate(CSCCS)
INV owner i wj(x)
______________________________________
Checkpoint(CCS)
Send x and ownership to Pj
rj(x) ______________________________________
Checkpoint(CCS)
Send x to Pj INVALIDATE ______________________________________
Send ACK to owner
ri(x)
CHECKPOINT ______________________________________
Send ACK to owner
CHECKPOINT ______________________________________
atomic update
ri(x) ______________________________________
Recv x from owner
ROC owner i
wi(x) ______________________________________
Recv ownership Change CCS
Invalidate(CSCCS)
. . .
Checkpoint inconsistency problem
Pj
Pi
wi(x)2 wi(y)6
rj(x)2
rj(x)
Pa
Pb
tf
x
y
rj(y)
rj(y)5 ! rj(y)5
rj(y) y
ckpt(y)5
wi(y)5
ckpt(x)2
INV(y)
Checkpoint atomicity
C
owner i
UPD ___________________________________
Send ACK1 to owner COMMIT
___________________________________
Send ACK2 to owner
ROC owner i
CX owner i
Recovery
if owner not available (crashed or separated from the
majority partition of system nodes), then ...
if any process from CS(x) available – chose it as the new
owner, else ...
if any process from CCS(x) available – chose it as the new
owner
concurrent owners (in distinct partitions) – only the ”majority”
owner can serve write requests and make new checkpoints
required failure detection and
majority partition detection
Correctness of the extended protocol
1. safety – the protocol correctly maintains the
coherence of shared data, accordingly to the
consistency model, besides failures of processes
and communication links
2. liveness – each access operation issued to any
shared data will eventually be performed (in a finite
time), even in the presence of failures.
Correctness of the extended protocol
Operation dependency: For any two operations
o1,o2Ĥ(t), we say that o2 is dependent on o1,
denoted o1o2 iff:
1) o1i o2
2) (write-before) ( o1=wi(x)v
o2=rj(x)v )
3) (transitive closure) o1o oo2
( )iP tcorP
x
O ,i j
P
( )o H t
Correctness of the extended protocol
Legal history: A set H(t)Ĥ(t) is legal history of
access operations iff:
1) Hi(t) H(t), and
2) ( o1 o2 o1 H(t) )
( )iP t
corP
1 ( )o H t 2 ( )o H t
Execution of access operations is atomically consistentif there exists a total order of the operations in H :
o1RT o2 o1 o2,
satisfying two conditions:
(legality) :: u v w(x)v o(x)u o(x)u r(x)v
(exclusive writing) :: w1(x) w2(x) w2(x) w1(x)
H = legal history of operations issued by the system
Hi = set of all access operations to shared objects issued by Pi
o1RT o2 = operation o1 finishes in real time before o2 starts
( ) , ( )w x v r x v H
( )o x u H
1 2( ), ( )w x w x H
Correctness of the extended protocol
Correctness of the extended protocol
Theorem 1 (safety): Every access to an object x
performed by a correct process is an atomically
consistent reliable operation.
Theorem 2 (liveness): The protocol eventually brings
a value of x to any correct process in the majority
partition requesting the access.
Brief summary of the protocol
independent checkpointing of DSM objects
replicated checkpoints
resilience to multiple node failures and network
partitioning
independent recovery (only lost objects must be
recovered)
Processor consistency
Processor consistency: Execution of access operations is
processor consistent if for each Pi there exists a serialization
i of the set Hi HW that satisfies:
(PRAM) ( o1j o2 o1 i o
2 )
(cache) ( w1 i w2 w1 i w
2 )
1..j n1 2, io o H HW
x O
1 2, |w w HW H x
1..i n
1..i n
Coherence protocol
Brand new coherence protocol:
Brzeziński and Szychowiak 2003
home-based (home-node=static owner, no object
manager necessary) – ensures cache consistency
local invalidation – ensures PRAM consistency
proven correct
Coherence protocol
P2 owner(x,y)
P1
VT2=
w2(x)0
modified_since2(0)={x.id,y.id}
0 1 0
w2(y)0
0 2 0
P3
r1(y) 0
w3(y)1
w2(x)2
0 0 1
0 2 1
0 3 1
0 2 0
r1(x) 2 r1(y)0 x.state=INV
0 3 0
REQ(x.id,VT1[2]=2)
R_UPD(x,3,)
W_UPD(y,1,)
R_UPD(y,2,{x.id}) REQ(y.id,VT1[2]=0)
x.value=0 x.owner=2
x.ts=VT2[2]=1
y.value=0 y.owner=2 y.ts=VT2[2]=2
x.value=3 x.owner=2 x.ts=3
y.value=1 y.owner=2 y.ts=2
Coherence protocol
ri(x)
RO owner i
ri(x)
wi(x)
rj(x) __________________________________
Send x to Pj wj(x)
__________________________________
local_invalidatei(...) Send ACK to j
local_invalidatei({...x...})
ri(x) ___________________________________
Recv x from owner k
local_invalidatei(...)
INV owner i
WR owner = i
wi(x) ___________________________________
Recv ACK from owner k
a) b)
Conclusions
1. introduction of a new concept: replication of ckpts
strict integration with coherence protocol (may lead to
overall cost reduction)
fast recovery (coherence protocol ensures consistency
of global ckpt)
network partitioning tolerated
Conclusions
2. design of coherence protocols extended with ckpt
replication for high reliability atomic causal processor
3. correctness proofs consistency models redefinition for unreliable
environment safety (consistency of ckpts) liveness (availability of ckpts)