1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Synchronous Data Replication These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

1

Advanced Database Topics

Copyright © Ellis Cohen 2002-2005

SynchronousData Replication

These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.

For more information on how you may use them, please see http://www.openlineconsult.com/db

Copyright © Ellis Cohen, 2002-2005 2

Topics

Models for Replication

ROWA Approaches

Eager Approaches

Reconciliation


Modelsfor

Replication


Data ReplicationData Replication

Put copies of the same data at multiple sites

AchievesDecreased network latency by placing replicas

near multiple high demand sites (access data at nearest replica)

High availability & reliability in face of failureParallel processingScalability (single copy no longer bottleneck)Disconnected operation

Query ProcessingIs even more complex, since the coordinator's

query optimizer additionally needs to decide which replica to use


Replication Group:Group of related data items which are replicated together at different sites

A database may have multiple replication groups

Two different replication groups may be replicated–at the same set of sites–at disjoint sets of sites–at overlapping sites

A replication group typically represents all the data used in an application

A transaction generally accesses data in a single replication groupMuch more complicated if a transaction accesses multiple replication groups

Replication Groups

S1 S2

S3

Ra Ra

Rb

Rb


Consistency Timeframes

Immediate ConsistencyAll replicas in a group are consistent after each and every update is made.

Transactional ConsistencyAt the end of each transaction, the replica group has a consistent model of the committed data. Updates need only be made at a single replica, and then are consistently propagated to others

Eventual ConsistencyUpdates made at one replica will eventually be propagated to other replicas, but at any point in time, different replicas may have inconsistent committed data. The replica group will only have a consistent model of the committed data when the system quiesces (all updates are propagated)


1SR - 1 Copy Serializability

Replicated form of serializability:Interleaved execution of transactions on

a replicated database is equivalent toSerial execution of those transactions on

a database where there is only one copy of each data item

Eventual approaches– may not be able to ensure 1SR, but– may be able to satisfy other weak

consistency guarantees


Basic Replication Topologies

MasterMaster/Snapshot

The replication group has a primary copy held at the

master. Its committed state represents the current committed state of the

entire replication group.Snapshots

Group

The replication group has no single primary copy. It

has a consistent (committed) state only if

the replicas are consistent with one another


Master vs Snapshot Sites

Master (Primary Copy) SiteHolds all the data in the replication groupHolds most consistent / up-to-date version

of the dataSnapshot Site

May hold a partial (instead of a complete) subset of the data in a replication group

i.e. a subset of its tables, or even horizontal and/or vertical fragments of tables

Often not as up-to-date as Master sitesMay be completely or partly read-only, in

particular, if it contains materialized views (involving multiple tables) of the data in the replication group


Partial SnapshotsPartial Snapshots further complicate

query processingCoordinator may need to get part of the queried data

from one snapshot, and part from anotherEspecially complicated if an operation needs to

update different partial snapshots (we'll ignore this case!)

Static vs Dynamic Partial SnapshotsStatic Partial Snapshots

Specified declaratively, made known to coordinator who uses it for query processing

Dynamic Partial SnapshotsReplica appears complete, but processes

queries by obtaining missing data from primary copy

Leads to complex query processing at snapshot site, since its request to the primary copy must take into account data already replicated (which might have already been modified during the transaction!)


Snapshots with Materialized ViewsA materialized view is a view where the results of the

view are actually maintained persistently

When the underlying data is modified, the materialized view must generally be modified as well (e.g. by triggers, sometimes set up automatically by the replication manager) or deleted.

If a query optimizer knows about materialized views, it can rewrite queries on the underlying data to efficiently use the materialized view instead ("query rewriting")

Different snapshots may contain different materialized views of the data. If the coordinator's query optimizer knows the location of materialized views, this can affect the replica it asks to process a query.

Materialized views can be static or dynamic. Dynamic materialized views often result from remembering the result sets of executed queries.


Complex Replication Topologies

MultiMaster(Combines

Group & Master)

HierarchicalMaster


Consistency Models

Total ConsistencyAll replicas (unless they are crashed or

disconnected) are consistent with one another.

Master ConsistencyThere is a master replica (the primary copy).

Transactions committed at the master site reflect the intended state of the data.

MultiMaster ConsistencyLike master consistency, but there are multiple

master replicas, all consistent with one another


Update Propagation Models

SynchronousWrite-Synchronous (ROWA)

Coordinator's write operation is not completed until every replica is updated

Commit-Synchronous (EAGER)All replicas commit (using 2PC) as part of coordinator's transaction

AsynchronousAs-Needed-Propagation (LAZY)

After transaction ends, updates are propagated to other replicas as needed

Eventual-Propagation (EVENTUAL)After transaction ends, updates are eventually propagated to other replicas

What is the consistency timeframe & model for each one?


Update & Consistency ModelsWrite-Synchronous (ROWA)

Coordinator's write operation is not completed until every replica is updatedImmediate Total Consistency

Commit-Synchronous (EAGER)All replicas commit (using 2PC) as part of coordinator's transactionTransactional Total Consistency

As-Needed-Propagation (LAZY)After transaction ends, updates are propagated to other replicas as neededTransactional (Multi-)Master Consistency

Eventual-Propagation (EVENTUAL)After transaction ends, updates are eventually propagated to other replicasEventual Consistency


ROWA Eager Lazy Eventual

Master/Snapshot

Group

Update Models & Topologies

Sometimes, these are both called EAGER

Sometimes, these are both

called LAZY


ROWARead OneWrite All


ROWA Update Model








ROWA Consistency ModelImmediate Consistency

Write-Synchronous (ROWA)Coordinator's write operation is not

completed until every replica is updated

Transactional ConsistencyCommit-Synchronous (EAGER)

All replicas commit (using 2PC) as part of coordinator's transaction

As-Needed-Propagation (LAZY)After transaction ends, updates are

propagated to other replicas as needed

Eventual ConsistencyEventual-Propagation (EVENTUAL)

After transaction ends, updates are eventually propagated to other replicas


ROWA Overview

Read One, Write AllAll replicas are updated immediately (without waiting until the transaction doing the update commits)

Data can be read from any replica

Immediate Total ConsistencyAll replicas (unless crashed or disconnected) are always consistent with one another

TopologyGroup-based. No need for a special primary copy.

ConcurrencyUsually lock-based. Other concurrency models can be used as well.

When might the ROWA model be used?


ROWA Uses

Hot StandbyUpon failure, another replica can be switched in immediately

Mobile UseSuppose every cell has a nearby replica. A mobile coordinator can switch from replica to replica during a transaction, using whichever one is nearest

ReliabilityRead from multiple replicas simultaneously to– avoid waiting in case of site/link failure– ensure that data is correct

Updates can be very expensive.Either they're done infrequently, or they must be worth the cost


ROWAC

On write,Coordinator must acquire X locks for all replicas, and writes to all of them

On read,Coordinator acquires S lock for the one replica it will actually read

Ensures 1SRCan use non-locking also

ROWA Advantage:Can read from any replica

ROWA Disadvantage:Every write requires communication round trip involving the farthest & slowest replicas

Serious ROWA ProblemIf a replica site crashes, the coordinator and all competing transactions must wait until it recovers

Solutions• All Available Writes• Quorum Consensus (Read Some, Write Some)


All Available Writes (AAW)

At transaction start, assumeAll replicas are available

Coordinator writes byWriting to all known available replicas.Those which do not ACK within timeout period are marked as unavailable but otherwise ignored

Coordinator reads byReading from chosen (e.g. nearest) replica.If it times out, mark it as unavailable, and read from a different replica

Coordinator augments 2PC withMissing Writes Validation: Makes sure that all replicas that were not written to are still unavailableAccess Validation: Make sure that all replicas read or written are still available.This is necessary for 1SR

C


Partitioning

Assume a set of replicas are partitioned.

C1 C2

Majority Partition ApproachOnly if the partition contains a (weighted) majority of the replicas.

Disconnected OperationEach can continue. Requires reconciliation when the network recovers (discuss later).

Can each partition continue executing read-write transactions that update its set of replicas?


Site Recovery

On restart– Site contacts sibling replica– Obtains & processes [relevant portion of]

log of all (sub)transactions committed while site was down, carefully in case

[a] new transaction completes while processing the log

– Makes itself available again (i.e. responds to reads and writes)

• Many variations of this protocol, esp to accommodate– Dynamic creation, removal and relocation

of replica sites

C


Multiple ReadsRead from n replicas in parallel

•Allows fastest one to respond

•Avoids taking time for reading another replica if first one is unavailable

•Use Voting: Detect/correct errors/sabotage by comparing results of multiple reads

•Guarantees getting latest value even if not all replicas were updated (Quorum Consensus Protocol: Requires that the write set contains weighted majority of replicas)

C


ROWA Summary

ROWA AdvantagesGlobal Consistency & 1SRCan read from any replica

ROWA DisadvantagesEvery write requires writing all

(available) replicasHigh overhead for every write(Can trade off write all for quorum

read, though it is generally more expensive)


Eager Approaches


Eager Update Model








ROWA Consistency ModelImmediate Consistency

Write-Synchronous (ROWA)Coordinator's write operation is not

completed until every replica is updated

Transactional ConsistencyCommit-Synchronous (EAGER)

All replicas commit (using 2PC) as part of coordinator's transaction

As-Needed-Propagation (LAZY)After transaction ends, updates are

propagated to other replicas as needed

Eventual ConsistencyEventual-Propagation (EVENTUAL)

After transaction ends, updates are eventually propagated to other replicas


Eager Overview

Read & WriteCoordinator uses a single replica for all reads & writes of replication group data. [If replicas hold partial snapshot, may need to read/write from multiple ones]There are variants that just write to the master.

Transactional Total ConsistencyAt the end of each transaction, all replicas in the group have a consistent model of the committed data.

Updates made at a single replica are consistently propagated to others at/by commit-time.

TopologyEither Group-based or Master/Snapshot

ConcurrencyAll concurrency mechanisms can be used

When might the Eager model be used?


Eager UsesHot StandbyUpon failure, another replica can be switched in immediately, although transactions which updated the failed replica will need to be aborted

Disconnected OperationIf the network is partitioned and contains a replica, operations can continue• Read-only transactions will have access to an up-to-

date version of the data• Read/write operations can continue if reconciliation

is supported

SerializabilityEnsuring transactional consistency ensures that concurrent transactions which use different replicas are serializable.

Commits can be expensive, since they require 2PC involving every replica


Eager Master/Snapshot

C

Coordinator interacts with a single replica (e.g. nearest one) chosen from the replica groupDuring 2PC– Coordinator requests PREPARE from that replica– (Unless the chosen replica is the master), the

chosen replica requests PREPARE from the Master, propagating all updates along with the request

– The master requests PREPARE from all the other snapshot replicas, propagating all updates along with the request

Read & write from single

replica

What happens in an hierarchical master topology?If the transaction uses data from two replication groups, which have replicas on the same machine, how does that affect 2PC?


Eager Master/Snapshot Concurrency

C

Lock-BasedData locked at primary copy. Either the

coordinator or the chosen replica requests those locks.

Non-Lock-BasedValidation/checking is done at the master, which

acts as a commit gateway

Can use either locking or non-locking concurrency


Eager Propagation Models

When and how are updates propagated–from chosen replica to master–from master to other snapshot replicas

•Transactional BatchSend batched information about writes to replicas along with PREPARE message

•Continuous on WritePropagate each update when it occurs (don't wait for the end of the transaction)

• Immediate ConfirmationPropagate each update when it occurs, and wait for an ACK. Similar to ROWA, but propagates managed by the replication group, not by the coordinator.


Propagation Capture & ApplyIn what format are updates "captured" where they are

made, and how are they applied by the other replicas?

Log-Based– Operations (logical log format; operation may

need to be modified for partial replicas)– Deltas (physiological log format:

"before" & "after" values of rows)

Procedural– Suppose each transaction is implemented by a

stored DB procedure. Just propagate the identity of the procedure and the parameters to it

– May require that replicas be complete


Eager Group

Coordinator interacts with a single replica (e.g. nearest one) chosen from the replica group

During 2PC– Coordinator requests PREPARE from that

replica– That replica requests PREPARE from all the

other snapshot replicas, propagating all updates along with the request

No primary copy, so– Must use a non-locking protocol– Validation/checking must be done at every replica

Read & write from single

replicaC


Eager Variants

All reads and writes are to master only– Other replicas used for hot standbys, or to

support disconnected operation

– Used to implement 2-safe backup

All writes are to master only– Queries of data unchanged by current

transaction can be directed to any replica

– How about querying data affected by the transaction's updates

• Must either be directed to master

• Coordinator maintains client-side cache with all changes, and queries use cache + any replica


Eager Summary

EAGER AdvantagesGlobal Consistency & 1SR

Need not immediately propagate each write

Can read/write from any single replica(except for variants)

EAGER DisadvantagesEvery commit requires propagating to all

(available) replicas

High overhead for every commit


MultiMaster Model

MultiMaster(Combines

Group & Master)

What kind of update model should be used among the

master sites?

What kind of update model should be

used among a master and its

snapshots?


Reconciliation


Failure and Partitioning

When eager replication is used, the replicas all need to be able to communicate with one another.

Failure prevents communication.– Site failure -- a site crashes– Network failure -- a link or links fail,

partitioning the network.

A live replica can't tell which of these is responsible for its inability to communicate.

It can generally assume that it is in a partition with just the replicas it can communicate with.


Partitioning

Assume a set of replicas are partitioned.

C1C2

Majority Partition ApproachOnly if the partition contains a (weighted) majority of the replicas.

Disconnected OperationEach can continue. Requires merging (a.k.a. reconciliation) when the network recovers.

Can each partition continue executing read-write transactions that update its set of replicas?


Primary Copy Election

In a master/snapshot topology, each partition needs a primary copy. What if a partition doesn’t have one?

• Majority Partition ApproachUse weights to ensure that the majority partition contains the primary copy. [But what if the primary copy itself crashed?]

• Elect a Primary CopyElect a primary copy using an election protocol [similar to 3PC protocol to elect a new coordinator]

What should be done in a multimaster environment?


Discovering Transaction Conflicts

As part of healing (i.e. recovery from) a network partition, conflicts may be discovered between committed transactions that were in disconnected partitions.

Modification (W/W) Conflicts:Transactions in different partitions modified the

same data item (inconsistently).Can lead to lost updates.

R/W Conflicts:A transaction in one partition read data that was

modified in the other partition.Can lead to non-serializable results; however,

because the results in each partition are consistent (w.r.t. the partition), it is sometimes acceptable to ignore pure R/W conflicts.


Eager Reconciliation Approaches

Compensation"Undo" conflicting committed

transactions by executing compensating transactions.

Tentative CommitWhen disconnected, transactions only

commit tentatively. During reconciliation, these are either fully committed or aborted.

Conflict ResolutionConflicting modifications are resolved by

"merging" the changes.


Primary vs Group ReconciliationEager Primary Reconciliation

During healing, the elected primary provides a description (typically the log) of transactions committed during partition to the original primary

The original primary identifies and reconciles all conflicts, which are (in the normal course of things) propagated to all the replicas

Eager Group ReconciliationDuring healing, a replica provides its changes to

some or all replicas it was partitioned from.Each replica identifies, reconciles and

propagates changes independently. To maintain consistency, this implies

– Symmetric reconciliation: The results of reconciliation must be identical at each replica, independent of the order in which changes and propagated updates are received

– A replica must be able to ignore changes and propagated updates it has already processed


Compensation

Every transaction that might need to be "undone" has a compensating transaction associated with it.

A committed transaction that has a conflict is "undone" by executing its compensating transaction (often followed by re-executing the original transaction)

This can lead to cascading compensation. Any committed transaction which read data written by the original transaction may need to have its compensating transaction run as well.


Motivating Tentative Commit

If we can delay all commitments during partition, we can simply abort conflicting transactions during healing.

However, commitment is necessary for reducing resource conflicts

Long-running transactions that don't commit– If lock-based: Can block other

transactions for long periods

– If validation-based: Are more likely to fail validation


Tentative CommitmentDuring network partition, commits are tentative

– A tentatively committed transaction is not yet durable and may subsequently be aborted.

– However, other transactions may see its updates. This can lead to cascaded aborts, so they must be tentative as well.

Reconciliation resolves tentative commits– Transactions without conflicts will be fully

committed– Transactions with conflicts will be aborted

Usually uses primary reconciliation– All resolution is done at the primary copy– If group reconciliation is used, it must be

symmetric, otherwise transactions will be committed at some sites and aborted at others

A system might also allow a transaction to explicitly commit tentatively (even without using replicas), and then be either

committed or aborted at a later time (forcing cascaded aborts)


Clients & Tentative Commitment

Explicit AbortA client may be able to explicitly abort a

transaction that is still only tentatively committed

Triggering & NotificationA client may be able to arrange to– execute a procedure when a tentatively

committed transaction is about to be committed (and which could actually abort the transaction)

– to notify the user or (more generally) execute a procedure after a tentatively committed transaction is committed or aborted.


Identifying Modification Conflicts

A site may receive an unprocessed update (transaction log entry) which conflicts with its current state

Update ConflictOld value of log entry <> Current record state

Insert ConflictPrimary key of record to be inserted is already

in the table

Delete ConflictPrimary key of record to be updated or deleted

not present in table


Resolution Techniquesfor Modification Conflicts

Latest TimestampIf update timestamp > data timestamp, do

update, else discardExample: Address Change

MaxIf new value > current data value, do update,

else discardExample: Max daily temperature

AdditiveData value := current data value +

update's new value - update's old valueExample: Bank account balance

These are built-in to Oracle; others may be defined by DBA.

These conflict resolution techniques can be used as a prelude to either compensation or tentative commit resolution.

Documents

1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Synchronous Data Replication These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike