99
CS 347 Notes07 1 CS 347: Parallel and Distributed Data Management Notes07: Data Replication Hector Garcia-Molina

CS 347: Parallel and Distributed Data Management Notes07: Data Replication

Embed Size (px)

DESCRIPTION

CS 347: Parallel and Distributed Data Management Notes07: Data Replication. Hector Garcia-Molina. Data Replication. Reliable net, fail-stop nodes The model Databasenode 1node 2node 3. item. fragment. Study one fragment, for time being Data replication  higher availability. - PowerPoint PPT Presentation

Citation preview

Page 1: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 1

CS 347: Parallel and Distributed

Data ManagementNotes07: Data Replication

Hector Garcia-Molina

Page 2: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 2

Data Replication

• Reliable net, fail-stop nodes• The modelDatabase node 1 node 2 node

3item

fragment

Page 3: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 3

• Study one fragment, for time being• Data replication higher availability

Page 4: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 4

Outline

• Basic Algorithms• Improved (Higher Availability)

Algorithms• Multiple Fragments & Other Issues

Page 5: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 5

Basic Solution (for C.C.)

• Treat each copy as an independent data item

X1 X2 X3

Lockmgr X3

Lockmgr X2

Lockmgr X1

Txi Txj Txk

Object X has copies X1, X2, X3

Page 6: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 6

• Read(X):– get shared X1 lock– get shared X2 lock– get shared X3 lock– read one of X1, X2, X3– at end of transaction, release X1, X2, X3

locks

X1

lockmgr

X3

lockmgr

X2

lockmgr

read

Page 7: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 7

• Write(X):– get exclusive X1 lock– get exclusive X2 lock– get exclusive X3 lock– write new value into X1, X2, X3– at end of transaction, release X1, X2, X3

locks

X1 X3X2

lock locklock

write write write

Page 8: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 8

• Correctness OK– 2PL serializability– 2PC atomic transactions

• Problem: Low availability

X1 X3X2

down!

cannot access X!

Page 9: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 9

• Readers lock and access a single copy

• Writers lock all copies and update all copies

• Good availability for reads• Poor availability for writes

Basic Solution — Improvement

X1 X2 X3

reader has lock writer will conflict!

Page 10: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 10

Reminder

• With basic solution– use standard 2PL– use standard commit protocols

Page 11: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 11

Variation on Basic: Primary copy

• Select primary site (static for now)• Readers lock and access primary copy• Writers lock primary copy

and update all copies

X1 * X2 X3reader

writelock

writer

Page 12: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 12

Commit Options for Primary Site Scheme• Local Commit

X1 * X2 X3

propagate update

lock,write, commit

writer

Page 13: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 13

Commit Options for Primary Site Scheme• Local Commit

X1 * X2 X3

propagate update

lock,write, commit

writer

Write(X):• Get exclusive X1* lock• Write new value into X1*

•Commit at primary; get sequence number

•Perform X2, X3 updates in sequence number order

...

...

Page 14: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 14

Example t = 0

X1: 0Y1: 0Z1: 0

* X2: 0Y2: 0Z2: 0

T1: X 1; Y 1;T2: Y 2;T3: Z 3;

Page 15: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 15

Example t = 1

X1: 1Y1: 1Z1: 3

* X2: 0Y2: 0Z2: 0

T1: X 1; Y 1;T2: Y 2;T3: Z 3;

active at node 1waiting for lock at node 1active at node 1

Page 16: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 16

Example t = 2

X1: 1Y1: 2Z1: 3

* X2: 0Y2: 0Z2: 0

T1: X 1; Y 1;T2: Y 2;T3: Z 3;

committedactive at 1committed

#2: X 1; Y 1

#1: Z 3

Page 17: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 17

Example t = 3

X1: 1Y1: 2Z1: 3

* X2: 1Y2: 2Z2: 3

T1: X 1; Y 1;T2: Y 2;T3: Z 3;

committedcommittedcommitted

#1: Z 3

#2: X 1; Y 1

#3: Y 2

Page 18: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 18

What good is RPWP-LC?

primary backup backup

updatescan’t read!

Page 19: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 19

What good is RPWP-LC?

primary backup backup

updatescan’t read!

Answer: Can read “out-of-date” backup copy (also useful with 1-safe backups... later)

Page 20: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 20

Commit Options for Primary Site Scheme• Distributed Commit

X1 * X2 X3

prepare to write new value

lock, computenew value

writer

ok

Page 21: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 21

Commit Options for Primary Site Scheme• Distributed Commit

X1 * X2 X3

prepare to write new value

lock, computenew value

writer

ok

commit (write)

Page 22: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 22

Example

X1: 0Y1: 0Z1: 0

* X2: 0Y2: 0Z2: 0

T1: X 1; Y 1;T2: Y 2;T3: Z 3;

Page 23: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 23

Basic Solution

• Read lock all; write lock all: RAWA• Read lock one; write lock all:

ROWA• Read and write lock primary: RPWP

– local commit: LC– distributed commit: DC

Page 24: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 24

Comparison

N = number of nodes with copiesP = probability that a node is operational

RAWAROWARPWP:LCRPWP:DC

Probabilitycan read

Probabilitycan write

Page 25: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 25

Comparison

N = number of nodes with copiesP = probability that a node is operational

RAWAROWARPWP:LCRPWP:DC

Probabilitycan read

Probabilitycan write

PN

PN

PN

PN

1 - (1-P)N

PPP

Page 26: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 26

Comparison

N = 5 = number of nodes with copiesP = 0.99 = probability that a node is operational

Read Prob. Write Prob.RAWA 0.9510 0.9510ROWA 1.0000 0.9510

RPWP:LC 0.9900 0.9900RPWP:DC 0.9900 0.9510

Page 27: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 27

Comparison

N = 100 = number of nodes with copiesP = 0.99 = probability that a node is operational

Read Prob. Write Prob.RAWA 0.3660 0.3660ROWA 1.0000 0.3660

RPWP:LC 0.9900 0.9900RPWP:DC 0.9900 0.3660

Page 28: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 28

Comparison

N = 5 = number of nodes with copiesP = 0.90 = probability that a node is operational

Read Prob. Write Prob.RAWA 0.5905 0.5905ROWA 1.0000 0.5905

RPWP:LC 0.9000 0.9000RPWP:DC 0.9000 0.5905

Page 29: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 29

Outline

• Basic Algorithms• Improved (Higher Availability)

Algorithms– Mobile Primary– Available Copies

• Multiple Fragments & Other Issues

Page 30: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 30

Mobile Primary (with RPWP)

(1) Elect new primary(2) Ensure new primary has seen all

previously committed transactions(3) Resolve pending transactions(4) Resume processing

primary backup backup

Page 31: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 31

(1) Elections

• Can be tricky...• One idea:

– Nodes have IDs– Largest ID wins

Page 32: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 32

(1) Elections: One scheme

(a) Broadcast “I want to be primary, ID=X”

(b) Wait long enough so anyone with larger ID can stop my takeover

(c) If I see “I want to be primary” message with smaller ID, kill that takeover

(d) After wait without seeing bigger ID, I am new primary!

Page 33: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 33

(1) Elections: Epoch Number

It is useful to attach an epoch or version number to messages:

primary: n3 epoch# = 1primary: n5 epoch# = 2primary: n3 epoch# = 3...

Page 34: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 34

(2) Ensure new primary has previously committed transactions

primary new primary backup

committed:T1, T2

need to getand apply:

T1, T2

How can we make sure new primaryis up to date? More on this coming

up...

Page 35: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 35

(3) Resolve pending transactions

primary new primary backup

T3? T3 in“W” state

T3 in“W” state

Page 36: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 36

Failed Nodes: Example

now: primary backup backup

-down- -down-

later:-down- -down- -down-

later primary backup

still:

P1 P2 P3

P1 P2 P3

P1 P2 P3-down-

commits T1

commit T2 (unaware of T1!)

Page 37: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 37

RPWP:DC & 3PC take care of problem!

• Option A– Failed node waits for:

• commit info from active node, or• all nodes are up and recovering

• Option B– Majority voting

Page 38: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 38

RPWP: DC & 3PC

primary backup 1 backup 2

(1) T end work

(2) send data

(3) get acks

(4) prepare

(5) get acks

(6) commit

tim

e

• may use 2PC...

Page 39: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 39

Node Recovery

• All transactions have commitsequence number

• Active nodes save update values“as long as necessary”

• Recovering node asks active primary formissed updates; applies in order

Page 40: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 40

Example: Majority Commit

C1 C2 C3

C T1,T2,T3 T1,T2 T1,T3P T3 T2W T4 T4 T4

state:

Page 41: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 41

Example: Majority Commit

t1: C1 failst2: C2 new primaryt3: C2 commits T1, T2, T3; aborts T4t4: C2 resumes processingt5: C2 commits T5, T6t6: C1 recovers; asks C2 for latest statet7: C2 sends committed and pending

transactions; C2 involves C1 in any future transactions

Page 42: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 42

2-safe vs. 1-safe Backups

• Up to now we have covered3/2-safe backups (RPWP:DC):

primary backup 1 backup 2

(1) T end work

(2) send data

(3) get acks

(4) commit

tim

e

Page 43: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 43

Guarantee• After transaction T commits at primary,

any future primary will “see” T

primary backup 1 backup 2

primary next primary backup 2

now:

T1, T2, T3

T1, T2, T3, T4

later:

Page 44: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 44

Performance Hit

• 3PC is very expensive– many messages– locks held longer (less concurrency)

[Note: group commit may help]

• Can use 2PC– may have blocking– 2PC still expensive

[up to 1 second reported]

Page 45: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 45

Alternative: 1-safe (RPWP:LC)

• Commit transactions unilaterally at primary• Send updates to backups as soon as possible

primary backup 1 backup 2

(1) T end work

(2) T commit

(3) send data

(4) purge data

tim

e

Page 46: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 46

Problem: Lost Transactions

primary backup 1 backup 2

primary next primary backup 2

now:

T1, T2, T3

T1, T4, T5

later:

T1 T1

T1, T4

Page 47: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 47

Claim

• Lost transaction problem tolerable– failures rare– only a “few” transactions lost

Page 48: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 48

Primary Recovery with 1-safe

• When failed primary recovers, need to“compensate” for missed transactions

primary backup 2

backup 3 next primary backup 2

now:

T1, T2, T3

T1, T4, T5

later:

T1, T4, T5 T1, T4

T1, T4, T5

next primary

T1, T2, T3,T3-1, T2-1, T4, T5

compensation

Page 49: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 49

Log Shipping

• “Log shipping:” propagate updates to backup

• Backup replays log• How to replay log efficiently?

– e.g., elevator disk sweeps– e.g., avoid overwrites

primary backuplog

Page 50: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 50

So Far in Data Replication

• RAWA• ROWA• Primary copy

– static• local commit• distributed commit

– mobile primary• 2-safe (distributed commit)

blocking or non-blocking• 1-safe (local commit)

Page 51: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 51

Outline

• Basic Algorithms• Improved (Higher Availability)

Algorithms– Mobile Primary– Available Copies

• Multiple Fragments & Other Issues

Page 52: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 52

PC-lock available copies

• Transactions write lock at all available copies• Transactions read lock at any available copy• Primary site (static) manages

U – set of available copies

X1 X2 X3

*

X4

downprimary

Page 53: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 53

Update Transaction

(1) Get U from primary(2) Get write locks from U nodes(3) Commit at U nodes

C0

PrimaryC1

BackupC2

Backup

Trans T3, U={C0, C1}

U={C0, C1}

Uupdates, 2PC

Page 54: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 54

A potential problem - example

Now: U={C0, C1}

-recovering-

C0

PrimaryC1

BackupC2

Backup

Trans T3, U={C0, C1}

I am recovering

Page 55: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 55

A potential problem - example

Later: U={C0, C1, C2}

-recovering-

C0

PrimaryC1

BackupC2

Backup

Trans T3, U={C0, C1}

You missed T0, T1, T2

T3 updates T3 updates

Page 56: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 56

Solution:

• Initially transaction T gets copy of U’ ofU from primary (or uses cached value)

• At commit of T, check U’ with current Uat primary (if different, abort T)

Page 57: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 57

Solution Continued

• When CX recovers:– request missed and pending transactions

from primary (primary updates U)– set write locks for pending transactions

• Primary polls nodes to detect failures(updates U)

Page 58: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 58

Example Revisited

C0

PrimaryC1

BackupC2

Backup

Trans T3, U={C0, C1}

You missed T0, T1, T2

U={C0, C1, C2}

-recovering-

U={C0, C1}

I am recovering

prepare prepare

reject

Page 59: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 59

Available Copies — No Primary

• Let all nodes have a copy of U (not just primary)

• To modify U, run a special atomic transaction at all available sites(use commit protocol)– E.g.: U1={C1, C2} U2={C1, C2 , C3}

only C1, C2 participate in this transaction– E.g.: U2={C1, C2 , C3} U3={C1, C2}

only C1, C2 participate in this transaction

Page 60: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 60

• Details are tricky...• What if commit of U-change

blocks?

Page 61: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 61

Node Recovery (no primary)

• Get missed updates from any active node

• No unique sequence of transactions• If all nodes fail, wait for - all to recover

- majority to recover

Page 62: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 62

recovering node

How much information (update values) must beremembered? By whom?

Committed:A,B,C,D,E,F

Pending: G

Committed:A,C,B,E,D

Pending: F,G,H

Committed:A,B

Example

Page 63: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 63

Correctness with replicated data

S1: r1[X1] r2[X2] w1[X1] w2[X2] Is this schedule serializable?

X1 X2

Page 64: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 64

S1: r1[X1] r2[X2] w1[X1] w2[X2] Is this schedule serializable?

X1 X2

One idea: Require transactions to update all copiesS1: r1[X1] r2[X2] w1[X1] w2[X2] w1[X2] w2[X1]

Page 65: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 65

S1: r1[X1] r2[X2] w1[X1] w2[X2] Is this schedule serializable?

X1 X2

One idea: Require transactions to update all copiesS1: r1[X1] r2[X2] w1[X1] w2[X2] w1[X2] w2[X1](not a good idea for high-availability algorithms)

Another idea: Build in copy-semantics into notion of serializability

Page 66: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 66

One copy serializable (1SR)

A schedule S on replicated data is 1SR if it is equivalent to a serial history of the same transactions on a one-copy database

Page 67: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 67

To check 1SR

• Take schedule• Treat ri[Xj] as ri[X] Xj is copy of

X wi[Xj] as wi[X]

• Compute P(S)• If P(S) acyclic, S is 1SR

Page 68: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 68

S1: r1[X1] r2[X2] w1[X1] w2[X2] S1’: r1[X] r2[X] w1[X] w2[X]

S1 is not 1SR!

Example

T1T2

T2T1

Page 69: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 69

Second example

S2: r1[X1] w1[X1] w1[X2]

r2[X1] w2[X1] w2[X2]

S2’: r1[X] w1[X] w1[X]

r2[X] w2[X] w2[X]

P(S2): T1 T2

S2 is 1SR

Page 70: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 70

Second example

S2: r1[X1] w1[X1] w1[X2]

r2[X1] w2[X1] w2[X2]

S2’: r1[X] w1[X] w1[X]

r2[X] w2[X] w2[X]

• Equivalent serial schedule

SS: r1[X] w1[X]

r2[X] w2[X]

Page 71: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 71

Question: Is this a “good” schedule?S3: r1[X1] w1[X1] w1[X2]

r2[X1] w2[X1]

Page 72: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 72

Question: Is this a “good” schedule?S3: r1[X1] w1[X1] w1[X2]

r2[X1] w2[X1]

S3: r1[X] w1[X] w1[X]

r2[X] w2[X]

S3: r1[X] w1[X]

r2[X] w2[X]

to be valid schedule,need precedence edgebetween w1(X) and w2(X)

Page 73: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 73

Question: Is this a “good” schedule?

S3: r1[X1] w1[X1] w1[X2]

r2[X1] w2[X1]

We need to know how w2(X2) is resolved:

OK:

w2[X2]

S3: r1[X1] w1[X1] w1[X2]

r2[X1] w2[X1] w2[X2]

Not OK:

Page 74: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 74

Question: Is this a “good” schedule?S3: r1[X1] w1[X1] w1[X2]

r2[X1] w2[X1]

Bottom Line: When w2[X2] is missing becauseX2 is down, assume X2 recover will performw2[X2] in correct order

S3: r1[X] w1[X]

r2[X] w2[X]

Page 75: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 75

Another example:

• S3 continues with T3:

S4: r1[X1] w1[X1] w1[X2] r3[X2] w3[X1]

r2[X1] w2[X1] w3[X2]

Page 76: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 76

Another example:S4: r1[X1] w1[X1] w1[X2] r3[X2] w3[X1]

r2[X1] w2[X1] w3[X2]S4: r1[X] w1[X] w1[X] r3[X] w3[X]

r2[X] w2[X] w3[X]

S4: r1[X] w1[X] r3[X] w3[X]

r2[X] w2[X] Seems OK but wheredo we place missingw2[X2]?

Page 77: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 77

Another example:

S4: r1[X1] w1[X1] w1[X2] r3[X2] w3[X1]

r2[X1] w2[X1] w3[X2]

w2[X2] must be before w1[X2]or after r3[x2] (else t3 read different)

Page 78: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 78

Another example:

S4: r1[X1] w1[X1] w1[X2] r3[X2] w3[X1]

r2[X1] w2[X1] w3[X2]

w2[X2] must be before w1[X2]or after r3[x2] (else t3 read different)

w2[X2]

One option:

performed by X2 recovery

Page 79: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 79

Outline

• Basic Algorithms• Improved (Higher Availability)

Algorithms– Mobile Primary– Available Copies (and 1SR)

• Multiple Fragments & Other Issues

Page 80: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 80

Multiple fragments

Fragment 1

Fragment 2

Node 1 Node 2 Node 3 Node 4

• A transaction spanning multiple fragments must- follow locking rules for each fragment- commit must involve “majority” in each fragment

Page 81: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 81

• Careful with update transactions that read but do not modify a fragment

Example:

F1

T1

T2

C1

F2

C2

C3 C4

Read lock

F1 at C1 Read lock

F2 at C3

Page 82: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 82

C1,C3 fail…

F1

T2

T1

C1

F2

C2

C3 C4

Writes F1 at C2

Commits F1 at C2 Writes F2 at C4

Commits F2 at C4

Page 83: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 83

Equivalent history:r1[F1] r2[F2] w1[F2] w2[F1]

not serializable!

Solution: commit at read sites too

Page 84: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 84

C1,C3 fail…

F1

T2

T1

C1

F2

C2

C3 C4

Writes F1 at C2

Commits F1 at C2 Writes F2 at C4

Commits F2 at C4

cannot commit at F1because U= {C1} is out of date...

Page 85: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 85

Read-Only Transactions

• Can provide “weaker correctness”• Does not impact values stored in DB

0

0

0

0

C1: primary C2: backup

T1: A 3T2: B 5

A:

B: B:

A:

Page 86: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 86

Later on:

0 3

0 5

0 3

0

C1: primary C2: backup

T1: A 3T2: B 5

R1 read transactionsees current sate

R2 read transactionat backup sees “old”

but “valid” state

B 5A:

B: :B

:A

Page 87: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 87

States Are Equivalent

• States at Primary:• no transactions• T1• T1, T2• T1, T2, T3

• States at Backup:• no transactions• T1• T1, T2• T1, T2, T3

Page 88: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 88

States Are Equivalent

• States at Primary:• no transactions• T1• T1, T2• T1, T2, T3

• States at Backup:• no transactions• T1• T1, T2• T1, T2, T3

at this point in time,backup may be behind...

Page 89: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 89

Schedule is Serializable

• S1 = T1 R1 T2 T3 R2 T4 ...(R1)...(R2)...

Page 90: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 90

Example 2

• A, B have different primaries now• 1-safe protocol used

0

0

0

0

C1 C2

T1: A 3T2: B 5

primaryprimaryB:B:

:A :A

Page 91: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 91

0 3

0

0

0 5

C1 C2

T1: A 3T2: B 5

primaryprimary

A 3

B 5

Page 92: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 92

0 3

0

0

0 5

C1 C2

T1: A 3T2: B 5

primaryprimary

A 3

B 5

At this time:• Q1: reads A, B at C1; sees T1 Q1 T2• Q2: reads A, B at C2; sees T2 Q2 T1

Page 93: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 93

• Schedule of update transactions is OK:– T1 T2 T2 T1

• Each R.O.T. sees OK schedule:– T1 Q1 T2 or T2 Q2 T1

• But there is NO single complete schedule that is “OK”...

0 3

0 5

0 3

0 5

C1 C2

primaryprimary

Eventually:

B:B:

:A :A

Page 94: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 94

• In many cases, such a scenario is OK• Called weak serializability:

– update schedule is serializable– R.O.T. see committed data

Page 95: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 95

Data Replication

• RAWA, ROWA• Primary copy

– static [local commit or distributed commit]– mobile primary [2-safe (2PC or 3PC) or 1-

safe]

• Available copies [with or without primary]

• Correctness (1SR)• Multiple Fragments• Read-Only Transactions

Page 96: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

CS 347 Notes07 96

Issues

• To centralize control or not?• How much availability?• “Weak” reads OK?

Page 97: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

Paxos

• Paxos is a replicated data management algorithm supposedly used in ZooKeeper.

CS 347 Notes07 97

Page 98: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

Paxos with 2 Phases

CS 347 Notes07 98

Page 99: CS 347:  Parallel and Distributed Data Management Notes07: Data Replication

Paxos with 2 Phases

CS 347 Notes07 99

primary (leader)

3 phase commit!

• uses majority voting• can handle partitions