Hector Garcia-Molina

CS 347 Notes06 1

CS 347: Parallel and Distributed

Data ManagementNotes06: Reliable

DistributedDatabase Management

CS 347 Notes06 2

Reliable distributed database management

• Reliability• Failure models• Scenarios

CS 347 Notes06 3

Reliability

• Correctness– Serializability– Atomicity– Persistence

• Availability

CS 347 Notes06 4

Types of failures

• Processor failures– Halt, delay, restart, bezerk, ...

• Storage failures– Volatile, non-volatile, atomic write,

transient errors, spontaneous failures

• Network failures– Lost message, out-of-order messages,

partitions, bounded delay

CS 347 Notes06 5

• Malevolent failures• Multiple failures• Detectable failures

More Types of failures

CS 347 Notes06 6

Failure models

• Cannot protect against everything• Unlikely failures (e.g., flooding in the

Sahara)

• Expensive to protect failures (e.g., earthquake)

• Failures we know how to protect against (e.g., message sequence numbers; stable storage)

CS 347 Notes06 7

Failure model:

DesiredEvents

Expected

Undesired

Unexpected

CS 347 Notes06 8

Node models

(1) Fail-stop nodestime

perfect halted recoveryperfect

Volatilememory lost

Stablestorage ok

CS 347 Notes06 9

(2) Byzantine nodesA Perfect

Perfect Arbitrary failure Recovery

At any given time, at most some fraction f of

nodesfailed (typically f < 1/2 or f < 1/3)

Node models

CS 347 Notes06 10

Network models

(1) Reliable network- in order messages- no spontaneous messages- timeout TD

I.e., no lost messages, except for node failures

If no ack in TD sec.Destination down

(not paused)

CS 347 Notes06 11

Variation of reliable net:

• Persistent messages– If destination down, net will

eventually deliver message– Simplifies node recovery, but leads to

inefficiencies (hides too much)– Not considered here

CS 347 Notes06 12

Network models

(2) Partitionable network- In order messages- No spontaneous messages

- no timeout; nodes can have different view of failures

CS 347 Notes06 13

Scenarios

• Reliable network– Fail-stop nodes

– No data replication (1)– Data replication (2)

• Partitionable network– Fail-stop nodes (3)

CS 347 Notes06 14

No Data Replication

• Reliable network, fail-stop nodes

• Basic idea: node P controls X

P Item Xnet

CS 347 Notes06 15

No Data Replication

• Reliable network, fail-stop nodes

• Basic idea: node P controls X

P Item Xnet

- Single control point simplifies concurrency control, recovery

- Not an availability hit: if P down, X unavailable too!

CS 347 Notes06 16

“P controls X” means- P does concurrency control for

X- P does recovery for X

CS 347 Notes06 17

Say transaction T wants to access X:

PT is process that represents T at this node

Local DMBS

Lockmgr

CS 347 Notes06 18

Process models

(A) Cohorts Spawn process

CommunicationData Access

LocalDMBS

CS 347 Notes06 19

Process models

(B) Transaction servers (manager)USER

LocalDMBS

TransMGR

LocalDMBS

TransMGR

LocalDMBS

TransMGR

CS 347 Notes06 20

• Cohorts: application code responsible for remote access

• Transaction manager: “system” handles distribution, remote access

CS 347 Notes06 21

Distributed commit problem

Action:a1,a2

Action:a3

Action:a4,a5

Transaction T

CS 347 Notes06 22

Centralized two-phase commit

Coordinator Participant

go exec*

nok abort*

ok* commit *

commit -

exec ok

exec nok

abort -

CS 347 Notes06 23

• Notation: Incoming message Outgoing message

( * = everyone)• When participant enters “W” state:

– it must have acquired all resources– it can only abort or commit if so

instructedby a coordinator

• Coordinator only enters “C” state if all participants are in “W”, i.e., it is certain that all will eventually commit

CS 347 Notes06 24

Handling node failures

• Coordinator and participant logs are used to reconstruct state before failure

CS 347 Notes06 25

-> Example: after participant fails: Log:

undo/redoinfo

info... ...

T1“W”state

CS 347 Notes06 26

At recovery:

• T1 is in “W” state• Obtain X,Y write locks (no read locks!)

• Wait for message from coordinator(or ask coordinator for outcome)

CS 347 Notes06 27

Other examples:

• No “W” record on log abort T1

• See “C” record on log finish T1

CS 347 Notes06 28

• Add timeouts to cope with messages lost during crashes

• Add finish (“F”) state for coordinator – all done, can forget outcome

CS 347 Notes06 29

Coordinator

_go_exec*

c-ok*-

nokabort*

ok*commit*

t=timeout

cping=coord. ping

CS 347 Notes06 30

Coordinator

_go_exec*

c-ok*-

ping- _t_

pingabort

pingcommit

_t_cping

_t_abort*

nokabort*

ok*commit*

t=timeout

cping=coord. ping

CS 347 Notes06 31

Participant

execok

execnok

commitc-ok

abortnok

CS 347 Notes06 32

Participant

execok

execnok

commitc-ok

abortnok

cping , _t - ping

cpingdone

“done” message counts as eitherc-ok or n-ok for coordinator

CS 347 Notes06 33

Participant

execok

equivalent to finish state

execnok

commitc-ok

abortnok

cping , _t - ping

cpingdone

“done” message counts as eitherc-ok or n-ok for coordinator

CS 347 Notes06 34

Presumed abort protocol

• “F” and “A” states combined in coordinator

• Saves persistent space (forget quicker)

• Presumed commit is analogous

CS 347 Notes06 35

Presumed abort-coordinator (participant unchanged)

_go_exec*

c-ok*-

pingabort

pingcommit _t_

nok, tabort*

ok*commit*

CS 347 Notes06 36

Remember: all state transitions must be loggedExample: tracking who has sent “OK” msgsLog at coord:

• After failure, we know still waiting for OK from node b

• Alternative: do not log receipts of “OK”s abort T1

T1start

part={a,b}

from a RCV

... ......

CS 347 Notes06 37

Example: logging receipt of C-OK messages

• If logged, can recover state• If not logged:

– resend commit *– participants reply “done” if duplicate

CS 347 Notes06 38

2PC is blocking

Sample scenario:Coord P2

W P1 P3

CS 347 Notes06 39

Case I:P1 “W”; coordinator sent commits

P1 “C”Case II:

P1 NOK; P1 A

P2, P3, P4 (surviving participants)

cannot safely abort or commit transaction

CS 347 Notes06 40

Variants of 2PC

• Linear

• Hierarchical

ok ok ok

commit commit commit

CS 347 Notes06 41

• Distributed

– Nodes broadcast all messages– Every node knows when to commit

Variants of 2PC

CS 347 Notes06 42

3PC = non-blocking commit

• Assume: failed node is down forever

• Key idea: before committing, coordinator tells participants everyone is ok

CS 347 Notes06 43

Coordinator Participant

_go_exec*

ack**commit*

nokabort*

ok*pre *

_exec_ok

commit-

execnok

preack

abort-

** means allnon-failed nodes

CS 347 Notes06 44

3PC recovery rules: termination protocol

• Survivors try to complete transaction, based on their current states

• Goal:– If dead nodes committed or aborted,

then survivors should not contradict!– Else, survivors can do as they please...

CS 347 Notes06 45

• Let {S1,S2,…Sn} be survivor sites• If one or more Si = COMMIT COMMIT T• If one or more Si = ABORT ABORT T• If one or more Si = PREPARE

T could not have aborted COMMIT T• If no Si = PREPARE (or COMMIT)

T could not have committed ABORT T

CS 347 Notes06 46

Example:

CS 347 Notes06 47

Example:

CS 347 Notes06 48

Example:

CS 347 Notes06 49

Example:

CS 347 Notes06 50

Once survivors make decision, they must select new coordinator to continue 3PC

P P C C

W P C C W P P C

Decide tocommit

CS 347 Notes06 51

Note: when survivors continue 3PC, failed nodes do not

countE.g., “ack**” when ack’s received

from all non-failed nodesP

ack**commit *

CS 347 Notes06 52

Note: 3PC unsafe with partitions!

abort commit

CS 347 Notes06 53

Node recovery:

• After node N recovers from failure:– do not participate in termination protocol

(why?)

CS 347 Notes06 54

Node recovery:

(why?)

later on...

CS 347 Notes06 55

Node recovery:

(why?)– wait until it hears commit or abort

decision from operational node

ping“C”(or “A”)

CS 347 Notes06 56

• Waiting for commit/abort decision from other node is ok, unless all fail:

? ? ? ? ?

CS 347 Notes06 57

Two options for all-failed problem:(A) Wait for all to recover(B) Majority commit

CS 347 Notes06 58

Option A

• Recovering node waits for either:(1) commit/abort outcome for T from other node(2) all nodes that participated in T are up and recovering:

then 3PC can continue(no danger that a failed node could haveaborted or committed)

Option B

• Want a “gang” of failed but recovered nodes to be able to terminate a transaction even when rest are failed...

CS 347 Notes06 59

CS 347 Notes06 60

Option B

• Nodes are assigned votes, total is VMajority is V+1 e.g., V=5

2 Maj=3 V=6 Maj=4

• To make state transitions, coordinator requires messages from nodes with a majority of votes

CS 347 Notes06 61

Example(1): Coord P2 W P1 P3 W

• Nodes P2, P3, P4 enter “W” state and fail• When they recover, coord. and P1 are

down• Each node has 1 vote, V=5, Maj=3

CS 347 Notes06 62

Example(1): Coord P2 W P1 P3 W

• Nodes P2, P3, P4 enter “W” state and fail• When they recover, coord. and P1 are

down• Each node has 1 vote, V=5, Maj=3

• Since P2, P3, P4 have majority, they know coord. could not have gone to “P” without at least one of their votes

• Therefore, T can be aborted!

CS 347 Notes06 63

Example(2): Coord P3 ”P” P1 P4 ”W” P2

• Each node has 1 vote; V=5, Maj=3• Nodes fail after entering states shown;

P3, P4 recover

CS 347 Notes06 64

Example(2): Coord P3 ”P” P1 P4 ”W” P2

• Each node has 1 vote; V=5, Maj=3• Nodes fail after entering states shown;

P3, P4 recover

• Termination rule says we can try to commit, but P3, P4 do not have enough votes, so they do nothing!

• P3, P4 doing nothing is good because later on, coord. P1, P2 could abort T

CS 347 Notes06 65

Summary: Majority rule ensures that any

decision (e.g., Preparing, committing) will be

knownto any future group making a decision

decision # 2

decision # 1

CS 347 Notes06 66

Important Detail for Majority 3PC

• Example:

CS 347 Notes06 67

Important Detail for Majority 3PC

• Example:

CS 347 Notes06 68

Need “Prepare To Abort” State

_go_exec*

ackC**commit*

nokpreA*ok*

preC *

_exec_ok

commit-

execnok

preCackC

preAackA

coordinator participant

ackA**abort*

abort-

** means participantswith majority votes preA

CS 347 Notes06 69

Example Revisited

CS 347 Notes06 70

Example Revisited

OK to commit sincetransaction could not have aborted

CS 347 Notes06 71

Example Revisited -II

CS 347 Notes06 72

Example Revisited -II

No decision:Transaction could have aborted orcould have committed... Block!

CS 347 Notes06 73

3PC with Majority Voting

• If survivors have majority andall states W try to abort

• If survivors have majority andstates in {W, PC, C} try to commit

• If survivors have majority andstates in {W, PA, A} try to abort

• Otherwise block

CS 347 Notes06 74

Comparison

Option A: only nodes that have not failed participate in 3PC

• Any size group can terminate(even one node)

• If all nodes fail, must wait for all to recover

CS 347 Notes06 75

Option B: Majority voting• A group of failed+recovering

nodes can terminate transaction (with majority of votes)

• Need majority for every commit blocking protocol!

Comparison

CS 347 Notes06 76

Reminder

• When node recovers, it uses its log in a normal fashion to determine status of transactions:– if commit found in log redo if

necessary– if abort found (or no “W” record)

rollback if necessary

CS 347 Notes06 77

– if in “W” state (or “P” state): • reclaim locks held by T before crash• try to terminate T (with other nodes)

– after locks claimed for “in doubt” transactions, start normal processing

Reminder - Continued

CS 347 Notes06 78

Final note

• If nodes use 2P locking, global deadlocks possible

Local WFG: Local WFG: no cycles no cycles!

CS 347 Notes06 79

• Need to “combine” WFGs to discover global deadlock

T2e.g., central

detection node

CS 347 Notes06 80

Problem: False deadlocks

infosent

at centralsite:

CS 347 Notes06 81

Problem: False deadlocks

infosent

at centralsite:

Note that T2 is not 2PL;it releases lock,then asks for another lock

Exercise

• Assume all waits are due to transaction lock requests

• Assume transactions well formed and 2PL; scheduler legal

• Show that false deadlocks are not possible

CS 347 Notes06 82

CS 347 Notes06 83

• Many deadlock solutions– Distributed vs. centralized– Detection vs. prevention

• timeouts• wait-die• wound-wait

• Covered in CS245

Hector Garcia-Molina

Documents

CS 245Notes 61 CS 245: Database System Principles Notes 6: Query Processing Hector Garcia-Molina

1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008

Data Warehousing and OLAP · 1 Data Warehousing and OLAP Hector Garcia-Molina Stanford University 2 Hector Garcia Molina: Data Warehousing and OLAP Warehousing Growing industry: $8

CS347Notes131 CS 347: Parallel and Distributed Data Management Notes XX: Publish/Subscribe Hector Garcia-Molina

CS 245Notes 11 CS 245: Database System Principles Notes 01: Introduction Hector Garcia-Molina

Hector Garcia-Molina Zoltan Gyongyi

Peer-to-peer archival data trading Brian Cooper and Hector Garcia-Molina Stanford University

L3: DDBS Design (Tailored from the Slides by Prof. Hector Garcia-Molina )

1 Query Compilation Parsing Logical Query Plan Source: our textbook, slides by Hector Garcia-Molina

1 Failure Recovery Introduction Undo Logging Redo Logging Source: slides by Hector Garcia-Molina

Chapters 15-16a1 (Slides by Hector Garcia-Molina, hector/cs245/notes.htm) Chapters 15 and 16: Query Processing

IN RE ORACLE CORPORATION DERIVATIVE …...2019/12/06 · 2016.43 Defendant Hector Garcia-Molina is a director of Oracle.44 Garcia-Molina assumed this position in October 2001.45 Garcia-Molina

Data Warehousing and OLAP Hector Garcia-Molina Stanford University

CS 245Notes 21 CS 245: Database System Principles Notes 02: Hardware Hector Garcia-Molina

Data$Mining$MTAT.03.183$ … · – Hector Garcia-Molina, Stanford University ... " snowflake schema ... Solution: Data Warehouse

Routing Indices For Peer-to-Peer Systems Arturo Crespo, Hector Garcia-Molina Stanford ICDCS 2002

Query Compilation Contains slides by Hector Garcia-Molina

CS 245Notes 101 CS 245: Database System Principles Notes 10: More TP Hector Garcia-Molina

Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University