Transaction

1

Transaction

Principles of Computer System (2012 Fall)

All-or-nothing

2

Where are we?

• System Complexity• Modularity & Naming• Enforced Modularity• Network• Fault Tolerance• Transaction– All-or-nothing– Before-or-after

3

Review

• Goal– Building reliable system out of unreliable

components• Fault, Error, Failure• MTTF, MBTF, MTTR– Bath curve

• RAID

4

Review

• Disk Failure Tolerance– Use replication– Checksum used to detect faults– If cannot read one disk, read data from replica(s)

instead– Weaker (but still very effective): error-correcting

codes on disk platter– Replication allows masking component failure

All-or-Nothing and Before-or-After Atomicity

• An action is atomic – If there is no way for a higher layer to discover the internal

structure of its implementation

5


• From the point of view of a procedure that invokes an atomic action– The atomic action always appears either to complete as

anticipated, or to do nothing– This consequence is the one that makes atomic actions

useful in recovering from failures

6


• From the point of view of a concurrent thread– An atomic action acts as though it occurs either

completely before or completely after every other concurrent atomic action

– This consequence is the one that makes atomic actions useful for coordinating concurrent threads

• Atomicity hides – not just the details of which steps form the atomic action– but the very fact that it has structure

7


• 1. Data abstraction – Hide the internal structure of data

• 2. Client/server organization– Hide the internal structure of major subsystems

• 3. Atomicity – Hide the internal structure of an action

• Enforce industrial-strength modularity– Guarantee absence of unanticipated interactions among

components of a complex system• The implementer’s point of view– Painful

8

Atomic actions’ benevolent side effects• Audit log– atomic actions that run into trouble record

• the nature of the detected failure and • the recovery sequence

– for later analysis• Data management system when insert a record– rearrange the file into a better physical order

• Cache• Garbage collection

• They are all hidden from upper levels9

Overall system fault tolerance model• 1. Error-free operation: – All work goes according to expectations– The user initiates actions and the system confirms the

actions by displaying messages to the user• 2. Tolerated error:– The user who has initiated an action notices that the

system failed before it confirmed completion of the action – when the system is operating again, checks to see

whether or not it actually performed that action

10

Overall system fault tolerance model• 3. Un-tolerated error: – The system fails without the user noticing– the user does not realize that he or she should check or

retry an action that the system may not have completed

11

Disk Storage System Fault Tolerance Model

• A perfect-disk assumption– a disk never decays and that it has no hard errors– only one thing can go wrong: a system crash at just the

wrong time

12

Disk Storage System Fault Tolerance Model

• The fault tolerance model– Error-free operation: • CAREFUL_GET returns the result of the most recent call

to CAREFUL_PUT • at sector_number on track, with status = OK.

– Detectable error: • The operating system crashes during a CAREFUL_PUT • and corrupts the disk buffer in volatile storage• and CAREFUL_PUT writes corrupted data on one sector

of the disk.

13

ALL_OR_NOTHING_PUT

1. procedure ALMOST_ALL_OR_NOTHING_PUT (data, all_or_nothing_sector)

2. CAREFUL_PUT(data, all_or_nothing_sector.S1)

3. CAREFUL_PUT (data, all_or_nothing_sector.S2)

4. CAREFUL_PUT (data, all_or_nothing_sector.S3)

5. procedure ALL_OR_NOTHING_GET (reference date,all_or_nothing_sector)

6. CAREFUL_GET (data1, all_or_nothing_sector.S1)



9. if (data1 = data2) data ← data1

10. else data ← data3

14

ALL_OR_NOTHING_PUT

1. procedure ALL_OR_NOTHING_PUT (data, all_or_nothing_sector)

2. CHECK_AND_REPAIR (all_or_nothing_sector)

3. ALMOST_ALL_OR_NOTHING_PUT (data,

all_or_nothing_sector)

4. procedure CHECK_AND_REPAIR (all_or_nothing_sector)

// Ensure copies match



7. CAREFUL_GET (data3, all_or_nothing_sector.S3)15

ALL_OR_NOTHING_PUT

8. if (data1 = data2) and (data2 = data3) return // State 1 or 7, no repair

9. if (data1 = data2)10. CAREFUL_PUT (data1, all_or_nothing_sector.S3) return // State 5 or

6.11. if (data2 = data3)12. CAREFUL_PUT (data2, all_or_nothing_sector.S1) return // State 2 or

3.13. CAREFUL_PUT (data1, all_or_nothing_sector.S2) // State 4, go to state 514. CAREFUL_PUT (data1, all_or_nothing_sector.S3 // State 5, go to state 7

16

Atomicity

17

Commit

18

Commit• Pre-commit– identify all the resources needed – establish their availability– maintain the ability to abort at any instant• shared resources, once reserved, cannot be released until

the commit point is passed• should not do anything externally visible

• Post-commit– release reserved resources that are no longer needed– perform externally visible actions – cannot try to acquire additional resources

• Q: where’s commit in ALL_OR_NOTHING_PUT?19

Shadow Copy• Pre-commit: – Create a complete duplicate working copy of the file that is

to be modified– make all changes to the working copy

• Commit point: – Carefully exchange the working copy with the original– Typically this step is bootstrapped

• Post-commit: – Release the space that was occupied by the original

20

The golden rule of atomicity ： Never modify the only copy!

21

Bank account transfer

xfer(bank, a, b, amt): bank[a] = bank[a] – amt bank[b] = bank[b] + amt

22


xfer(bank, a, b, amt):

bank[a] = bank[a] – amt

bank[b] = bank[b] + amt

<- a=100, b=100

xfer(bank, a, b, 50):

<- a=50, b=100

<- a=50, b=150

23



audit(bank): sum = 0 for acct in bank: sum = sum + bank[acct] return sum

24



audit(bank): sum = 0 for acct in bank: sum = sum + bank[acct] return sum

<- sum=200

audit(bank):

<- sum=150

<- sum=200

25

Eventual goal: transactionsxfer(bank, a, b, amt): begin bank[a] = bank[a] – amt bank[b] = bank[b] + amt commit

audit(bank): begin sum = 0 for acct in bank: sum = sum + bank[acct] commit return sum

26

Strawman implementationxfer(bank, a, b, amt): bank[a] = read_accounts(bankfile) bank[a] = bank[a] – amt bank[b] = bank[b] + amt write_accounts(bankfile)

27

Shadow copyxfer(bank, a, b, amt): bank[a] = read_accounts(bankfile) bank[a] = bank[a] – amt bank[b] = bank[b] + amt write_accounts(#bankfile) rename(“#bankfile”, bankfile)

28

Rename

• Rename – unlink(bankname)– link("newfile", bankname)– unlink("newfile")

29

File system data structures

• directory data blocks:– filename “bank” → inode 12– filename “#bank” → inode 13

• inode 12:– data blocks: 3, 4, 5– refcount: 1


30

Rename• What needs to happen during rename?– Point "bank" directory entry at inode 13– Remove "newfile" directory entry– Remove refcount on inode 12

• First try at rename(x, y):– y's dirent gets x's inode #– decref(y's original inode)– remove x's dirent

• Problems– What happens if we crash after modifying y's dirent?– Two names point to inode 13, refcount is 1.– What if we increment refcount before, and decrement it

afterwards?

31

Increase ref-count before

rename(x, y): newino = lookup(x) oldino = lookup(y)

incref(newino) change y's dirent to newino decref(oldino) remove x's dirent decref(newino)

32

rename(“#bank”, “bank”)




33





34





35





36





37

Recovery after crashsalvage(disk): for inode in disk.inodes: inode.refcnt = find_all_refs(disk.root_dir, inode) if exists(“#bank”): unlink(“#bank”)

38

Shadow copy• Write to a copy of data, atomically switch to new

copy• Switching can be done with one all-or-nothing

operation (sector write)• Requires a small amount of all-or-nothing atomicity

from lower layer (disk)• Main rule: only make one write to current/live copy

of data– In our example, sector write for rename.– Creates a well-defined commit point.

39

Shadow copy• Does the shadow copy approach work in

general?– +: Works well for a single file.– -: Hard to generalize to multiple files or directories.

• Might have to place all files in a single directory, or rename subdirs.

– -: Requires copying the entire file for any (small) change.

– -: Only one operation can happen at a time.– -: Only works for operations that happen on a single

computer, single disk.

40

Summary• Not always possible to use replication to avoid failures.

– To recover from failure, need to reason about what happened to the system

• Atomicity makes it much easier to reason about possible system states:– All-or-nothing atomicity– Before-or-after atomicity

• Shadow copy: one approach to all-or-nothing atomicity– Make all modifications to a shadow copy of data.– Replace old copy with new copy with an all-or-nothing write to one

sector.• -> commit point

– Rule: never modify live data (unless one sector write is all you need)

41

Summary• Two kinds of atomicity: all-or-nothing, before-or-after

– Shadow copy can provide all-or-nothing atomicity

– Golden rule of atomicity: never modify the only copy!

• Typical way to achieve all-or-nothing atomicity.– Works because you can fall back to the old copy in case of failure.

– Software can also use all-or-nothing atomicity to abort in case of error.

• Drawbacks of shadow file approach:– only works for single file (maybe fixable with shadow dirs)

– copy the entire file for every all-or-nothing action (harder to avoid)

• Still, shadow copy is a simple and effective design when it suffices.– Many Unix applications (e.g., text editors) use it, owing to rename.

42

All-or-nothing: Logging

• A more general techniques for achieving all-or-nothing atomicity– Idea: keep a log of all changes, and whether each

change commits or aborts– We will start out with a simple scheme that's all-

or-nothing but slow– Then, we will optimize its performance while

preserving atomicity

Journal Storage• A store operation – Not overwrites old data– create a new, tentative version of the data • remains invisible to any reader outside this all-or-

nothing action • until the action commits

43

Journal Storage

44

Journal Storage

45

Journal Storage1. procedure NEW_ACTION ()2. id ← NEW_OUTCOME_RECORD ()3. id.outcome_record.state ← PENDING4. return id

5. procedure COMMIT (reference id)6. id.outcome_record.state ← COMMITTED

7. procedure ABORT (reference id)8. id.outcome_record.state ← ABORTED

46

Journal Storage1. procedure READ_CURRENT_VALUE (data_id, caller_id)2. starting at end of data_id repeat until beginning3. v ← previous version of data_id // Get next older

version4. a ← v.action_id // Identify the action a that created it5. s ← a.outcome_record.state // Check action a’s outcome

record

6. if s = COMMITTED then7. return v.value8. else skip v // Continue backward search9. signal (“Tried to read an uninitialized variable!”)

47

Journal Storage10. procedure WRITE_NEW_VALUE (reference data_id,

new_value, caller_id)11. if caller_id.outcome_record.state = PENDING12. append new version v to data_id13. v.value ← new_value14. v.action_id ← caller_id15. else signal (“Tried to write outside of an all-or-nothing action!”)

48

Journal Storage

49

Journal Storage1. procedure TRANSFER (reference debit_account, reference credit_account, amount)

2. my_id ← NEW_ACTION ()

3. xvalue ← READ_CURRENT_VALUE (debit_account, my_id)

4. xvalue ← xvalue - amount

5. WRITE_NEW_VALUE (debit_account, xvalue, my_id)

6. yvalue ← READ_CURRENT_VALUE (credit_account, my_id)

7. yvalue ← yvalue + amount

8. WRITE_NEW_VALUE (credit_account, yvalue, my_id)

9. if xvalue > 0 then

10. COMMIT (my_id)

11. else

12. ABORT (my_id)

13. signal (“Negative transfers are not allowed.”)50

Log

51

52

Example: Bank Account App

xfer(bank, a, b, amt):copy(bank, tmp)tmp[a] = tmp[a] – amttmp[b] = tmp[b] + amtrename(tmp, bank)

53

Shadow copy abort VS. commit

xfer(bank, a, b, amt):copy(bank, tmp)tmp[a] = tmp[a] – amttmp[b] = tmp[b] + amtif tmp[a] < 0:print “Not enough funds”unlink(tmp)else:rename(tmp, bank)

54

Transaction terminology

xfer(bank, a, b, amt): begin

bank[a] = bank[a] – amtbank[b] = bank[b] + amtif bank[a] < 0:print “Not enough funds”abortelse:commit

55

Consider our bank account example• Two accounts: A and B.

– Accounts start out empty.– Run these all-or-nothing actions:

begin A = 100 B = 50 commit

begin A = A - 20 B = B + 20 commit

begin A = A + 30 --CRASH--

A Log Sample +-------+------+--------+-------+------+--------+-------+TID | T1 | T1 | T1 | T2 | T2 | T2 | T3 |OLD | A=0 | B=0 | | A=100 | B=50 | | A=80 |NEW | A=100 | B=50 | COMMIT | A=80 | B=70 | COMMIT | A=110 | +-------+------+--------+-------+------+--------+-------+

• Begin• Write variable• Read variable• Commit• Abort

begin A = 100 B = 50 commit

begin A = A - 20 B = B + 20 commit

begin A = A + 30 --CRASH--

Logging

• What happens when a program runs now?– begin: allocate a new transaction ID.– write variable: append an entry to the log.– read variable: scan the log looking for last

committed value.• As an aside: how to see your own updates?• Read uncommitted values from your own tid.

Read with a logread(log, var): commits = { } for record r in log[len(log)-1] .. log[0]: if (r.type == commit): commits = commits + r.tid if (r.type == update) and (r.tid in commits) and (r.var == var): return r.new_val

Logging– Commit: write a commit record• Expectedly, writing a commit record is the "commit

point" for action, because of the way read works (looks for commit record)

• However, writing log records better be all-or-nothing– One approach, from last time: make each record fit within one

sector

– Abort: • Do nothing or write a record?

– could write an abort record, but not strictly needed• Recover from a crash: do nothing.

– Recovery?

Performance of Read

• What's the performance of this log-only approach?– Write performance is probably good: sequential

writes, instead of random.– (Since we aren't using the old values yet, we could

have skipped the read.)• Read performance is terrible: scan the log for

every read!– Crash recovery is instantaneous: nothing to do.

Performance Optimization

• How can we optimize read performance?– Keep both a log and "cell storage”– Log is just as before: authoritative, provides all-or-

nothing atomicity– Cell storage: provides fast reads, but cannot

provide all-or-nothing

• Terminology– "log" an update when it's written to the log– "install" an update when it's written to cell storage

Read / write with cell storage

62

Write-ahead-log protocol (WAL)Log the update before installing it

63

Read / write with cell storageread(var): return cell_read(var)

write(var, value): log.append(cur_tid, update, var, read(var), value) cell_write(var, value)

64

Read / write with cell storage

• Two questions we have to answer now:– How to update both the log and cell storage when

an update happens?– How to recover cell storage from the authoritative

log after crash?

Order Matters• Ordering of logging and installing

– The two together don't have all-or-nothing atomicity

– Can crash in-between, so just one of the two might have taken place

• What happens if we install first and then log?– If we crash, no idea what happened to cell storage, or how to fix it

– Bad idea, violates the golden rule ("Never modify the only copy”)

• The corresponding rule for logging is the "Write-ahead-log protocol" (WAL)– Log the update before installing it

• If we crash, log is authoritative and intact, can repair cell storage– Can think of it as not being the only copy, once it's in the log

Logging Protocol

• CHANGE Record– The identity of the all-or-nothing action– A component action redo• installs the intended value in cell storage

– After commit, if the system crashes , the recovery procedure can perform the install on behalf of the action

– A second component action undo• reverses the effect on cell storage of the install

– After aborts or the system crashes, it may be necessary for the recovery procedure to reverse the effect

66

Logging Protocol• The application• NEW_ACTION– log a BEGIN record that contains just the new identity– As the all-or-nothing action proceeds through its pre-commit

phase, it logs CHANGE records• To implement COMMIT or ABORT– logs an OUTCOME record– commit point

67

Logging Protocol1. procedure TRANSFER (debit_account, credit_account,

amount)2. my_id ← LOG (BEGIN_TRANSACTION)3. dbvalue.old ← GET (debit_account)4. dbvalue.new ← dbvalue.old - amount5. crvalue.old ← GET (credit_account, my_id)6. crvalue.new ← crvalue.old + amount7. LOG (CHANGE, my_id,8. “PUT (debit_account, dbvalue.new)”, //redo

action9. “PUT (debit_account, dbvalue.old)”) //undo

action

68

Logging Protocol10. LOG ( CHANGE, my_id,11. “PUT (credit_account, crvalue.new)”

//redo action

12. “PUT (credit_account, crvalue.old)”)//undo action

13. PUT (debit_account, dbvalue.new) // install14. PUT (credit_account, crvalue.new) // install15. if dbvalue.new > 0 then16. LOG ( OUTCOME, COMMIT, my_id)17. else18. LOG (OUTCOME, ABORT, my_id)19. signal(“Action not allowed. Would make debit

account negative.”)20. LOG (END_TRANSACTION, my_id)

69

Logging Protocol

70

Logging Protocol1. procedure ABORT (action_id)2. starting at end of log repeat until beginning3. log_record ← previous record of log4. if log_record.id = action_id then5. if (log_record.type = OUTCOME)6. then signal (“Can’t abort an already

completed action.”)7. if (log_record.type = CHANGE)8. then perform undo_action of log_record9. if (log_record.type = BEGIN)10. then break repeat11. LOG (action_id, OUTCOME, ABORTED) // Block future

undos.12. LOG (action_id, END)

71

Logging Protocol: non-volatile cell memory

72

Logging Protocol: non-volatile cell memory

73

74

75

winners: 1067, 1081, 1084, 1083, and 1086

losers: 1082, 1085, and 1087

Performance now• Writes might still be OK– but we do write twice: log & install

• Reads are fast– just look up in cell storage

• Recovery requires scanning the entire log

• Remaining performance problems:– We have to write to disk twice.– Scanning the log will take longer and longer, as the log

grows.

Optimization: truncate the log

• Current log grows without bound: not practical.– What part of the log can be discarded?

• Must know the outcome of every action in that part of log

• Cell storage must reflect all of those log records (commits, aborts).

– Truncating mechanism (assuming no pending actions):• Flush all cached updates to cell storage

• Write a checkpoint record, to save our place in the log.

• Truncate log prior to checkpoint record

• (Often log implemented as a series of files, so can delete old log files.)

– With pending actions, delete before checkpoint & earliest undecided record

Summary• Logging is a general technique for achieving all-or-

nothing atomicity.– Widely used: databases, file systems, ..

• Can achieve reasonable performance with logging.– Writes are always fast: sequential.– Reads can be fast with cell storage.– Key idea 1: write-ahead logging, makes it safe to update

cell storage.– Key idea 2: recovery protocol, undo losers / redo

winners.

Documents

Transaction