Upload
doane
View
29
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Transaction. All-or-nothing. Principles of Computer System (2012 Fall). Where are we?. System Complexity Modularity & Naming Enforced Modularity Network Fault Tolerance Transaction All-or-nothing Before-or-after. Review. Goal Building reliable system out of unreliable components - PowerPoint PPT Presentation
Citation preview
1
Transaction
Principles of Computer System (2012 Fall)
All-or-nothing
2
Where are we?
• System Complexity• Modularity & Naming• Enforced Modularity• Network• Fault Tolerance• Transaction– All-or-nothing– Before-or-after
3
Review
• Goal– Building reliable system out of unreliable
components• Fault, Error, Failure• MTTF, MBTF, MTTR– Bath curve
• RAID
4
Review
• Disk Failure Tolerance– Use replication– Checksum used to detect faults– If cannot read one disk, read data from replica(s)
instead– Weaker (but still very effective): error-correcting
codes on disk platter– Replication allows masking component failure
All-or-Nothing and Before-or-After Atomicity
• An action is atomic – If there is no way for a higher layer to discover the internal
structure of its implementation
5
All-or-Nothing and Before-or-After Atomicity
• From the point of view of a procedure that invokes an atomic action– The atomic action always appears either to complete as
anticipated, or to do nothing– This consequence is the one that makes atomic actions
useful in recovering from failures
6
All-or-Nothing and Before-or-After Atomicity
• From the point of view of a concurrent thread– An atomic action acts as though it occurs either
completely before or completely after every other concurrent atomic action
– This consequence is the one that makes atomic actions useful for coordinating concurrent threads
• Atomicity hides – not just the details of which steps form the atomic action– but the very fact that it has structure
7
All-or-Nothing and Before-or-After Atomicity
• 1. Data abstraction – Hide the internal structure of data
• 2. Client/server organization– Hide the internal structure of major subsystems
• 3. Atomicity – Hide the internal structure of an action
• Enforce industrial-strength modularity– Guarantee absence of unanticipated interactions among
components of a complex system• The implementer’s point of view– Painful
8
Atomic actions’ benevolent side effects• Audit log– atomic actions that run into trouble record
• the nature of the detected failure and • the recovery sequence
– for later analysis• Data management system when insert a record– rearrange the file into a better physical order
• Cache• Garbage collection
• They are all hidden from upper levels9
Overall system fault tolerance model• 1. Error-free operation: – All work goes according to expectations– The user initiates actions and the system confirms the
actions by displaying messages to the user• 2. Tolerated error:– The user who has initiated an action notices that the
system failed before it confirmed completion of the action – when the system is operating again, checks to see
whether or not it actually performed that action
10
Overall system fault tolerance model• 3. Un-tolerated error: – The system fails without the user noticing– the user does not realize that he or she should check or
retry an action that the system may not have completed
11
Disk Storage System Fault Tolerance Model
• A perfect-disk assumption– a disk never decays and that it has no hard errors– only one thing can go wrong: a system crash at just the
wrong time
12
Disk Storage System Fault Tolerance Model
• The fault tolerance model– Error-free operation: • CAREFUL_GET returns the result of the most recent call
to CAREFUL_PUT • at sector_number on track, with status = OK.
– Detectable error: • The operating system crashes during a CAREFUL_PUT • and corrupts the disk buffer in volatile storage• and CAREFUL_PUT writes corrupted data on one sector
of the disk.
13
ALL_OR_NOTHING_PUT
1. procedure ALMOST_ALL_OR_NOTHING_PUT (data, all_or_nothing_sector)
2. CAREFUL_PUT(data, all_or_nothing_sector.S1)
3. CAREFUL_PUT (data, all_or_nothing_sector.S2)
4. CAREFUL_PUT (data, all_or_nothing_sector.S3)
5. procedure ALL_OR_NOTHING_GET (reference date,all_or_nothing_sector)
6. CAREFUL_GET (data1, all_or_nothing_sector.S1)
7. CAREFUL_GET (data2, all_or_nothing_sector.S2)
8. CAREFUL_GET (data3, all_or_nothing_sector.S3)
9. if (data1 = data2) data ← data1
10. else data ← data3
14
ALL_OR_NOTHING_PUT
1. procedure ALL_OR_NOTHING_PUT (data, all_or_nothing_sector)
2. CHECK_AND_REPAIR (all_or_nothing_sector)
3. ALMOST_ALL_OR_NOTHING_PUT (data,
all_or_nothing_sector)
4. procedure CHECK_AND_REPAIR (all_or_nothing_sector)
// Ensure copies match
5. CAREFUL_GET (data1, all_or_nothing_sector.S1)
6. CAREFUL_GET (data2, all_or_nothing_sector.S2)
7. CAREFUL_GET (data3, all_or_nothing_sector.S3)15
ALL_OR_NOTHING_PUT
8. if (data1 = data2) and (data2 = data3) return // State 1 or 7, no repair
9. if (data1 = data2)10. CAREFUL_PUT (data1, all_or_nothing_sector.S3) return // State 5 or
6.11. if (data2 = data3)12. CAREFUL_PUT (data2, all_or_nothing_sector.S1) return // State 2 or
3.13. CAREFUL_PUT (data1, all_or_nothing_sector.S2) // State 4, go to state 514. CAREFUL_PUT (data1, all_or_nothing_sector.S3 // State 5, go to state 7
16
Atomicity
17
Commit
18
Commit• Pre-commit– identify all the resources needed – establish their availability– maintain the ability to abort at any instant• shared resources, once reserved, cannot be released until
the commit point is passed• should not do anything externally visible
• Post-commit– release reserved resources that are no longer needed– perform externally visible actions – cannot try to acquire additional resources
• Q: where’s commit in ALL_OR_NOTHING_PUT?19
Shadow Copy• Pre-commit: – Create a complete duplicate working copy of the file that is
to be modified– make all changes to the working copy
• Commit point: – Carefully exchange the working copy with the original– Typically this step is bootstrapped
• Post-commit: – Release the space that was occupied by the original
20
The golden rule of atomicity : Never modify the only copy!
21
Bank account transfer
xfer(bank, a, b, amt): bank[a] = bank[a] – amt bank[b] = bank[b] + amt
22
Bank account transfer
xfer(bank, a, b, amt):
bank[a] = bank[a] – amt
bank[b] = bank[b] + amt
<- a=100, b=100
xfer(bank, a, b, 50):
<- a=50, b=100
<- a=50, b=150
23
Bank account transfer
xfer(bank, a, b, amt): bank[a] = bank[a] – amt bank[b] = bank[b] + amt
audit(bank): sum = 0 for acct in bank: sum = sum + bank[acct] return sum
24
Bank account transfer
xfer(bank, a, b, amt): bank[a] = bank[a] – amt bank[b] = bank[b] + amt
audit(bank): sum = 0 for acct in bank: sum = sum + bank[acct] return sum
<- sum=200
audit(bank):
<- sum=150
<- sum=200
25
Eventual goal: transactionsxfer(bank, a, b, amt): begin bank[a] = bank[a] – amt bank[b] = bank[b] + amt commit
audit(bank): begin sum = 0 for acct in bank: sum = sum + bank[acct] commit return sum
26
Strawman implementationxfer(bank, a, b, amt): bank[a] = read_accounts(bankfile) bank[a] = bank[a] – amt bank[b] = bank[b] + amt write_accounts(bankfile)
27
Shadow copyxfer(bank, a, b, amt): bank[a] = read_accounts(bankfile) bank[a] = bank[a] – amt bank[b] = bank[b] + amt write_accounts(#bankfile) rename(“#bankfile”, bankfile)
28
Rename
• Rename – unlink(bankname)– link("newfile", bankname)– unlink("newfile")
29
File system data structures
• directory data blocks:– filename “bank” → inode 12– filename “#bank” → inode 13
• inode 12:– data blocks: 3, 4, 5– refcount: 1
• inode 13:– data blocks: 6, 7, 8– refcount: 1
30
Rename• What needs to happen during rename?– Point "bank" directory entry at inode 13– Remove "newfile" directory entry– Remove refcount on inode 12
• First try at rename(x, y):– y's dirent gets x's inode #– decref(y's original inode)– remove x's dirent
• Problems– What happens if we crash after modifying y's dirent?– Two names point to inode 13, refcount is 1.– What if we increment refcount before, and decrement it
afterwards?
31
Increase ref-count before
rename(x, y): newino = lookup(x) oldino = lookup(y)
incref(newino) change y's dirent to newino decref(oldino) remove x's dirent decref(newino)
32
rename(“#bank”, “bank”)
• directory data blocks:– filename “bank” → inode 12– filename “#bank” → inode 13
• inode 12:– data blocks: 3, 4, 5– refcount: 1
• inode 13:– data blocks: 6, 7, 8– refcount: 2
33
rename(“#bank”, “bank”)
• directory data blocks:– filename “bank” → inode 13– filename “#bank” → inode 13
• inode 12:– data blocks: 3, 4, 5– refcount: 1
• inode 13:– data blocks: 6, 7, 8– refcount: 2
34
rename(“#bank”, “bank”)
• directory data blocks:– filename “bank” → inode 13– filename “#bank” → inode 13
• inode 12:– data blocks: 3, 4, 5– refcount: 0
• inode 13:– data blocks: 6, 7, 8– refcount: 2
35
rename(“#bank”, “bank”)
• directory data blocks:– filename “bank” → inode 13– filename “#bank” → inode 13
• inode 12:– data blocks: 3, 4, 5– refcount: 0
• inode 13:– data blocks: 6, 7, 8– refcount: 2
36
rename(“#bank”, “bank”)
• directory data blocks:– filename “bank” → inode 13– filename “#bank” → inode 13
• inode 12:– data blocks: 3, 4, 5– refcount: 0
• inode 13:– data blocks: 6, 7, 8– refcount: 1
37
Recovery after crashsalvage(disk): for inode in disk.inodes: inode.refcnt = find_all_refs(disk.root_dir, inode) if exists(“#bank”): unlink(“#bank”)
38
Shadow copy• Write to a copy of data, atomically switch to new
copy• Switching can be done with one all-or-nothing
operation (sector write)• Requires a small amount of all-or-nothing atomicity
from lower layer (disk)• Main rule: only make one write to current/live copy
of data– In our example, sector write for rename.– Creates a well-defined commit point.
39
Shadow copy• Does the shadow copy approach work in
general?– +: Works well for a single file.– -: Hard to generalize to multiple files or directories.
• Might have to place all files in a single directory, or rename subdirs.
– -: Requires copying the entire file for any (small) change.
– -: Only one operation can happen at a time.– -: Only works for operations that happen on a single
computer, single disk.
40
Summary• Not always possible to use replication to avoid failures.
– To recover from failure, need to reason about what happened to the system
• Atomicity makes it much easier to reason about possible system states:– All-or-nothing atomicity– Before-or-after atomicity
• Shadow copy: one approach to all-or-nothing atomicity– Make all modifications to a shadow copy of data.– Replace old copy with new copy with an all-or-nothing write to one
sector.• -> commit point
– Rule: never modify live data (unless one sector write is all you need)
41
Summary• Two kinds of atomicity: all-or-nothing, before-or-after
– Shadow copy can provide all-or-nothing atomicity
– Golden rule of atomicity: never modify the only copy!
• Typical way to achieve all-or-nothing atomicity.– Works because you can fall back to the old copy in case of failure.
– Software can also use all-or-nothing atomicity to abort in case of error.
• Drawbacks of shadow file approach:– only works for single file (maybe fixable with shadow dirs)
– copy the entire file for every all-or-nothing action (harder to avoid)
• Still, shadow copy is a simple and effective design when it suffices.– Many Unix applications (e.g., text editors) use it, owing to rename.
42
All-or-nothing: Logging
• A more general techniques for achieving all-or-nothing atomicity– Idea: keep a log of all changes, and whether each
change commits or aborts– We will start out with a simple scheme that's all-
or-nothing but slow– Then, we will optimize its performance while
preserving atomicity
Journal Storage• A store operation – Not overwrites old data– create a new, tentative version of the data • remains invisible to any reader outside this all-or-
nothing action • until the action commits
43
Journal Storage
44
Journal Storage
45
Journal Storage1. procedure NEW_ACTION ()2. id ← NEW_OUTCOME_RECORD ()3. id.outcome_record.state ← PENDING4. return id
5. procedure COMMIT (reference id)6. id.outcome_record.state ← COMMITTED
7. procedure ABORT (reference id)8. id.outcome_record.state ← ABORTED
46
Journal Storage1. procedure READ_CURRENT_VALUE (data_id, caller_id)2. starting at end of data_id repeat until beginning3. v ← previous version of data_id // Get next older
version4. a ← v.action_id // Identify the action a that created it5. s ← a.outcome_record.state // Check action a’s outcome
record
6. if s = COMMITTED then7. return v.value8. else skip v // Continue backward search9. signal (“Tried to read an uninitialized variable!”)
47
Journal Storage10. procedure WRITE_NEW_VALUE (reference data_id,
new_value, caller_id)11. if caller_id.outcome_record.state = PENDING12. append new version v to data_id13. v.value ← new_value14. v.action_id ← caller_id15. else signal (“Tried to write outside of an all-or-nothing action!”)
48
Journal Storage
49
Journal Storage1. procedure TRANSFER (reference debit_account, reference credit_account, amount)
2. my_id ← NEW_ACTION ()
3. xvalue ← READ_CURRENT_VALUE (debit_account, my_id)
4. xvalue ← xvalue - amount
5. WRITE_NEW_VALUE (debit_account, xvalue, my_id)
6. yvalue ← READ_CURRENT_VALUE (credit_account, my_id)
7. yvalue ← yvalue + amount
8. WRITE_NEW_VALUE (credit_account, yvalue, my_id)
9. if xvalue > 0 then
10. COMMIT (my_id)
11. else
12. ABORT (my_id)
13. signal (“Negative transfers are not allowed.”)50
Log
51
52
Example: Bank Account App
xfer(bank, a, b, amt):copy(bank, tmp)tmp[a] = tmp[a] – amttmp[b] = tmp[b] + amtrename(tmp, bank)
53
Shadow copy abort VS. commit
xfer(bank, a, b, amt):copy(bank, tmp)tmp[a] = tmp[a] – amttmp[b] = tmp[b] + amtif tmp[a] < 0:print “Not enough funds”unlink(tmp)else:rename(tmp, bank)
54
Transaction terminology
xfer(bank, a, b, amt): begin
bank[a] = bank[a] – amtbank[b] = bank[b] + amtif bank[a] < 0:print “Not enough funds”abortelse:commit
55
Consider our bank account example• Two accounts: A and B.
– Accounts start out empty.– Run these all-or-nothing actions:
begin A = 100 B = 50 commit
begin A = A - 20 B = B + 20 commit
begin A = A + 30 --CRASH--
A Log Sample +-------+------+--------+-------+------+--------+-------+TID | T1 | T1 | T1 | T2 | T2 | T2 | T3 |OLD | A=0 | B=0 | | A=100 | B=50 | | A=80 |NEW | A=100 | B=50 | COMMIT | A=80 | B=70 | COMMIT | A=110 | +-------+------+--------+-------+------+--------+-------+
• Begin• Write variable• Read variable• Commit• Abort
begin A = 100 B = 50 commit
begin A = A - 20 B = B + 20 commit
begin A = A + 30 --CRASH--
Logging
• What happens when a program runs now?– begin: allocate a new transaction ID.– write variable: append an entry to the log.– read variable: scan the log looking for last
committed value.• As an aside: how to see your own updates?• Read uncommitted values from your own tid.
Read with a logread(log, var): commits = { } for record r in log[len(log)-1] .. log[0]: if (r.type == commit): commits = commits + r.tid if (r.type == update) and (r.tid in commits) and (r.var == var): return r.new_val
Logging– Commit: write a commit record• Expectedly, writing a commit record is the "commit
point" for action, because of the way read works (looks for commit record)
• However, writing log records better be all-or-nothing– One approach, from last time: make each record fit within one
sector
– Abort: • Do nothing or write a record?
– could write an abort record, but not strictly needed• Recover from a crash: do nothing.
– Recovery?
Performance of Read
• What's the performance of this log-only approach?– Write performance is probably good: sequential
writes, instead of random.– (Since we aren't using the old values yet, we could
have skipped the read.)• Read performance is terrible: scan the log for
every read!– Crash recovery is instantaneous: nothing to do.
Performance Optimization
• How can we optimize read performance?– Keep both a log and "cell storage”– Log is just as before: authoritative, provides all-or-
nothing atomicity– Cell storage: provides fast reads, but cannot
provide all-or-nothing
• Terminology– "log" an update when it's written to the log– "install" an update when it's written to cell storage
Read / write with cell storage
62
Write-ahead-log protocol (WAL)Log the update before installing it
63
Read / write with cell storageread(var): return cell_read(var)
write(var, value): log.append(cur_tid, update, var, read(var), value) cell_write(var, value)
64
Read / write with cell storage
• Two questions we have to answer now:– How to update both the log and cell storage when
an update happens?– How to recover cell storage from the authoritative
log after crash?
Order Matters• Ordering of logging and installing
– The two together don't have all-or-nothing atomicity
– Can crash in-between, so just one of the two might have taken place
• What happens if we install first and then log?– If we crash, no idea what happened to cell storage, or how to fix it
– Bad idea, violates the golden rule ("Never modify the only copy”)
• The corresponding rule for logging is the "Write-ahead-log protocol" (WAL)– Log the update before installing it
• If we crash, log is authoritative and intact, can repair cell storage– Can think of it as not being the only copy, once it's in the log
Logging Protocol
• CHANGE Record– The identity of the all-or-nothing action– A component action redo• installs the intended value in cell storage
– After commit, if the system crashes , the recovery procedure can perform the install on behalf of the action
– A second component action undo• reverses the effect on cell storage of the install
– After aborts or the system crashes, it may be necessary for the recovery procedure to reverse the effect
66
Logging Protocol• The application• NEW_ACTION– log a BEGIN record that contains just the new identity– As the all-or-nothing action proceeds through its pre-commit
phase, it logs CHANGE records• To implement COMMIT or ABORT– logs an OUTCOME record– commit point
67
Logging Protocol1. procedure TRANSFER (debit_account, credit_account,
amount)2. my_id ← LOG (BEGIN_TRANSACTION)3. dbvalue.old ← GET (debit_account)4. dbvalue.new ← dbvalue.old - amount5. crvalue.old ← GET (credit_account, my_id)6. crvalue.new ← crvalue.old + amount7. LOG (CHANGE, my_id,8. “PUT (debit_account, dbvalue.new)”, //redo
action9. “PUT (debit_account, dbvalue.old)”) //undo
action
68
Logging Protocol10. LOG ( CHANGE, my_id,11. “PUT (credit_account, crvalue.new)”
//redo action
12. “PUT (credit_account, crvalue.old)”)//undo action
13. PUT (debit_account, dbvalue.new) // install14. PUT (credit_account, crvalue.new) // install15. if dbvalue.new > 0 then16. LOG ( OUTCOME, COMMIT, my_id)17. else18. LOG (OUTCOME, ABORT, my_id)19. signal(“Action not allowed. Would make debit
account negative.”)20. LOG (END_TRANSACTION, my_id)
69
Logging Protocol
70
Logging Protocol1. procedure ABORT (action_id)2. starting at end of log repeat until beginning3. log_record ← previous record of log4. if log_record.id = action_id then5. if (log_record.type = OUTCOME)6. then signal (“Can’t abort an already
completed action.”)7. if (log_record.type = CHANGE)8. then perform undo_action of log_record9. if (log_record.type = BEGIN)10. then break repeat11. LOG (action_id, OUTCOME, ABORTED) // Block future
undos.12. LOG (action_id, END)
71
Logging Protocol: non-volatile cell memory
72
Logging Protocol: non-volatile cell memory
73
74
75
winners: 1067, 1081, 1084, 1083, and 1086
losers: 1082, 1085, and 1087
Performance now• Writes might still be OK– but we do write twice: log & install
• Reads are fast– just look up in cell storage
• Recovery requires scanning the entire log
• Remaining performance problems:– We have to write to disk twice.– Scanning the log will take longer and longer, as the log
grows.
Optimization: truncate the log
• Current log grows without bound: not practical.– What part of the log can be discarded?
• Must know the outcome of every action in that part of log
• Cell storage must reflect all of those log records (commits, aborts).
– Truncating mechanism (assuming no pending actions):• Flush all cached updates to cell storage
• Write a checkpoint record, to save our place in the log.
• Truncate log prior to checkpoint record
• (Often log implemented as a series of files, so can delete old log files.)
– With pending actions, delete before checkpoint & earliest undecided record
Summary• Logging is a general technique for achieving all-or-
nothing atomicity.– Widely used: databases, file systems, ..
• Can achieve reasonable performance with logging.– Writes are always fast: sequential.– Reads can be fast with cell storage.– Key idea 1: write-ahead logging, makes it safe to update
cell storage.– Key idea 2: recovery protocol, undo losers / redo
winners.