132
Cer$fying a Crash-safe File System Haogang Chen Thesis Advisors Frans Kaashoek and Nickolai Zeldovich

Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

  • Upload
    vocong

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Cer$fying a Crash-safe File System

Haogang Chen

Thesis AdvisorsFrans Kaashoek and Nickolai Zeldovich

Page 2: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

File systems should not lose data

• People use file systems to store permanent data

Page 3: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

File systems should not lose data

• People use file systems to store permanent data

• Computers can crash any$me • power failures

• hardware failures (unplug USB drive)

• soIware bugs

Page 4: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

File systems should not lose data

• People use file systems to store permanent data

• Computers can crash any$me • power failures

• hardware failures (unplug USB drive)

• soIware bugs

• File systems should not lose or corrupt data in case of crashes

Page 5: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

File systems are complex and have bugs

• Linux ext4: ~60,000 lines of code

Page 6: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

File systems are complex and have bugs

• Linux ext4: ~60,000 lines of code

Cumula&ve number of bug patches in Linux file systems [Lu et al., FAST’13]

# of

pat

ches

for b

ugs

0

150

300

450

600

Dec-03 Apr-04 Dec-04 Jan-06 Feb-07 Apr-08 Jun-09 Aug-10 May-11

ext3xfsjfsreiserfsext4btrfs

Page 7: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

File systems are complex and have bugs

• Linux ext4: ~60,000 lines of code

• Some bugs are serious: data loss, security exploits, etc.

Cumula&ve number of bug patches in Linux file systems [Lu et al., FAST’13]

# of

pat

ches

for b

ugs

0

150

300

450

600

Dec-03 Apr-04 Dec-04 Jan-06 Feb-07 Apr-08 Jun-09 Aug-10 May-11

ext3xfsjfsreiserfsext4btrfs

Page 8: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Researches in avoiding bugs in file systems

• Most research is on finding bugs • Crash injec$on (e.g., EXPLODE [OSDI’06])

• Symbolic execu$on (e.g., EXE [Oakland’06])

• Design modeling (e.g., in Alloy [ABZ’08])

• Some elimina$on of bugs by proving: • FS without directories [Arkoudas et al. 2004]

• BilbyFS [Keller 2014]

• UBIFS [Ernst et al. 2013]

Page 9: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

reduce# of bugs

Researches in avoiding bugs in file systems

• Most research is on finding bugs • Crash injec$on (e.g., EXPLODE [OSDI’06])

• Symbolic execu$on (e.g., EXE [Oakland’06])

• Design modeling (e.g., in Alloy [ABZ’08])

• Some elimina$on of bugs by proving: • FS without directories [Arkoudas et al. 2004]

• BilbyFS [Keller 2014]

• UBIFS [Ernst et al. 2013]

Page 10: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

incomplete + no crashes

reduce# of bugs

Researches in avoiding bugs in file systems

• Most research is on finding bugs • Crash injec$on (e.g., EXPLODE [OSDI’06])

• Symbolic execu$on (e.g., EXE [Oakland’06])

• Design modeling (e.g., in Alloy [ABZ’08])

• Some elimina$on of bugs by proving: • FS without directories [Arkoudas et al. 2004]

• BilbyFS [Keller 2014]

• UBIFS [Ernst et al. 2013]

Page 11: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Dealing with crashes is hard

• Crashes expose many par$ally-updated states • Reasoning about all failure cases is hard

Page 12: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Dealing with crashes is hard

• Crashes expose many par$ally-updated states • Reasoning about all failure cases is hard

• Performance op$miza$ons lead to more tricky par$al states • Disk I/O is expensive

• Buffer updates in memory

Page 13: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Dealing with crashes is hard

• Crashes expose many par$ally-updated states • Reasoning about all

failure cases is hard

• Performance op$miza$ons lead to more tricky par$al states • Disk I/O is expensive

• Buffer updates in memory

commit353b67d8ced4dc53281c88150ad295e24bc4b4c5Author:JanKara<[email protected]>Date:SatNov2600:35:392011+0100Title:jbd:Issuecacheflushaftercheckpointing

---a/fs/jbd/checkpoint.c+++b/fs/jbd/checkpoint.c@@-504,7+503,25@@intcleanup_journal_tail(journal_t*journal)spin_unlock(&journal->j_state_lock);return1;}+spin_unlock(&journal->j_state_lock);++/*+*Weneedtomakesurethatanyblocksthatwererecentlywrittenout+*---perhapsbylog_do_checkpoint()---areflushedoutbeforewe+*dropthetransactionsfromthejournal.It'sunlikelythiswillbe+*necessary,especiallywithanappropriatelysizedjournal,butwe+*needthistoguaranteecorrectness.Fortunately+*cleanup_journal_tail()doesn'tgetcalledallthatoften.+*/+if(journal->j_flags&JFS_BARRIER)+blkdev_issue_flush(journal->j_fs_dev,GFP_KERNEL,NULL);+spin_lock(&journal->j_state_lock);+if(!tid_gt(first_tid,journal->j_tail_sequence)){+spin_unlock(&journal->j_state_lock);+/*Someoneelsecleanedupjournalsoreturn0*/+return0;+}

A patch for Linux’s write-ahead logging (jbd) in 2012: “Is it safe to omit a disk write barrier here?”

It's unlikely this will be necessary, … but we need this to guarantee correctness. Fortunately this func;on doesn't get called all that o<en.

Page 14: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Goal: cer$fy a file system under crashes

Page 15: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Goal: cer$fy a file system under crashes

• FSCQ: first cer$fied crash-safe file system

A complete file system with a machine-checkable proof that its implementa$on meets its specifica$on, both under normal execuDon and under any sequence of crashes, including crashes during recovery.

Page 16: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Contribu$ons

• CHL: Crash Hoare Logic • Specifica$on framework for crash-safety of storage

• Crash condi$on and recovery seman$cs

• Automa$on to reduce proof effort

• FSCQ: the first cer$fied crash-safe file system • Basic Unix-like file system (no hard-links, no concurrency)

• Precise specifica$on for the core subset of POSIX

• I/O performance on par with Linux ext4

• CPU overhead is high

Page 17: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ runs standard Unix programs

Crash Hoare Logic (CHL) Top-level specifica;on Internal specifica;ons

Program Proof

FSCQ (wriKen in Coq)

Program

Page 18: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ runs standard Unix programs

Coq proof checker

Crash Hoare Logic (CHL) Top-level specifica;on Internal specifica;ons

Program Proof

FSCQ (wriKen in Coq)

OK

Program

Page 19: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ runs standard Unix programs

Coq proof checker

Crash Hoare Logic (CHL) Top-level specifica;on Internal specifica;ons

Program Proof

FSCQ (wriKen in Coq)

FSCQ’s Haskell code

FSCQ’s FUSE serverOK

Code extrac$on

Haskell compiler

Page 20: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ runs standard Unix programs

Coq proof checker

Crash Hoare Logic (CHL) Top-level specifica;on Internal specifica;ons

Program Proof

FSCQ (wriKen in Coq)

FSCQ’s Haskell code

FSCQ’s FUSE server

Haskell libraries & FUSE driver

OK

Linux kernel /dev/sda

Code extrac$on

Haskell compiler

Page 21: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ runs standard Unix programs

Coq proof checker

Crash Hoare Logic (CHL) Top-level specifica;on Internal specifica;ons

Program Proof

FSCQ (wriKen in Coq)

FSCQ’s Haskell code

FSCQ’s FUSE server

Haskell libraries & FUSE driver

OK

Linux kernel /dev/sda

Code extrac$on

Haskell compiler

syscalls FUSE upcalls disk read(), write(), sync()

$mvsrcdest$gitclonerepo…$make

Page 22: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ’s Trusted CompuDng Base

Coq proof checker

Crash Hoare Logic (CHL) Top-level specificaDon Internal specifica;ons

Program Proof

FSCQ (wriKen in Coq)

FSCQ’s Haskell code

FSCQ’s FUSE server

Haskell libraries & FUSE driver

OK

Linux kernel /dev/sda

Code extracDon

Haskell compiler

syscalls FUSE upcalls disk read(), write(), sync()

$mvsrcdest$gitclonerepo…$make

Page 23: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Outline

• Crash safety • What is the correct behavior aIer a crash?

• Challenge 1: formalizing crashes • Crash Hoare Logic (CHL)

• Challenge 2: incorpora$ng performance op$miza$ons • Disk sequences

• Building a complete file system

• Evalua$on

Page 24: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

What is crash safety?

• What guarantee should file system provide when it crashes and reboot?

• Look it up in the POSIX standard?

Page 25: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

POSIX is vague about crash behavior

[...] a power failure [...] can cause data to be lost. The data may be associated with a file that is s:ll open, with one that has been closed, with a directory, or with any other internal system data structures associated with permanent storage. This data can be lost, in whole or part, so that only careful inspec:on of file contents could determine that an update did not occur.

IEEE Std 1003.1, 2013 Edi$on

Page 26: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

POSIX is vague about crash behavior

• POSIX’s goal was to specify “common-denominator” behavior

• Gives freedom to file systems to implement their own op$miza$ons

[...] a power failure [...] can cause data to be lost. The data may be associated with a file that is s:ll open, with one that has been closed, with a directory, or with any other internal system data structures associated with permanent storage. This data can be lost, in whole or part, so that only careful inspec:on of file contents could determine that an update did not occur.

IEEE Std 1003.1, 2013 Edi$on

Page 27: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

What is crash safety?

• What guarantee should file system provide when it crashes and reboot?

• Look it up in the POSIX standard? (Too Vague)

Page 28: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

What is crash safety?

• What guarantee should file system provide when it crashes and reboot?

• Look it up in the POSIX standard? (Too Vague)

• A simple and useful defini$on is transacDonal • Atomicity: every file-system call is all-or-nothing

• Durability: every call persists on disk when it returns

Page 29: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

What is crash safety?

• What guarantee should file system provide when it crashes and reboot?

• Look it up in the POSIX standard? (Too Vague)

• A simple and useful defini$on is transacDonal • Atomicity: every file-system call is all-or-nothing

• Durability: every call persists on disk when it returns

• Run every file-system call inside a transac$on, using write-ahead logging.

Page 30: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Write-ahead logging

Disk

Page 31: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Write-ahead logging

Log0

➡ log_begin()

Disk

Page 32: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Write-ahead logging

Log0 2 a

8 b

5 c

➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)

Disk

1. Append writes to the log

Page 33: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Write-ahead logging

Log0 2 a

8 b

5 c3

➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)➡ log_commit()

Disk

1. Append writes to the log2. Set commit record

Page 34: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Write-ahead logging

a c b Log0 2 a

8 b

5 c3

➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)➡ log_commit()

Disk

1. Append writes to the log2. Set commit record3. Apply the log to disk loca$ons

Page 35: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Write-ahead logging

a c b Log0

➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)➡ log_commit()

Disk

1. Append writes to the log2. Set commit record3. Apply the log to disk loca$ons4. Truncate the log

Page 36: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Write-ahead logging

• Recovery: aIer crash, replay (apply) any commiRed transac$on in the log

a c b Log0

➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)➡ log_commit()

Disk

1. Append writes to the log2. Set commit record3. Apply the log to disk loca$ons4. Truncate the log

Page 37: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Write-ahead logging

• Recovery: aIer crash, replay (apply) any commiRed transac$on in the log

• Atomicity: either all writes appear on disk or none do

a c b Log0

➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)➡ log_commit()

Disk

1. Append writes to the log2. Set commit record3. Apply the log to disk loca$ons4. Truncate the log

Page 38: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Write-ahead logging

• Recovery: aIer crash, replay (apply) any commiRed transac$on in the log

• Atomicity: either all writes appear on disk or none do

• Durability: all changes are persisted on disk when log_commit() returns

a c b Log0

➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)➡ log_commit()

Disk

1. Append writes to the log2. Set commit record3. Apply the log to disk loca$ons4. Truncate the log

Page 39: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Example: transac$onal crash safety

defcreate(dir,name):log_begin()newfile=allocate_inode()newfile.init()dir.add(name,newfile)log_commit()

Page 40: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Example: transac$onal crash safety

defcreate(dir,name):log_begin()newfile=allocate_inode()newfile.init()dir.add(name,newfile)log_commit()

deflog_recover():ifcommitted:

log_apply()log_truncate()

… aTer crash …

Page 41: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Example: transac$onal crash safety

• Q: How to formally define what happens when the computer crashes?

defcreate(dir,name):log_begin()newfile=allocate_inode()newfile.init()dir.add(name,newfile)log_commit()

deflog_recover():ifcommitted:

log_apply()log_truncate()

… aTer crash …

Page 42: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Example: transac$onal crash safety

• Q: How to formally define what happens when the computer crashes?

• Q: How to formally specify the behavior of “create” in presence of crash and recovery?

defcreate(dir,name):log_begin()newfile=allocate_inode()newfile.init()dir.add(name,newfile)log_commit()

deflog_recover():ifcommitted:

log_apply()log_truncate()

… aTer crash …

Page 43: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Approach: Crash Hoare Logic{pre} code {post}

SPEC disk write (a, v)PRE a 7! v0POST a 7! v

Page 44: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Approach: Crash Hoare Logic

• Crash condiDon: all intermediate disk states (plus two end-states)

{pre} code {post}

SPEC disk write (a, v)PRE a 7! v0POST a 7! vCRASH a 7! v0 _ a 7! v

{crash}

Page 45: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Approach: Crash Hoare Logic

• Crash condiDon: all intermediate disk states (plus two end-states)

• CHL’s disk model matches what most other file systems assume:

• Wri$ng a single block is an atomic opera$on, no data corrup$on

{pre} code {post}

SPEC disk write (a, v)PRE a 7! v0POST a 7! vCRASH a 7! v0 _ a 7! v

{crash}

Page 46: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Asynchronous disk I/O

Page 47: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Asynchronous disk I/O • For performance, hard drive caches

writes in its internal vola$le buffer • Writes do not persist immediately

Page 48: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Asynchronous disk I/O • For performance, hard drive caches

writes in its internal vola$le buffer • Writes do not persist immediately

• Disk flushes the buffer to media in background • Writes might be reordered

Page 49: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Asynchronous disk I/O • For performance, hard drive caches

writes in its internal vola$le buffer • Writes do not persist immediately

• Disk flushes the buffer to media in background • Writes might be reordered

• Use write barrier (disk_sync) to force flushing the buffer • Make data persistent & enforce ordering

• Disk syncs are expensive!

Page 50: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of

the recent writes

Page 51: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of

the recent writes a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)

Q: What are the possible disk states ifcrashing aIer the 3 writes?

Page 52: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of

the recent writes a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)

Q: What are the possible disk states ifcrashing aIer the 3 writes?

A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2

Page 53: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of

the recent writes

• Idea: use value-sets: • Read returns the latest value:

• Write adds a value to the set:

• Sync discards previous values:

• Reboot chooses a random value:

a 7! hv0, vsi

a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)

Q: What are the possible disk states ifcrashing aIer the 3 writes?

A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2

Page 54: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of

the recent writes

• Idea: use value-sets: • Read returns the latest value:

• Write adds a value to the set:

• Sync discards previous values:

• Reboot chooses a random value:

a 7! hv0, vsiv0

a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)

Q: What are the possible disk states ifcrashing aIer the 3 writes?

A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2

Page 55: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of

the recent writes

• Idea: use value-sets: • Read returns the latest value:

• Write adds a value to the set:

• Sync discards previous values:

• Reboot chooses a random value:

a 7! hv0, vsi

a 7! hv, {v0} [ vsiv0

a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)

Q: What are the possible disk states ifcrashing aIer the 3 writes?

A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2

Page 56: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of

the recent writes

• Idea: use value-sets: • Read returns the latest value:

• Write adds a value to the set:

• Sync discards previous values:

• Reboot chooses a random value:

a 7! hv0, vsi

a 7! hv, {v0} [ vsia 7! hv0, ?i

v0

a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)

Q: What are the possible disk states ifcrashing aIer the 3 writes?

A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2

Page 57: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of

the recent writes

• Idea: use value-sets: • Read returns the latest value:

• Write adds a value to the set:

• Sync discards previous values:

• Reboot chooses a random value:

a 7! hv0, vsi

a 7! hv, {v0} [ vsia 7! hv0, ?ia 7! hv0, ?i, v0 2 {v0} [ vs

v0

a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)

Q: What are the possible disk states ifcrashing aIer the 3 writes?

A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2

Page 58: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

CHL asynchronous disk model

SPEC disk write (a, v)PRE disk |= a 7! hv0, vsiPOST disk |= a 7! hv, {v0} [ vsiCRASH disk |= a 7! hv0, vsi _

a 7! hv, {v0} [ vsi

Page 59: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

CHL asynchronous disk model

• Specifica$ons for disk_write, disk_read, and disk_sync are axioms

SPEC disk write (a, v)PRE disk |= a 7! hv0, vsiPOST disk |= a 7! hv, {v0} [ vsiCRASH disk |= a 7! hv0, vsi _

a 7! hv, {v0} [ vsi

Page 60: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

CHL asynchronous disk model

• Specifica$ons for disk_write, disk_read, and disk_sync are axioms

• “disk |= …” means the disk address space entails the predicate

SPEC disk write (a, v)PRE disk |= a 7! hv0, vsiPOST disk |= a 7! hv, {v0} [ vsiCRASH disk |= a 7! hv0, vsi _

a 7! hv, {v0} [ vsi

Page 61: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Abstrac$on layers• Each abstrac$on layer forms an address space

Page 62: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Abstrac$on layers• Each abstrac$on layer forms an address space

Physical disk log a 7! hv0, vsi

Page 63: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Abstrac$on layers• Each abstrac$on layer forms an address space

Physical disk log a 7! hv0, vsi

Logical disk a 7! v

Page 64: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Abstrac$on layers• Each abstrac$on layer forms an address space

Physical disk log a 7! hv0, vsi

Logical disk a 7! v

Files inum 7! filefile0 file1 file2 filen

Page 65: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Abstrac$on layers• Each abstrac$on layer forms an address space

Physical disk log a 7! hv0, vsi

Logical disk a 7! v

Files inum 7! filefile0 file1 file2 filen

Directory tree

Page 66: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Abstrac$on layers• Each abstrac$on layer forms an address space

• RepresentaDon invariants connect logical states between layers

Physical disk log a 7! hv0, vsi

Logical disk a 7! v

Files inum 7! filefile0 file1 file2 filen

Directory treedir_rep

files_rep

log_rep

Page 67: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Example: representa$on invariant SPEC log write (a, v)PRE

old state |= a 7! v0POST

new state |= a 7! v

• old_state and new_state are “logical disks” exposed by the logging system

Page 68: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Example: representa$on invariant

• old_state and new_state are “logical disks” exposed by the logging system

• log_rep connects transac$on state to an on-disk representa$on

• Describes the log’s on-disk layout using many ⟼ primi$ves

SPEC log write (a, v)PRE disk |= log rep (ActiveTxn, start state, old state)

old state |= a 7! v0POST disk |= log rep (ActiveTxn, start state, new state)

new state |= a 7! vCRASH disk |= log rep (ActiveTxn, start state, any state)

Page 69: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Cer$fying procedures

• bmap: return the block address at a given offset for an inode

defbmap(inode,bnum):ifbnum>=NDIRECT:indirect=log_read(inode.blocks[NDIRECT])returnindirect[bnum-NDIRECT]else:returninode.blocks[bnum]

Page 70: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Cer$fying procedures

• bmap: return the block address at a given offset for an inode

PRE POST

CRASH

defbmap(inode,bnum):ifbnum>=NDIRECT:indirect=log_read(inode.blocks[NDIRECT])returnindirect[bnum-NDIRECT]else:returninode.blocks[bnum]

Page 71: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Cer$fying procedures• Follow the control flow graph

PRE POST

CRASH

if

return

log_read returnprocedure bmap()

Page 72: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Cer$fying procedures• Follow the control flow graph

• Need pre/post/crash condi$ons for each called procedure

PRE POST

CRASH

if

return

log_read returnprocedure bmap()

Page 73: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Cer$fying procedures• Follow the control flow graph

• Need pre/post/crash condi$ons for each called procedure

• Chain pre- and postcondi$ons, forming proof obligaIons

PRE POST

CRASH

if

return

log_read returnprocedure bmap()

Page 74: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Cer$fying procedures• Follow the control flow graph

• Need pre/post/crash condi$ons for each called procedure

• Chain pre- and postcondi$ons, forming proof obligaIons

• CHL: combines crash condi$ons, get more proof obligaDons

PRE POST

CRASH

if

return

log_read returnprocedure bmap()

Page 75: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Proof automa$on• CHL follows the CFG, and generates proof obliga$ons

PRE POST

CRASH

if

return

log_read returnprocedure bmap()

Page 76: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Proof automa$on• CHL follows the CFG, and generates proof obliga$ons

• CHL solves trivial obliga$ons automa$cally (common case)

PRE POST

CRASH

if

return

log_read returnprocedure bmap()

Page 77: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Proof automa$on• CHL follows the CFG, and generates proof obliga$ons

• CHL solves trivial obliga$ons automa$cally (common case)

• Remaining proof effort: changing representaIon invariants

• Show that rep invariant holds at entry and exit

PRE POST

CRASH

if

return

log_read returnprocedure bmap()

inodes_rep

inodes_rep

inodes_rep

Page 78: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Specifying an en$re system call (simplified)

SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, start state)

start state |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]

Page 79: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Specifying an en$re system call (simplified)

POST disk |= log rep(NoTxn, new state)new state |= dir rep(new tree) ^

new tree = tree.update(path, fn, EmptyFile)

SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, start state)

start state |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]

Page 80: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Specifying an en$re system call (simplified)

POST disk |= log rep(NoTxn, new state)new state |= dir rep(new tree) ^

new tree = tree.update(path, fn, EmptyFile)

CRASH disk |= log rep(NoTxn, start state) _log rep(NoTxn, new state) _log rep(ActiveTxn, start state, any state) _log rep(CommittingTxn, start state, new state)

SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, start state)

start state |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]

Page 81: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Specifying an en$re system call (simplified)

POST disk |= log rep(NoTxn, new state)new state |= dir rep(new tree) ^

new tree = tree.update(path, fn, EmptyFile)

CRASH disk |= log rep(NoTxn, start state) _log rep(NoTxn, new state) _log rep(ActiveTxn, start state, any state) _log rep(CommittingTxn, start state, new state)

would_recover_either (start_state, new_state)

SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, start state)

start state |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]

Page 82: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

CRASH disk |= would recover either (start state, new state)

Specifying an en$re system call (simplified)

POST disk |= log rep(NoTxn, new state)new state |= dir rep(new tree) ^

new tree = tree.update(path, fn, EmptyFile)

SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, start state)

start state |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]

Page 83: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Specifying log recovery

SPEC log recover ()PRE disk |= would recover either (last state, committed state)POST disk |= log rep(NoTxn, last state) _

log rep(NoTxn, committed state)CRASH disk |= would recover either (last state, committed state)

Page 84: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Specifying log recovery

• log_recover() is idempotent:

• Crash condi$on implies its own precondi$on

• OK to run log_recover() again aIer a crash in itself

SPEC log recover ()PRE disk |= would recover either (last state, committed state)POST disk |= log rep(NoTxn, last state) _

log rep(NoTxn, committed state)CRASH disk |= would recover either (last state, committed state)

Page 85: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

procedure bmap()

Recovery execu$on seman$cs

PRE POST

CRASH

if

return

log_read return

log_recover

Page 86: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

procedure bmap()

Recovery execu$on seman$cs

PRE POST

CRASH

if

return

log_read return

log_recover

Page 87: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

procedure bmap()

Recovery execu$on seman$cs

PRE POST

CRASH

if

return

log_read return

log_recover RECOVER

Page 88: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Joint execu$on of two procedures bmap ⨝ log_recover

Recovery execu$on seman$cs

• Whenever bmap (or log_recover) crashes, run log_recover aIer reboot

PRE POST

CRASH

if

return

log_read return

log_recover RECOVER

Page 89: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

End-to-end specifica$on

• create() is atomic, if log_recover() runs aIer every crash

• POST is stronger than RECOVER

SPEC create (drum, fn) on log recover ()PRE disk |= log rep(NoTxn, start state)

start state |= dir rep(tree) ^9 path, tree[path].node = drum^fn /2 tree[path]

POST disk |= log rep(NoTxn, new state)new state |= dir rep(new tree) ^

new tree = tree.update(path, fn, EmptyFile)RECOVER disk |= log rep(NoTxn, start state) _

log rep(NoTxn, new state)

Page 90: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

CHL summary

• Key ideas: crash condiDons and recovery semanDcs

• CHL benefit: enables precise failure specifica$ons • Allows for automa$c chaining of pre/post/crash condi$ons

• Reduces proof burden

• CHL cost: must write crash condi$on for every func$on, loop, etc. • Crash condi$ons are oIen simple (above logging layer)

Page 91: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Outline

• Crash safety • What is the correct behavior aIer a crash?

• Challenge 1: formalizing crashes • Crash Hoare Logic (CHL)

• Challenge 2: incorpora$ng performance op$miza$ons • Disk sequences

• Building a complete file system

• Evalua$on

Page 92: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ implements many op$miza$ons

• Group commit • Buffer transac$ons in memory, and flush them in a single batch

• Relax durability guarantee

• Log-bypass writes • File data writes go to the disk (buffer cache) directly

• Log checksums • Checksum log entries to reduce write barriers

• Deferred apply • Apply the log only when the log is full

Page 93: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

disk

Example: group commit

logdata

Page 94: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

disk

Example: group commit

transac;on cache

logdata

1. Each file-system call forms a transac$on, and are buffered in the transacIon cache

Page 95: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

disk

Example: group commit

➡ mkdir(‘d’)➡ create(‘d/a’)➡ rename(‘d/a’,‘d/b’)

, ,transac;on cache

logdata

1. Each file-system call forms a transac$on, and are buffered in the transacIon cache

Page 96: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

disk

Example: group commit

➡ mkdir(‘d’)➡ create(‘d/a’)➡ rename(‘d/a’,‘d/b’)➡ fsync(‘d’)

transac;on cache

logdata

1. Each file-system call forms a transac$on, and are buffered in the transacIon cache

2. fsync() flushes cached transac$ons to the on-disk log in a batch • Preserve order

Page 97: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

➡ mkdir(‘d’)➡ create(‘d/a’)

Challenge: formalizing group commit• Many more crash states (e.g., before or aIer mkdir() )

• On-disk state can be irrelevant to create() itself, but to some previous opera$ons

SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, start state)

start state |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]

POST disk |= log rep(NoTxn, new state)new state |= dir rep(new tree) ^

new tree = tree.update(path, fn, EmptyFile)CRASH disk |= would recover either (start state, new state)

Page 98: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Specifica$on idea: disk sequences

Page 99: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Specifica$on idea: disk sequences

flushed state

disk0

in-memory transac;ons write-ahead log

txn1 txn2 txnn

Page 100: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

disk sequence

Specifica$on idea: disk sequences

disk0

flushed state

disk0

in-memory transac;ons write-ahead log

txn1 txn2 txnn

Page 101: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

disk sequence

Specifica$on idea: disk sequences• Each (cached) system call adds a new logical disk to the sequence

disk0

flushed state

disk1 diskn

disk0

in-memory transac;ons

latest

write-ahead log

txn1 txn2 txnn

Page 102: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

disk sequence

Specifica$on idea: disk sequences• Each (cached) system call adds a new logical disk to the sequence

• Each logical disk has a corresponding tree

disk0

flushed state

disk1 diskn

disk0

in-memory transac;ons

latest

write-ahead log

tree_rep tree_reptree_rep

txn1 txn2 txnn

Page 103: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

disk sequence

Specifica$on idea: disk sequences• Each (cached) system call adds a new logical disk to the sequence

• Each logical disk has a corresponding tree

• Capture the idea that metadata updates must be ordered

disk0

flushed state

disk1 diskn

disk0

in-memory transac;ons

latest

write-ahead log

tree_rep tree_reptree_rep

txn1 txn2 txnn

Page 104: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

New specifica$on with disk sequence

• Specifica$on isn’t more complicated

SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, disk seq)

disk seq.latest |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]

POST disk |= log rep(NoTxn, disk seq ++ {new state})new state |= dir rep(new tree) ^

new tree = tree.update(path, fn, EmptyFile)CRASH disk |= would recover any (disk seq ++ {new state})

Page 105: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Specifica$on for fsync on directories

• AIer fsync(), there is only one possible on-disk state (the latest one)

SPEC fsync (dir inum)PRE disk |= log rep(NoTxn, disk seq)

disk seq.latest |= tree rep(tree) ^IsDir(find inum(tree, dir inum))

POST disk |= log rep(NoTxn, {disk seq.latest})CRASH disk |= would recover any(disk seq)

Page 106: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Formaliza$on techniques for op$miza$ons

• Group commit • Disk sequences: captures ordered metadata updates

• Log-bypass writes • Disk relaDons: enforces safety w.r.t. metadata updates

• Log checksums • Checksum model: soundly reasons about hash collision

Page 107: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Outline

• Crash safety • What is the correct behavior aIer a crash?

• Challenge 1: formalizing crashes • Crash Hoare Logic (CHL)

• Challenge 2: incorpora$ng performance op$miza$ons • Disk sequences

• Building a complete file system

• Evalua$on

Page 108: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ: building a complete file system • File system design is close

to v6 Unix (+ logging)FSCQ system calls

Directory

Directory tree

Block-level file

Inode

Bitmap allocator

Buffer cache

Write-ahead log

Page 109: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ: building a complete file system • File system design is close

to v6 Unix (+ logging)

• Implementa$on aims to reduce proof effort

FSCQ system calls

Directory

Directory tree

Block-level file

Inode

Bitmap allocator

Buffer cache

Write-ahead log

Page 110: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ: building a complete file system • File system design is close

to v6 Unix (+ logging)

• Implementa$on aims to reduce proof effort• Many precise internal

abstrac$on layers

• e.g., split File and Inode

FSCQ system calls

Directory

Directory tree

Block-level file

Inode

Bitmap allocator

Buffer cache

Write-ahead log

Block-level file

Inode

Page 111: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ: building a complete file system • File system design is close

to v6 Unix (+ logging)

• Implementa$on aims to reduce proof effort• Many precise internal

abstrac$on layers

• e.g., split File and Inode

• Reuse proven components

• e.g., general bitmap allocator

FSCQ system calls

Directory

Directory tree

Block-level file

Inode

Bitmap allocator

Buffer cache

Write-ahead log

Bitmap allocator

Directory tree

Inode

Page 112: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ: building a complete file system • File system design is close

to v6 Unix (+ logging)

• Implementa$on aims to reduce proof effort• Many precise internal

abstrac$on layers

• e.g., split File and Inode

• Reuse proven components

• e.g., general bitmap allocator

• Simpler specifica$ons

• e.g., no hard link ⇒ tree spec

FSCQ system calls

Directory

Directory tree

Block-level file

Inode

Bitmap allocator

Buffer cache

Write-ahead log

Directory tree

Page 113: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Evalua$on

• What bugs do FSCQ’s theorems eliminate?

• How much development effort is required for FSCQ?

• How well does FSCQ perform?

Page 114: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Does FSCQ eliminate bugs?

• One data point: once theorems proven, no implementa$on bugs in proven code • Did find some mistakes in spec, as a result of end-to-end checks

• E.g., forgot to specify that extending a file should zero-fill

Page 115: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Does FSCQ eliminate bugs?

• One data point: once theorems proven, no implementa$on bugs in proven code • Did find some mistakes in spec, as a result of end-to-end checks

• E.g., forgot to specify that extending a file should zero-fill

• Systema$c study • Categorize bugs from Linux kernel’s patch history

• Manually examine if FSCQ can eliminate bugs in each category

Page 116: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ’s theorems eliminate many bugs Bug category Prevented?

Mistakes in logging logice.g., combining incompa:ble op:miza:ons ✔

Misuse of logging APIe.g., releasing indirect block in two transac:ons ✔

Mistakes in recovery protocol e.g., issuing write barrier in the wrong order ✔

Improper corner-case handling e.g., running out of blocks during rename ✔

Page 117: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ’s theorems eliminate many bugs Bug category Prevented?

Mistakes in logging logice.g., combining incompa:ble op:miza:ons ✔

Misuse of logging APIe.g., releasing indirect block in two transac:ons ✔

Mistakes in recovery protocol e.g., issuing write barrier in the wrong order ✔

Improper corner-case handling e.g., running out of blocks during rename ✔

Low-level bugs e.g., double free, integer overflow Some (memory safe)

Returning incorrect error code Some

Page 118: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ’s theorems eliminate many bugs Bug category Prevented?

Mistakes in logging logice.g., combining incompa:ble op:miza:ons ✔

Misuse of logging APIe.g., releasing indirect block in two transac:ons ✔

Mistakes in recovery protocol e.g., issuing write barrier in the wrong order ✔

Improper corner-case handling e.g., running out of blocks during rename ✔

Low-level bugs e.g., double free, integer overflow Some (memory safe)

Returning incorrect error code Some

Concurrency Not supported

Security Not supported

Page 119: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Development effort• Total of ~50,000 lines of verified code, specs, and proofs in Coq

• > 50% reusable infrastructure

4%12%

7%5%

21%8%

44%

CHL infrastructureGeneral data structuresWrite-ahead logBuffer cacheInodes and filesDirectoriesTop-level API

Page 120: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Development effort• Total of ~50,000 lines of verified code, specs, and proofs in Coq

• > 50% reusable infrastructure

• Comparison: ext4 has ~60,000 lines of C code (many more features)

4%12%

7%5%

21%8%

44%

CHL infrastructureGeneral data structuresWrite-ahead logBuffer cacheInodes and filesDirectoriesTop-level API

Page 121: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Development effort• Total of ~50,000 lines of verified code, specs, and proofs in Coq

• > 50% reusable infrastructure

• Comparison: ext4 has ~60,000 lines of C code (many more features)

• What’s the cost of adding new features to FSCQ?

4%12%

7%5%

21%8%

44%

CHL infrastructureGeneral data structuresWrite-ahead logBuffer cacheInodes and filesDirectoriesTop-level API

Page 122: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Change effort propor$onal to scope of change

FSCQ system calls

Directory

Directory tree

Block-level file

Inode

Bitmap allocator

Buffer cache

Write-ahead log

Page 123: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Change effort propor$onal to scope of change

• Indirect blocks: • + 1,500 lines in Inode

FSCQ system calls

Directory

Directory tree

Block-level file

Inode

Bitmap allocator

Buffer cache

Write-ahead log

Inode

Page 124: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Change effort propor$onal to scope of change

• Indirect blocks: • + 1,500 lines in Inode

• Write-back buffer cache: • + 2300 lines beneath log

~ 600 lines in rest of FSCQ

FSCQ system calls

Directory

Directory tree

Block-level file

Inode

Bitmap allocator

Buffer cache

Write-ahead log

Buffer cache

Page 125: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Change effort propor$onal to scope of change

• Indirect blocks: • + 1,500 lines in Inode

• Write-back buffer cache: • + 2300 lines beneath log

~ 600 lines in rest of FSCQ

• Group commit: • + 1800 lines in Log

~ 100 lines in rest of FSCQ

FSCQ system calls

Directory

Directory tree

Block-level file

Inode

Bitmap allocator

Buffer cache

Write-ahead logWrite-ahead log

Page 126: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Change effort propor$onal to scope of change

• Indirect blocks: • + 1,500 lines in Inode

• Write-back buffer cache: • + 2300 lines beneath log

~ 600 lines in rest of FSCQ

• Group commit: • + 1800 lines in Log

~ 100 lines in rest of FSCQ

• Changed lines include code, specs and proofs

FSCQ system calls

Directory

Directory tree

Block-level file

Inode

Bitmap allocator

Buffer cache

Write-ahead log

Page 127: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Performance comparison

• File-system-intensive workload • LFS “largefile” benchmark

• mailbench, a qmail-like mail server

• Compare with ext4 (non-cer$fied) in default mode • Mount op$on: async,data=ordered

• Use FUSE to forward and serialize requests (disable concurrency)

• Running on an hard disk on a desktop • Quad-core Intel i7-980X 3.33 GHz / 24 GB / Hitachi HDS721010CLA332

• Linux 3.11 / GHC 8.0.1 / all file systems run on a separate par$$on

Page 128: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ Performance

• FSCQ’s CPU overhead is high

• FSCQ’s I/O performance is on par with ext4

Runn

ing

Tim

e (s

econ

ds)

0

5

10

15

20

25

30

largefile mailbench

FSCQ ext4

Page 129: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

FSCQ Performance

• FSCQ’s CPU overhead is high

• FSCQ’s I/O performance is on par with ext4

Runn

ing

Tim

e (s

econ

ds)

0

5

10

15

20

25

30

largefile mailbench

FSCQ ext4 Number of disk I/Os per opera;on

largefile mailbench

write sync write sync

FSCQ 1.0 1.0 50.0 9.8

ext4 1.0 1.0 38.0 12.3

Page 130: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Future direc$ons

• Extrac$ng to na$ve code • Reduce both CPU overhead and TCB

• Cer$fying crash-safe applica$ons • Use FSCQ’s top-level spec to cer$fy a mail server or a KV store

• Suppor$ng concurrency • Run FSCQ in a mul$-user environment

• Exploit both I/O concurrency and parallelism

Page 131: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Conclusion

• CHL helps specify and prove crash safety • Crash condi$ons

• Recovery execu$on seman$cs

• FSCQ: first cer$fied crash-safe file system • Precise specifica$on in presence of crashes

• I/O performance on par with Linux ext4

• Moderate development effort

Page 132: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the

Conclusion

• CHL helps specify and prove crash safety • Crash condi$ons

• Recovery execu$on seman$cs

• FSCQ: first cer$fied crash-safe file system • Precise specifica$on in presence of crashes

• I/O performance on par with Linux ext4

• Moderate development effort

h|ps://github.com/mit-pdos/fscq-impl