85
File Systems Top to Bottom and Back [email protected] LISA’13 Washington, DC November 3, 2013

S8 File Systems Tutorial USENIX LISA13

Embed Size (px)

DESCRIPTION

Slides from the S8 File Systems Tutorial at USENIX LISA'13 conference in Washington, DC. The topic covers ext4, btrfs, and ZFS with an emphasis on Linux implementations.

Citation preview

Page 1: S8 File Systems Tutorial USENIX LISA13

File SystemsTop to Bottom

and Back

[email protected]’13 Washington, DC

November 3, 2013

Page 2: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Agenda• Introduction• Installation• Creation and Destruction• Backup and Restore• Migration• Settings and Options• Performance and Tuning

2

Page 3: S8 File Systems Tutorial USENIX LISA13

Introduction

Page 4: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

File Systems

• Today’s discussions: emphasis on Linux• ext4, with a few comments on ext3• btrfs• ZFS

• Not in scope (maybe next year?)• ReFS• HSF+

ext4

ZFS

btrfs

File system discussed on slide

4

Page 5: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

ext4 Highlights• ext3 was limited

• 16TB filesystem size (32-bit block numbers)• 32k limit on subdirectories• Performance limitations

• ext4 is natural successor• Easy migration from ext3• Replace indirect blocks with extents• > 16TB filesystem size• Preallocation• Journal checksums

• Now default on many Linux distros5

ext4

Page 6: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

ZFS Highlights• Figure out why storage has become so

complicated• Blow away 20+ years of obsolete

assumptions• Sun had to replace UFS• Opportunity to design integrated system

from scratch• Widely ported: Linux, FreeBSD, OSX• Builtin RAID• Checksums• Large scale (256 ZB)

6

ZFS

Page 7: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

btrfs• New copy-on-write file system• Pooled storage model• Snapshots• Checksums• Large scale (16 EB)• Builtin RAID• Clever in-place migration from ext3

7

btrfs

Page 8: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Pooled Storage Model• Old school

• 1 disk means• 1 file system• 1 directory structure (directory tree)

• File systems didn’t change when virtual disks (eg RAID) arrived• ok, so we could partition them... ugly solution

• New school• Combine storage devices into a pool• Allow many file systems per pool

8

ZFS

btrfs

ReFS

Page 9: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Sysadmin’s View of Pools

9

Pool

ConfigurationInformation

File System

File SystemVolume

DatasetFile System

ZFS

btrfs

Page 10: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Blocks and Extents• Early file systems were block-based

• ext3, UFS, FAT• Data blocks are fixed sizes• Difficult to scale due to indirection levels and

allocation algorithms• Extents solve many indirection issues

• Extent is a contiguous area of storage reserved for a file

• Data blocks are variable sizes• ext4, btrfs, ZFS, XFS, NTFS, VxFS

10

ext4

ZFS

btrfs

Page 11: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Data

Data

Data

Data

Blocks and Extents

11

DirectDirectDirect

┊Metadata is list of (direct) pointers to fixed-size blocks

Direct

Data

Data

Data

ExtentExtentExtent

Block-based

Extent-based

Metadata is list of extent structures (offset + length) to mixed-size blocks

ext4

ZFS

btrfs

Page 12: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Scalability

• Block-based: go with indirect blocks• Really just pointers to pointers• Gets ugly at triple-indirection• Function of data size and block size

• Extent-based: grow trees• B-trees are popular

• ext4, for more than 3 levels• btrfs

• ZFS uses a Merkle tree

12

Problem: what happens when we need more metadata?

ext4

ZFS

btrfs

Page 13: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Data

Data

Data

Data

Data

Direct

Indirect Blocks

13

DirectDirectDirect

┊Metadata

Double Indirect Data

Direct

┊Indirect

IndirectIndirect

Direct

Problem 1: big files use lots of indirectionProblem 2: metadata size fixed at creation

ext3UFS

Page 14: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Treed Metadata

• Trees can be large, yet efficiently searched and modified

• Enables copy-on-write (COW)• Lots of good computer science here!

14

Data Data Data Data

Root

ext4

ZFS

btrfs

Page 15: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Trees Allow Copy-on-Write

15

1. Initial block tree 2. COW some data

3. COW metadata 4. Update Uberblocks & free

ZFS

btrfs

Page 16: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

fsckProblem: how do we know the metadata is correct?• Keep redundant copies• But what if the copies don’t agree?1. File system check reconciles metadata

inconsistencies• fsck (ext[234], btrfs, UFS), chkdsk (FAT), etc• Repairs problems that are known to occur (!)• Does not repair data (!)

2. Build a transactional system with atomic updates• Databases (MySQL, Oracle, etc)• ZFS

16

ext4

ZFS

btrfs

Page 17: S8 File Systems Tutorial USENIX LISA13

Installation

Page 18: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Ubuntu 12.04.3• ext4 = default root file system• btrfs version v0.19 installed by default• ZFS

1. Install python-software-propertiesapt-get install python-software-properties 

2.Add ZFSonLinux repoapt-add-repository --yes ppa:zfs-native/stableapt-get update

3. Install ZFS packageapt-get install debootstrap ubuntu-zfs

4.Verifymodprobe -l zfsdmesg | grep ZFS:

ext4

ZFS

btrfs

18

Page 19: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Fedora Core F19• ext4 = default root file system• btrfs version v0.20-rc1 installed by default• ZFS

1.Update to latest package versions2.Add ZFSonLinux repo

yum localinstall --nogpgcheck http://archive.zfsonlinux.org/fedora/zfs-release-1-2$(rpm -E %dist).noarch.rpm

3. Install ZFS packageyum install zfs 

4.Verifymodprobe -l zfsdmesg | grep ZFS:

19

ext4

ZFS

btrfs

Beware of word wrap

Page 20: S8 File Systems Tutorial USENIX LISA13

AΩCreation and Destruction

Page 21: S8 File Systems Tutorial USENIX LISA13

But first...a brief discussion

of RAID

Page 22: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

RAID Basics• Disks fail. Sometimes they lose data.

Sometimes they completely die. Get over it.• RAID = Redundant Array of Inexpensive

Disks• RAID = Redundant Array of Independent

Disks• Key word: Redundant• Redundancy is good.• More redundancy is better.• Everything else fails, too. You’re over it by

now, right?22

Page 23: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

RAID-0 or Striping• RAID-0

• SNIA definition: fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern

• Good for space and performance• Bad for dependability

• ZFS Dynamic Stripe• Data is dynamically mapped to member disks• No fixed-length sequences• Allocate up to ~1 MByte/vdev before changing vdev• Good combination of the concatenation feature with

RAID-0 performance

23

ZFS

btrfs

Page 24: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

RAID-0 Example

24

Total write size = 2816 kBytes

RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes

ZFS Dynamic Stripe recordsize = 128 kBytes

384 kBytes

ZFS

btrfs

Page 25: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

RAID-1 or Mirroring • Straightforward: put N copies of the data

on N disks• Good for read performance and

dependability• Bad for space

• Arbitration: btrfs and ZFS do not blindly trust either side of mirror• Most recent, correct view of data wins• Checksums validate data

25

ZFS

btrfs

Page 26: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Traditional Mirrors

26

File system does bad read Can not tell

If it’s a metadata block FS panics does disk rebuild

Or we get back bad data

Page 27: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

• What if a disk is (mostly) ok, but the data became corrupted?

• btrfs and ZFS improve dependability using checksums for data and store checksums in metadata

Checksums for Mirrors

27

ZFS

btrfs

Page 28: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

RAID-5 and RAIDZ• N+1 redundancy

• Good for space and dependability• Bad for performance

• RAID-5 (btrfs)• Parity check data is distributed across the RAID array's

disks• Must read/modify/write when data is smaller than stripe

width• RAIDZ (ZFS)

• Dynamic data placement• Parity added as needed• Writes are full-stripe writes• No read/modify/write (write hole)

28

ZFS

btrfs

Page 29: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

RAID-5 and RAIDZ

29

DiskA DiskB DiskC DiskD DiskED0:0 D0:1 D0:2 D0:3 P0P1 D1:0 D1:1 D1:2 D1:3

D2:3 P2 D2:0 D2:1 D2:2D3:2 D3:3 P3 D3:0 D3:1

DiskA DiskB DiskC DiskD DiskEP0 D0:0 D0:1 D0:2 D0:3P1 D1:0 D1:1 P2:0 D2:0

D2:1 D2:2 D2:3 Gap P2:1D2:4 D2:5 P3 D3:0

RAID-5

RAIDZ

ZFS

btrfs

Page 30: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

RAID-6, RAIDZ2, RAIDZ3• Adding more parity

• Parity 1: XOR• Parity 2: another Reed-Solomon syndrome• Parity 3: yet another Reed-Solomon

syndrome• Double parity: N+2

• RAID-6 (btrfs)• RAIDZ2 (ZFS)

• Triple parity: N+3• RAIDZ3 (ZFS)

30

ZFS

btrfs

Page 31: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Dependability vs Space

31

For this analysis, RAIDZ1/2 and RAID-5/6 are equivalent

ZFS

btrfs

Dependability model metric MTTDL = Mean time to data loss (bigger is better)

Page 32: S8 File Systems Tutorial USENIX LISA13

We now return you to your regularly

scheduled program:AΩ

Page 33: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Create a Simple Pool1. Determine the name of an unused disk

• /dev/sd* or /dev/hd*• /dev/disk/by-id• /dev/disk/by-path• /dev/disk/by-vdev (ZFS)

2. Create a simple pool• btrfsmkfs.btrfs -m single /dev/sdb

• ZFSzpool create zwimming /dev/sddNote: might need “-f” flag to create EFI label

3. Woohoo!33

ZFS

btrfs

Page 34: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Verify Pool Status• btrfsbtrfs filesystem show

• ZFSzpool status

34

ZFS

btrfs

Page 35: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Destroy Pool• btrfs

• Unmount all btrfs file systems • ZFSzpool destroy zwimming• Unmounts file systems and volumes• Exports pool• Marks pool as destroyed

• Walk away...• Until overwritten, data is still ok and can be

imported again• To see destroyed ZFS poolszpool import -D

35

ZFS

btrfs

Page 36: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Create Mirrored Pool1. Determine the name of two unused disks2. Create a mirrored pool

• btrfsmkfs.btrfs -d raid1 /dev/sdb /dev/sdc

• -d specifies redundancy for data, metadata is redundant by default

• ZFSzpool create zwimming mirror /dev/sdd /dev/sde

3. Woohoo!4. Verify

36

ZFS

btrfs

Page 37: S8 File Systems Tutorial USENIX LISA13

Creating Filesystems

Page 38: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Create & Mount File System• Make some mount points for this examplemkdir /mnt.ext4mkdir /mnt.btrfs

• ext4mkfs.ext4 /dev/sdfmount /dev/sdf /mnt.ext4

• btrfsmount /dev/sdb /mnt.btrfs

• ZFS• zpool create already made a file system and

mounted it at /zwimming• Verify...

38

ext4

ZFS

btrfs

Page 39: S8 File Systems Tutorial USENIX LISA13

But first... a brief introduction to accounting principles

Page 40: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Verify Mounted File Systems• df is handy tool to verify mounted file

systems

• WAT?• Pool space accounting isn’t like traditional

filesystem space accounting• NB: the raw disk has 1,073,741,824 bytes

40

ext4

ZFS

btrfs

root@ubuntu:~# df -hFilesystem Size Used Avail Use% Mounted on.../dev/sdf 976M 1.3M 924M 1% /mnt.ext4zwimming 976M 0 976M 0% /zwimming/dev/sdb 1.0G 56K 894M 1% /mnt.btrfs

Page 41: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Again!• Try again with our mirrored pool examples

• WAT, WAT, WAT?• The accounting is correct, your

understanding of the accounting might need a little bit of help

• Adding RAID-5, compression, copies, and deduplication makes accounting very confusing

41

ext4

ZFS

btrfs

root@ubuntu:~# df -hFilesystem Size Used Avail Use% Mounted on.../dev/sdf 976M 1.3M 924M 1% /mnt.ext4zwimming 976M 0 976M 0% /zwimming/dev/sdc 2.0G 56K 1.8G 1% /mnt.btrfs

Page 42: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Accounting Sanity• A full explanation of the accounting for

pools is an opportunity for aspiring writers!

• A more pragmatic view: • The accounting is correct• You can tell how much space is unallocated

(free), but you can’t tell how much data you can put into it, until you do so

42

ZFS

btrfs

Page 43: S8 File Systems Tutorial USENIX LISA13

btrfs subvolumesand

ZFS filesystems

Page 44: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

One Pool Many File Systems

• Good idea: create new file systems when you want a new policy• readonly, quota, snapshots/clones, etc

• Act like directories, but slightly heavier44

ZFS

btrfs

Pool

ConfigurationInformation

File System

File SystemVolume

DatasetFile System

Page 45: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Create New File Systems• Context: new file system in existing pool• btrfs

btrfs subvolume /mnt.btrfs/sv1• ZFS

zfs create zwimming/fs1• Verify

45

ZFS

btrfs

root@ubuntu:~# df -hFilesystem Size Used Avail Use% Mounted on.../dev/sdf 976M 1.3M 924M 1% /mnt.ext4zwimming 976M 128K 976M 1% /zwimming/dev/sdb 1.0G 64K 894M 1% /mnt.btrfszwimming/fs1 976M 128K 976M 1% /zwimming/fs1root@ubuntu:~# ls -l /mnt.btrfstotal 0drwxr-xr-x 1 root root 0 Nov 2 20:30 sv1root@ubuntu:~# ls -l /zwimmingtotal 2drwxr-xr-x 2 root root 2 Nov 2 20:29 fs1root@ubuntu:~# btrfs subvolume list /mnt.btrfsID 256 top level 5 path sv1

Page 46: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Nesting• It is tempting to create deep, nested

multiple file system structures• But it increases management complexity• Good idea: use shallow file system

hierarchy

46

ZFS

btrfs

Page 47: S8 File Systems Tutorial USENIX LISA13

Backup and Restore

Page 48: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Traditional Tools• For file systems, the traditional tools work

as you expect• cp, scp, tar, rsync, zip, ...• For ZFS volumes, dd

• But those are boring, let’s talk about snapshots and replication

48

ext4

ZFS

btrfs

Page 49: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Snapshots

• Create a snapshot by not free'ing COWed blocks• Snapshot creation is fast and easy• Number of snapshots determined by use – no

hardwired limit• Recursive snapshots also possible in ZFS• Terminology: btrfs “writable snapshot” is like ZFS

“clone”49

ZFS

btrfs

Current tree root

Snapshot tree root

Page 50: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Create Read-only Snapshot• btrfs

• btrfs version v0.20-rc1 or later• Read-only needed for btrfs send

btrfs subvolume snapshot -r /mnt.btrfs/sv1 \/mnt.btrfs/sv1_ro

• ZFSzfs snapshot zwimming@snapme

50

ZFS

btrfs

Page 51: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Create Writable Snapshot• btrfs

btrfs subvolume snapshot /mnt.btrfs/sv1• ZFS

zfs snapshot zwimming@snapmezfs clone zwimming@snapme zwimming/cloneme

51

ZFS

btrfs

root@ubuntu:~# btrfs subvolume snapshot /mnt.btrfs/sv1 /mnt.btrfs/sv1_snapCreate a snapshot of '/mnt.btrfs/sv1' in '/mnt.btrfs/sv1_snap'root@ubuntu:~# btrfs subvolume list /mnt.btrfsID 256 top level 5 path sv1ID 257 top level 5 path sv1_snaproot@ubuntu:~# zfs snapshot zwimming@snapmeroot@ubuntu:~# zfs list -t snapshotNAME USED AVAIL REFER MOUNTPOINTzwimming@snapme 0 - 31K -root@ubuntu:~# ls -l /zwimming/.zfs/snapshottotal 0dr-xr-xr-x 1 root root 0 Nov 2 21:02 snapmeroot@ubuntu:~# zfs clone zwimming@snapme zwimming/clonemeroot@ubuntu:~# df -hFilesystem Size Used Avail Use% Mounted on...zwimming 976M 0 976M 0% /zwimmingzwimming/cloneme 976M 0 976M 0% /zwimming/cloneme

Page 52: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

btfs Send and Receive• New feature in v0.20-rc1• Operates on read-only snapshots

btrfs subvolume snapshot -r /mnt.btrfs/sv1 \/mnt.btrfs/sv1_ro

• Note: send data must be on disk, either wait or use sync command

• Send the to stdout, receive from stdin

52

btrfs

root# btrfs subvolume snapshot -r /mnt.btrfs/sv1 /mnt.btrfs/sv1_roroot# syncroot# btrfs subvolume create /mnt.btrfs/backuproot# btrfs send /mnt.btrfs/sv1_ro | btrfs receive /mnt.btrfs/backupAt subvol /mnt.btrfs/sv1_roAt subvol sv1_roroot# btrfs subvolume list /mnt.btrfsID 256 gen 8 top level 5 path sv1ID 257 gen 8 top level 5 path sv1_roID 258 gen 13 top level 5 path backupID 259 gen 14 top level 5 path backup/svr_ro

Page 53: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

ZFS Send and Receive• Works the same on file systems as volumes

(datasets)• Send a snapshot as a stream to stdout

• Whole: single snapshot• Incremental: difference between two snapshots

• Receive a snapshot into a dataset• Whole: create a new dataset• Incremental: add to existing, common snapshot

• Each snapshot has a GUID and creation time property• Good idea: avoid putting time in snapshot name, use the

properties for automation• Example

zfs send zwimming@snap | zfs receive zbackup/zwimming

53

ZFS

Page 54: S8 File Systems Tutorial USENIX LISA13

Migration

Page 55: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Forward Migration• But first... backup your data!• And second... test your backup• ext3 ➯ ext4• ext3 or ext4 ➯ btrfs

• Cleverly treats existing ext3 or ext4 data as read-only snapshot

• btrfs seed devices• Read-only file system as basis of new file system• All writes are COWed into new file system

• ZFS is fundamentally different• Use traditional copies: cp, tar, rsync, etc

55

ext4

ZFS

btrfs

Page 56: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Reverting Migration• Once you start to use ext4 features or

add data to btrfs, the old ext3 filesystems doesn’t see the new data• Seems to be unallocated space• Reverting loses the changes after migration

• But first... backup your data!• And second... test your backup

56

ext4

btrfs

Page 57: S8 File Systems Tutorial USENIX LISA13

Settings and Options

Page 58: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

ext4 Options• Extends function set available to ext2 and ext3• Creation options

• uninit_bg creates file system without initializing all of the block groups• speeds filesystem creation• can speed fsck

• Mount options of note• barriers enabled by default• max_batch_time for coalescing synchronous writes

• Adjusts dynamically by observing commit time• Use with caution, know your workload

• discard/nodiscard for enabling TRIM for SSDs• Is TRIM actually useful? The jury is still out...

58

ext4

Page 59: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

btrfs Options• Mount options

• degraded: useful when mounting redundant pools with broken or missing devices

• compress: select zlib, lzo, or no compression algorithms• Note: by default, only compressible data is

written• discard: enables TRIM (see ext4 option)• fatal_errors: choose error fail policy

59

btrfs

Page 60: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

ZFS Properties

60

• Recall that ZFS doesn’t use fstab or mkfs• Properties are stored in metadata for the pool or

dataset• By default, properties are inherited• Some properties are common to all datasets, but

a specific dataset type may have additional properties

• Easily set or retrieved via scripts• Can set at creation time, or later (restrictions

apply)• In general, properties affect future file system

activity

ZFS

Page 61: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Managing ZFS Properties• Pool properties

zpool get all poolnamezpool get propertyname poolnamezpool set propertyname=value poolname

• Dataset propertieszfs get all datasetzfs get propertyname [dataset]zfs set propertyname=value dataset

61

ZFS

Page 62: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

User-defined Properties• Useful for adding metadata to datasets

• Limited to description property on pools• Recall each pool has a dataset of the same name

• Names• Must include colon ':'• Can contain lower case alphanumerics or “+” “.” “_”• Max length = 256 characters• By convention, module:property

• com.sun:auto-snapshot

• Values• Max length = 1024 characters

• Examples• com.richardelling:important_files=true

62

ZFS

Page 63: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

ZFS Pool Properties

Property Change? Brief Descriptionaltroot Alternate root directory (ala chroot)

autoexpand Policy for expanding when vdev size changes

autoreplace vdev replacement policyavailable readonly Available storage space

bootfs Default bootable dataset for root pool

cachefile Cache file to use other than /etc/zfs/zpool.cache

capacity readonly Percent of pool space used

dedupditto Automatic copies for deduped datadedupratio readonly Deduplication efficiency metricdelegation Master pool delegation switchfailmode Catastrophic pool failure policy

63

ZFS

Page 64: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

More ZFS Pool Properties

Property Change? Brief Description

feature@async_destroy Reduce pain of dataset destroy workload

feature@empty_bpobj Improves performance for lots of snapshots

feature@lz4_compress lz4 compression

guid readonly Unique identifier

health readonly Current health of the pool

listsnapshots zfs list policy

size readonly Total size of pool

used readonly Amount of space used

version readonly Current on-disk version

64

ZFS

Page 65: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Common Dataset Properties

Property Change? Brief Description

available readonly Space available to dataset & children

checksum Checksum algorithmcompression Compression algorithm

compressratio readonly Compression ratio – logical size:referenced physical

copies Number of copies of user datacreation readonly Dataset creation timededup Deduplication policylogbias Separate log write policymlslabel Multilayer security label

origin readonly For clones, origin snapshot

65

ZFS

Page 66: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

More Dataset Properties

Property Change? Brief Descriptionprimarycache ARC caching policy

readonly Is dataset in readonly mode?

referenced readonly Size of data accessible by this dataset

refreservationMinimum space guaranteed to a dataset, excluding descendants (snapshots & clones)

reservation Minimum space guaranteed to dataset, including descendants

secondarycache L2ARC caching policysync Synchronous write policy

type readonly Type of dataset (filesystem, snapshot, volume)

66

ZFS

Page 67: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Still More Dataset Properties

Property Change? Brief Descriptionused readonly Sum of usedby* (see below)

usedbychildren readonly Space used by descendantsusedbydataset readonly Space used by dataset

usedbyrefreservation readonly Space used by a refreservation for this dataset

usedbysnapshots readonly Space used by all snapshots of this dataset

zoned readonly Is dataset added to non-global zone (Solaris)

67

ZFS

Page 68: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

ZFS Volume Properties

Property Change? Brief Descriptionshareiscsi iSCSI service (per-distro option)

volblocksize creation fixed block sizevolsize Implicit quota

zoned readonly Set if dataset delegated to non-global zone (Solaris)

68

ZFS

Page 69: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

ZFS File System Properties

Property Change? Brief Description

aclinherit ACL inheritance policy, when files or directories are created

aclmode ACL modification policy, when chmod is used

atime Disable access time metadata updates

canmount Mount policy

casesensitivity creation Filename matching algorithm (CIFS client feature)

devices Device opening policy for datasetexec File execution policy for dataset

mounted readonly Is file system currently mounted?

69

ZFS

Page 70: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

ZFS Filesystem Properties2

Property Change? Brief Description

nbmand export/import

File system should be mounted with non-blocking mandatory locks (CIFS client feature)

normalization creation Unicode normalization of file names for matching

quota Max space dataset and descendants can consume

recordsize Suggested maximum block size for files

refquota Max space dataset can consume, not including descendants

setuid setuid mode policysharenfs NFS sharing options (per-distro)

sharesmb Files system shared with SMB (per-distro)

70

ZFS

Page 71: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

ZFS Filesystem Properties3

Property Change? Brief Description

snapdir Controls whether .zfs directory is hiddenutf8only creation UTF-8 character file name policyvscan Virus scan enabledxattr Extended attributes policy

71

ZFS

Page 72: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

ZFS Distro Properties

Release Property Brief Descriptionillumos comment Human-readable comment field

ZFSonLinux ashift Sets default disk sector size

72

Release Property Brief DescriptionSolaris 11 encryption Dataset encryption

Delphix/illumos clones Clone descendantsDelphix/illumos refratio Compression ratio for references

Solaris 11 share Combines sharenfs & sharesmbSolaris 11 shadow Shadow copy

NexentaOS/illumos worm WORM feature

Delphix/illumos written Amount of data written since last snapshot

Pool Properties

Dataset Properties

ZFS

Page 73: S8 File Systems Tutorial USENIX LISA13

Performanceand Tuning

Page 74: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

About Disks• Hard disk drives are slow. Get over it.

74

ext4

ZFS

btrfs

Disk Size RPM Max Size (GBytes)

Average Rotational

Latency (ms)

Average Seek (ms)

HDD 2.5” 5,400 1,000 5.5 11HDD 3.5” 5,900 4,000 5.1 16HDD 3.5” 7,200 4,000 4.2 8 - 8.5HDD 2.5” 10,000 300 3 4.2 - 4.6HDD 2.5” 15,000 146 2 3.2 - 3.5

SSD (w) 2.5” N/A 800 0 0.02 - 0.25SSD (r) 2.5” N/A 1,000 0 0.02 - 0.15

Page 75: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Metadata

btrfs Performance• Move metadata to separate devices

• Common option for distributed file systems• Attribute-intensive workloads can benefit

from faster metadata management

75

btrfs

Pool

HDD

SSDSSDRAID-1

HDDHDDRAID-1

HDDHDDRAID-1

HDDHDDRAID-1

HDDHDDRAID-1

RAID-10

Minimal

Good

Better

Page 76: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

ZFS Performance

76

CacheMain PoolLog

HDD

SSDSSDmirror

HDD HDD HDDraidz, raidz2, raidz3

HDDHDDmirror

HDDHDDmirror

HDDHDDmirror

HDDHDDmirror

stripe

SSDSSDstripe

SSD

SSD

Minimal

Good

Better

Best SSDSSDmirror

SSDSSDmirror

stripe

ZFS

Page 77: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

More ZFS Performance

77

CacheMain PoolLog

SSDSSDmirror

HDD HDD HDDraidz, raidz2, raidz3

HDDHDDmirror

HDDHDDmirror

HDDHDDmirror

stripe

SSDSSDstripe

SSDGood

Better SSDSSDmirror

SSDSSDmirror

stripe

Best HDDSSDmirror

HDDSSDmirror

HDDSSDmirror

stripe

Performance $ / Byte

Best

Better

Good

ZFS

Page 78: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Device Sector Optimization• Problem: not all drive sectors are equal

and read-modify-write is inefficient• 512 bytes - legacy and enterprise• 4KB - Advanced Format (AF) consumer and

high-density• ZFSonLinux

• zpool create ashift option (size = 2ashift)Sector size ashift512 bytes 9

4kB 12

ZFS

78

Page 79: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Resilver CompleteBad Disk Offlined

Wounded Soldier

79

ext4

ZFS

btrfsN

FS S

ervi

ce

Page 80: S8 File Systems Tutorial USENIX LISA13

SummaryWoohoo!

Page 81: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Great File Systems!• All of these file systems have great

features and bright futures• Now you know how to use them better!

• ext4 is now default for many Linux distros• btrfs takes it to the next level in the Linux

ecosystem• ZFS is widely ported to many different

OSes• OpenZFS organization recently launched to

be focal point for open-source ZFS• We’re always looking for more contributors!

81

ext4

ZFS

btrfs

Page 82: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Websites• www.Open-ZFS.org• www.ZFSonLinux.org

• github.com/zfsonlinux/pkg-zfs/wiki/HOWTO-install-Ubuntu-to-a-Native-ZFS-Root-Filesystem

• btrfs.wiki.kernel.org

82

ZFS

btrfs

Page 83: S8 File Systems Tutorial USENIX LISA13

SlideFile Systems: Top to Bottom and Back — USENIX LISA’13November 3, 2013

Online Chats• irc.freenode.net

• #zfs - general ZFS discussions• #zfsonlinux - Linux-specific discussions• #btrfs - general btrfs discusions

ZFS

btrfs

83

Page 84: S8 File Systems Tutorial USENIX LISA13

Thank You!

[email protected]@richardelling

Page 85: S8 File Systems Tutorial USENIX LISA13