26
Solaris Data Technologies The Zettabyte Filesystem Ulrich Gräf Consultant, OS Ambassador Core Technology Team Office Frankfurt, Sun Microsystems

Solaris Data Technologies The Zettabyte Filesystem

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Solaris Data Technologies The Zettabyte Filesystem

Solaris Data TechnologiesThe Zettabyte Filesystem

Ulrich Gräf

Consultant, OS Ambassador

Core Technology Team

Office Frankfurt, Sun Microsystems

Page 2: Solaris Data Technologies The Zettabyte Filesystem

The Perfect Filesystem

Write my data

Keep it safe

Read it back

Do it fast

Don't hassle me

Page 3: Solaris Data Technologies The Zettabyte Filesystem

Existing Filesystems

Lots of limits: size, number of files, etc.No data integrity checksNo defense against silent data corruptionNo data security: spying, tampering, theftNumerous performance and scaling

problemsExcruciating to manage

Page 4: Solaris Data Technologies The Zettabyte Filesystem

The Next Generation Filesystem Objective

End the suffering!

Page 5: Solaris Data Technologies The Zettabyte Filesystem

The ZFS Filesystem

Write my data!Immense capacity (128-bit)

Moore's Law: need 65th bit in 12 years

Zettabyte = 70-bit (a billion TB)

ZFS capacity: 256 quadrillion ZB

Quantum limit of Earth-based storage

Dynamic metadataNo limits on files, directory entries, etc.

No wacky knobs (e.g. inodes/cg)

Page 6: Solaris Data Technologies The Zettabyte Filesystem

The ZFS Filesystem

Keep it safe!

Provable data integrity modelComplete end-to-end verification

Detects bit rot, phantom writes, misdirections, common administrative errors

Self-healing data

Disk scrubbing

Real-time remote replication

Data authentication and encryption

Page 7: Solaris Data Technologies The Zettabyte Filesystem

The ZFS Filesystem

Do it fast!

Write sequentialization

Dynamic striping across all disks

Multiple block sizes

Constant-time snapshots

Concurrent, constant-time directory ops

Byte-range locking for concurrent writes

Page 8: Solaris Data Technologies The Zettabyte Filesystem

The ZFS Filesystem

Don't hassle me!Pooled storage – no more volumes!

Filesystems are cheap and easy to create

Grow and shrink are automatic

No raw device names to remember

No more fsck(1M) ever

No more editing /etc/vfstab

Unlimited snapshots and user undo

All administration online

Page 9: Solaris Data Technologies The Zettabyte Filesystem

Volumes vs. Pooled Storage

Traditional volumes

Partition per filesystem; painful to manage

Block-based FS/Volume interface slow, brittle

Pooled Storage

Filesystems share space; easy to mange

Transactional ZPL/DMU interface fast, robust

FS FS FSFS

(ZPL)

FS

(ZPL)

FS

(ZPL)

Storage Pool

(DMU/SPA)Volume Volume Volume

Page 10: Solaris Data Technologies The Zettabyte Filesystem

POSIX isn't the only game in town

DMU provides transactional object storage model

Filesystems

Databases

Raw volume emulation

What else might we do?

Object-Based Storage

ZPL NFS Oracle

UFS OtherFS Raw

DMU

SPA

Zvol

Page 11: Solaris Data Technologies The Zettabyte Filesystem

UFS/SVM Administration, page 1

Set up partition tables for each disk

Format one disk at a time

Each partition must be the maximum expected size (how do we know?) of the user's filesystem

The partitions on each disk must match

Also, we must set aside partitions for SVM itself to store its configuration database.

Create the SVM metadb

# format

... (long interactive session omitted)

# metadb -a -f disk1:slice0 disk2:slice0

Page 12: Solaris Data Technologies The Zettabyte Filesystem

UFS/SVM Administration, page 2

Create Ann's volume (d20)

Volume names must be of the form dNNN

We can't name Ann's volume “ann” (fixed in another project)

# metainit d10 1 1 disk1:slice1

d10: Concat/Stripe is setup

# metainit d11 1 1 disk2:slice1

d11: Concat/Stripe is setup

# metainit d20 -m d10

d20: Mirror is setup

# metattach d20 d11

d20: submirror d11 is attached

Page 13: Solaris Data Technologies The Zettabyte Filesystem

UFS/SVM Administration, page 3

Create Bob's volume (d21)

# metainit d12 1 1 disk1:slice2

d12: Concat/Stripe is setup

# metainit d13 1 1 disk2:slice2

d13: Concat/Stripe is setup

# metainit d21 -m d12

d21: Mirror is setup

# metattach d21 d13

d21: submirror d13 is attached

Page 14: Solaris Data Technologies The Zettabyte Filesystem

UFS/SVM Administration, page 4

Create Sue's volume (d22)

# metainit d14 1 1 disk1:slice3

d14: Concat/Stripe is setup

# metainit d15 1 1 disk2:slice3

d15: Concat/Stripe is setup

# metainit d22 -m d14

d22: Mirror is setup

# metattach d22 d15

d22: submirror d15 is attached

Page 15: Solaris Data Technologies The Zettabyte Filesystem

UFS/SVM Administration, page 5

Create and mount Ann's filesystem

Like SVM, UFS can't name Ann's filesystem “ann”

It's on volume d20, so its name is /dev/md/dsk/d20

Manually add it to /etc/vfstab

# newfs /dev/md/rdsk/d20

newfs: construct a new file system /dev/md/rdsk/d20: (y/n)? y

... (many pages of 'superblock backup' output omitted)

# mount /dev/md/dsk/d20 /export/home/ann

# vi /etc/vfstab

... While in 'vi', type this exactly:

/dev/md/dsk/d20 /dev/md/rdsk/d20 /export/home/ann ufs 2 yes -

Page 16: Solaris Data Technologies The Zettabyte Filesystem

UFS/SVM Administration, page 6

Now do the same for Bob and Sue

# newfs /dev/md/rdsk/d21

newfs: construct a new file system /dev/md/rdsk/d21: (y/n)? y

... (many pages of 'superblock backup' output omitted)

# mount /dev/md/dsk/d21 /export/home/bob

# vi /etc/vfstab ... While in 'vi', type this exactly:

/dev/md/dsk/d21 /dev/md/rdsk/d21 /export/home/bob ufs 2 yes -

# newfs /dev/md/rdsk/d22

newfs: construct a new file system /dev/md/rdsk/d22: (y/n)? y

... (many pages of 'superblock backup' output omitted)

# mount /dev/md/dsk/d22 /export/home/sue

# vi /etc/vfstab ... While in 'vi', type this exactly:

/dev/md/dsk/d22 /dev/md/rdsk/d22 /export/home/sue ufs 2 yes -

Page 17: Solaris Data Technologies The Zettabyte Filesystem

UFS/SVM Administration, page 7

Later, add more space for Bob

Note: this space isn't available to Ann or Sue

# format

... (long interactive session omitted)

# metattach d12 disk3:slice1

d12: component is attached

# metattach d13 disk4:slice1

d13: component is attached

# metattach d21

# growfs -M /export/home/bob /dev/md/rdsk/d21

/dev/md/rdsk/d21:

... (many pages of 'superblock backup' output omitted)

Page 18: Solaris Data Technologies The Zettabyte Filesystem

ZFS Administration

Create a storage pool named “home” # zpool create "home" mirror(disk1,disk2)

Create filesystems “ann”, “bob”, “sue” # zfs mount -c home/ann /export/home/ann

# zfs mount -c home/bob /export/home/bob

# zfs mount -c home/sue /export/home/sue

Later, add space to the “home” pool # zpool add "home" mirror(disk3,disk4)

Page 19: Solaris Data Technologies The Zettabyte Filesystem

Provable Data Integrity Model

Three Big Rules

All operations are copy-on-write

Never overwrite live data

On-disk state always valid

No need for fsck(1M)

All operations are transactional

Related changes succeed or fail as a whole

No need for journaling

All data is checksummed

No silent data corruption

No panics on bad metadata

Page 20: Solaris Data Technologies The Zettabyte Filesystem

Traditional Checksums

Checksums stored with data blocks

Fine for detecting bit rot, but:

Can't detect phantom writes, misdirections

Can't validate the checksum itself

Can't authenticate the data

Can't detect common administrative errors

Data

Checksum

Data

Checksum

Data

Checksum

Data

Checksum

Page 21: Solaris Data Technologies The Zettabyte Filesystem

ZFS Checksums

Checksums stored with indirect blocks

Self-validating, self-authenticating checksum tree

Detects phantom writes, misdirections, common administrative errors (e.g. swap on active ZFS disk)

Address

Checksum

Data Data Data Data

Checksum

Address Address

ChecksumChecksum

Address

Address

ChecksumChecksum

Address

Page 22: Solaris Data Technologies The Zettabyte Filesystem

Self-Healing Data

Media error under traditional FS:

Bad user data causes silent data corruption

Bad metadata causes SDC, panic, or both

Media error under ZFS:

Checksum detects data corruption

SPA gets valid data from another replica

SPA repairs the damaged replica

ZFS returns valid data to application

No human intervention required

A non-event

Page 23: Solaris Data Technologies The Zettabyte Filesystem

ZFS S10 Content

Simple AdministrationPooled storageUnlimited snapshotsNT-style ACLs and extended attributesQuotas & reservationsLeast privilege – lets end users do moreOnline everythingAutomatic grow/shrinkDynamic metadata – no wacky knobsHot spaceHost-neutral on-disk format (import/export)

Page 24: Solaris Data Technologies The Zettabyte Filesystem

ZFS S10 Content

Provable end-to-end data integrity64-bit self-validating checksums on every block

Historically considered “too expensive”Turns out, no it isn'tAnd the alternative is unacceptable

Always-consistent on-disk format

Never needs fsckDoesn't depend on journaling

Self-healing data in mirrored configurations

Page 25: Solaris Data Technologies The Zettabyte Filesystem

ZFS S10 Content

Smokin' performanceAlready faster than UFS on most benchmarks

Under 1 second file system creation

Dynamic striping – automatically maximizes bandwidth

Intrinsically hot-spot free due to COW model

Constant-time directory operations (create/delete)

Multiple block sizes

Page 26: Solaris Data Technologies The Zettabyte Filesystem

ZFS S10 Content

Scalability16 exabytes per file system

Many zettabytes per pool

Unlimited files per file system

Unlimited file systems per pool

Unlimited directory size

Unlimited number of devices

No O(data) operations