ZFS in the Trenches - Cuddletechcuddletech.com/lisa.pdf · ZFS in the Trenches Ben Rockwood Director of Systems Engineering Joyent, Inc

ZFS in the Trenches

Ben RockwoodDirector of Systems EngineeringJoyent, Inc.

The Big Questions

Is node 5 of 150 struggling

Is I/O efficient as it can be?

How fast are requests being answered?

What is my mix of sync vs async I/O?

Do I need to tune?

How do I back it up?

Setting the Record Straight

Correcting Assumptions

Tuning ZFS is Evil (true)

ZFS doesn’t require tuning (false)

ZFS is a memory hog (true)

ZFS is slow (false)

ZFS won’t allow corruption (false)

Amazing Efficiency

ZFS ARC is extremely efficient

Example: 32 production zones (different customers), 120,000 reads per second... ZERO physical read I/O!

ZFS TXG Sync is extremely efficient, provides a clean and orderly flush of writes to disk

ZFS Prefetch intelligence is smarter than you... but true efficiency can vary based on workload.

Physical vs Logical I/O

Observability

Kstat

ZFS ARC Kstats are incredibly useful

Tools such as arcstat or arc_summary are good examples

More Kstats (hopefully) are coming

Things get more interesting when combined with physical disk kstats and VFS layer kstats

root@quadra ~$ kstat -p -n arcstatszfs:0:arcstats:c 264750592zfs:0:arcstats:c_max 268435456zfs:0:arcstats:c_min 126918784zfs:0:arcstats:class misczfs:0:arcstats:crtime 11.976322232zfs:0:arcstats:data_size 105994240zfs:0:arcstats:deleted 871464zfs:0:arcstats:demand_data_hits 2690899zfs:0:arcstats:demand_data_misses 90664zfs:0:arcstats:demand_metadata_hits 6643919zfs:0:arcstats:demand_metadata_misses 340739zfs:0:arcstats:evict_skip 19395259zfs:0:arcstats:hash_chain_max 6zfs:0:arcstats:hash_chains 4508zfs:0:arcstats:hash_collisions 297073zfs:0:arcstats:hash_elements 28047zfs:0:arcstats:hash_elements_max 71400zfs:0:arcstats:hdr_size 5048208zfs:0:arcstats:hits 12065410zfs:0:arcstats:l2_abort_lowmem 0zfs:0:arcstats:l2_cksum_bad 0zfs:0:arcstats:l2_evict_lock_retry 0zfs:0:arcstats:l2_evict_reading 0zfs:0:arcstats:l2_feeds 0zfs:0:arcstats:l2_free_on_write 0zfs:0:arcstats:l2_hdr_size 0zfs:0:arcstats:l2_hits 0zfs:0:arcstats:l2_io_error 0zfs:0:arcstats:l2_misses 0zfs:0:arcstats:l2_read_bytes 0zfs:0:arcstats:l2_rw_clash 0zfs:0:arcstats:l2_size 0zfs:0:arcstats:l2_write_bytes 0zfs:0:arcstats:l2_writes_done 0zfs:0:arcstats:l2_writes_error 0zfs:0:arcstats:l2_writes_hdr_miss 0zfs:0:arcstats:l2_writes_sent 0zfs:0:arcstats:memory_throttle_count 3610zfs:0:arcstats:mfu_ghost_hits 217488zfs:0:arcstats:mfu_hits 7727741zfs:0:arcstats:misses 820341

MDB

mdb Provides many useful features that aren’t as difficult to use as you think (see zfs.c)

If you use just one, “::zfs_params” is handy to see all ZFS tunables in one shot

Several walkers are available, most handy when doing postmortem (Solaris CAT has wrappers for that.)

root@quadra ~$ mdb -kLoading modules: [ unix genunix s.....

> ::zfs_paramsarc_reduce_dnlc_percent = 0x3zfs_arc_max = 0x10000000zfs_arc_min = 0x0arc_shrink_shift = 0x5zfs_mdcomp_disable = 0x0zfs_prefetch_disable = 0x0zfetch_max_streams = 0x8zfetch_min_sec_reap = 0x2zfetch_block_cap = 0x100zfetch_array_rd_sz = 0x100000zfs_default_bs = 0x9zfs_default_ibs = 0xemetaslab_aliquot = 0x80000spa_max_replication_override = 0x3mdb: variable spa_mode not found: unknown symbol namezfs_flags = 0x0zfs_txg_synctime = 0x5zfs_txg_timeout = 0x1ezfs_write_limit_min = 0x2000000zfs_write_limit_max = 0x1e428200zfs_write_limit_shift = 0x3zfs_write_limit_override = 0x0zfs_no_write_throttle = 0x0zfs_vdev_cache_max = 0x4000zfs_vdev_cache_size = 0xa00000zfs_vdev_cache_bshift = 0x10vdev_mirror_shift = 0x15zfs_vdev_max_pending = 0x23zfs_vdev_min_pending = 0x4zfs_scrub_limit = 0xazfs_vdev_time_shift = 0x6zfs_vdev_ramp_rate = 0x2zfs_vdev_aggregation_limit = 0x20000fzap_default_block_shift = 0xezfs_immediate_write_sz = 0x8000zfs_read_chunk_size = 0x100000zil_disable = 0x0zfs_nocacheflush = 0x0metaslab_gang_bang = 0x20001metaslab_df_alloc_threshold = 0x20000metaslab_df_free_pct = 0x1ezio_injection_enabled = 0x0zvol_immediate_write_sz = 0x8000

DTrace

I live by the FBT provider

Watch entry and return of every ZFS function (w00t!)

Most powerful when used for timing or aggregating stacks to learn code flow

The fsstat provider is a hidden gem.

zdb

Examine on-disk structures

Can find and fix issues

Breakdown of disk utilization can be telling

Extremely interesting, but rarely handy

Have used it to recover deleted files

root@quadra ~$ zdbquadra version=18 name='quadra' state=0 txg=20039 pool_guid=15851861834417040262 hostid=12739881 hostname='quadra' vdev_tree type='root' id=0 guid=15851861834417040262 children[0] type='mirror' id=0 guid=7006308066956826552 whole_disk=0 metaslab_array=23 metaslab_shift=33 ashift=9 asize=1000188936192 is_log=0 children[0] type='disk' id=0 guid=916181432694315835 path='/dev/dsk/c2d0s0' devid='id1,cmdk@AHitachi_HDT721010SLA360=______STF605MH1TK48W/a' phys_path='/pci@0,0/pci-ide@1f,5/ide@0/cmdk@0,0:a' whole_disk=1 DTL=18 children[1] type='disk' id=1 guid=13343223963796600669 path='/dev/dsk/c3d0s0' devid='id1,cmdk@AHitachi_HDT721010SLA360=______STF605MH1S14KW/a' phys_path='/pci@0,0/pci-ide@1f,5/ide@1/cmdk@0,0:a' whole_disk=1 DTL=16

Keys to Observability

Use Dtrace to hone your understanding of ZFS internals

Don’t over-focus your instrumentation, leverage VFS and ZFS together to get a holistic picture

Avoid getting obsessed with the Dtrace Syscall provider (you can do better)

Hint: Study up on bdev_strategy & biodone

Kstats, kstats, kstats....

Inside ARC

The Cache Lists

Most Recently Used (MRU)

Most Frequently Used (MFU)

MRU Ghost

MFU Ghost

ARC & Prefetch

ARC Kstats can tell you how data arrived in cache: prefetch or direct

The mix of direct to prefetched data can determine if your prefetch is worth the I/O

If prefetch hits are less than 10% just disable it

ARC Sizing

ARC is HUGE! 7/8th of Physical Memory by default!

Min and Max size is tunable

Watch ghost lists if you limit

>25GB ARC on 32GB node extremely common

Ghost list hit rate seems to be the best indicator that you should consider L2ARC

$ ./arc_summary.pl System Memory: Physical RAM: 16247 MB Free Memory : 1064 MB LotsFree: 253 MB

ZFS Tunables (/etc/system):

ARC Size: Current Size: 5931 MB (arcsize) Target Size (Adaptive): 5985 MB (c) Min Size (Hard Limit): 507 MB (zfs_arc_min) Max Size (Hard Limit): 15223 MB (zfs_arc_max)

ARC Size Breakdown: Most Recently Used Cache Size: 73% 4405 MB (p) Most Frequently Used Cache Size: 26% 1579 MB (c-p)

ARC Efficency: Cache Access Total: 30934237785 Cache Hit Ratio: 99% 30933044424 [Defined State for buffer] Cache Miss Ratio: 0% 1193361 [Undefined State for Buffer] REAL Hit Ratio: 99% 30888972063 [MRU/MFU Hits Only]

Data Demand Efficiency: 99% Data Prefetch Efficiency: 99%

CACHE HITS BY CACHE LIST: Anon: 0% 43145095 [ New Customer, First Cache Hit ] Most Recently Used: 0% 56656566 (mru) [ Return Customer ] Most Frequently Used: 99% 30832315497 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 0% 280356 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 0% 646910 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data: 78% 24388331448 Prefetch Data: 19% 6013656064 Demand Metadata: 1% 472836583 Prefetch Metadata: 0% 58220329 CACHE MISSES BY DATA TYPE: Demand Data: 21% 255389 Prefetch Data: 32% 385838 Demand Metadata: 31% 381534 Prefetch Metadata: 14% 170600 ---------------------------------------------

Physical I/O

ZFS Breathing

Async writes grouped into Transaction Group (TXG) and “sync”ed to disk at regular intervals

txg_synctime represents expected time to flush a TXG (5 sec)

txg_timeout is typical frequency of flush (30 sec)

Pre-snv_87 txg_time flushed every 5 seconds (tunable)

Can be monitored via Dtrace (FBT: spa_sync)

Read I/O

Reads are always synchronous

...but ARC absorbs it very, very well

ZFS Intent Log

ZIL throws a kink in the works

Without a Log Device (“SLOG”) the intent log is part of the pool

Can be monitored via Dtrace (FBT: zil_commit_writer)

Disabling ZILSync I/O treated as Async

The fastest SSD won’t match the speed of zil_disable=1

ZFS is always consistent on disk, regardless

NFS corruption issues over-blown

If power is lost, inflight data (uncommited TXG) is lost... this may cause logical corruption

In some situations, its a necessary evil.. but have a good UPS

IOSTAT IS DEAD

... deal with it.

iostat

asvc_t & %b have diminished meaning due to TXG’s

Monitoring the physical devices not as telling as from the VFS layer

What’s more interesting is the gap between TXG sync’s... which is better monitored via Dtrace, rather than inferred.

If you must, always use iostat & fsstat together; essentially to see whats going in and out of ZFS.

The Write Throttle

Delays processes that are over-zealous

Tunable and can be disabled

Provides excellent hook to find out who is doing heavy I/O

Provides throughput numbers (fbt::dsl_pool_sync)

Backing it up

Backup Architecture

Todays data loads are getting too big for traditional weekly full, daily incr architecture

Full/Incr architecture designed around tape rotation

Disk-to-disk backup is different, dump old assumptions

Backup is really an exercise in asynchronous replication

Joyent backup is rooted in NCDP ideology

Rsync & Co.

Useful for non-ZFS clients, but slow startup time can be a killer

Rsync to ZFS dataset, snapshot dataset, repeat. Instant backup retention solution.

For proper consistency with ZFS, snapshot, rsync the snap, then release. Example: DB Data/Logs.

ZFS Send/Recv

Exercise in snapshot replication

No startup lag like rsync.

Very fast; in head-to-head rsync vs zfs s/r, zfs s/r is on average 40% faster

Large improvements in performance added to svn_105

Block (Pool) Replication

AVS (SNDR) can do replication of pool block devices, but pool in unusable.

Same goes for similar block-level replication tools

NDMP

I’m watching NDMP with great interest

NDMPcopy native for Solaris is a big win

Solaris implementation will snapshot prior to copy and release behind

Supposedly can use zfs s/r for data in addition to dump/tar

Would allow for re-integration of traditional backup “managers” such as NetBackup

Parting Thoughts

Transactional Filesystem

Always bear in mind the transactional nature of ZFS

Throw away most of your UFS training... or at least update it

Consult the VFS layer before jumping to conclusions about device level activity

Re-Discover Storage

Spend time in the code with Dtrace

Attempt to challenge old assumptions

Benchmark for fun and profit (use FileBench!)

Focus on the Application

More than ever, benchmark application performance

Don’t assume that system level I/O metrics tell the whole story

Use Dtrace to explore the interaction between your app and the VFS

Thank You.

Robert MilkowskiJason WilliamsMarcelo LealMax Bruning

James DickensAdrian Cockroft

Jim MauroRichard McDougall

Tom HaynesPeter TribbleJason King

Octave OrgeronBrendan Gregg

MattyRoch

Joerg Moellenkamp...

These people are awesome! -->

Resources

Joyent: joyent.com

Cuddletech: cuddletech.com

Solaris Internals Wiki: solarisinternals.com/wiki

Also:

blogs.sun.com

planetsolaris.com

Documents

ZFS in the Trenches - Cuddletechcuddletech.com/lisa.pdf · ZFS in the Trenches Ben Rockwood Director of Systems Engineering Joyent, Inc