Upload
trannhan
View
216
Download
2
Embed Size (px)
Citation preview
ZFS in the Trenches
Ben RockwoodDirector of Systems EngineeringJoyent, Inc.
The Big Questions
Is node 5 of 150 struggling
Is I/O efficient as it can be?
How fast are requests being answered?
What is my mix of sync vs async I/O?
Do I need to tune?
How do I back it up?
Setting the Record Straight
Correcting Assumptions
Tuning ZFS is Evil (true)
ZFS doesn’t require tuning (false)
ZFS is a memory hog (true)
ZFS is slow (false)
ZFS won’t allow corruption (false)
Amazing Efficiency
ZFS ARC is extremely efficient
Example: 32 production zones (different customers), 120,000 reads per second... ZERO physical read I/O!
ZFS TXG Sync is extremely efficient, provides a clean and orderly flush of writes to disk
ZFS Prefetch intelligence is smarter than you... but true efficiency can vary based on workload.
Physical vs Logical I/O
Observability
Kstat
ZFS ARC Kstats are incredibly useful
Tools such as arcstat or arc_summary are good examples
More Kstats (hopefully) are coming
Things get more interesting when combined with physical disk kstats and VFS layer kstats
root@quadra ~$ kstat -p -n arcstatszfs:0:arcstats:c 264750592zfs:0:arcstats:c_max 268435456zfs:0:arcstats:c_min 126918784zfs:0:arcstats:class misczfs:0:arcstats:crtime 11.976322232zfs:0:arcstats:data_size 105994240zfs:0:arcstats:deleted 871464zfs:0:arcstats:demand_data_hits 2690899zfs:0:arcstats:demand_data_misses 90664zfs:0:arcstats:demand_metadata_hits 6643919zfs:0:arcstats:demand_metadata_misses 340739zfs:0:arcstats:evict_skip 19395259zfs:0:arcstats:hash_chain_max 6zfs:0:arcstats:hash_chains 4508zfs:0:arcstats:hash_collisions 297073zfs:0:arcstats:hash_elements 28047zfs:0:arcstats:hash_elements_max 71400zfs:0:arcstats:hdr_size 5048208zfs:0:arcstats:hits 12065410zfs:0:arcstats:l2_abort_lowmem 0zfs:0:arcstats:l2_cksum_bad 0zfs:0:arcstats:l2_evict_lock_retry 0zfs:0:arcstats:l2_evict_reading 0zfs:0:arcstats:l2_feeds 0zfs:0:arcstats:l2_free_on_write 0zfs:0:arcstats:l2_hdr_size 0zfs:0:arcstats:l2_hits 0zfs:0:arcstats:l2_io_error 0zfs:0:arcstats:l2_misses 0zfs:0:arcstats:l2_read_bytes 0zfs:0:arcstats:l2_rw_clash 0zfs:0:arcstats:l2_size 0zfs:0:arcstats:l2_write_bytes 0zfs:0:arcstats:l2_writes_done 0zfs:0:arcstats:l2_writes_error 0zfs:0:arcstats:l2_writes_hdr_miss 0zfs:0:arcstats:l2_writes_sent 0zfs:0:arcstats:memory_throttle_count 3610zfs:0:arcstats:mfu_ghost_hits 217488zfs:0:arcstats:mfu_hits 7727741zfs:0:arcstats:misses 820341
MDB
mdb Provides many useful features that aren’t as difficult to use as you think (see zfs.c)
If you use just one, “::zfs_params” is handy to see all ZFS tunables in one shot
Several walkers are available, most handy when doing postmortem (Solaris CAT has wrappers for that.)
root@quadra ~$ mdb -kLoading modules: [ unix genunix s.....
> ::zfs_paramsarc_reduce_dnlc_percent = 0x3zfs_arc_max = 0x10000000zfs_arc_min = 0x0arc_shrink_shift = 0x5zfs_mdcomp_disable = 0x0zfs_prefetch_disable = 0x0zfetch_max_streams = 0x8zfetch_min_sec_reap = 0x2zfetch_block_cap = 0x100zfetch_array_rd_sz = 0x100000zfs_default_bs = 0x9zfs_default_ibs = 0xemetaslab_aliquot = 0x80000spa_max_replication_override = 0x3mdb: variable spa_mode not found: unknown symbol namezfs_flags = 0x0zfs_txg_synctime = 0x5zfs_txg_timeout = 0x1ezfs_write_limit_min = 0x2000000zfs_write_limit_max = 0x1e428200zfs_write_limit_shift = 0x3zfs_write_limit_override = 0x0zfs_no_write_throttle = 0x0zfs_vdev_cache_max = 0x4000zfs_vdev_cache_size = 0xa00000zfs_vdev_cache_bshift = 0x10vdev_mirror_shift = 0x15zfs_vdev_max_pending = 0x23zfs_vdev_min_pending = 0x4zfs_scrub_limit = 0xazfs_vdev_time_shift = 0x6zfs_vdev_ramp_rate = 0x2zfs_vdev_aggregation_limit = 0x20000fzap_default_block_shift = 0xezfs_immediate_write_sz = 0x8000zfs_read_chunk_size = 0x100000zil_disable = 0x0zfs_nocacheflush = 0x0metaslab_gang_bang = 0x20001metaslab_df_alloc_threshold = 0x20000metaslab_df_free_pct = 0x1ezio_injection_enabled = 0x0zvol_immediate_write_sz = 0x8000
DTrace
I live by the FBT provider
Watch entry and return of every ZFS function (w00t!)
Most powerful when used for timing or aggregating stacks to learn code flow
The fsstat provider is a hidden gem.
zdb
Examine on-disk structures
Can find and fix issues
Breakdown of disk utilization can be telling
Extremely interesting, but rarely handy
Have used it to recover deleted files
root@quadra ~$ zdbquadra version=18 name='quadra' state=0 txg=20039 pool_guid=15851861834417040262 hostid=12739881 hostname='quadra' vdev_tree type='root' id=0 guid=15851861834417040262 children[0] type='mirror' id=0 guid=7006308066956826552 whole_disk=0 metaslab_array=23 metaslab_shift=33 ashift=9 asize=1000188936192 is_log=0 children[0] type='disk' id=0 guid=916181432694315835 path='/dev/dsk/c2d0s0' devid='id1,cmdk@AHitachi_HDT721010SLA360=______STF605MH1TK48W/a' phys_path='/pci@0,0/pci-ide@1f,5/ide@0/cmdk@0,0:a' whole_disk=1 DTL=18 children[1] type='disk' id=1 guid=13343223963796600669 path='/dev/dsk/c3d0s0' devid='id1,cmdk@AHitachi_HDT721010SLA360=______STF605MH1S14KW/a' phys_path='/pci@0,0/pci-ide@1f,5/ide@1/cmdk@0,0:a' whole_disk=1 DTL=16
Keys to Observability
Use Dtrace to hone your understanding of ZFS internals
Don’t over-focus your instrumentation, leverage VFS and ZFS together to get a holistic picture
Avoid getting obsessed with the Dtrace Syscall provider (you can do better)
Hint: Study up on bdev_strategy & biodone
Kstats, kstats, kstats....
Inside ARC
The Cache Lists
Most Recently Used (MRU)
Most Frequently Used (MFU)
MRU Ghost
MFU Ghost
ARC & Prefetch
ARC Kstats can tell you how data arrived in cache: prefetch or direct
The mix of direct to prefetched data can determine if your prefetch is worth the I/O
If prefetch hits are less than 10% just disable it
ARC Sizing
ARC is HUGE! 7/8th of Physical Memory by default!
Min and Max size is tunable
Watch ghost lists if you limit
>25GB ARC on 32GB node extremely common
Ghost list hit rate seems to be the best indicator that you should consider L2ARC
$ ./arc_summary.pl System Memory: Physical RAM: 16247 MB Free Memory : 1064 MB LotsFree: 253 MB
ZFS Tunables (/etc/system):
ARC Size: Current Size: 5931 MB (arcsize) Target Size (Adaptive): 5985 MB (c) Min Size (Hard Limit): 507 MB (zfs_arc_min) Max Size (Hard Limit): 15223 MB (zfs_arc_max)
ARC Size Breakdown: Most Recently Used Cache Size: 73% 4405 MB (p) Most Frequently Used Cache Size: 26% 1579 MB (c-p)
ARC Efficency: Cache Access Total: 30934237785 Cache Hit Ratio: 99% 30933044424 [Defined State for buffer] Cache Miss Ratio: 0% 1193361 [Undefined State for Buffer] REAL Hit Ratio: 99% 30888972063 [MRU/MFU Hits Only]
Data Demand Efficiency: 99% Data Prefetch Efficiency: 99%
CACHE HITS BY CACHE LIST: Anon: 0% 43145095 [ New Customer, First Cache Hit ] Most Recently Used: 0% 56656566 (mru) [ Return Customer ] Most Frequently Used: 99% 30832315497 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 0% 280356 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 0% 646910 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data: 78% 24388331448 Prefetch Data: 19% 6013656064 Demand Metadata: 1% 472836583 Prefetch Metadata: 0% 58220329 CACHE MISSES BY DATA TYPE: Demand Data: 21% 255389 Prefetch Data: 32% 385838 Demand Metadata: 31% 381534 Prefetch Metadata: 14% 170600 ---------------------------------------------
Physical I/O
ZFS Breathing
Async writes grouped into Transaction Group (TXG) and “sync”ed to disk at regular intervals
txg_synctime represents expected time to flush a TXG (5 sec)
txg_timeout is typical frequency of flush (30 sec)
Pre-snv_87 txg_time flushed every 5 seconds (tunable)
Can be monitored via Dtrace (FBT: spa_sync)
Read I/O
Reads are always synchronous
...but ARC absorbs it very, very well
ZFS Intent Log
ZIL throws a kink in the works
Without a Log Device (“SLOG”) the intent log is part of the pool
Can be monitored via Dtrace (FBT: zil_commit_writer)
Disabling ZILSync I/O treated as Async
The fastest SSD won’t match the speed of zil_disable=1
ZFS is always consistent on disk, regardless
NFS corruption issues over-blown
If power is lost, inflight data (uncommited TXG) is lost... this may cause logical corruption
In some situations, its a necessary evil.. but have a good UPS
IOSTAT IS DEAD
... deal with it.
iostat
asvc_t & %b have diminished meaning due to TXG’s
Monitoring the physical devices not as telling as from the VFS layer
What’s more interesting is the gap between TXG sync’s... which is better monitored via Dtrace, rather than inferred.
If you must, always use iostat & fsstat together; essentially to see whats going in and out of ZFS.
The Write Throttle
Delays processes that are over-zealous
Tunable and can be disabled
Provides excellent hook to find out who is doing heavy I/O
Provides throughput numbers (fbt::dsl_pool_sync)
Backing it up
Backup Architecture
Todays data loads are getting too big for traditional weekly full, daily incr architecture
Full/Incr architecture designed around tape rotation
Disk-to-disk backup is different, dump old assumptions
Backup is really an exercise in asynchronous replication
Joyent backup is rooted in NCDP ideology
Rsync & Co.
Useful for non-ZFS clients, but slow startup time can be a killer
Rsync to ZFS dataset, snapshot dataset, repeat. Instant backup retention solution.
For proper consistency with ZFS, snapshot, rsync the snap, then release. Example: DB Data/Logs.
ZFS Send/Recv
Exercise in snapshot replication
No startup lag like rsync.
Very fast; in head-to-head rsync vs zfs s/r, zfs s/r is on average 40% faster
Large improvements in performance added to svn_105
Block (Pool) Replication
AVS (SNDR) can do replication of pool block devices, but pool in unusable.
Same goes for similar block-level replication tools
NDMP
I’m watching NDMP with great interest
NDMPcopy native for Solaris is a big win
Solaris implementation will snapshot prior to copy and release behind
Supposedly can use zfs s/r for data in addition to dump/tar
Would allow for re-integration of traditional backup “managers” such as NetBackup
Parting Thoughts
Transactional Filesystem
Always bear in mind the transactional nature of ZFS
Throw away most of your UFS training... or at least update it
Consult the VFS layer before jumping to conclusions about device level activity
Re-Discover Storage
Spend time in the code with Dtrace
Attempt to challenge old assumptions
Benchmark for fun and profit (use FileBench!)
Focus on the Application
More than ever, benchmark application performance
Don’t assume that system level I/O metrics tell the whole story
Use Dtrace to explore the interaction between your app and the VFS
Thank You.
Robert MilkowskiJason WilliamsMarcelo LealMax Bruning
James DickensAdrian Cockroft
Jim MauroRichard McDougall
Tom HaynesPeter TribbleJason King
Octave OrgeronBrendan Gregg
MattyRoch
Joerg Moellenkamp...
These people are awesome! -->
Resources
Joyent: joyent.com
Cuddletech: cuddletech.com
Solaris Internals Wiki: solarisinternals.com/wiki
Also:
blogs.sun.com
planetsolaris.com