Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Headline in Arial Bold 30pt
Fixing XFSFilesystems Faster
Dave Chinner <[email protected]>Barry Naujok <[email protected]>
01/30/08 Slide 2
Overview
• The “Repair Problem”• The “First Attempt” • An “Alternate Solution”• Analysis of Failure and Success• The “Final Design”• Results• Futures
01/30/08 Slide 3
The “Repair Problem”
• Filesystem capacity grows faster than disk capabilities
• Number of objects indexed grows faster than the rate we can read them
• Repair reads every object in the filesystem• Therefore, if repair doesn't get smarter, it will
take longer as capacity grows• 4 years ago a customer was very unhappy with xfs_repair taking 8 days to complete.
01/30/08 Slide 4
What Does xfs_repair Do?
• Phase 1 – finds and validates primary metadata• Phase 2 – reads in free space and inode locations• Phase 3 – inode discovery and checking• Phase 4 – extent discovery and checking• Phase 5 – rebuild free space and inode indexes• Phase 6 – check directory structure• Phase 7 – check link counts
01/30/08 Slide 5
The “First Attempt”
• Was aimed at improving xfs_repair on Irix
• No kernel block device caching in Irix• Lots of relatively slow CPUs but with high I/O
throughput• Phases 3 and 4 scan each Allocation Group (AG)
sequentially, but each AG is mostly self contained
01/30/08 Slide 6
The “First Attempt”, Part 2
• Add hash-based block caching to xfs_repair• Use a thread per AG and process multiple AGs at
once• Little I/O optimisation
– mainly relying on multiple CPUs being able to issue I/O faster than a single process
– some optimisation by batching synchronous readahead I/O
• Block based caching was released for Linux in version 2.8.0
• Multithreading was released in version 2.8.11
01/30/08 Slide 7
An “Alternate Solution”
• Patch to 2.7.18 created by Agami Systems• Used intelligent object based prefetch to prime
the kernel buffer cache• Processed inodes passed off to prefetch threads
to read in associated metadata• Processes only a single AG at a time• Faster on a single disk than 2.7.18 until it ran out
of memory• Much faster than 2.7.18 on multi-disk arrays
01/30/08 Slide 8
Success and Failure
Runtime (sec)0
50
100
150
200
250
300
350
400
450
500
550
600
650
250Gb SATA Disk 1.65M inodes
2.7.18
2.7.18 + Agami
2.8.0
2.8.10
2.8.20
Runtime (sec)0
2500
5000
7500
10000
12500
15000
17500
20000
22500
5.5TB RAID5 Array 37M inodes
2.7.18
2.7.18 + Agami
2.8.0
2.8.10
2.8.20
01/30/08 Slide 9
Analysis of Success and Failure
• We started comparison of 2.7.18 + Agami's patch against 2.7.18 and 2.8.20
• Surprise! In almost all cases, 2.8.x was much slower than 2.7.18.
• Block caching in xfs_repair was not working at all well on Linux
• Threading across AGs making it even worse.
01/30/08 Slide 10
Analysis of Failure
• The optimisations for Irix focussed on CPU level parallelism– CPU bound not I/O bound
• Linux analysis was done on CPUs 2-3x faster and a smaller I/O subsystem– I/O bound, not CPU bound
• Adding more seeks into an already I/O bound setup makes it slower, not faster
01/30/08 Slide 11
Analysis of Success
• The Agami patch used 10 threads to prefetch objects from a queue of 100, and adds 10 objects at a time to the prefetch queue
• Prefetch threads do no processing, only prime the kernel block device cache
• Processing thread feeds the prefetch queue as it processes objects it has read
• Speed up due to removing I/O latency in the processing thread.
01/30/08 Slide 12
Rejecting Success!
• The Agami patch was superior to existing threading but we rejected it
• Not a cross-platform solution– needs to run on Irix and FreeBSD as well, which lack raw
block device caching in the kernel
• Other technical reasons:– non-trivial porting effort to 2.8.x– Can not control cache usage or low memory readahead
thrashing– Does not optimise I/O patterns at all
01/30/08 Slide 13
Are We Crazy? (YES!)
• But we'd seen the light!• Object based prefetch reduces I/O latency within
an AG to speed up per-AG processing• Per-AG parallelism allows saturation of larger,
more complex storage configurations• We could combine the two methods and go even
faster!
01/30/08 Slide 14
Further Analysis
• Further analysis on a single threaded repair:– Tracing exact order of I/O from repair process– Identifying common patterns of metadata
• often contiguous• lots of single blocks separated by small number of data blocks
– identifying sub-optimal I/O patterns• backwards seeks• seeks across a large portion of the disk
• Looking for ways to sequentialise and reduce the number of I/Os the repair process issued.
01/30/08 Slide 15
The “Final Solution”
• All patches included in xfs_repair version 2.9.4
• Added a pair of per-AG prefetch queues– one for blocks ahead of the current location– one for blocks behind current location– Second pass for “behind blocks” removing backwards seeks.
• Prefetch threads process the queue– identify contiguous blocks and metadata dense sparse ranges– issues single large I/O and throws away non-metadata blocks– uses bandwidth instead of seeks to read metadata blocks
close together
01/30/08 Slide 16
The “Final Solution”, Part 2
• Processing thread could stall on blocks in “behind queue”– prefetch threads switch queues if the primary block queue
starts to run low
• Block cache needed work:– needed locking to be thread-safe– Different phases read metadata in different block sizes
• Used to purge cache between phases and reread blocks• Made all I/O sizes the same -> no re-read between phases
01/30/08 Slide 17
The “Final Solution”, Part 3
• Phase 6 – directory scanning was improved– now uses same inode scanning as Phase 3+4– visits each directory and inode counting links in a more I/O
efficient manner
• Phase 7 – link count verification– needed another inode scan to record link counts in inodes– now recorded in Phase 3 and compared to calculated counts
from Phase 6– only does I/O if they differ
01/30/08 Slide 18
The “Final Solution”, Part 4
• Per-AG parallelism enhanced with “ag_stride”– avoids parallel processing of AGs on same disks– If phase 3 does not overflow the cache, phase 4 is fully
parallelised without needing I/O
• Low memory behaviour optimised– cached blocks given priority based on:
• how likely they are to be used again• how expensive they were to read in initially
– low priority blocks purged first when cache overflows– reuse of free blocks to prevent heap fragmentation
01/30/08 Slide 19
Generating Test Filesystems
• Need to simulate aged filesystems• Script runs at least 10 processes in parallel• Each process
– creates variable sized files at a varying directory depth– uses small direct I/Os to cause non-optimal allocation patterns – 10% probability of deleting a file instead of creating.
• Results in:– large and fragmented directory structures– physically separate inode chunks– Generates fragmented files and hence randomly varying
inode extent lists
01/30/08 Slide 20
The Results
• Test system #1 – Desktop/Workstation– dual processor x86_64, 2GB RAM, single 250GB SATA disk
• 100,000 inodes, 7% full• 400,000 inodes, 100% full• 815,000 inodes, 100% full• 1.65M inodes, 100% full• 5.7M inodes, 100% full• 11M inodes, 37% full• 17M inodes, 100% full
01/30/08 Slide 21
250GB SATA Disk 100,000 Inodes
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Total0
5
10
15
20
25
30
35
40
45
250GB SATA Disk - 100,000 Inodes
2.7.18
Agami (2.7.18)
2.8.20
2.9.4
Runti
me (
sec)
01/30/08 Slide 22
250GB SATA Disk 400,000 Inodes
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Total0
20
40
60
80
100
120
140
160
180
250GB SATA Disk - 400,000 Inodes
2.7.18
Agami (2.7.18)
2.8.20
2.9.4
Runti
me (
sec)
01/30/08 Slide 23
250GB SATA Disk 800,000 Inodes
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Total0
25
50
75
100
125
150
175
200
225
250
275
300
325
250GB SATA Disk - 800,000 Inodes
2.7.18
Agami (2.7.18)
2.8.20
2.9.4
Runti
me (
sec)
01/30/08 Slide 24
250GB SATA Disk – 1.65M Inodes
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Total0
50
100
150
200
250
300
350
400
450
500
550
600
650
250GB SATA Disk - 1.65M Inodes
2.7.18
Agami (2.7.18)
2.8.20
2.9.4
Runti
me (
sec)
01/30/08 Slide 25
250GB SATA Disk 5.7M Inodes
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Total0
250500750
10001250
1500
1750
20002250
25002750
30003250
35003750
250GB SATA Disk - 5.7M Inodes
2.7.18
Agami (2.7.18)
2.8.20
2.9.4
2.9.4+bhash=65536
Runti
me (
sec)
01/30/08 Slide 26
250GB SATA Disk 11M Inodes
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Total0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
250GB SATA Disk - 11M Inodes
2.7.18
Agami (2.7.18)
2.8.20
2.9.4
Runti
me (
sec)
01/30/08 Slide 27
250GB SATA Disk 17M Inodes
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Total0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
250GB SATA Disk - 17M Inodes
2.7.18
Agami (2.7.18)
2.8.20 + bhash=4096
2.9.4
Runti
me (
sec)
01/30/08 Slide 28
250GB SATA Disk – Runtime Scaling
100k 400k 815k 1.65M0
50
100
150
200
250
300
350
400
450
500
550
600
650
Cache < Memory
2.7.18
Agami (2.7.18)
2.8.20
2.9.4
# of inodes in the filesystem
Runti
me (
sec)
5.7M 11M 17M0
100020003000
400050006000
700080009000
1000011000120001300014000
Cache > Memory
# of inodes in the filesystemR
unti
me (
sec)
01/30/08 Slide 29
100k 400k 815k 1.65M 5.7M 11M 17M0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
250GB SATA Disk - Inode Processing Rate
2.7.18
Agami (2.7.18)
2.8.20
2.9.4
ext3
Number of Inodes in the filesystem
Sca
n R
ate
(in
odes/
sec)
250GB SATA Disk – Inode Processing Rate
Bigger is Better
01/30/08 Slide 30
More Results
• Test system #2 – large server– 4p ia64, 48GB RAM:
• 5-way RAID0 stripe of 4+1 hardware RAID5 luns, 5.5TB capacity– 6M inodes, 80% full– 30M inodes, 100% full– 300M inodes, 60% full
01/30/08 Slide 31
5.5TB Volume 6M Inodes
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Total0
50100150
200250
300
350
400450
500550
600650
700750
5.5TB Volume - 6M Inodes
2.7.18
Agami (2.7.18)
2.8.20
2.9.4
Runti
me (
sec)
01/30/08 Slide 32
5.5TB Volume 30M Inodes
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Total0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
5.5TB Volume - 30M Inodes
2.7.18
Agami (2.7.18)
2.8.20
2.9.4
Runti
me (
sec)
01/30/08 Slide 33
5.5TB Volume 300M Inodes
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Total0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
5.5TB Volume - 300M Inodes
2.7.18
Agami (2.7.18)
2.8.20
2.9.4
Runti
me (
sec)
01/30/08 Slide 34
5.5TB 300M Inodes, Part 2
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Total0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
5.5TB - 300M Inodes, ag_stride
2.7.18
Agami (2.7.18)
2.9.4
2.9.4 + ag_stride=1
2.9.4+ags=1+pf=16
Runti
me (
sec)
01/30/08 Slide 35
5.5TB Volume – Runtime Scaling
6M 30M 300M0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
5.5TB Volume - Runtime Scaling
2.7.18Agami (2.7.18)
2.8.20
2.9.4
2.9.4+ags=12.9.4+ags=1+pf
# of inodes in the filesystem
Runti
me (
sec)
01/30/08 Slide 36
5.5TB Volume – Runtime Scaling
6M 30M 300M0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
5.5TB Volume - Runtime Scaling
2.7.18Agami (2.7.18)
2.9.4
2.9.4+ags=1
2.9.4+ags=1+pf
# of inodes in the filesystem
Runti
me (
sec)
01/30/08 Slide 37
5.5TB Volume – Inode Processing Rate
6M 30M 300M0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
5.5TB Volume - Inode Processing Rate
2.7.18
Agami (2.7.18)
2.8.20
2.9.4
2.9.4+ags=1
2.9.4+ags=1+pf
Number of Inodes in the filesystem
Sca
n R
ate
(in
odes/
sec)
Bigger is Better
01/30/08 Slide 38
5.5TB Volume – Low Memory
2.7.18 Agami (2.7.18) 2.9.40
200
400
600
800
1000
1200
1400
1600
1800
2000
5.5TB Volume - Low Memory
48GB RAM
2GB RAM
Tests used the 30M inode filesystem
Runti
me (
sec)
01/30/08 Slide 39
Futures
• Memory usage reductions– allow larger filesystems to be checked in small RAM configs– Introduce more efficient indexing structures– Use extents for indexing free space
• Performance– Multithreading of Phase 6– Directory name hash checking scalability– Trade memory usage savings for larger caches
• Robustness– Phase 1 on badly broken filesystems– Preservation of broken directories
01/30/08 Slide 40
Questions?