Fixing XFS Filesystems Faster - Linux Australiamirror.linux.org.au/pub/linux.conf.au/2008/slides/135-fixing_xfs... · Analysis of Success and Failure • We started comparison of

Headline in Arial Bold 30pt

Fixing XFSFilesystems Faster

Dave Chinner <[email protected]>Barry Naujok <[email protected]>

mailto:[email protected]

01/30/08 Slide 2

Overview

• The “Repair Problem”• The “First Attempt” • An “Alternate Solution”• Analysis of Failure and Success• The “Final Design”• Results• Futures

01/30/08 Slide 3

The “Repair Problem”

• Filesystem capacity grows faster than disk capabilities

• Number of objects indexed grows faster than the rate we can read them

• Repair reads every object in the filesystem• Therefore, if repair doesn't get smarter, it will

take longer as capacity grows• 4 years ago a customer was very unhappy with xfs_repair taking 8 days to complete.

01/30/08 Slide 4

What Does xfs_repair Do?

• Phase 1 – finds and validates primary metadata• Phase 2 – reads in free space and inode locations• Phase 3 – inode discovery and checking• Phase 4 – extent discovery and checking• Phase 5 – rebuild free space and inode indexes• Phase 6 – check directory structure• Phase 7 – check link counts

01/30/08 Slide 5

The “First Attempt”

• Was aimed at improving xfs_repair on Irix

• No kernel block device caching in Irix• Lots of relatively slow CPUs but with high I/O

throughput• Phases 3 and 4 scan each Allocation Group (AG)

sequentially, but each AG is mostly self contained

01/30/08 Slide 6

The “First Attempt”, Part 2

• Add hash-based block caching to xfs_repair• Use a thread per AG and process multiple AGs at

once• Little I/O optimisation

– mainly relying on multiple CPUs being able to issue I/O faster than a single process

– some optimisation by batching synchronous readahead I/O

• Block based caching was released for Linux in version 2.8.0

• Multithreading was released in version 2.8.11

01/30/08 Slide 7

An “Alternate Solution”

• Patch to 2.7.18 created by Agami Systems• Used intelligent object based prefetch to prime

the kernel buffer cache• Processed inodes passed off to prefetch threads

to read in associated metadata• Processes only a single AG at a time• Faster on a single disk than 2.7.18 until it ran out

of memory• Much faster than 2.7.18 on multi-disk arrays

01/30/08 Slide 8

Success and Failure

Runtime (sec)0

50

100

150

200

250

300

350

400

450

500

550

600

650

250Gb SATA Disk 1.65M inodes

2.7.18

2.7.18 + Agami

2.8.0

2.8.10

2.8.20

Runtime (sec)0

2500

5000

7500

10000

12500

15000

17500

20000

22500

5.5TB RAID5 Array 37M inodes

2.7.18

2.7.18 + Agami

2.8.0

2.8.10

2.8.20

01/30/08 Slide 9

Analysis of Success and Failure

• We started comparison of 2.7.18 + Agami's patch against 2.7.18 and 2.8.20

• Surprise! In almost all cases, 2.8.x was much slower than 2.7.18.

• Block caching in xfs_repair was not working at all well on Linux

• Threading across AGs making it even worse.

01/30/08 Slide 10

Analysis of Failure

• The optimisations for Irix focussed on CPU level parallelism– CPU bound not I/O bound

• Linux analysis was done on CPUs 2-3x faster and a smaller I/O subsystem– I/O bound, not CPU bound

• Adding more seeks into an already I/O bound setup makes it slower, not faster

01/30/08 Slide 11

Analysis of Success

• The Agami patch used 10 threads to prefetch objects from a queue of 100, and adds 10 objects at a time to the prefetch queue

• Prefetch threads do no processing, only prime the kernel block device cache

• Processing thread feeds the prefetch queue as it processes objects it has read

• Speed up due to removing I/O latency in the processing thread.

01/30/08 Slide 12

Rejecting Success!

• The Agami patch was superior to existing threading but we rejected it

• Not a cross-platform solution– needs to run on Irix and FreeBSD as well, which lack raw

block device caching in the kernel

• Other technical reasons:– non-trivial porting effort to 2.8.x– Can not control cache usage or low memory readahead

thrashing– Does not optimise I/O patterns at all

01/30/08 Slide 13

Are We Crazy? (YES!)

• But we'd seen the light!• Object based prefetch reduces I/O latency within

an AG to speed up per-AG processing• Per-AG parallelism allows saturation of larger,

more complex storage configurations• We could combine the two methods and go even

faster!

01/30/08 Slide 14

Further Analysis

• Further analysis on a single threaded repair:– Tracing exact order of I/O from repair process– Identifying common patterns of metadata

• often contiguous• lots of single blocks separated by small number of data blocks

– identifying sub-optimal I/O patterns• backwards seeks• seeks across a large portion of the disk

• Looking for ways to sequentialise and reduce the number of I/Os the repair process issued.

01/30/08 Slide 15

The “Final Solution”

• All patches included in xfs_repair version 2.9.4

• Added a pair of per-AG prefetch queues– one for blocks ahead of the current location– one for blocks behind current location– Second pass for “behind blocks” removing backwards seeks.

• Prefetch threads process the queue– identify contiguous blocks and metadata dense sparse ranges– issues single large I/O and throws away non-metadata blocks– uses bandwidth instead of seeks to read metadata blocks

close together

01/30/08 Slide 16

The “Final Solution”, Part 2

• Processing thread could stall on blocks in “behind queue”– prefetch threads switch queues if the primary block queue

starts to run low

• Block cache needed work:– needed locking to be thread-safe– Different phases read metadata in different block sizes

• Used to purge cache between phases and reread blocks• Made all I/O sizes the same -> no re-read between phases

01/30/08 Slide 17


• Phase 6 – directory scanning was improved– now uses same inode scanning as Phase 3+4– visits each directory and inode counting links in a more I/O

efficient manner

• Phase 7 – link count verification– needed another inode scan to record link counts in inodes– now recorded in Phase 3 and compared to calculated counts

from Phase 6– only does I/O if they differ

01/30/08 Slide 18


• Per-AG parallelism enhanced with “ag_stride”– avoids parallel processing of AGs on same disks– If phase 3 does not overflow the cache, phase 4 is fully

parallelised without needing I/O

• Low memory behaviour optimised– cached blocks given priority based on:

• how likely they are to be used again• how expensive they were to read in initially

– low priority blocks purged first when cache overflows– reuse of free blocks to prevent heap fragmentation

01/30/08 Slide 19

Generating Test Filesystems

• Need to simulate aged filesystems• Script runs at least 10 processes in parallel• Each process

– creates variable sized files at a varying directory depth– uses small direct I/Os to cause non-optimal allocation patterns – 10% probability of deleting a file instead of creating.

• Results in:– large and fragmented directory structures– physically separate inode chunks– Generates fragmented files and hence randomly varying

inode extent lists

01/30/08 Slide 20

The Results

• Test system #1 – Desktop/Workstation– dual processor x86_64, 2GB RAM, single 250GB SATA disk

• 100,000 inodes, 7% full• 400,000 inodes, 100% full• 815,000 inodes, 100% full• 1.65M inodes, 100% full• 5.7M inodes, 100% full• 11M inodes, 37% full• 17M inodes, 100% full

01/30/08 Slide 21

250GB SATA Disk 100,000 Inodes

Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Total0

5

10

15

20

25

30

35

40

45

250GB SATA Disk - 100,000 Inodes

2.7.18

Agami (2.7.18)

2.8.20

2.9.4

Runti

me (

sec)

01/30/08 Slide 22



20

40

60

80

100

120

140

160

180


2.7.18

Agami (2.7.18)

2.8.20

2.9.4

Runti

me (

sec)

01/30/08 Slide 23



25

50

75

100

125

150

175

200

225

250

275

300

325


2.7.18

Agami (2.7.18)

2.8.20

2.9.4

Runti

me (

sec)

01/30/08 Slide 24

250GB SATA Disk – 1.65M Inodes


50

100

150

200

250

300

350

400

450

500

550

600

650

250GB SATA Disk - 1.65M Inodes

2.7.18

Agami (2.7.18)

2.8.20

2.9.4

Runti

me (

sec)

01/30/08 Slide 25

250GB SATA Disk 5.7M Inodes


250500750

10001250

1500

1750

20002250

25002750

30003250

35003750

250GB SATA Disk - 5.7M Inodes

2.7.18

Agami (2.7.18)

2.8.20

2.9.4

2.9.4+bhash=65536

Runti

me (

sec)

01/30/08 Slide 26

250GB SATA Disk 11M Inodes


1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

12000

13000

14000

250GB SATA Disk - 11M Inodes

2.7.18

Agami (2.7.18)

2.8.20

2.9.4

Runti

me (

sec)

01/30/08 Slide 27

250GB SATA Disk 17M Inodes


1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

250GB SATA Disk - 17M Inodes

2.7.18

Agami (2.7.18)

2.8.20 + bhash=4096

2.9.4

Runti

me (

sec)

01/30/08 Slide 28

250GB SATA Disk – Runtime Scaling

100k 400k 815k 1.65M0

50

100

150

200

250

300

350

400

450

500

550

600

650

Cache < Memory

2.7.18

Agami (2.7.18)

2.8.20

2.9.4

# of inodes in the filesystem

Runti

me (

sec)

5.7M 11M 17M0

100020003000

400050006000

700080009000

1000011000120001300014000

Cache > Memory

# of inodes in the filesystemR

unti

me (

sec)

01/30/08 Slide 29

100k 400k 815k 1.65M 5.7M 11M 17M0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

250GB SATA Disk - Inode Processing Rate

2.7.18

Agami (2.7.18)

2.8.20

2.9.4

ext3

Number of Inodes in the filesystem

Sca

n R

ate

(in

odes/

sec)

250GB SATA Disk – Inode Processing Rate

Bigger is Better

01/30/08 Slide 30

More Results

• Test system #2 – large server– 4p ia64, 48GB RAM:

• 5-way RAID0 stripe of 4+1 hardware RAID5 luns, 5.5TB capacity– 6M inodes, 80% full– 30M inodes, 100% full– 300M inodes, 60% full

01/30/08 Slide 31

5.5TB Volume 6M Inodes


50100150

200250

300

350

400450

500550

600650

700750

5.5TB Volume - 6M Inodes

2.7.18

Agami (2.7.18)

2.8.20

2.9.4

Runti

me (

sec)

01/30/08 Slide 32



500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

6500


2.7.18

Agami (2.7.18)

2.8.20

2.9.4

Runti

me (

sec)

01/30/08 Slide 33



10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

110000

120000


2.7.18

Agami (2.7.18)

2.8.20

2.9.4

Runti

me (

sec)

01/30/08 Slide 34

5.5TB 300M Inodes, Part 2


5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

5.5TB - 300M Inodes, ag_stride

2.7.18

Agami (2.7.18)

2.9.4

2.9.4 + ag_stride=1

2.9.4+ags=1+pf=16

Runti

me (

sec)

01/30/08 Slide 35

5.5TB Volume – Runtime Scaling

6M 30M 300M0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

110000

120000

5.5TB Volume - Runtime Scaling

2.7.18Agami (2.7.18)

2.8.20

2.9.4

2.9.4+ags=12.9.4+ags=1+pf


Runti

me (

sec)

01/30/08 Slide 36

5.5TB Volume – Runtime Scaling

6M 30M 300M0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

5.5TB Volume - Runtime Scaling

2.7.18Agami (2.7.18)

2.9.4

2.9.4+ags=1

2.9.4+ags=1+pf


Runti

me (

sec)

01/30/08 Slide 37

5.5TB Volume – Inode Processing Rate

6M 30M 300M0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

110000

120000

5.5TB Volume - Inode Processing Rate

2.7.18

Agami (2.7.18)

2.8.20

2.9.4

2.9.4+ags=1

2.9.4+ags=1+pf

Number of Inodes in the filesystem

Sca

n R

ate

(in

odes/

sec)

Bigger is Better

01/30/08 Slide 38

5.5TB Volume – Low Memory

2.7.18 Agami (2.7.18) 2.9.40

200

400

600

800

1000

1200

1400

1600

1800

2000

5.5TB Volume - Low Memory

48GB RAM

2GB RAM

Tests used the 30M inode filesystem

Runti

me (

sec)

01/30/08 Slide 39

Futures

• Memory usage reductions– allow larger filesystems to be checked in small RAM configs– Introduce more efficient indexing structures– Use extents for indexing free space

• Performance– Multithreading of Phase 6– Directory name hash checking scalability– Trade memory usage savings for larger caches

• Robustness– Phase 1 on badly broken filesystems– Preservation of broken directories

01/30/08 Slide 40

Questions?

Documents

Fixing XFS Filesystems Faster - Linux Australiamirror.linux.org.au/pub/linux.conf.au/2008/slides/135-fixing_xfs... · Analysis of Success and Failure • We started comparison of