Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
The Full Path to Full-Path Indexing
Yang Zhan, Alex Conway, Yizheng Jiao, Eric Knorr, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, Jun Yuan
Talk Overview▪ What is full-path indexing and its benefits?
• locality▪ What are the challenges?
• renames▪ How do we overcome them?
• data structure techniques: tree surgery and lifting
Conventional file systems use inodes/
foo
bar
var
A
B
C
D
E
data of D...
...
...data of C
data of A
...data of E
...data of B
disk...
worst case: no relation between the logical and physical locations
Inode file systems show no locality in the worst case/
foo
bar
var
A
B
C
D
E
data of D...
...
...data of C
data of A
...data of E
...data of B
disk...
read all files under /foo
data of D
data of C
data of E
multiple random IOs
Full-path indexing file systems use full-paths▪ Full-path indexing file systems index metadata and
data in key-value stores using full-paths
Full-path indexing file systems use full-paths/
foo
bar
var
A
B
C
D
E
disk
logically related things are physically related
.../A: data/bar/B: data/foo/C: data/foo/var/D: data
.../foo/var/E: data
Full-path file systems ensure locality/
foo
bar
var
A
B
C
D
E
disk
read all files under /foo
one large IO
.../A: data/bar/B: data/foo/C: data/foo/var/D: data
.../foo/var/E: data
/foo/C: data/foo/var/D: data/foo/var/E: data
Scans are fast in full-path file systems
40
ext4 btrfs BetrFS-FPxfs zfs
seco
nds
Time to grep the linux source directory (lower is better)
full-path file systems are 3.3x faster than other file systems
60
20
0
Talk Overview▪ What is full-path indexing and its benefits?
• locality▪ What are the challenges?
• renames▪ How do we overcome them?
• data structure techniques: tree surgery and lifting
Renames are cheap in inode file systems/
foo
bar
var
A
B
C
D
E
data of D...
...
...data of C
data of A
...data of E
...data of B
disk...
rename /foo/var to /bar/var
one pointer swing
Renames are cheap in inode file systems/
foo
bar
var
A
B
C
D
E
data of D...
...
...data of C
data of A
...data of E
...data of B
disk...
rename /foo/var to /bar/var
no change
Renames seem expensive with full-path indexing/
foo
bar
var
A
B
C
D
E
.../A: data/bar/B: data/foo/C: data/foo/var/D: data
.../foo/var/E: data
disk
rename /foo/var to /bar/var
/foo/var/D: data/foo/var/E: data
need to maintain key order
Renames seem expensive with full-path indexing
.../A: data/bar/B: data/foo/C: data
...
/foo/var/D: data/foo/var/E: data
.../A: data/bar/B: data
/foo/C: data...
/bar/var/D: data/bar/var/E: data
1. move kv pairs2. change key prefixes
Expensive when rename size is large
Renaming big files are slow in full-path file systems
Renaming big files are slow in full-path file systems
slow
Renaming big files are slow in full-path file systems
Rename the Linux source directory takes 20 seconds
slow
Inode vs. Full-path indexing
rename locality
inode file systems
full-path file systems
We want to get decent renames with good locality
▪ In FAST 2016, zoning was introduced to BetrFS▪ Zoning tries to get both locality and fast renames
Zoning tries to solve the rename problem
Zoning Example/
foo
bar
var
A
B
C
D
E
full-path indexing within one zone
indirection between zones zone maintenance cost:a zone is too large: zone splita zone is too small: zone merge
Zoning tries to achieve both fast renames and locality
small zones(inode file systems)
big zones(full-path file systems)
scan cost rename cost
sweet spot
Zoning performance
rename locality other operations
zoning
Zoning achieves cheap renames
Zoning achieves cheap renames
big files form their own zones, renaming them is cheap
Zoning performance
rename locality other operations
zoning
Zoning has relatively good locality
0
20
40
60
ext4 btrfs BetrFS-FPxfs zfs
seco
nds
Time to grep the linux source directory (lower is better)
BetrFS-Zone
zoning is still 2.2x faster than other file systems, but 33% slower than full-path indexing
Zoning performance
rename locality other operations
zoning
Zone maintenance can be expensive
10k zone splits
50% drop
Zoning performance
rename locality other operations
zoning
Zoning is not the answer
Talk Overview▪ What is full-path indexing and its benefits?
• locality▪ What are the challenges?
• renames▪ How do we overcome them?
• data structure techniques: tree surgery and lifting
Moving is expensive in an array/
foo
bar
var
A
B
C
D
E
disk
.../A/bar/B/foo/C/foo/var/D
.../foo/var/E
Moving kv pairs looks hard in an array, but they are stored in a Bε-tree in BetrFS
Bε-trees sometimes allow easy movesCan be viewed as B-trees in this talk
/foo/var/D/foo/var/E
/foo/C
/bar/B
/A
leaves store kv pairs
/bar/B/foo/C
/A
interior nodes store pivots and pointers to children
Bε-trees sometimes allow easy moves
/foo/var/D/foo/var/E
/foo/C
/bar/B
/A
/bar/B/foo/C
/A
/foo/C
move /foo/C to /C
This move can be done by changing pivots and a pointer swing
/foo/var/E
Bε-trees sometimes allow easy moves
/foo/var/D/foo/var/E
/foo/C
/bar/B
/A
/bar/B/foo/C
/A
/foo/C
move /foo/C to /C
This move can be done by changing pivots and a pointer swing
/C
Problem 1: keys are not updated (should be /C)
/foo/var/E
Problem 2: not easy to move /foo/var/E
/C
▪ Two problems:• need to get an isolated subtree
▪ tree surgery in O(Bε-tree height) IOs• need to update keys
▪ lifting, no additional IO cost▪ The whole solution is called range-rename
A rename can be done by moving a subtree
Tree surgery slices out an isolated subtree...
...
......
...
......
......
......
......
...
.........
...
.........
...
get a subtree of range (/redmin, /redmax)
completely in range
need to take care of fringe nodes
Tree surgery slices out an isolated subtree...
...
......
...
......
......
......
......
...
.........
...
.........
...
get a subtree of range (/redmin, /redmax)
1. walk down the tree with min and max key
Tree surgery slices out an isolated subtree...
...
......
...
......
......
......
......
...
.........
...
.........
...
get a subtree of range (/redmin, /redmax)
1. walk down the tree with min and max key
2. start node splits from leaves
Tree surgery slices out an isolated subtree...
...
......
...
......
......
......
......
...
......
......
.........
...
get a subtree of range (/redmin, /redmax)
1. walk down the tree with min and max key
2. start node splits from leaves
3. keep splitting until two splits converge
Tree surgery slices out an isolated subtree...
...
......
...
......
......
......
......
...
......
......
.........
...
get a subtree of range (/redmin, /redmax)
1. walk down the tree with min and max key
2. start node splits from leaves
3. keep splitting until two splits converge
Tree surgery slices out an isolated subtree...
...
......
...
......
......
......
......
...
......
......
.........
...
get a subtree of range (/redmin, /redmax)
1. walk down the tree with min and max key
2. start node splits from leaves
3. keep splitting until two splits converge
we now have a isolated subtree
▪ Reasons:• to setup pivots for the source tree• POSIX allows renames to overwrite files
Tree surgery also slices at the destination
Tree surgery finishes with a pointer swing
......
...
▪ rename /red to /blue
...
......
.........
...
...
......
...
Tree surgery finishes with a pointer swing
......
...
▪ rename /red to /blue
...
......
.........
...
...
......
...
Tree surgery completes in O(Bε-tree height) IOs...
...
......
...
......
......
......
......
...
.........
...
.........
...
most data: not touched
each subtree slicing go through two root-to-leaf path
▪ Revert the IO cost to O(subtree size)
▪ Solution: lifting to convert Bε-trees to lifted Bε-trees• prefix updates are free
Updating all keys in the subtree is expensive
Lifting lifts the common prefix of two pivots
/redmax
/orange/redmin
/yellow/red/v/red/e
/red/d/red/c
/red/x/red/w
/red/z/red/y
/red/b/red/a
......
...
......
Common_prefix(/redmin, /redmax) = /red
all keys in the subtree must have prefix /red, there is no need to store that
lift /red
Lifting lifts the common prefix of two pivots
/redmax
/orange/redmin
/yellow/v/e
/d/c
/x/w
/z/y
/b/a
......
...
......
lift /red
Search for /red/z
/red lifted, search for /z
/red lifted, search for /z
/red lifted, search for /z
Lifting lifts the common prefix of two pivots
/redmax
/orange/redmin
/yellow/v/e
/d/c
/x/w
/z/y
/b/a
......
...
......
lift /red
/bluemax
/aqua/bluemin
/orange
lift /blue
Lifting lifts the common prefix of two pivots
/blue/v/blue/e
/blue/d/blue/c
/blue/x/blue/w
/blue/z/blue/y
/blue/b/blue/a
......
...
......
/bluemax
/aqua/bluemin
/orange
lift /blue
all keys in the subtree are updated automatically after the pointer swing
▪ Lifting happens at all times▪ Cost of other operations:
• collect lifted parts along the root-to-leaf path• no additional IO
▪ Cost of maintaining key lifting• key lifting can only change in node splits/merges• no additional IO
Lifting does not introduce additional IOs
▪ Range-rename performs tree surgery• O(Bε-tree height) IOs
▪ Key/value pairs are stored in lifted Bε-trees• keys are updated after tree surgery without cost
Range-rename completes in O(Bε-tree height) IOs
Evaluation
other operations rename applications
range-rename
▪ Dell optilex destop• 4-core 3.4 GHz i7, 4 GB RAM• 7200 RPM 500 GB Seagate Barrcuda
Experimental Setup
Tokubench
no cliff
Range-rename doesn’t charge other operations as much as zoning
Evaluation
other operations rename applications
range-rename
Rename Throughput
Rename Throughput
In normal cases, range-rename can rename as fast as other file systems
Evaluation
other operations rename applications
range-rename
IMAP benchmark
ext4 btrfs BetrFS-Zonexfs zfs BetrFS-RR
The throughput of 4 threads operating on the dovecot server (higher is better)
0
50
100
150
ops/
s
BetrFS-RR is 12% faster than BetrFS-Zone
rsync benchmark
ext4 btrfs BetrFS-Zonexfs zfs BetrFS-RR0
20
40
MB
/s
The throughput of rsync to copy the linux directory (higher is better)
BetrFS-RR is 13% faster than BetrFS-Zone
BetrFS-RR is faster than BetrFS-Zone in application benchmarks
Evaluation
other operations rename applications
range-rename
Evaluation
other operations rename applications
range-rename
● Lower taxes on non-rename operations○ A few regressions
● Worst case rename costs: logarithmic in size● Additional opportunities for range rename
in paper
▪ BetrFS with range-rename• maintain full-path indexing• decent rename performance• no tradeoff: locality, rename and other operations
Conclusion
Web: betrfs.orgCode: https://github.com/oscarlab/betrfs
Email: [email protected]