Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
© 2006 IBM Corporation
IBM Linux Technology Center
The Linux ext2/3/4 Filesystem: Past, Present, and Future
Theodore Ts'oIBM Linux Technology Center
September 11, 2006
IBM Linux Technology Center
© 2006 IBM Corporation
Agenda
A brief history of the ext2/3 filesystemThe ext3 filesystem formatFeatures added to ext3 in Linux 2.6New features planned for ext3/4Why ext4?Conclusion
IBM Linux Technology Center
© 2006 IBM Corporation
A brief history of Linux filesystems
The Minix filesystem (1991, used to bootstrap Linux)Max FS size: 64MB
Max file size: 64MB
Max filename: 14/30 bytes (fixed-length directory entries)
Only supported modification timestampFirst attempt to improve on Minixfs: the ext filesystem (1992)
Max FS size: 2GB
Max file size: 2GB
Max filename: 255 bytes
Still only one timestamp
Linked lists for free block/inodes caused performance problems
IBM Linux Technology Center
© 2006 IBM Corporation
The xiafs and ext2fs filesystems
Xiafs: minimal changes from minix (January 1993)Max FS size: 2GB
Max file size: 64MB (instead of 2GB)
Max filename: 248 bytes (fixed-length directory entries)
ctime/mtime/atime timestampsExt2fs – improvements to extfs (January 1993)
Max FS extended to 4TB
Variable block sizes
ctime/mtime/atime timestamps
Improved block/inode allocation using bitmaps and block groups
IBM Linux Technology Center
© 2006 IBM Corporation
Competition between xiafs and ext2fs
Since xiafs only made minor changes to minix, it was initially (appeared) more stable.Frank Xia tried to rename xiafs to Linuxfs – negative reaction to marketing-driven changesExt2 had a larger development community (so more features added) and had a more scalable design. In the end it became the dominant “default” filesystemFeatures added to ext2 over the years
sparse superblocks
Large file support (> 2GB)
Extended attributes
ACL's
IBM Linux Technology Center
© 2006 IBM Corporation
The ext3 filesystem
Journalling added to ext2 in 2000 (work started in 1998).Since it required many changes to the code base, a new version of the filesystem code was created in the kernel.
Hence, ext3
But really just ext2 with the COMPAT_HAS_JOURNAL feature (from the filesystem format point of view)
Other Journaling FilesystemsReiserfs, JFS, XFS
Advantages of ext3Backwards compatibility with ext2
Robustness against hardware errors highest priority
IBM Linux Technology Center
© 2006 IBM Corporation
The ext2/3 filesystem format
Verfy similar to the BSD FFSCylinder groups have become “block groups”Compatibility feature sets allow controlled addition of new features via three bitmasks:
R/W Compat – The kernel may mount the filesystem even if it does not understand a feature in this bitmask. (E2fsck however will refuse to touch a filesystem it doesn't understand)
R/O Compat – The kernel may mount the filesystem read/only if it does not understand a feature in this bitmask
Incompat – The kernel must not mount the filesystem if it does not understand a feature in this bitmask
IBM Linux Technology Center
© 2006 IBM Corporation
Ext2 Filesystem Layout
Boot BG #0 BG #1 ... BG #N
SuperBlock
BlockBitmap
InodeBitmap
Data blocksInodeTable
FS des-criptors
IBM Linux Technology Center
© 2006 IBM Corporation
Ext2 Inode structure
ModeOwners
SizeTimestamps
...
direct blocks
indirectd. indirectt. indirect
data
datadata
datadata
datadata
datadatadatadata
IBM Linux Technology Center
© 2006 IBM Corporation
Ext2 Directory Layout
name1name2name3name4
I1i2I3I3
Inode Table
Directory
IBM Linux Technology Center
© 2006 IBM Corporation
Features added to Linux 2.6
BKL removal and other scalability improvements (Andrew Morton, Alex Thomas)Directory Indexing (Daniel Phillips, Theodore Ts'o)Extended Attributes (Andreas Gruenbacher)Online resizing (Andreas Dilger, Stephen Tweedie)Reservation-based block preallocation (Mingming Cao, Andrew Morton, Stephen Tweedie, Badari Pulvarty)
IBM Linux Technology Center
© 2006 IBM Corporation
Reducing Lock Contention
Motivation: scaling issues for 2.4's ext3/jbd under workloads with concurrent I/OTo address this problem:
replaced the per-filesystem superblock lock in ext3 with finer-grained locks
Removed the big (global) kernel lock from the JBD layerResult: SDET benchmark throughput improved by a factor of 10
IBM Linux Technology Center
© 2006 IBM Corporation
Directory Indexing
Motivation: large directories took a long time to searchSolution: Add a search tree indexed by the hash of the filename to the directory
Variation of a B+tree– Directory entries stored in only leaf nodes– The use of fixed-length, 64-bit hashes as keys results in a high
fanout factorFully backwards compatible with older kernels
Interior nodes look like deleted directory entries
Older kernels will clear the directory indexed bit when they modify a directory, thus invalidating the interior nodes until they can be regenerated.
IBM Linux Technology Center
© 2006 IBM Corporation
Extended Attributes
Motivation: need to store small amounts of custom metadata which is associated with files or directories
Also needed to support Access Control Lists (ACL's)EAs are stored in a single EA block, which can be shared by inodes have same extended attributesIn Linux 2.6.11+, EA's can be stored in the expanded inode as well.
This EA-in-inode makes the ext3 top filesystem on Samba4 benchmarks
IBM Linux Technology Center
© 2006 IBM Corporation
Online Resizing
Motivation: Taking advantage of new disk space after a logical volume has been grown by the LVM subsystem without needing to unmount the filesystemSolution: Reserve space so that the number of blocks needed for the block group descriptors can be grown
An additional 4k block is required for every 32 block groups
Block group descriptors must be contiguously stored after the superblock
Integrated into the kernel as of 2.6.10 and e2fsprogs 1.39
IBM Linux Technology Center
© 2006 IBM Corporation
Reservation based block preallocation
Block preallocation helps reduce file fragmentation caused by concurrent allocationExt3 added block preallocation since 2.6.10 kernel.Ext3 uses in-memory block reservation to support a large preallocation
Ext3 (before)
Ext3 (After)
file 1
file 2
file 3
file 4
IBM Linux Technology Center
© 2006 IBM Corporation
Reservation Tree
(8, 31)
(0, 7) (32, 63)
(64, 71)
file 3file 2file 1 file 4
disk blocks
Files
IBM Linux Technology Center
© 2006 IBM Corporation
4 threads 16threads 64threads0
5
10
15
20
25
30
35
40
tiobench sequential write
ext3 2.4.29ext3 2.6.11JFSXFS
Thro
ughp
ut(M
B/se
c)
IBM Linux Technology Center
© 2006 IBM Corporation
Features planned for ext3/4
ExtentsSupport for large disks (48 and 64 bit block numbers)Fine-grained timestampsAsynchornous (background) unlink/truncateSupport > 32,000 subdirectoriesFiner grained locking to support parallel directory operations
IBM Linux Technology Center
© 2006 IBM Corporation
01......11121314
200201......211212123765530
213
...1236
1238......
1239......
65531......
65532......
65533......
0......200201......213............1239.........6553......
direct blockindirect blockdouble indirect blocktriple indirect block
i_data Ext2/3 Indirect Block Map
disk blocksWhy Extents?
IBM Linux Technology Center
© 2006 IBM Corporation
Extents
● Extents are an efficient way to represent large files● An extent is a single descriptor for a range of contiguous
blocks
logical length physical
0 1000 200
IBM Linux Technology Center
© 2006 IBM Corporation
header
01000200100120006000
...
...
200201......1199.........60006001......6199......
i_data
Extent Map
disk blocks
IBM Linux Technology Center
© 2006 IBM Corporation
i_data
index node...
0
...
...
header
0
0
...
...
Extent Treeleaf node disk blocks
extents
extents index
node header
root
IBM Linux Technology Center
© 2006 IBM Corporation
Extent Related Works
Multiple block allocationAn efficient way to allocating a chunk of contiguous blocks at a time
Delayed allocationEnable multiple block allocation by deferring and clustering single block allocation
IBM Linux Technology Center
© 2006 IBM Corporation
Evaluation of Extents Patches
Improvements for large file creation/removal/sequential read/sequential rewriteBenchmarks used: dbench, tiobench, FFSB filemark, sqlbench, iozone, etc.
IBM Linux Technology Center
© 2006 IBM Corporation
4 threads 16threads 64threads0
5
10
15
20
25
30
35
40
Tiobench Sequential Write Comparison With Extents
ext3 2.6.11ext3+extetnsJFSXFS
Thro
ughp
ut(M
B/se
c)
IBM Linux Technology Center
© 2006 IBM Corporation
Large File Sequential I/O Comparison Using FFSB
Sequential Read Sequential write Sequential re-write0
20
40
60
80
100
120
140
160
180
127
7175.7
153.7
91.9
102.7
156.3
89.394.8
166.3
104.3 100 ext3 ext3+extentsJFSXFS
Thro
ughp
ut(M
B/se
c)
IBM Linux Technology Center
© 2006 IBM Corporation
Ext4: The next-generation ext3
When initial versions of the extents patches were sent out for comment, some Linux kernel developers expressed concern:
Ext3 was too important to risk destabilizing code quality
Backwards incompatible extensions could cause user confusionAfter much discussion, a consensus on moving forward
Ext3 cleanup patches would be applied
The ext3 code base would be forked to fs/ext4, with the filesystem name ext4-dev
New work would happen in ext4-dev, and when the feature set for ext4 is stablized it would be renamed from ext4-dev to ext4.
IBM Linux Technology Center
© 2006 IBM Corporation
Conclusion
The ext2/3/4 filesystem is oldest filesystem which is still being actively developed in LinuxHas served the Linux community well for over 10 yearsWith new improvements being constantly being proposed, implemented, and placed into production, ext3/4 development continues to remain vital and exciting!