View
220
Download
1
Tags:
Embed Size (px)
Citation preview
File Systems: Designs
Kamen Yotov
CS 614 Lecture, 04/26/2001
Overview
The Design and Implementation of a Log-Structured File SystemSequential StructureSpeeds up Writes & Crash Recovery
The Zebra Stripped Network File SystemStripping across multiple serversRAID equivalent data recovery
Log-structured FS: Intro
Order of magnitude faster!?!The future is dominated by writesMain memory increasesReads are handled by cache
Logging is old, but now is differentNTFS, Linux Kernel 2.4
Challenge – finding free spaceBandwidth utilization 65-75% vs. 5-10%
Design issues of the 1990’s
Importance factorsCPU – exponential growthMain memory – caches, buffersDisks – bandwidth, access time
WorkloadsSmall files – single random I/OsLarge files – bandwidth vs. FS policies
Problems with current FS
Scattered information5 I/O operations to access a file under BSD
Synchronous writesMay be only the meta-data, but it’s enoughHard to benefit from the faster CPUs
Network file systemsMore synchrony in the wrong place
Log-structured FS (LFS)
FundamentalsBuffering many small write operationsWriting them at once on a single,
continuous disk block
Simple or not?How to retrieve information?How to manage the free space?
LFS: File Location and Reading
Index structures to permit random access retrieval
Again inodes, …but at random positions on the log!
inode map: indexed, memory resident
Writes are better, while reads are at least as good!
Example: Creating 2 Files
inode
directory
data
inode map
LFS: Free Space Management
Need large free chucks of spaceThreading
Excessive fragmentation of free spaceNot better than other file systems
CopyingCan be to different places…Big costs
Combination
Threading & copying
Solution: Segmentation
Large fixed-size blocks (e.g. 1MB) Threading through segments Copying inside segments
Transfer longer than seekingSegment cleaning Which are the live chucks To what file they belong and at what position
(inode update) Segment summary block(s) File version stamps
Segment Cleaning Policies
When should the cleaner execute?Continuously, At night, When exhausted
How many segments to clean at once?Which segments are to be cleaned?Most fragmented ones or…
How should the live blocks be grouped when written back?Locality for future reads…
Measuring & Analysing
Write cost Average amount of time busy for byte of data
written, including cleaning overhead 1.0 is perfect – full bandwidth, no overhead Bigger is worse LFS: seek and rotational latency negligible, so it’s
just totaldata!
Performance trade-off: utilization vs. speed
The key: bimodal segment distribution!
Simulation & Results
Purpose: Analyze different cleaning policiesHarsh modelFile system is modeled as a set of 4K filesAt each step a file is chosen and rewritten
Uniform: Each with equal likelihood to be chosen
Hot-and-cold: The 10-90 formula
Runs until write cost is stabilized
Write Cost vs. Disk Utilization
2
4
6
8
10
12
14
16
18
0.1
0.2
0.9
0.3
0.4
0.5
0.6
0.7
0.8
FFS today
FFS improved
LFS uniform
LFS hot-and-cold
No variance
Disk utilization
Write cost
Hot & Cold Segments
Why is locality worse than no locality?
Free space valuable in cold segments Value based on data stability Approximate stability with age
Cost-benefit policy Benefit: Amount of:
Space cleaned (inverse of utilization of segment) Time stays free (timestamp of youngest block)
Cost (read + write live data)
Segment Utilization Distributions
1
2
3
4
5
6
7
8
9
0.1
0.2
0.9
0.3
0.4
0.5
0.6
0.7
0.8
Fraction of segments (0.001)
Segment utilization
Uniform
Hot-and-cold (greedy)
Hot-and-cold (cost-benefit)
Write Cost vs. Disk Utilization (revisited)
2
4
6
8
10
12
14
16
18
0.1
0.2
0.9
0.3
0.4
0.5
0.6
0.7
0.8
FFS today
FFS improvedLFS uniformLFS cost-
benefit
No variance
Disk utilization
Write cost
Crash Recovery
Currently file systems require full scanLog-based systems are definitely betterCheck-pointing (two-phase, trade-offs)
(Meta-)information – logCheckpoint region – fixed position
inode map blocks segment usage table time & last segment written
Roll-forward
Crash Recovery (cont.)
Naïve method: On a crash, just use the latest checkpoint and go from there!
Roll-forward recovery Scan segment summary blocks for new inodes If just data, but no inode, assume incomplete and
ignore Adjust utilization of segments Restore consistency between directory entries and
inodes (special records in the log for the purpose)
Experience with Sprite LFS
Part of Sprite Network Operating System
All implemented, roll-forwarding disabled
Short 30 second check-pointing interval
Not more complicated to implement than a normal “allocation” file system NTFS and Ext2 even more…
Not great improvement to the user as few applications are disk-bound!
So, let’s go micro!
Micro-benchmarks were produced
10 times faster when creating small files
Faster in reading of order preserved
Only case slower in Sprite isWrite file randomlyRead it sequentially
Produced locality differs a lot!
Sprite LFS vs. Sun OS 4.0.3
Sizes 4KB block 1MB segment
x10 speed-up in writes/deletes
Temporal locality
Saturate CPU!!!
Random write
Size 8KB block
Slow on writes/deletes
Logical locality
Keep disk busy
Sequential read
Related Work
WORM media – always been logged Maintain indexes No deletion necessary
Garbage collection Scavenging = Segment cleaning Generational = Cost-benefit scheme Difference: random vs. sequential
Logging similar to database systems Use of the log differs (like NTFS, Linux 2.4)
Recovery is like “redo log”-ging
Zebra Networked FS: Intro
Multi-server networked file systemClients stripe data throughRedundancy ensures fault-tolerance &
recoverability
Suitable for multimedia & parallel tasks
Borrows from RAID and LFS principles
Achieves speed-ups from 20% to 5x
Zebra: Background
RAIDDefinitions
Stripes Fragments
ProblemsBandwidth bottleneckSmall files
Differences with Distributed File Systems
stripedata parity
Per file vs. Per client stripping
RAID standard
4 I/Os for small files 2 reads 2 writes
LFS Data distribution Parity distribution Storage efficient
14
25
36
1 2 3 4 5 6 large
file small file
(1) small file
(2)
14
25
36
1 2 3 4 5 6
many files (LFS)
Zebra: Network LSF
Logging between clients and servers (as opposed to file server and disks)Per client strippingMore efficient storage space usageParity mechanism is simplified
No overhead for small filesNever needs to be modified
Typical distributed computing problems
Zebra: Components
File Manager
Stripe Cleaner
Storage Servers
Clients
File Manager and Stripe cleaner may reside on a Storage Server as separate processes – useful for fault tolerance!
FastNetwork
File Manager
Stripe Cleaner
StorageServer
StorageServer
Client
Client
Client
Client
Client
Client
…
Zebra: Component Dynamics
Clients Location, fetching &
delivery of fragments Striping, parity
computation, writing
Storage servers Bulk data repositories Fragment operations
Store, Append, Retrieve, Delete, Identify
Synchronous, non overwrite semantics
File Manager Meta-data repository Just pointers to blocks RPC bottleneck for many
small files Can run as a separate
process on a Storage Server
Stripe Cleaner Similar to the Sprite LFS
we discussed Runs as a separate, user
mode process
Zebra: System Operation - Deltas
Communication via DeltasFields
File ID, File version, Block #Old & New block pointers
TypesUpdate, Cleaner, Reject
Reliable, because stored in the logReplay after crashes
Zebra: System Operation (cont.)
Writing files Flushes on
Threshold age (30 s) Cache full & dirty Application fsync File manager request
Striping Deltas update Concurrent transfers
Reading files Nearly identical to
conventional FS Good client caching
Consistency
Stripe cleaning Choosing which to… Space utilization
through deltas Stripe Status File
Zebra: Advanced System Operations
Adding Storage Servers Scalable
Restoring from crashes Consistency & Availability Specifics due to distributed system state
Internally inconsistent stripes Stripe information inconsistent with File Manager Stripe cleaner state consistency with Storage Servers
Logging and check-pointing Fast recoveries after failures
Prototyping
Most of the interesting parts only on paper Included
All UNIX file commands, file system semantics Functional cleaner Clients construct fragments and write parities File Manager and Storage Servers checkpoint
Some advanced crash recovery methods omitted Metadata not yet stored on Storage Servers Clients do not automatically reconstruct fragments upon a
Storage Server crash Storage Servers do not reconstruct fragments on recovery File Manager and Cleaner not automatically restarted
Measurements: Platform
Cluster of DECstation-5000 Model 200 100 Mb/s FDDI local network ring 20 SPECint 32 MB RAM 12 MB/s memory to memory copy 8 MB/s memory to controller copy RZ57 1GB disks, 15ms seek
2 MB/s native transfer bandwidth 1.6 MB/s real transfer bandwidth (due to controller)
Caching disk controllers (1MB)
Measurements: Results (1)
Large File Writes
0
0.5
1
1.5
2
2.5
3
3.5
4
1 2 3 4
Servers
To
tal
Th
rou
gh
pu
t (M
B/s
)
1 client
2 clients
3 clients
1 client w/ parity
Sprite
NFS
Measurements: Results (2)
Large File Reads
0
1
2
3
4
5
6
1 2 3 4
Servers
To
tal
Th
rou
gh
pu
t (M
B/s
)
1 client
2 clients
3 clients
1 client (reconstruct)
Sprite
NFS
Measurements: Results (3)
Small File Writes
0
10
20
30
40
50
60
70
NFS Sprite Zebra Sprite N.C. Zebra N.C.
File System
Ela
pse
d T
ime
(sec
on
ds)
Server Flush
Client Flush
Write
Open/Close
Measurements: Results (4)
Resource Utilization
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Zebra LW Sprite LW Zebra LR Sprite LR Zebra SW Sprite SW
% u
tili
zed
FM CPU
FM Disk
Client CPU
SS CPU
SS Disk
Zebra: Conclustions
Pros Applies parity and
log structure to network file systems
Performance Scalability Cost-effective
servers Availability Simplicity
Cons Lacks name caching,
causing severe performance degradations
Not well suited for transaction processing
Metadata problems Small reads are
problematic again