Upload
four2five
View
216
Download
0
Embed Size (px)
Citation preview
8/9/2019 Hydrafs Presentation With Notes
1/30
April 27, 2010
HydraFSC. Ungureanu, B. Atkin, A. Aranya, et al.
Slides: Joe Buck, CMPS 229, Spring 2010
1
Tuesday, April 27, 2010
8/9/2019 Hydrafs Presentation With Notes
2/30
Introduction
What is HydraFS?
Why is it necessary?
2
Tuesday, April 27, 2010
HydraFS is a le system on top of HydraStore, a scalable, distributed CASApplications dont write to a CAS interface, they write to a FS interface. Need an adapterlayer, thus hydraFS.CAS uses a put/get model
8/9/2019 Hydrafs Presentation With Notes
3/30
HYDRAstor
What is HYDRAstor
Immutable data
High Latency
Jitter
Put / Get API
3
Tuesday, April 27, 2010
Inconsistent use of capitalization in acronyms Jitter, in this case, means distance between writes to storageMention chunking
8/9/2019 Hydrafs Presentation With Notes
4/30
Hydra Diagram
CommitServer
FileServer
StorageNode
StorageNode
StorageNode
StorageNode
HYDRAstor Block Access Library
HydraFS
Hydra
Access Node
Single System Content Addressable Store4
Tuesday, April 27, 2010
8/9/2019 Hydrafs Presentation With Notes
5/30
CAS
4 KB 4 KB 4 KB
4 KB 4 KB 4 KB
Client
CAS
5
Tuesday, April 27, 2010
8/9/2019 Hydrafs Presentation With Notes
6/30
CAS - continued
4 KB 4 KB 4 KB
4 KB 4 KB 4 KB
Client
CAS
Chunker
6
Tuesday, April 27, 2010
The chunker uses some heuristic involving content data and some hard set limits to chunk invariable sizes
8/9/2019 Hydrafs Presentation With Notes
7/30
CAS - continued
2 KB
4 KB 4 KB 4 KB
Client
CAS
cas1: 10KB
7
Tuesday, April 27, 2010
Objects in the CAS have ids that are pointed to by meta-data.cas1 is 10 kb in size
8/9/2019 Hydrafs Presentation With Notes
8/30
CAS - continued
1 KB 4 KB
Client
CAS
cas1: 10KB cas2: 9KB
8
Tuesday, April 27, 2010
cas2 is 9 kb
8/9/2019 Hydrafs Presentation With Notes
9/30
A little more on CAS addresses
Same data doesnt mean the same address
Impossible to calculate prior to write
Foreground processing writes shallow trees
Root cannot be updated until all child nodes are set
9
Tuesday, April 27, 2010
Dif ering retention levels can produce di f erent CAS addresses.Collisions can be detected but are unlikely.Writes are done asynch, block on root node commit
8/9/2019 Hydrafs Presentation With Notes
10/30
Issues for a CAS FS
Updates are more expensive
Metadata cache misses cause signicant performance issues
The combination of high latency and high throughput means lots of buffering
10
Tuesday, April 27, 2010
Updates must touch all meta-data that points to a f ected dataBuf ering allows for optimal write ordering and read cache is important as well
8/9/2019 Hydrafs Presentation With Notes
11/30
Design Decisions
Decouple data and metadata processing
Fixed size caches with admission control
Second-order cache for metadata
11
Tuesday, April 27, 2010
From the previous 3 issues come 3 design decisions:1) this is done via a log. Allows batching of meta-data updates2) this prevents swapping, other resource over-allocations3) removes operations from reads via cache hits, improves metadata cache hit rate
8/9/2019 Hydrafs Presentation With Notes
12/30
Issues - continued
Immutable Blocks
FS can only reference blocks already written
Forms DAGs
Height of DAGs needs to be minimized
12
Tuesday, April 27, 2010
The entire tree must be updated if a block contained in it is updated, makes updates quiteexpensive
8/9/2019 Hydrafs Presentation With Notes
13/30
8/9/2019 Hydrafs Presentation With Notes
14/30
Issues - continued
Variable sized blocks
Avoids the shifting window problem
Use a balanced tree structure
14
Tuesday, April 27, 2010
This is the chunking referred to in the paper.there is a min / max size for chunks.tree helps minimize DAGs
8/9/2019 Hydrafs Presentation With Notes
15/30
FS design
High Throughput
Minimize the number of dependent I/O operations
Availability guarantees no worse than standard Unix FS
Efciently support both local and remote access
15
Tuesday, April 27, 2010
close to open consistency (fsync acknowledgment means data is persisted)Remote access could be NFS or CIFS
8/9/2019 Hydrafs Presentation With Notes
16/30
File System Layout
Filename1
Filename2
Filename3
321
365
442
R
R
D
Regular File Inode
File Contents
Inode B Tree
Super Blocks
Imap Handle
Imap B Tree
Imap Segmented Array
Directory Inode
Directory Blocks
Inode B Tree
16
Tuesday, April 27, 2010
Inode map similar to Log-Structured le systemFiles dedup across le systems
8/9/2019 Hydrafs Presentation With Notes
17/30
HydraFS Software Stack
Uses FUSE
Split into le server and commit server
Simplies metadata locking
Amortizes the cost of metadata updates via batching
Each server has its own caching strategy
17
Tuesday, April 27, 2010
File server manages the interface to the client, records le modications in a transaction logstored in hydra, in-memory cache of recent le modications.Commit server reads transaction log, updates FS metadata, generates new FS versions
8/9/2019 Hydrafs Presentation With Notes
18/30
Writing Data
Data stored in inode specic buffer
Chunked, marked dirty and written to Hydra
After write conrmation, block freed and entered in uncommitted block table
Needed until metadata is ushed to storage
Designed for append writing, in-place updates are expensive
18
Tuesday, April 27, 2010
Chunks have a max size at which point a chunk is createdWrites cached in memory until Hydra conrms them. (this allows for responses to reads in themeantime or failures in hydra.Data not visible in Hydra until a new FS is created.
8/9/2019 Hydrafs Presentation With Notes
19/30
Metadata Cleaning
Dirty data kept until the commit server applies changes
New versions of le systems are created periodically
Metadata in separate structures, tagged by time
Always clean (in Hydra), can be dropped from cache at any time
Cleaning allows le servers to drop changes in the new FS version
19
Tuesday, April 27, 2010
New FS allows a le server to clean its dirty metadata proactively.
8/9/2019 Hydrafs Presentation With Notes
20/30
Admission Control
Events assume worse case memory usage
If insufcient resources are available, the event blocks
Limits the number of active events
Memory usage is tuned to the amount of physical memory
20
Tuesday, April 27, 2010
Not all memory used are freed when an action completes. For example, cache. This can beushed if the system nds it needs to reclaim memory.Not swapping is key for keeping latencies low and performance up.
8/9/2019 Hydrafs Presentation With Notes
21/30
Read Processing
Aggressive read-ahead
Multiple fetches to get metadata
Weighted caching to favor metadata over data
Fast range map
Metadata read-ahead
Primes FRM, cache
21
Tuesday, April 27, 2010
Read-ahead goes into an in-memory LRU cache, default is 20 MB.HydraFS caches both meta-data and data. Uses large leaf nodes and high-fan parent nodes.Fast range map is a look-aside bu f er, translates le o f set to content address.FRM and BtreeReadAhead add 36% performance for small memory/cpu overhead
8/9/2019 Hydrafs Presentation With Notes
22/30
Deletion
File deletion removes the entry from the current FS
Data remains until there are no pointers to it
22
Tuesday, April 27, 2010
The data will remain in storage until all FS versions that reference it are garbage collected.Block maybe pointed to by other les as well.The FS only marks roots for deletion, Hydra handles reference counting and storagereclamation.
8/9/2019 Hydrafs Presentation With Notes
23/30
Performance
0.0
0.2
0.4
0.6
0.8
1.0
Read (iSCSI) Read (Hydra) Write (iSCSI) Write (Hydra)
N o r m
a l i z e
d T h r o u g
h p u t
Raw block deviceFile system
23
Tuesday, April 27, 2010
Sequential throughputiSCSI is 6 disks per node -> software raid5 (likely the write hit iscsi takes)Block size 64 KBHydraFS 82% of read, 88% on write
8/9/2019 Hydrafs Presentation With Notes
24/30
Metadata Intensive
Postmark
Generates les, then issues transactions.
File size: 512 B - 16 KB
Create DeleteOverall
Alone Tx Alone Tx
ext3 1,851 68 1,787 68 136HydraFS 61 28 676 28 57
24
Tuesday, April 27, 2010
This is a worse-case for HydraFSHad to create FSes on the y due to limit on outstanding metadata updatesFewer operations to amortize costs over
8/9/2019 Hydrafs Presentation With Notes
25/30
Write Performance vs Dedup
0
50
100
150
200
250
300
350
0 20 40 60 80
T h r o u g
h p u
t ( M B / s )
Duplicate Ratio (%)
HydraHydraFS
25
Tuesday, April 27, 2010
HydraFS within 12% of Hydra throughout
8/9/2019 Hydrafs Presentation With Notes
26/30
Write Behind
6
6.5
7
7.5
8
8.5
9
9.5
10
0 5 10 15 20
O f f s e
t ( G B )
Time (s)26
Tuesday, April 27, 2010
Helps with bu f ering. No IO in the write critical pathA lot of jitter around 6 seconds, biggest gap is 1.5 GB
8/9/2019 Hydrafs Presentation With Notes
27/30
Hydra Latency
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50 60 70
P r (
t < = x
)
Time (ms)27
Tuesday, April 27, 2010
90% percentile at 10 msPoint: even though Hydra is jittery and high latency, hydraFS still works (smoothes things out)
8/9/2019 Hydrafs Presentation With Notes
28/30
Future Work
Allow multiple nodes to manage same FS Makes failover transparent and automatic
Exposing snapshots to users
Incorporating SSD storage to lower latencies, make HydraFS usableas primary storage
28
Tuesday, April 27, 2010
8/9/2019 Hydrafs Presentation With Notes
29/30
Thank you
Questions? Comments?
email: [email protected]
Paper: ht tp://www.usenix.org/events/fast10/tech/full_papers/ungurea nu.pdf
29
Tuesday, April 27, 2010
http://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfmailto:[email protected]://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfhttp://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfhttp://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfhttp://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfmailto:[email protected]:[email protected]8/9/2019 Hydrafs Presentation With Notes
30/30
Sample Operations
Block Write
Block Read
Searchable Block Write
Searchable Block Read
30
Tuesday, April 27, 2010
Writes trade blocks for CAS addresses, reads invert thatLabels can group data for retention or deletion, garbage collection reaps all the data thatisnt part of a tree anchored by a retention block