Hydrafs Presentation With Notes

8/9/2019 Hydrafs Presentation With Notes

1/30

April 27, 2010

HydraFSC. Ungureanu, B. Atkin, A. Aranya, et al.

Slides: Joe Buck, CMPS 229, Spring 2010

1

Tuesday, April 27, 2010


2/30

Introduction

What is HydraFS?

Why is it necessary?

2


HydraFS is a le system on top of HydraStore, a scalable, distributed CASApplications dont write to a CAS interface, they write to a FS interface. Need an adapterlayer, thus hydraFS.CAS uses a put/get model


3/30

HYDRAstor

What is HYDRAstor

Immutable data

High Latency

Jitter

Put / Get API

3


Inconsistent use of capitalization in acronyms Jitter, in this case, means distance between writes to storageMention chunking


4/30

Hydra Diagram

CommitServer

FileServer

StorageNode

StorageNode

StorageNode

StorageNode

HYDRAstor Block Access Library

HydraFS

Hydra

Access Node

Single System Content Addressable Store4



5/30

CAS

4 KB 4 KB 4 KB

4 KB 4 KB 4 KB

Client

CAS

5



6/30

CAS - continued

4 KB 4 KB 4 KB

4 KB 4 KB 4 KB

Client

CAS

Chunker

6


The chunker uses some heuristic involving content data and some hard set limits to chunk invariable sizes


7/30

CAS - continued

2 KB

4 KB 4 KB 4 KB

Client

CAS

cas1: 10KB

7


Objects in the CAS have ids that are pointed to by meta-data.cas1 is 10 kb in size


8/30

CAS - continued

1 KB 4 KB

Client

CAS

cas1: 10KB cas2: 9KB

8


cas2 is 9 kb


9/30

A little more on CAS addresses

Same data doesnt mean the same address

Impossible to calculate prior to write

Foreground processing writes shallow trees

Root cannot be updated until all child nodes are set

9


Dif ering retention levels can produce di f erent CAS addresses.Collisions can be detected but are unlikely.Writes are done asynch, block on root node commit


10/30

Issues for a CAS FS

Updates are more expensive

Metadata cache misses cause signicant performance issues

The combination of high latency and high throughput means lots of buffering

10


Updates must touch all meta-data that points to a f ected dataBuf ering allows for optimal write ordering and read cache is important as well


11/30

Design Decisions

Decouple data and metadata processing

Fixed size caches with admission control

Second-order cache for metadata

11


From the previous 3 issues come 3 design decisions:1) this is done via a log. Allows batching of meta-data updates2) this prevents swapping, other resource over-allocations3) removes operations from reads via cache hits, improves metadata cache hit rate


12/30

Issues - continued

Immutable Blocks

FS can only reference blocks already written

Forms DAGs

Height of DAGs needs to be minimized

12


The entire tree must be updated if a block contained in it is updated, makes updates quiteexpensive


13/30


14/30

Issues - continued

Variable sized blocks

Avoids the shifting window problem

Use a balanced tree structure

14


This is the chunking referred to in the paper.there is a min / max size for chunks.tree helps minimize DAGs


15/30

FS design

High Throughput

Minimize the number of dependent I/O operations

Availability guarantees no worse than standard Unix FS

Efciently support both local and remote access

15


close to open consistency (fsync acknowledgment means data is persisted)Remote access could be NFS or CIFS


16/30

File System Layout

Filename1

Filename2

Filename3

321

365

442

R

R

D

Regular File Inode

File Contents

Inode B Tree

Super Blocks

Imap Handle

Imap B Tree

Imap Segmented Array

Directory Inode

Directory Blocks

Inode B Tree

16


Inode map similar to Log-Structured le systemFiles dedup across le systems


17/30

HydraFS Software Stack

Uses FUSE

Split into le server and commit server

Simplies metadata locking

Amortizes the cost of metadata updates via batching

Each server has its own caching strategy

17


File server manages the interface to the client, records le modications in a transaction logstored in hydra, in-memory cache of recent le modications.Commit server reads transaction log, updates FS metadata, generates new FS versions


18/30

Writing Data

Data stored in inode specic buffer

Chunked, marked dirty and written to Hydra

After write conrmation, block freed and entered in uncommitted block table

Needed until metadata is ushed to storage

Designed for append writing, in-place updates are expensive

18


Chunks have a max size at which point a chunk is createdWrites cached in memory until Hydra conrms them. (this allows for responses to reads in themeantime or failures in hydra.Data not visible in Hydra until a new FS is created.


19/30

Metadata Cleaning

Dirty data kept until the commit server applies changes

New versions of le systems are created periodically

Metadata in separate structures, tagged by time

Always clean (in Hydra), can be dropped from cache at any time

Cleaning allows le servers to drop changes in the new FS version

19


New FS allows a le server to clean its dirty metadata proactively.


20/30

Admission Control

Events assume worse case memory usage

If insufcient resources are available, the event blocks

Limits the number of active events

Memory usage is tuned to the amount of physical memory

20


Not all memory used are freed when an action completes. For example, cache. This can beushed if the system nds it needs to reclaim memory.Not swapping is key for keeping latencies low and performance up.


21/30

Read Processing

Aggressive read-ahead

Multiple fetches to get metadata

Weighted caching to favor metadata over data

Fast range map

Metadata read-ahead

Primes FRM, cache

21


Read-ahead goes into an in-memory LRU cache, default is 20 MB.HydraFS caches both meta-data and data. Uses large leaf nodes and high-fan parent nodes.Fast range map is a look-aside bu f er, translates le o f set to content address.FRM and BtreeReadAhead add 36% performance for small memory/cpu overhead


22/30

Deletion

File deletion removes the entry from the current FS

Data remains until there are no pointers to it

22


The data will remain in storage until all FS versions that reference it are garbage collected.Block maybe pointed to by other les as well.The FS only marks roots for deletion, Hydra handles reference counting and storagereclamation.


23/30

Performance

0.0

0.2

0.4

0.6

0.8

1.0

Read (iSCSI) Read (Hydra) Write (iSCSI) Write (Hydra)

N o r m

a l i z e

d T h r o u g

h p u t

Raw block deviceFile system

23


Sequential throughputiSCSI is 6 disks per node -> software raid5 (likely the write hit iscsi takes)Block size 64 KBHydraFS 82% of read, 88% on write


24/30

Metadata Intensive

Postmark

Generates les, then issues transactions.

File size: 512 B - 16 KB

Create DeleteOverall

Alone Tx Alone Tx

ext3 1,851 68 1,787 68 136HydraFS 61 28 676 28 57

24


This is a worse-case for HydraFSHad to create FSes on the y due to limit on outstanding metadata updatesFewer operations to amortize costs over


25/30

Write Performance vs Dedup

0

50

100

150

200

250

300

350

0 20 40 60 80

T h r o u g

h p u

t ( M B / s )

Duplicate Ratio (%)

HydraHydraFS

25


HydraFS within 12% of Hydra throughout


26/30

Write Behind

6

6.5

7

7.5

8

8.5

9

9.5

10

0 5 10 15 20

O f f s e

t ( G B )

Time (s)26


Helps with bu f ering. No IO in the write critical pathA lot of jitter around 6 seconds, biggest gap is 1.5 GB


27/30

Hydra Latency

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70

P r (

t < = x

)

Time (ms)27


90% percentile at 10 msPoint: even though Hydra is jittery and high latency, hydraFS still works (smoothes things out)


28/30

Future Work

Allow multiple nodes to manage same FS Makes failover transparent and automatic

Exposing snapshots to users

Incorporating SSD storage to lower latencies, make HydraFS usableas primary storage

28



29/30

Thank you

Questions? Comments?

email: [email protected]

Paper: ht tp://www.usenix.org/events/fast10/tech/full_papers/ungurea nu.pdf

29

http://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfmailto:[email protected]://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfhttp://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfhttp://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfhttp://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfmailto:[email protected]:[email protected]


30/30

Sample Operations

Block Write

Block Read

Searchable Block Write

Searchable Block Read

30


Writes trade blocks for CAS addresses, reads invert thatLabels can group data for retention or deletion, garbage collection reaps all the data thatisnt part of a tree anchored by a retention block

Documents

Hydrafs Presentation With Notes