View
445
Download
0
Category
Preview:
Citation preview
1
Scalable metadata management at very large-scale file systems: a Survey
Presented by Viet-Trung TRAN KerData Team
2
Outline Overview of metadata management
Motivation: Distributed metadata management
Metadata distribution strategies Static sub-tree Partitioning Hashing Lazy Hybrid Dynamic sub-tree partitioning Probabilistic lookup
Metadata partitioning granularity
Hierarchical file systems are dead
Gfarm/BlobSeer: a Versioning distributed file system
3
Overview of metadata management Hierarchical file systems
Name space/Directory service Maps human readable name to file identifier (/a/b/c -> 232424)
File location service Maps file identifier to distributed file parts (232424 -> {block, object addresses})
Access control service
Flat file systems (Amazon S3, some obsolete ones: CFS, PAST) No Name space/Directory service
4
Motivation 50% to 80% of file systems accesses are to metadata [Ousterhout et al.,1985]
Current strength: Distributed object-based file systems Object-based storages manage data Metadata servers manage metadata
Provide the ability to efficiently handle scalable I/O
But limited performance of metadata server (MDS) still becomes critical to overall system performance
At very large-scale Number of files/directories can be trillions Number of concurrent users : ???
=> Scalable metadata management in a distributed manner is crucial
5
Metadata distribution strategies Problem
Multiple metadata servers, each handles a part of the whole namespace Input: File pathname Output: the metadata server which owns the correspondent file metadata
Constrains Load balancing Lookup time Migration cost up on the configuration Directory operations (e.g: ls) Scalability Locking
Solutions ???
6
Static sub-tree partitioning NFS, Sprite, AFS, Coda.
The name space is partitioned at system configuration stage
A system administrator decides how the system should be distributed
Manually assign sub-trees of the hierarchy to individual servers.
Pros Simple client tasks of identifying servers responsible for metadata No inner communication between servers Servers are independent each other Concurrent processing among different sub-trees
Cons No load balancing Granularity is very large
7
Hashing Lustre, Vesta, zFS
Hash the file pathname/ or file global identifier to the location of the metadata (the correspondent metadata server)
Pros Load balancing Lookup time O(1)
Cons Hashing eliminated all hierarchical locality Migration cost: large (Need to rehash when changing the configuration/number of
servers) Very slow in some directory operations: rename, create link Hard to satisfy POSIX directory access semantics
8
Lazy Hybrid Scott et al., 2003
Seek to capitalize on the benefits of a hashed distribution while avoiding the problem associated with path traversal when checking permission
Still rely on the mapping of full pathname to distribute metadata
A small modification: Uses the hash value as an index in the MLT rather the metadata identifier-> to facilitate the addition and the removal of servers
9
Lazy Hybrid Avoid directory traversal for permission checking
=>Using a dual entry access control list (ACL) structure for managing permissions. Each file or directory has two ACLs representing File permissions Path permissions: Can be constructed recursively. Only updated when an ancestor
directory’s access permissions are changed
10
Lazy Hybrid ‒ Lazy policies 4 expensive operations
Changing permissions on a directory Changing the name of a directory Removing a directory Changing the MLT
=> message exchanges, migration metadata
Lazy policies: execute first, the metadata is updated later up on the first access Invalidation: by using inner
communication between servers Lazy metadata update and relocation
(recursively update up to root)
11
Lazy Hybrid Pros
Avoid directory traversal in most case Pros of pure hashing (load balancing, lookup time, recovery) Good scalability
Cons No locality benefits (A small modification affects multiple metadata servers) Individual hot-spots popular files (Can be improved if we can dynamically replicate
the associated metadata. Is it possible on a DHT?)
12
Dynamic sub-tree partitioning - Ceph Weil et al., 2004, 2006
Ceph dynamically maps sub-trees of the directory hierarchy to metadata servers based on the current workload.
Individual directories are hashed across multiple nodes only when they become hot spots.
Key design: Partitioning by path name
13
Dynamic sub-tree partitioning - Ceph No distributed locking (each metadata has its authority MDS)
Authority MDS serializes accesses to the metadata
Collaborative caching
Pros Embedded inodes in dentry to maximize directory operations Locality benefits Load balancing (the number of replicas are dynamically adjusted)
Cons Needs an accurate load measurement Migration cost for addition/removal of servers Not clearly to understand the paper
14
Dynamic sub-tree partitioning - Farsite Douceur et al.,2006
Partitioning by pathname complicates renames across partitions
Partitioning by file identifier which is not mutable can be a better approach
Tree-structured file identifiers: is encored by a variant of Elias y’ coding?
Client uses file map to know which servers manages which regions of file-identifier space
15
Dynamic sub-tree partitioning - Farsite Two-phase locking for ensuring the atomic rename operation
Leader: the destination directory 2 followers: the source directory, the file being renamed Followers validates its part, lock the relevant metadata, and notify the leader Leader decides whether the update is valid
Dynamic partitioning policies Not yet developed optimal policies A file to be active if the file’s metadata has been accessed within a 5 minutes interval The load on a region of file identifier space is the count of active files in the region. Transfer a few, heavily loaded sub-trees rather than in many lightly-loaded sub-trees
16
Dynamic sub-tree partitioning - Farsite Pros
Fast creating and renaming files Dynamic partitioning Requires less multi-server operations
Cons Requires directory traversals from root to the desired file since the partitioning is not
based on name (can be reduced by caching)?
17
Probabilistic lookup ‒ Bloom filter based Chinese working stuffs
[1] Y. Zhu, H. Jiang, J. Wang, and F. Xian, "HBA: Distributed Metadata Management for Large Cluster-Based Storage Systems," IEEE Trans. Parallel Distrib. Syst., vol. 19, 2008, p. 750‒763.
[2] Y. Hua, Y. Zhu, H. Jiang, D. Feng, and L. Tian, "Scalable and Adaptive Metadata Management in Ultra Large-Scale File Systems," 2008 The 28th International Conference on Distributed Computing Systems, 2008, pp. 403-410.
[3] Y. Hua, H. Jiang, Y. Zhu, D. Feng, and L. Tian, "SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems," System, 2009.
18
Probabilistic lookup ‒ Bloom filter based The Bloom filter is a fast and space-efficient probabilistic data structure that is
used to test whether an element is a member of a set. Elements can be added to the set, but not removed. The more elements that are added to the set, the larger the probability of false positives.
False positive: the test returns true but the element is not in the set.
19
Pure Bloom filter array approach Each MDS handles an array of BFs
One BF has m bits, representing all files whose metadata is store locally Other BFs, are BF replicas from other MDSs
Clients randomly chose a MDS to ask if the MDS owns the file’s metadada Sent (a full pathname) The MDS uses k hash functions to calculate k array positions. The MDS checks the array of BFs, returns the correct MDS or the correct metadata
if it has
20
Pure Bloom filter array approach Pros
No load balancing (the authors said YES) Low migration cost up on the addition/removal of MDS (a single BF need to be
added or deleted)
Cons No load balancing. The approach doesn’t care about how to distribution the
metadata. It’s random False positive/ the accuracy of PBA degrades quickly when increasing the number of
files When a file or directory is renamed, although only rebuild the BFs associated with all
the involved files but it may take a lot of time since rehashing them all with k functions.
21
Hierarchical Bloom filter array approach Assumption: a small portion of files absorb most of the I/O activities
Ensure Least recently used (LRU) BFs (high bit/file ratio) cache the recently visited files,
which are replicated globally among all MDSs. => ensure high hit rate
22
Hierarchical Bloom filter array approach Pros
Higher hit rate
Cons Simulation work Hash (a full pathname) -> Directory traversal for checking permissions. Can be
improved by using a dual entry access control list (ACL)
23
Metadata partitioning granularity Problem
Efficiently organize and maintain very large directories, each contains billion of files High metadata performance within the capacity of parallel modifications of metadata
Example Directory with billions of entries can grow to 10-100 GB in size A directory = {Dentry¦ Dentry = (name, inode)} Concurrently update a single huge directory The synchronization mechanism of updating metadata greatly restricts the
parallelization
Scalable distributed directory’s inner structure to facilitate the parallelization?
24
Scalable Distributed directories No Partitioning (Single metadata server)
Sub-tree partitioning Static (NFS, Sprite) Dynamic (Ceph, Farsite)
File partitioning Uses hash to distribute metadata Uses Bloom filter based (hash again)
Partitioning within a single directory GPFS from IBM, Frank et al.,2002 GIGA+, Patil et al., 2007 A modification of GIGA+, Xing et al.,2009
25
GIGA+: Scalable directories for Shared file systems Focuses on
Maintain UNIX file-system semantics
High throughput and scalability Incremental growth Minimal bottlenecks and shared
state synchronization
26
GIGA+: Scalable directories for Shared file systems A directory is dynamically fragmented in a number of partitions
Extendible hashing to provide incremental growth
27
GIGA+: Scalable directories for Shared file systems Map Directory partitions to MDSs?
Client caches a partition-to-server map (P2SMap) Clients may have the inconsistent copies of P2SMap GIGA+ keep the history of the fragmenting process
28
GIGA+: Scalable directories for Shared file systems
29
GIGA+: Scalable directories for Shared file systems Hash (directory full path name) to the home server
Hash (file name) to the partition ID
P2SMap is represented as a bitmap
30
GIGA+: Scalable directories for Shared file systems Two level metadata
Infrequently updated metadata is managed at a centralized MDS Owner, creation time, …
Only highly dynamic attributes are managed across all servers managing the directory
Modification time, access time, …
Pros Parallel processing within a directory
Cons
Only efficient for large directories
31
Summary Some small notes for distributed metadata management
A dual entry access control list Embedded inode Two level metadata
Can we build a system which is better or at least is a combination of GIGA+, dynamic sub-tree partition, and LH? At least, things come up on my mind was to improve BlobSeer version manager
“Distributed BLOB management based on GIGA+ approach” Each version manager manages a part of BLOB namespace This idea must be refined
32
Hierarchical file systems are dead Context
Hierarchical file system namespace is over forty years olds. A typical disk was 300 MB and now is closer to 300 GB Billion files and directories
Proposes New API Indexing Semantic
33
Hierarchical file systems are dead [1] S. Ames, C. Maltzahn, and E. Miller, "Quasar: A Scalable Naming Language for Very Large File
Collections," Citeseer, 2008.
[2] A. Leung, A. Parker-Wood, and E. Miller, "Copernicus: A Scalable, High-Performance Semantic File System," ssrc.ucsc.edu, 2009.
[3] A. Leung, I. Adams, and E. Miller, "Magellan: A Searchable Metadata Architecture for Large-Scale File Systems," Systems Research, 2009.
[4] A. Leung, M. Shao, T. Bisson, S. Pasupathy, and E. Miller, "Spyglass: Fast, scalable metadata search for large-scale storage systems," Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09), 2009.
[5] S. Patil, G.A. Gibson, G.R. Ganger, J. Lopez, M. Polte, W. Tantisiroj, and L. Xiao, "In search of an API for scalable file systems: Under the table or above it," 2009, pp. 1-5.
[6] M. Seltzer and N. Murphy, "Hierarchical file systems are dead," usenix.org, 2009.
[7] Y. Hua, H. Jiang, Y. Zhu, D. Feng, and L. Tian, "SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems," System, 2009.
34
Gfarm/BlobSeer: A versioning file system Motivation
A Versioning file system Versioning access control Time-based access which enable root to rollback the whole system to at a given
time Consistency semantics Version granularity: on-close, on-snapshot, on-write
Can be specified per each file Replication technique Efficient versioning management with respect to space and workload
35
Gfarm/Blobseer: Key design Versioned data’s access control
Versioning flag as an addition access mode to the standard file system interface Control which user/group can access versioned data Control whether a file/directory must be versioned or not Built on ACL (need to check if Gfam supports ACL)
Normal user access Control access based on ACL
Administrative access Only the root administrator is able to rollback the whole system by using some root access commands
36
Gfarm/Blobseer: Key design How to access the system
Version-based API A File handle and a desired version as the API parameters
Time-based API A File handle and a timestamp as the API parameters
The system should provide also version-able directory operations
To allow time-based access, clock is synchronized between the system components. There is two cases Time-based access to BLOB Clock synchronization is a assumption Algorithm for ensuring clock synchronization
37
Gfarm/Blobseer: Key design Time-based access scheme
38
Gfarm/Blobseer: Key design Consistency semantics: There is 3 cases
Read-Read: Nothing must be taken into consideration. Caches, replicas are welcome.
Write-Write: Write operations can concurrently generate new Blob versions. Caches are ok, but only one replica is valid to be writen concurrently.
Read-Write: Two cases: (1) Read a desired version: caches, replicas are ok. (2) Read with live up-to-date flag: disabled caches on all writers and on only readers with the live up-to-date flag. Disable all replicas.
Versioning granularity: handled per file, and based on extended attributes
39
Gfarm/Blobseer: Key design Replication technique
Count on BlobSeer replication rather on Gfarm replication
Versioning awareness: an efficient decoupled version management in order to save spaces and management workload Versioned data on BlobSeer Versioned metadata
Inodes and directory structure on the metadata server Infrequent file attributes on the metadata server File attributes associated to each single version on BlobSeer
40
Gfarm/Blobseer: Implementation Versioned directory structure: work on the Gfarm metadata server
Directory structure is reorganized as a multi-version b-tree [1] B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer, "An
asymptotically optimal multiversion B-tree," The VLDB Journal, vol. 5, 1996, p. 264‒275.
[2] C. Soules, G. Goodson, J. Strunk, and G. Ganger, "Metadata efficiency in versioning file systems," Proceedings of the 2nd USENIX Conference on File and Storage Technologies, USENIX Association, 2003, p. 58.
[3] T. Haapasalo and I. Jaluta, "Transactions on the Multiversion B + -Tree," Technology, 2009.
41
Gfarm/Blobseer: Implementation Multi-version B-tree
Given a timestamp, Multi-version B-tree can efficiently return the entries that exist at that time
42
Gfarm/Blobseer: Implementation Infrequent updated metadata will be handled in a journal-based approach?
Map versioned file attributes to versioned object attributes on BlobSeer Modify BlobSeer version manager to handle object attributes Or distribute them somewhere on the dht
43
Gfarm/Blobseer: Implementation Time-based access to BLOB
Assign creation time together with version number to each new created BLOB version
Given a pair {BLOB ID, a timestamp} -> the version manager can map it to a pair {BLOB ID, version number)
Replication policy Cluster awareness on BlobSeer in order to gain access performance to each replica
44
Conclusion What can I work on?
Versioning access control Multi-version B-tree Time-based access to BlobSeer Cluster-awareness to BlobSeer Clock synchronization Some more recent papers on versioning file system. They was found but not yet be
read.
If I can earn some papers? Which can be: Security control in a versioning file system Gfarm/BlobSeer: a versioning distributed file system Scalable dynamic distributed version/BLOB management
45
MERCI DE VOTRE ATTENTION!
Recommended