Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Experiences on File SystemsWhich is the best file system for you?
Jakob BlomerCERN PH/SFT
CHEP 2015Okinawa, Japan
1 / 22
Why Distributed File Systems
Physics experiments store their files in a variety of file systemsfor a good reason
∙ File system interface is portable: we can take our local analysisapplication and run it anywhere on a big data set
∙ File system as a storage abstraction is asweet spot between data flexibility and data organization
We like file systems for their rich and standardized interface
We struggle finding an optimal implementation of that interface
2 / 22
Why Distributed File Systems
Physics experiments store their files in a variety of file systemsfor a good reason
∙ File system interface is portable: we can take our local analysisapplication and run it anywhere on a big data set
∙ File system as a storage abstraction is asweet spot between data flexibility and data organization
We like file systems for their rich and standardized interface
We struggle finding an optimal implementation of that interface
2 / 22
Shortlist of Distributed File Systems
QuantcastFile System
CernVM-FS3 / 22
Agenda
1 What do we want from a distributed file system?
2 Sorting and searching the file system landscape
3 Technology trends and future challenges
4 / 22
Can one size fit all?
MeanFile Size
Change FrequencyThroughput
MB/s
ThroughputIOPS
Volume
Cache Hit RateRedundancy
Confidentiality
DataValue
[Data are illustrative]
Data Classes∙ Home folders
∙ Physics Data
∙ Recorded∙ Simulated∙ Analysis results
∙ Software binaries
∙ Scratch area
5 / 22
Can one size fit all?
MeanFile Size
Change FrequencyThroughput
MB/s
ThroughputIOPS
Volume
Cache Hit RateRedundancy
Confidentiality
DataValue
[Data are illustrative]
Data Classes∙ Home folders
∙ Physics Data
∙ Recorded∙ Simulated∙ Analysis results
∙ Software binaries
∙ Scratch area
5 / 22
Can one size fit all?
MeanFile Size
Change FrequencyThroughput
MB/s
ThroughputIOPS
Volume
Cache Hit RateRedundancy
Confidentiality
DataValue
[Data are illustrative]
Data Classes∙ Home folders
∙ Physics Data
∙ Recorded∙ Simulated∙ Analysis results
∙ Software binaries
∙ Scratch area
5 / 22
Can one size fit all?
MeanFile Size
Change FrequencyThroughput
MB/s
ThroughputIOPS
Volume
Cache Hit RateRedundancy
Confidentiality
DataValue
[Data are illustrative]
Data Classes∙ Home folders
∙ Physics Data
∙ Recorded∙ Simulated∙ Analysis results
∙ Software binaries
∙ Scratch area
5 / 22
Can one size fit all?
MeanFile Size
Change FrequencyThroughput
MB/s
ThroughputIOPS
Volume
Cache Hit RateRedundancy
Confidentiality
DataValue
[Data are illustrative]
Data Classes∙ Home folders
∙ Physics Data
∙ Recorded∙ Simulated∙ Analysis results
∙ Software binaries
∙ Scratch area
Depending on the use case, the dimensions span orders of magnitude5 / 22
POSIX Interface
∙ No DFS is fully POSIXcompliant
∙ It must provide just enough tonot break applications
∙ Often this can be onlydiscovered by testing
File system operationsessential
create(), unlink(), stat()open(), close(),read(), write(), seek()
difficult for DFSs
File locksWrite-throughAtomic rename()File ownershipExtended attributesUnlink opened filesSymbolic links, hard linksDevice files, IPC files
6 / 22
POSIX Interface
∙ No DFS is fully POSIXcompliant
∙ It must provide just enough tonot break applications
∙ Often this can be onlydiscovered by testing
Missing APIsPhysical file location, file replicationproperties, file temperature, . . .
File system operationsessential
create(), unlink(), stat()open(), close(),read(), write(), seek()
difficult for DFSs
File locksWrite-throughAtomic rename()File ownershipExtended attributesUnlink opened filesSymbolic links, hard linksDevice files, IPC files
6 / 22
Application-Defined File Systems?
Mounted file system
FILE * f = fopen (" su sy . dat " , " r " ) ;
whi le ( . . . ) {f r e a d ( . . . ) ;. . .
}f c l o s e ( f ) ;
File system library
hdfsFS f s = hdf sConnect (" d e f a u l t " , 0 ) ;
h d f s F i l e f = hd f sOpenF i l e (f s , " su sy . dat " , . . . ) ;
whi le ( . . . ) {hdfsRead ( f s , f , . . . ) ;. . .
}h d f s C l o s e F i l e ( f s , f ) ;
Application independent from file system
Allows for standard tools (ls, grep, . . . )
Performance tuned API
Requires code changes
System administrator selectsthe file system
Application selectsthe file system
What we ideally want is an application-defined, mountable file system
Fuse Parrot
7 / 22
Application-Defined File Systems?
Mounted file system
FILE * f = fopen (" su sy . dat " , " r " ) ;
whi le ( . . . ) {f r e a d ( . . . ) ;. . .
}f c l o s e ( f ) ;
File system library
hdfsFS f s = hdf sConnect (" d e f a u l t " , 0 ) ;
h d f s F i l e f = hd f sOpenF i l e (f s , " su sy . dat " , . . . ) ;
whi le ( . . . ) {hdfsRead ( f s , f , . . . ) ;. . .
}h d f s C l o s e F i l e ( f s , f ) ;
Application independent from file system
Allows for standard tools (ls, grep, . . . )
Performance tuned API
Requires code changes
System administrator selectsthe file system
Application selectsthe file system
What we ideally want is an application-defined, mountable file system
Fuse Parrot
7 / 22
Agenda
1 What do we want from a distributed file system?
2 Sorting and searching the file system landscape
3 Technology trends and future challenges
8 / 22
DistributedFile Systems
Super-computers
Lustre
GPFS
Orange-FS
BeeGFS
Panasas
SharedDisk
GFS2
OCFS2
GeneralPurpose
NFS
Gluster-FS
Ceph
XtreemFS
MooseFS
PersonalFiles
AFS
drop-box/own-
cloudTahoe-LAFS
Big Data
HDFS
QFS
MapR FS
9 / 22
DistributedFile Systems
Super-computers
Lustre
GPFS
Orange-FS
BeeGFS
Panasas
SharedDisk
GFS2
OCFS2
GeneralPurpose
(p)NFS
Gluster-FS
Ceph
XtreemFS
MooseFS
PersonalFiles
AFS
drop-box/own-
cloudTahoe-LAFS
Big Data
HDFS
QFS
MapR FS
∙ Fast parallelwrites
∙ InfiniBand,Myrinet, . . . ∙ High level of
POSIXcompliance
∙ Privacy
∙ Sharing
∙ Sync
∙ Incrementalscalability
∙ Ease ofadministration
∙ MapReduceworkflows
∙ Commodityhardware
9 / 22
DistributedFile Systems
Super-computers
Lustre
GPFS
Orange-FS
BeeGFS
Panasas
SharedDisk
GFS2
OCFS2
GeneralPurpose
(p)NFS
Gluster-FS
Ceph
XtreemFS
MooseFS
PersonalFiles
AFS
drop-box/own-
cloudTahoe-LAFS
Big Data
HDFS
QFS
MapR FS
Used in HEP
∙ Fast parallelwrites
∙ InfiniBand,Myrinet, . . . ∙ High level of
POSIXcompliance
∙ Privacy
∙ Sharing
∙ Sync
∙ Incrementalscalability
∙ Ease ofadministration
∙ MapReduceworkflows
∙ Commodityhardware
9 / 22
DistributedFile Systems
Super-computers
Lustre
GPFS
Orange-FS
BeeGFS
Panasas
SharedDisk
GFS2
OCFS2
GeneralPurpose
(p)NFS
Gluster-FS
Ceph
XtreemFS
MooseFS
PersonalFiles
AFS
drop-box/own-
cloudTahoe-LAFS
Big Data
HDFS
QFS
MapR FS
dCache
XRootD
CernVM-FS
EOS
HEP
Used in HEP
∙ Fast parallelwrites
∙ InfiniBand,Myrinet, . . . ∙ High level of
POSIXcompliance
∙ Privacy
∙ Sharing
∙ Sync
∙ Incrementalscalability
∙ Ease ofadministration
∙ MapReduceworkflows
∙ Commodityhardware
10 / 22
DistributedFile Systems
Super-computers
Lustre
GPFS
Orange-FS
BeeGFS
Panasas
SharedDisk
GFS2
OCFS2
GeneralPurpose
(p)NFS
Gluster-FS
Ceph
XtreemFS
MooseFS
PersonalFiles
AFS
drop-box/own-
cloudTahoe-LAFS
Big Data
HDFS
QFS
MapR FS
dCache
XRootD
CernVM-FS
EOS
HEP
Used in HEP
∙ Tape access
∙ WANfederation
∙ Softwaredistribution
∙ Fault-tolerance ∙ Fast parallel
writes
∙ InfiniBand,Myrinet, . . . ∙ High level of
POSIXcompliance
∙ Privacy
∙ Sharing
∙ Sync
∙ Incrementalscalability
∙ Ease ofadministration
∙ MapReduceworkflows
∙ Commodityhardware
10 / 22
File System Architecture
data
met
a-da
ta
∙ ∙ ∙ ∙∙∙
Object-based file system
delete()
create()
read()write()
Target: Incremental scaling, large & immutable filesTypical for Big Data applications
Examples: Hadoop File System, Quantcast File System
11 / 22
File System Architecture
data
met
a-da
ta
∙ ∙ ∙ ∙∙∙
Object-based file system
delete()
create()
read()write()
Target: Incremental scaling, large & immutable filesTypical for Big Data applications
Examples: Hadoop File System, Quantcast File System
Head node can help in job scheduling
11 / 22
File System Architecture
data
met
a-da
ta
∙ ∙ ∙
∙∙∙
Parallel file system
delete()
create()
read()write()
Target: Maximum aggregated throughput, large filesTypical for High-Performance Computing
Examples: Lustre, MooseFS, pNFS, XtreemFS
12 / 22
File System Architecture
data
met
a-da
ta
∙ ∙ ∙
∙∙∙
Distributed meta-data
delete()
create()
read()write()
Target: Avoid single point of failure and meta-data bottleneckModern general-purpose distributed file system
Examples: Ceph, OrangeFS
13 / 22
File System Architecture
Symmetric, peer-to-peer
Distributed hash table —Hosts of pathn ○
hash(pathn)
Target: Conceptual simplicity, inherently scalable
Examples: GlusterFS
14 / 22
File System Architecture
Symmetric, peer-to-peer
Distributed hash table —Hosts of pathn ○
hash(pathn)
Target: Conceptual simplicity, inherently scalable
Examples: GlusterFS
Difficult to deal with node churnSlow lookup beyond LAN
In HEP we use caching and catalog based data management
14 / 22
Agenda
1 What do we want from a distributed file system?
2 Sorting and searching the file system landscape
3 Technology trends and future challenges
15 / 22
Trends and Challenges
We are lucky: large data sets tend to be immutable everywhereFor instance: media, backups, VM images, scientific data sets, . . .
∙ Reflected in hardware: shingled magnetic recording drives
∙ Reflected in software: log-structured space management
We need to invest in scaling fault-tolerance and speedtogether with with the capacity
∙ Replication becomes too expensive at the Petabyte and Exabyte scale→ Erasure codes
∙ Explicit use of SSDs
∙ For meta-data (high IOPS requirement)∙ As a fast storage pool∙ As a node-local cache
16 / 22
Trends and Challenges
We are lucky: large data sets tend to be immutable everywhereFor instance: media, backups, VM images, scientific data sets, . . .
∙ Reflected in hardware: shingled magnetic recording drives
∙ Reflected in software: log-structured space management
We need to invest in scaling fault-tolerance and speedtogether with with the capacity
∙ Replication becomes too expensive at the Petabyte and Exabyte scale→ Erasure codes
∙ Explicit use of SSDs
∙ For meta-data (high IOPS requirement)∙ As a fast storage pool∙ As a node-local cache
Amazon Elastic File Systemwill be on SSD
16 / 22
Log-Structured Data
Idea: Store all modifications in a change log
Unix file system
Log-structured file system
Disk
Disk
· · · Log −→
File Inode
Directory Index
Used by
∙ Zebra experimental DFS
∙ Commercial filers(e. g. NetApp)
∙ Key-value and BLOB stores
∙ File systems for flash memory
Advantages
∙ Minimal seek
∙ Fast and robust crash recovery
∙ Efficient allocation inDRAM, flash, and disks
∙ Ideal for immutable data17 / 22
Bandwidth Lags Behind Capacity
×1
×10
×100
×1000
1995 2005 2015
Impr
ovem
ent
Year
HDDDRAM
SSDEthernet
∙ Bandwidth� Capacity
∙ Capacity and bandwidthof affordable storage scaleat different paces
∙ It will be prohibitivelycostly to constantly“move data in and out”
∙ Ethernet bandwidth scaledsimilarly to capacity
We should use the highbi-section bandwidth amongworker nodes and make thempart of the storage network
18 / 22
Fault Tolerance at the Petascale
How to prevent data loss
1 Replication: simple, large storage overhead (similar to RAID 1)
2 Erasure codes: based on parity blocks (similar to RAID 5)
yDatay Parity
distribute all blocks
Requires extracompute power
Available commercially: GPFS, NEC Hydrastor, Scality RING, . . .Emerging in open source file systems: EOS, Ceph, QFS
Engineering Challenges
∙ Fault detection
∙ De-correlation of failure domains instead of random placement
∙ Failure prediction, e. g. based on MTTF and Markov models19 / 22
Fault Tolerance at the Petascale
Example from Facebook
size—e.g., 40KB instead of 1GB—reconstructing theBLOB is much faster and lighter weight than rebuildingthe block. Full block rebuilding is handled offline byrebuilder nodes.
Rebuilder Nodes At large scale, disk and node failuresare inevitable. When this happens blocks stored on thefailed components need to be rebuilt. Rebuilder nodes arestorage-less, CPU-heavy nodes that handle failure detec-tion and background reconstruction of data blocks. Eachrebuilder node detects failure through probing and re-ports the failure to a coordinator node. It rebuilds blocksby fetching n companion or parity blocks from the failedblock’s strip and decoding them. Rebuilding is a heavy-weight process that imposes significant I/O and networkload on the storage nodes. Rebuilder nodes throttle them-selves to avoid adversely impacting online user requests.Scheduling the rebuilds to minimize the likelihood ofdata loss is the responsibility of the coordinator nodes.
Coordinator Nodes A cell requires many maintenancetask, such as scheduling block rebuilding and ensuringthat the current data layout minimizes the chances ofdata unavailability. Coordinator nodes are storage-less,CPU-heavy nodes that handle these cell-wide tasks.
As noted earlier, blocks in a stripe are laid out on dif-ferent failure domains to maximize reliability. However,after initial placement and after failure, reconstruction,and replacement there can be violations where a stripe’sblocks are in the same failure domain. The coordinatorruns a placement balancer process that validates the blocklayout in the cell, and rebalance blocks as appropriate.Rebalancing operations, like rebuilding operations, in-cur significant disk and network load on storage nodesand are also throttled so that user requests are adverselyimpacted.
5.4 Geo-replicationIndividual f4 cells all reside in a single datacenter andthus are not tolerant to datacenter failures. To add dat-acenter fault tolerance we initially double-replicated f4cells and placed the second replica in a different data-center. If either datacenter fails, all the BLOBs are stillavailable from the other datacenter. This provides all ofour fault tolerance requirements and reduces the effective-replication-factor from 3.6 to 2.8.
Given the rarity of datacenter failure events wesought a solution that could further reduce the effective-replication-factor with the tradeoff of decreased through-put for BLOBs stored at the failed datacenter. We arecurrently deploying geo-replicated XOR coding thatreduces the effective-replication-factor to 2.1.
Geo-replicated XOR coding provides datacenter faulttolerance by storing the XOR of blocks from two differ-ent volumes primarily stored in two different datacentersin a third datacenter as shown in Figure 9. Each data
Datacenter 1
Datacenter 2
Datacenter 3
Block B
A XOR B
Block A
Figure 9: Geo-replicated XOR Coding.
and parity block in a volume is XORed with the equiv-alent data or parity block in the other volume, calledits buddy block, to create their XOR block. These XORblocks are stored with normal triple-replicated index filesfor the volumes. Again, because the index files are tinyrelative to the data, coding them is not worth the addedcomplexity.
The 2.1 replication factor comes from the 1.4X for theprimary single cell replication for each of two volumesand another 1.4X for the geo-replicated XOR of the twovolumes: 1.4⇤2+1.4
2 = 2.1.Reads are handled by a geo-backoff node that receives
requests for a BLOB that includes the data file, offset,and length (R6 in Figure 8). This node then fetches thespecified region from the local XOR block and the remoteXOR-companion block and reconstructs the requestedBLOB. These reads go through the normal single-cellread path through storage nodes Index and File APIs orbackoff node File APIs if there are disk, host, or rackfailures that affect the XOR or XOR-companion blocks.
We chose XOR coding for geo-replication becauseit significantly reduces our storage requirements whilemeeting our fault tolerance goal of being able to survivethe failure of a datacenter.
5.5 f4 Fault ToleranceSingle f4 cells are tolerant to disk, host, and rack fail-ures. Geo-replicating XOR volumes brings tolerance todatacenter failures. This subsection explains the failuredomains in a single cell, how f4 lays out blocks to in-crease its resilience, gives an example of recovery if allfour types of failure all affect the same BLOB, and sum-marizes how all components of a cell are fault tolerant.
Failure Domains and Block Placement Figure 10 il-lustrates how data blocks in a stripe are laid out in af4 cell. A rack is the largest failure domain and is ourprimary concern. Given a stripe S of n data blocks andk parity blocks, we attempt to lay out the blocks so thateach of these is on a different rack, and at least on a dif-ferent node. This requires that a cell have at least n + kracks, of roughly the same size. Our current implemen-tation initially lays out blocks making a best-effort toput each on a different rack. The placement balancer pro-cess detects and corrects any rare violations that place astripe’s blocks on the same rack.
8
Replication factor 2.1using erasure codingwithin every data centerplus cross data centerXOR encoding
Muralidhar et al. (2014) Link
20 / 22
Towards a Better Implementation of File Systems
Workflows Distributed File Systems
∙ Integration with applications
∙ Feature-rich: quotas, permissions, links
∙ Globally federated namespace
Data Provisioning Object Stores
∙ Very good scalability on local networks
∙ Flat namespace
∙ Turns out that certain types of datado not need a hierarchical namespacee. g. cached objects, media, VM images
∙ ∙
BuildingBlocks
21 / 22
Conclusion
File Systems are at the Heart of our Distributed Computing
∙ Most convenient way of connecting our applications to data
∙ The hierarchical namespace is a natural way to organize data
We can tailor file systems to our needsusing existing building blocks
With a focus on
1 Wide area data federation
2 Multi-level storage hierarchy from tape to memory
3 Erasure coding
4 Coalescing storage and compute nodes
22 / 22
Thank you for your time!
Backup Slides
Milestones in Distributed File SystemsBiased towards open-source, production file systems
1983
AFS
1985
NFS
1995
Zebra
2000
OceanStore
2002
Venti
2003
GFS
2005
XRootD
2007
Ceph
25 / 22
Milestones in Distributed File SystemsBiased towards open-source, production file systems
1983
AFS
1985
NFS
1995
Zebra
2000
OceanStore
2002
Venti
2003
GFS
2005
XRootD
2007
Ceph
1983
AFS
∙
Andrew File SystemClient-server
∙ Roaming home folders
∙ Identity tokens andaccess control lists (ACLs)
∙ Decentralized operation (“Cells”)
∙
“AFS was the first safe and efficient distributed com-puting system, available [. . . ] on campus. It was aclear precursor to the Dropbox-like software pack-ages today. [. . . ] [It] allowed students (like DrewHouston and Arash Ferdowsi) access to all theirstuff from any connected computer.”
http://www.wired.com/2011/12/backdrop-dropbox
25 / 22
Milestones in Distributed File SystemsBiased towards open-source, production file systems
1983
AFS
1985
NFS
1995
Zebra
2000
OceanStore
2002
Venti
2003
GFS
2005
XRootD
2007
Ceph
1985
NFS
∙
Network File SystemClient-server
∙ Focus on portability
∙ Separation of protocoland implementation
∙ Stateless servers
∙ Fast crash recovery
∙
Sandberg, Goldberg, Kleiman, Walsh, Lyon (1985)
25 / 22
Milestones in Distributed File SystemsBiased towards open-source, production file systems
1983
AFS
1985
NFS
1995
Zebra
2000
OceanStore
2002
Venti
2003
GFS
2005
XRootD
2007
Ceph
1995
Zebra
∙
Zebra File SystemParallel
∙ Striping and parity
∙ Redundant array ofinexpensive nodes (RAIN)
∙ Log-structured data
∙
sequential transfers. LFS is particularly effective for writingsmall files, since it can write many files in a single transfer;in contrast, traditional file systems require at least twoindependent disk transfers for each file. Rosenblum reporteda tenfold speedup over traditional file systems for writingsmall files. LFS is also well-suited for RAIDs because itbatches small writes together into large sequential transfersand avoids the expensive parity updates associated withsmall random writes.
Zebra can be thought of as a log-structured network filesystem: whereas LFS uses the logging approach at theinterface between a file server and its disks, Zebra uses thelogging approach at the interface between a client and itsservers. Figure 4 illustrates this approach, which we call
per-client
striping. Each Zebra client organizes its new filedata into an append-only log, which it then stripes across theservers. The client computes parity for the log, not forindividual files. Each client creates its own log, so a singlestripe in the file system contains data written by a singleclient.
Per-client striping has a number of advantages over per-file striping. The first is that the servers are used efficientlyregardless of file sizes: large writes are striped, allowingthem to be completed in parallel, and small writes arebatched together by the log mechanism and written to theservers in large transfers; no special handling is needed foreither case. Second, the parity mechanism is simplified.Each client computes parity for its own log without fear ofinteractions with other clients. Small files do not haveexcessive parity overhead because parity is not computedon a per-file basis. Furthermore, parity never needs to beupdated because file data are never overwritten in place.
The above introduction to per-client striping leavessome unanswered questions. For example, how can files beshared between client workstations if each client is writingits own log? Zebra solves this problem by introducing acentral
file manager
, separate from the storage servers, thatmanages metadata such as directories and file attributes andsupervises interactions between clients. Also, how is free
63
52
41
64
File Servers
1
⊗
2
⊗
34
⊗
5
⊗
6
1 2 3 5
File BFile A
File CFile D
Client’s Log
Figure 4
.
Per-client striping in Zebra
. Each clientforms its new file data into a single append-only log andstripes this log across the servers. In this example file Aspans several servers while file B is stored entirely on asingle server. Parity is computed for the log, not forindividual files.
space reclaimed from the logs? Zebra solves this problemwith a
stripe cleaner
, which is analogous to the cleaner in alog-structured file system. The next section provides a moredetailed discussion of these issues and several others.
3 Zebra Components
The Zebra file system contains four main componentsas shown in Figure 5:
clients
, which are the machines thatrun application programs;
storage servers
, which store filedata; a
file manager
, which manages the file and directorystructure of the file system; and a
stripe cleaner
, whichreclaims unused space on the storage servers. There may beany number of clients and storage servers but only a singlefile manager and stripe cleaner. More than one of thesecomponents may share a single physical machine; forexample, it is possible for one machine to be both a storageserver and a client. The remainder of this section describeseach of the components in isolation; Section 4 then showshow the components work together to implement operationssuch as reading and writing files, and Sections 5 and 6describe how Zebra deals with crashes.
We will describe Zebra under the assumption that thereare several storage servers, each with a single disk.However, this need not be the case. For example, storageservers could each contain several disks managed as aRAID, thereby giving the appearance to clients of a singledisk with higher capacity and throughput. It is also possibleto put all of the disks on a single server; clients would treatit as several logical servers, all implemented by the samephysical machine. This approach would still provide manyof Zebra’s benefits: clients would still batch small files fortransfer over the network, and it would still be possible toreconstruct data after a disk failure. However, a single-server Zebra system would limit system throughput to thatof the one server, and the system would not be able tooperate when the server is unavailable.
3.1 Clients
Clients are machines where application programsexecute. When an application reads a file the client must
Figure 5: Zebra schematic
. Clients run applications;storage servers store data. The file manager and thestripe cleaner can run on any machine in the system,although it is likely that one machine will run both ofthem. A storage server may also be a client.
Network
Client
FileManager
StripeCleaner
Client
StorageServer
StorageServer
StorageServer
StorageServer
Hartman, Ousterhout (1995)
25 / 22
Milestones in Distributed File SystemsBiased towards open-source, production file systems
1983
AFS
1985
NFS
1995
Zebra
2000
OceanStore
2002
Venti
2003
GFS
2005
XRootD
2007
Ceph
2000
OceanStore
∙
OceanStorePeer-to-peer
∙ “Global Scale”:1010 users, 1014 files
∙ Untrusted infrastructure
∙ Based on peer-to-peeroverlay network
∙ Nomadic data throughaggressive caching
∙ Foundation for today’sdecentral dropbox replacements
∙Ideally, a user would entrust all of his or her data to OceanStore;
in return, the utility’s economies of scale would yield much betteravailability, performance, and reliability than would be availableotherwise. Further, the geographic distribution of servers wouldsupport deep archival storage, i.e. storage that would survive ma-jor disasters and regional outages. In a time when desktop worksta-tions routinely ship with tens of gigabytes of spinning storage, themanagement of data is far more expensive than the storage media.OceanStore hopes to take advantage of this excess of storage spaceto make the management of data seamless and carefree.
1.2 Two Unique GoalsThe OceanStore system has two design goals that differentiate itfrom similar systems: (1) the ability to be constructed from an un-trusted infrastructure and (2) support of nomadic data.
Untrusted Infrastructure: OceanStore assumes that the infras-tructure is fundamentally untrusted. Servers may crash withoutwarning or leak information to third parties. This lack of trust is in-herent in the utility model and is different from other cryptographicsystems such as [35]. Only clients can be trusted with cleartext—allinformation that enters the infrastructure must be encrypted. How-ever, rather than assuming that servers are passive repositories ofinformation (such as in CFS [5]), we allow servers to be able toparticipate in protocols for distributed consistency management. Tothis end, we must assume that most of the servers are working cor-rectly most of the time, and that there is one class of servers that wecan trust to carry out protocols on our behalf (but not trust with thecontent of our data). This responsible party is financially responsi-ble for the integrity of our data.
Nomadic Data: In a system as large as OceanStore, locality isof extreme importance. Thus, we have as a goal that data can becached anywhere, anytime, as illustrated in Figure 1. We call thispolicy promiscuous caching. Data that is allowed to flow freely iscalled nomadic data. Note that nomadic data is an extreme con-sequence of separating information from its physical location. Al-though promiscuous caching complicates data coherence and loca-tion, it provides great flexibility to optimize locality and to trade offconsistency for availability. To exploit this flexibility, continuousintrospective monitoring is used to discover tacit relationships be-tween objects. The resulting “meta-information” is used for local-ity management. Promiscuous caching is an important distinctionbetween OceanStore and systems such as NFS [43] and AFS [23]in which cached data is confined to particular servers in particularregions of the network. Experimental systems such as XFS [3] al-low “cooperative caching” [12], but only in systems connected bya fast LAN.
The rest of this paper is as follows: Section 2 gives a system-leveloverview of the OceanStore system. Section 3 shows sample ap-plications of the OceanStore. Section 4 gives more architecturaldetail, and Section 5 reports on the status of the current prototype.Section 6 examines related work. Concluding remarks are given inSection 7.
2 SYSTEM OVERVIEWAn OceanStore prototype is currently under development. This sec-tion provides a brief overview of the planned system. Details on theindividual system components are left to Section 4.The fundamental unit in OceanStore is the persistent object.
Each object is named by a globally unique identifier, or GUID.
pool
pool
pool
pool
pool
pool
pool
Figure 1: The OceanStore system. The core of the system iscomposed of a multitude of highly connected “pools”, amongwhich data is allowed to “flow” freely. Clients connect to one ormore pools, perhaps intermittently.
Objects are replicated and stored on multiple servers. This replica-tion provides availability1 in the presence of network partitions anddurability against failure and attack. A given replica is independentof the server on which it resides at any one time; thus we refer tothem as floating replicas.A replica for an object is located through one of two mecha-
nisms. First, a fast, probabilistic algorithm attempts to find theobject near the requesting machine. If the probabilistic algorithmfails, location is left to a slower, deterministic algorithm.Objects in the OceanStore are modified through updates. Up-
dates contain information about what changes to make to an ob-ject and the assumed state of the object under which those changeswere developed, much as in the Bayou system [13]. In principle,every update to an OceanStore object creates a new version2. Con-sistency based on versioning, while more expensive to implementthan update-in-place consistency, provides for cleaner recovery inthe face of system failures [49]. It also obviates the need for backupand supports “permanent” pointers to information.OceanStore objects exist in both active and archival forms. An
active form of an object is the latest version of its data togetherwith a handle for update. An archival form represents a permanent,read-only version of the object. Archival versions of objects areencoded with an erasure code and spread over hundreds or thou-sands of servers [18]; since data can be reconstructed from any suf-ficiently large subset of fragments, the result is that nothing shortof a global disaster could ever destroy information. We call thishighly redundant data encoding deep archival storage.An application writer views the OceanStore as a number of ses-
sions. Each session is a sequence of read and write requests relatedto one another through the session guarantees, in the style of theBayou system [13]. Session guarantees dictate the level of con-sistency seen by a session’s reads and writes; they can range fromsupporting extremely loose consistency semantics to supporting theACID semantics favored in databases. In support of legacy code,OceanStore also provides an array of familiar interfaces such as theUnix file system interface and a simple transactional interface.
If application semantics allow it, this availability is provided at the expenseof consistency.In fact, groups of updates are combined to create new versions, and weplan to provide interfaces for retiring old versions, as in the Elephant FileSystem [44].
2
Kubiatowicz et al. (2000)
25 / 22
Milestones in Distributed File SystemsBiased towards open-source, production file systems
1983
AFS
1985
NFS
1995
Zebra
2000
OceanStore
2002
Venti
2003
GFS
2005
XRootD
2007
Ceph
2002
Venti
∙
VentiArchival storage
∙ De-duplication throughcontent-addressable storage
∙ Content hashes provideintrinsic file integrity
∙ Merkle trees verify thefile system hierarchy
∙
overlapped and do not benefit from the striping of theindex. One possible solution is a form of read-ahead.When reading a block from the data log, it is feasible toalso read several following blocks. These extra blockscan be added to the caches without referencing theindex. If blocks are read in the same order they werewritten to the log, the latency of uncached indexlookups will be avoided. This strategy should work wellfor streaming data such as multimedia files.
The basic assumption in Venti is that the growth incapacity of disks combined with the removal ofduplicate blocks and compression of their contentsenables a model in which it is not necessary to reclaimspace by deleting archival data. To demonstrate why webelieve this model is practical, we present somestatistics derived from a decade’s use of the Plan 9 filesystem.
The computing environment in which we work includestwo Plan 9 file servers named bootes and emelie.Bootes was our primary file repository from 1990 until1997 at which point it was superseded by emelie. Overthe life of these two file servers there have been 522user accounts of which between 50 and 100 were activeat any given time. The file servers have hostednumerous development projects and also containseveral large data sets including chess end games,astronomical data, satellite imagery, and multimediafiles.
Figure 6 depicts the size of the active file system asmeasured over time by du, the space consumed on thejukebox, and the size of the jukebox’s data if it were tobe stored on Venti. The ratio of the size of the archivaldata and the active file system is also given. As can beseen, even without using Venti, the storage required toimplement the daily snapshots in Plan 9 is relativelymodest, a result of the block level incremental approachto generating a snapshot. When the archival data isstored to Venti the cost of retaining the snapshots isreduced significantly. In the case of the emelie filesystem, the size on Venti is only slightly larger than theactive file system; the cost of retaining the dailysnapshots is almost zero. Note that the amount ofstorage that Venti uses for the snapshots would be thesame even if more conventional methods were used toback up the file system. The Plan 9 approach tosnapshots is not a necessity, since Venti will removeduplicate blocks.
When stored on Venti, the size of the jukebox data isreduced by three factors: elimination of duplicateblocks, elimination of block fragmentation, andcompression of the block contents. Table 2 presents thepercent reduction for each of these factors. Note, bootesuses a 6 Kbyte block size while emelie uses 16 Kbyte,so the effect of removing fragmentation is moresignificant on emelie.
Bootes: storage size
0
50
100
150
200
250
300
Jul-90
Jan-91
Jul-91
Jan-92
Jul-92
Jan-93
Jul-93
Jan-94
Jul-94
Jan-95
Jul-95
Jan-96
Jul-96
Jan-97
Jul-97
Jan-98
Size
(Gb)
Emelie: storage size
050
100150200250300350400450
Jan-97
Jul-97
Jan-98
Jul-98
Jan-99
Jul-99
Jan-00
Jul-00
Jan-01
Jul-01
Size
(Gb)
JukeboxVentiActive file system
Bootes: ratio of archival to active data
00.5
11.5
22.5
33.5
44.5
5
Jul-90
Jan-91
Jul-91
Jan-92
Jul-92
Jan-93
Jul-93
Jan-94
Jul-94
Jan-95
Jul-95
Jan-96
Jul-96
Jan-97
Jul-97
Rat
io
Emelie: ratio of archival to active data
0
1
2
3
4
5
6
7
Jan-97
Jul-97
Jan-98
Jul-98
Jan-99
Jul-99
Jan-00
Jul-00
Jan-01
Jul-01
Rat
io
Jukebox / ActiveVenti / Active
Figure 6. Graphs of the various sizes of two Plan 9 file servers.Quinlan and Dorward (2002)
25 / 22
Milestones in Distributed File SystemsBiased towards open-source, production file systems
1983
AFS
1985
NFS
1995
Zebra
2000
OceanStore
2002
Venti
2003
GFS
2005
XRootD
2007
Ceph
2003
GFS
∙
Google File SystemObject-based
∙ Co-designed for map-reduce
∙ Coalesce storage andcompute nodes
∙ Serialize data access
∙
Google AppEngine documentation
25 / 22
Milestones in Distributed File SystemsBiased towards open-source, production file systems
1983
AFS
1985
NFS
1995
Zebra
2000
OceanStore
2002
Venti
2003
GFS
2005
XRootD
2007
Ceph
2005
XRootD
∙
XRootDNamespace delegation
∙ Global tree of redirectors
∙ Flexibility throughdecomposition intopluggable components
∙ Namespace independentfrom data access
∙
redirect
redirect open()
641 = 64
642 = 4096
Client open()
cmsd
xrootd
cmsd
xrootd
cmsd
xrootd
cmsd
xrootd
cmsd
xrootd
cmsd
xrootd
cmsd
xrootd
open()
Hanushevsky (2013)
25 / 22
Milestones in Distributed File SystemsBiased towards open-source, production file systems
1983
AFS
1985
NFS
1995
Zebra
2000
OceanStore
2002
Venti
2003
GFS
2005
XRootD
2007
Ceph
2007
Ceph
∙
Ceph File System and RADOSParallel, distributed meta-data
∙ Peer-to-peer file systemat the cluster scale
∙ Data placementacross failure domains
∙ Adaptive workloaddistribution
∙
choose(1,row)
choose(3,cabinet)
choose(1,disk)·········
··················
······
··················
···············
···············
··················
root
row2
row1 row3 row4
cab24cab21
cab22
cab23
Figure 5.1: A partial view of a four-level cluster map hierarchy consisting of rows, cabinets, andshelves of disks. Bold lines illustrate items selected by each select operation in the placementrule and fictitious mapping described by Table 5.1.
Action Resulting i⃗take(root) rootselect(1,row) row2select(3,cabinet) cab21 cab23 cab24select(1,disk) disk2107 disk2313 disk2437emit
Table 5.1: A simple rule that distributes three replicas across three cabinets in the same row.
descends through any intermediate buckets, pseudo-randomly selecting a nested item in each
bucket using the function c(r,x) (defined for each kind of bucket in Section 5.2.4), until it finds
an item of the requested type t. The resulting n|⃗i| distinct items are placed back into the input i⃗
and either form the input for a subsequent select(n,t) or are moved into the result vector with an
emit operation.
As an example, the rule defined in Table 5.1 begins at the root of the hierarchy in
Figure 5.1 and with the first select(1,row) chooses a single bucket of type “row” (it selects row2).
91
Weil (2007)
25 / 22
Hardware Bandwidth and Capacity NumbersMethod and entries marked † from Patterson (2004) Link
Hard Disk Drives SSD (MLC Flash) DRAM EthernetYear Capacity Bandwidth Capacity Bandwidth (r/w) Capacity Bandwidth Bandwidth
1993 16Mibit/chip† 267MiB/s†
1994 4.3GB† 9MB/s†
1995 100Mbit/s†
2003 73.4GB† 86MB/s† 10 Gbit/s†2004 512Mibit/chip 3.2 GiB/s2008 160GB 250/100MB/s3
2012 800GB 500/460MB/s4
2014 6TB 220MB/s1 2TB 2.8/1.9 GB/s5 8Gibit/chip 25.6 GiB/s 100Gbit/s2015 8TB 195MB/s2
Increase ×1860 ×24 ×12.5 ×11.2 ×512 ×98 ×1000
1http://www.storagereview.com/seagate_enterprise_capacity_6tb_35_sas_hdd_review_v42http://www.storagereview.com/seagate_archive_hdd_review_8tb
3http://www.storagereview.com/intel_x25-m_ssd_review4http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review
5http://www.tomshardware.com/reviews/intel-ssd-dc-p3700-nvme,3858.html
HDD: Seagate ST15150 (1994)†, Seagate 373453 (2004)†,Seagate ST6000NM0034 (2014), Seagate ST8000AS0012 [SMR] (2015)
SSD: Intel X25-M (2008), Intel SSD DC S3700 (2012), Intel SSD DC P3700 (2014)DRAM: Fast Page DRAM (1993)†, DDR2-400 (2004), DDR4-3200 (2014)Ethernet: Fast Ethernet IEEE 802.3u (1995)†, 10 GbitE IEEE 802.3ae (2003)†,
100 GbitE IEEE 802.3bj (2014)
26 / 22
Data Integrity and File System Snapshots
Root (/)
Contents: A, B
hash(hA, hB)
/A
Contents: 𝛼, 𝛽
hash(h𝛼, h𝛽) =: hA
/A/𝛼
Contents: data𝛼
hash(data𝛼) =: h𝛼
/A/𝛽
Contents: data𝛽
hash(data𝛽) =: h𝛽
/B
Contents: ∅
hash(∅) =: hB
Merkle tree
∙ Hash tree with cryptographichash function providessecure identifier for sub trees
∙ It is easy to sign a small hashvalue (data authenticity)
∙ Efficient calculation of changes(fast replication)
∙ Bonus: versioning and datade-duplication
∙ Full potential together withcontent-addressable storage
∙ Self-verifying data chunks,trivial to distribute and cache
27 / 22
Data Integrity and File System Snapshots
hash(hA, hB)
Contents: hA [A], hB [B]
hash(h𝛼, h𝛽)=: hA
Contents: h𝛼 [𝛼], h𝛽 [𝛽]
hash(data𝛼)=: h𝛼
Contents: data𝛼
hash(data𝛽)=: h𝛽
Contents: data𝛽
hash(∅)=: hB
Contents: ∅
Merkle tree⊕
Content-addressable storage
∙ Hash tree with cryptographichash function providessecure identifier for sub trees
∙ It is easy to sign a small hashvalue (data authenticity)
∙ Efficient calculation of changes(fast replication)
∙ Bonus: versioning and datade-duplication
∙ Full potential together withcontent-addressable storage
∙ Self-verifying data chunks,trivial to distribute and cache
27 / 22
System Integration
System-aided
∙ Kernel-level file system fast intrusive
∙ Virtual NFS server easy deployment limited by NFS semantics
Unprivileged options (mostly)
∙ Library (e. g. libhdfs, libXrd, . . . ) tuned API not transparent
∙ Interposition systems:
∙ Pre-loaded libraries∙ ptrace based∙ FUSE
(“file system in user space”)
Application
libc
OS Kernel
open()
system call FUSEupcall
28 / 22
Bibliography I
Survey Articles
Satyanarayanan, M. (1990).A survey of distributed file systems.Annual Review of Computer Science, 4(1):73–104.
Guan, P., Kuhl, M., Li, Z., and Liu, X. (2000).A survey of distributed file systems.University of California, San Diego.
Thanh, T. D., Mohan, S., Choi, E., Kim, S., and Kim, P. (2008).A taxonomy and survey on distributed file systems.In Proc. int. conf. on Networked Computing and Advanced InformationManagement (NCM’08), pages 144 – 149.
Peddemors, A., Kuun, C., Spoor, R., Dekkers, P., and den Besten, C. (2010).Survey of technologies for wide area distributed storage.Technical report, SURFnet.
Depardon, B., Séguin, C., and Mahec, G. L. (2013).Analysis of six distributed file systems.Technical Report hal-00789086, Université de Picardie Jules Verne.
29 / 22
Bibliography II
Donvito, G., Marzulli, G., and Diacono, D. (2014).Testing of several distributed file-systems (hdfs, ceph and glusterfs) forsupporting the hep experiment analysis.Journal of Physics: Conference Series, 513.
File Systems
Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., and Lyon, B. (1985).Design and implementation of the sun network filesystem.In Proc. of the Summer USENIX conference, pages 119–130.
Morris, J. H., Satyanarayanan, M., Conner, M. H., Howard, J. H., Rosenthal, D.S. H., and Smith, F. D. (1986).Andrew: A distributed personal computing environment.Communications of the ACM, 29(3):184–201.
Hartman, J. H. and Osterhout, J. K. (1995).The Zebra striped network file system.ACM Transactions on Computer Systems, 13(3):274–310.
30 / 22
Bibliography III
Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D.,Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B.(2000).OceanStore: An architecture for global-scale persistent storage.ACM SIGPLAN Notices, 35(11):190–201.
Quinlan, S. and Dorward, S. (2002).Venti: a new approach to archival storage.In Proc. of the 1st USENIX Conf. on File and Storage Technologies (FAST’02),pages 89–102.
Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003).The Google file system.ACM SIGOPS Operating Systems Review, 37(5):29–43.
Schwan, P. (2003).Lustre: Building a file system for 1,000-node clusters.In Proc. of the 2003 Linux Symposium, pages 380–386.
Dorigo, A., Elmer, P., Furano, F., and Hanushevsky, A. (2005).XROOTD - a highly scalable architecture for data access.WSEAS Transactions on Computers, 4(4):348–353.
31 / 22
Bibliography IV
Weil, S. A. (2007).Ceph: reliable, scalable, and high-performance distributed storage.PhD thesis, University of California Santa Cruz.
32 / 22