Experiences on File Systems - Which is the best file system for you? · 2018. 11. 16. · HDFS QFS MapR FS 9/22. Distributed FileSystems Super-computers Lustre GPFS Orange-FS BeeGFS

Experiences on File SystemsWhich is the best file system for you?

Jakob BlomerCERN PH/SFT

CHEP 2015Okinawa, Japan

1 / 22

Why Distributed File Systems

Physics experiments store their files in a variety of file systemsfor a good reason

∙ File system interface is portable: we can take our local analysisapplication and run it anywhere on a big data set

∙ File system as a storage abstraction is asweet spot between data flexibility and data organization

We like file systems for their rich and standardized interface

We struggle finding an optimal implementation of that interface

2 / 22

Why Distributed File Systems

Physics experiments store their files in a variety of file systemsfor a good reason

∙ File system interface is portable: we can take our local analysisapplication and run it anywhere on a big data set

∙ File system as a storage abstraction is asweet spot between data flexibility and data organization

We like file systems for their rich and standardized interface

We struggle finding an optimal implementation of that interface

2 / 22

Shortlist of Distributed File Systems

QuantcastFile System

CernVM-FS3 / 22

Agenda

1 What do we want from a distributed file system?

2 Sorting and searching the file system landscape

3 Technology trends and future challenges

4 / 22

Can one size fit all?

MeanFile Size

Change FrequencyThroughput

MB/s

ThroughputIOPS

Volume

Cache Hit RateRedundancy

Confidentiality

DataValue

[Data are illustrative]

Data Classes∙ Home folders

∙ Physics Data

∙ Recorded∙ Simulated∙ Analysis results

∙ Software binaries

∙ Scratch area

5 / 22


MeanFile Size


MB/s

ThroughputIOPS

Volume


Confidentiality

DataValue



∙ Physics Data



∙ Scratch area

5 / 22


MeanFile Size


MB/s

ThroughputIOPS

Volume


Confidentiality

DataValue



∙ Physics Data



∙ Scratch area

5 / 22


MeanFile Size


MB/s

ThroughputIOPS

Volume


Confidentiality

DataValue



∙ Physics Data



∙ Scratch area

5 / 22


MeanFile Size


MB/s

ThroughputIOPS

Volume


Confidentiality

DataValue



∙ Physics Data



∙ Scratch area

Depending on the use case, the dimensions span orders of magnitude5 / 22

POSIX Interface

∙ No DFS is fully POSIXcompliant

∙ It must provide just enough tonot break applications

∙ Often this can be onlydiscovered by testing

File system operationsessential

create(), unlink(), stat()open(), close(),read(), write(), seek()

difficult for DFSs

File locksWrite-throughAtomic rename()File ownershipExtended attributesUnlink opened filesSymbolic links, hard linksDevice files, IPC files

6 / 22

POSIX Interface

∙ No DFS is fully POSIXcompliant

∙ It must provide just enough tonot break applications

∙ Often this can be onlydiscovered by testing

Missing APIsPhysical file location, file replicationproperties, file temperature, . . .

File system operationsessential

create(), unlink(), stat()open(), close(),read(), write(), seek()

difficult for DFSs

File locksWrite-throughAtomic rename()File ownershipExtended attributesUnlink opened filesSymbolic links, hard linksDevice files, IPC files

6 / 22

Application-Defined File Systems?

Mounted file system

FILE * f = fopen (" su sy . dat " , " r " ) ;

whi le ( . . . ) {f r e a d ( . . . ) ;. . .

}f c l o s e ( f ) ;

File system library

hdfsFS f s = hdf sConnect (" d e f a u l t " , 0 ) ;

h d f s F i l e f = hd f sOpenF i l e (f s , " su sy . dat " , . . . ) ;

whi le ( . . . ) {hdfsRead ( f s , f , . . . ) ;. . .

}h d f s C l o s e F i l e ( f s , f ) ;

Application independent from file system

Allows for standard tools (ls, grep, . . . )

Performance tuned API

Requires code changes

System administrator selectsthe file system

Application selectsthe file system

What we ideally want is an application-defined, mountable file system

Fuse Parrot

7 / 22

http://fuse.sourceforge.net/

http://ccl.cse.nd.edu/software/parrot/

Application-Defined File Systems?

Mounted file system

FILE * f = fopen (" su sy . dat " , " r " ) ;

whi le ( . . . ) {f r e a d ( . . . ) ;. . .

}f c l o s e ( f ) ;

File system library

hdfsFS f s = hdf sConnect (" d e f a u l t " , 0 ) ;

h d f s F i l e f = hd f sOpenF i l e (f s , " su sy . dat " , . . . ) ;

whi le ( . . . ) {hdfsRead ( f s , f , . . . ) ;. . .

}h d f s C l o s e F i l e ( f s , f ) ;

Application independent from file system

Allows for standard tools (ls, grep, . . . )

Performance tuned API

Requires code changes

System administrator selectsthe file system

Application selectsthe file system

What we ideally want is an application-defined, mountable file system

Fuse Parrot

7 / 22

http://fuse.sourceforge.net/

http://ccl.cse.nd.edu/software/parrot/

Agenda




8 / 22

DistributedFile Systems

Super-computers

Lustre

GPFS

Orange-FS

BeeGFS

Panasas

SharedDisk

GFS2

OCFS2

GeneralPurpose

NFS

Gluster-FS

Ceph

XtreemFS

MooseFS

PersonalFiles

AFS

drop-box/own-

cloudTahoe-LAFS

Big Data

HDFS

QFS

MapR FS

9 / 22


Super-computers

Lustre

GPFS

Orange-FS

BeeGFS

Panasas

SharedDisk

GFS2

OCFS2

GeneralPurpose

(p)NFS

Gluster-FS

Ceph

XtreemFS

MooseFS

PersonalFiles

AFS

drop-box/own-

cloudTahoe-LAFS

Big Data

HDFS

QFS

MapR FS

∙ Fast parallelwrites

∙ InfiniBand,Myrinet, . . . ∙ High level of

POSIXcompliance

∙ Privacy

∙ Sharing

∙ Sync

∙ Incrementalscalability

∙ Ease ofadministration

∙ MapReduceworkflows

∙ Commodityhardware

9 / 22


Super-computers

Lustre

GPFS

Orange-FS

BeeGFS

Panasas

SharedDisk

GFS2

OCFS2

GeneralPurpose

(p)NFS

Gluster-FS

Ceph

XtreemFS

MooseFS

PersonalFiles

AFS

drop-box/own-

cloudTahoe-LAFS

Big Data

HDFS

QFS

MapR FS

Used in HEP



POSIXcompliance

∙ Privacy

∙ Sharing

∙ Sync





9 / 22


Super-computers

Lustre

GPFS

Orange-FS

BeeGFS

Panasas

SharedDisk

GFS2

OCFS2

GeneralPurpose

(p)NFS

Gluster-FS

Ceph

XtreemFS

MooseFS

PersonalFiles

AFS

drop-box/own-

cloudTahoe-LAFS

Big Data

HDFS

QFS

MapR FS

dCache

XRootD

CernVM-FS

EOS

HEP

Used in HEP



POSIXcompliance

∙ Privacy

∙ Sharing

∙ Sync





10 / 22


Super-computers

Lustre

GPFS

Orange-FS

BeeGFS

Panasas

SharedDisk

GFS2

OCFS2

GeneralPurpose

(p)NFS

Gluster-FS

Ceph

XtreemFS

MooseFS

PersonalFiles

AFS

drop-box/own-

cloudTahoe-LAFS

Big Data

HDFS

QFS

MapR FS

dCache

XRootD

CernVM-FS

EOS

HEP

Used in HEP

∙ Tape access

∙ WANfederation

∙ Softwaredistribution

∙ Fault-tolerance ∙ Fast parallel

writes


POSIXcompliance

∙ Privacy

∙ Sharing

∙ Sync





10 / 22

File System Architecture

data

met

a-da

ta

∙ ∙ ∙ ∙∙∙

Object-based file system

delete()

create()

read()write()

Target: Incremental scaling, large & immutable filesTypical for Big Data applications

Examples: Hadoop File System, Quantcast File System

11 / 22


data

met

a-da

ta

∙ ∙ ∙ ∙∙∙

Object-based file system

delete()

create()

read()write()

Target: Incremental scaling, large & immutable filesTypical for Big Data applications

Examples: Hadoop File System, Quantcast File System

Head node can help in job scheduling

11 / 22


data

met

a-da

ta

∙ ∙ ∙

∙∙∙

Parallel file system

delete()

create()

read()write()

Target: Maximum aggregated throughput, large filesTypical for High-Performance Computing

Examples: Lustre, MooseFS, pNFS, XtreemFS

12 / 22


data

met

a-da

ta

∙ ∙ ∙

∙∙∙

Distributed meta-data

delete()

create()

read()write()

Target: Avoid single point of failure and meta-data bottleneckModern general-purpose distributed file system

Examples: Ceph, OrangeFS

13 / 22


Symmetric, peer-to-peer

Distributed hash table —Hosts of pathn ○

hash(pathn)

Target: Conceptual simplicity, inherently scalable

Examples: GlusterFS

14 / 22


Symmetric, peer-to-peer

Distributed hash table —Hosts of pathn ○

hash(pathn)

Target: Conceptual simplicity, inherently scalable

Examples: GlusterFS

Difficult to deal with node churnSlow lookup beyond LAN

In HEP we use caching and catalog based data management

14 / 22

Agenda




15 / 22

Trends and Challenges

We are lucky: large data sets tend to be immutable everywhereFor instance: media, backups, VM images, scientific data sets, . . .

∙ Reflected in hardware: shingled magnetic recording drives

∙ Reflected in software: log-structured space management

We need to invest in scaling fault-tolerance and speedtogether with with the capacity

∙ Replication becomes too expensive at the Petabyte and Exabyte scale→ Erasure codes

∙ Explicit use of SSDs

∙ For meta-data (high IOPS requirement)∙ As a fast storage pool∙ As a node-local cache

16 / 22

Trends and Challenges

We are lucky: large data sets tend to be immutable everywhereFor instance: media, backups, VM images, scientific data sets, . . .

∙ Reflected in hardware: shingled magnetic recording drives

∙ Reflected in software: log-structured space management

We need to invest in scaling fault-tolerance and speedtogether with with the capacity

∙ Replication becomes too expensive at the Petabyte and Exabyte scale→ Erasure codes

∙ Explicit use of SSDs

∙ For meta-data (high IOPS requirement)∙ As a fast storage pool∙ As a node-local cache

Amazon Elastic File Systemwill be on SSD

16 / 22

Log-Structured Data

Idea: Store all modifications in a change log

Unix file system

Log-structured file system

Disk

Disk

· · · Log −→

File Inode

Directory Index

Used by

∙ Zebra experimental DFS

∙ Commercial filers(e. g. NetApp)

∙ Key-value and BLOB stores

∙ File systems for flash memory

Advantages

∙ Minimal seek

∙ Fast and robust crash recovery

∙ Efficient allocation inDRAM, flash, and disks

∙ Ideal for immutable data17 / 22

Bandwidth Lags Behind Capacity

×1

×10

×100

×1000

1995 2005 2015

Impr

ovem

ent

Year

HDDDRAM

SSDEthernet

∙ Bandwidth� Capacity

∙ Capacity and bandwidthof affordable storage scaleat different paces

∙ It will be prohibitivelycostly to constantly“move data in and out”

∙ Ethernet bandwidth scaledsimilarly to capacity

We should use the highbi-section bandwidth amongworker nodes and make thempart of the storage network

18 / 22

Fault Tolerance at the Petascale

How to prevent data loss

1 Replication: simple, large storage overhead (similar to RAID 1)

2 Erasure codes: based on parity blocks (similar to RAID 5)

yDatay Parity

distribute all blocks

Requires extracompute power

Available commercially: GPFS, NEC Hydrastor, Scality RING, . . .Emerging in open source file systems: EOS, Ceph, QFS

Engineering Challenges

∙ Fault detection

∙ De-correlation of failure domains instead of random placement

∙ Failure prediction, e. g. based on MTTF and Markov models19 / 22

Fault Tolerance at the Petascale

Example from Facebook

size—e.g., 40KB instead of 1GB—reconstructing theBLOB is much faster and lighter weight than rebuildingthe block. Full block rebuilding is handled offline byrebuilder nodes.

Rebuilder Nodes At large scale, disk and node failuresare inevitable. When this happens blocks stored on thefailed components need to be rebuilt. Rebuilder nodes arestorage-less, CPU-heavy nodes that handle failure detec-tion and background reconstruction of data blocks. Eachrebuilder node detects failure through probing and re-ports the failure to a coordinator node. It rebuilds blocksby fetching n companion or parity blocks from the failedblock’s strip and decoding them. Rebuilding is a heavy-weight process that imposes significant I/O and networkload on the storage nodes. Rebuilder nodes throttle them-selves to avoid adversely impacting online user requests.Scheduling the rebuilds to minimize the likelihood ofdata loss is the responsibility of the coordinator nodes.

Coordinator Nodes A cell requires many maintenancetask, such as scheduling block rebuilding and ensuringthat the current data layout minimizes the chances ofdata unavailability. Coordinator nodes are storage-less,CPU-heavy nodes that handle these cell-wide tasks.

As noted earlier, blocks in a stripe are laid out on dif-ferent failure domains to maximize reliability. However,after initial placement and after failure, reconstruction,and replacement there can be violations where a stripe’sblocks are in the same failure domain. The coordinatorruns a placement balancer process that validates the blocklayout in the cell, and rebalance blocks as appropriate.Rebalancing operations, like rebuilding operations, in-cur significant disk and network load on storage nodesand are also throttled so that user requests are adverselyimpacted.

5.4 Geo-replicationIndividual f4 cells all reside in a single datacenter andthus are not tolerant to datacenter failures. To add dat-acenter fault tolerance we initially double-replicated f4cells and placed the second replica in a different data-center. If either datacenter fails, all the BLOBs are stillavailable from the other datacenter. This provides all ofour fault tolerance requirements and reduces the effective-replication-factor from 3.6 to 2.8.

Given the rarity of datacenter failure events wesought a solution that could further reduce the effective-replication-factor with the tradeoff of decreased through-put for BLOBs stored at the failed datacenter. We arecurrently deploying geo-replicated XOR coding thatreduces the effective-replication-factor to 2.1.

Geo-replicated XOR coding provides datacenter faulttolerance by storing the XOR of blocks from two differ-ent volumes primarily stored in two different datacentersin a third datacenter as shown in Figure 9. Each data

Datacenter 1

Datacenter 2

Datacenter 3

Block B

A XOR B

Block A

Figure 9: Geo-replicated XOR Coding.

and parity block in a volume is XORed with the equiv-alent data or parity block in the other volume, calledits buddy block, to create their XOR block. These XORblocks are stored with normal triple-replicated index filesfor the volumes. Again, because the index files are tinyrelative to the data, coding them is not worth the addedcomplexity.

The 2.1 replication factor comes from the 1.4X for theprimary single cell replication for each of two volumesand another 1.4X for the geo-replicated XOR of the twovolumes: 1.4⇤2+1.4

2 = 2.1.Reads are handled by a geo-backoff node that receives

requests for a BLOB that includes the data file, offset,and length (R6 in Figure 8). This node then fetches thespecified region from the local XOR block and the remoteXOR-companion block and reconstructs the requestedBLOB. These reads go through the normal single-cellread path through storage nodes Index and File APIs orbackoff node File APIs if there are disk, host, or rackfailures that affect the XOR or XOR-companion blocks.

We chose XOR coding for geo-replication becauseit significantly reduces our storage requirements whilemeeting our fault tolerance goal of being able to survivethe failure of a datacenter.

5.5 f4 Fault ToleranceSingle f4 cells are tolerant to disk, host, and rack fail-ures. Geo-replicating XOR volumes brings tolerance todatacenter failures. This subsection explains the failuredomains in a single cell, how f4 lays out blocks to in-crease its resilience, gives an example of recovery if allfour types of failure all affect the same BLOB, and sum-marizes how all components of a cell are fault tolerant.

Failure Domains and Block Placement Figure 10 il-lustrates how data blocks in a stripe are laid out in af4 cell. A rack is the largest failure domain and is ourprimary concern. Given a stripe S of n data blocks andk parity blocks, we attempt to lay out the blocks so thateach of these is on a different rack, and at least on a dif-ferent node. This requires that a cell have at least n + kracks, of roughly the same size. Our current implemen-tation initially lays out blocks making a best-effort toput each on a different rack. The placement balancer pro-cess detects and corrects any rare violations that place astripe’s blocks on the same rack.

8

Replication factor 2.1using erasure codingwithin every data centerplus cross data centerXOR encoding

Muralidhar et al. (2014) Link

20 / 22

https://www.usenix.org/conference/osdi14/technical-sessions/presentation/muralidhar

Towards a Better Implementation of File Systems

Workflows Distributed File Systems

∙ Integration with applications

∙ Feature-rich: quotas, permissions, links

∙ Globally federated namespace

Data Provisioning Object Stores

∙ Very good scalability on local networks

∙ Flat namespace

∙ Turns out that certain types of datado not need a hierarchical namespacee. g. cached objects, media, VM images

∙ ∙

BuildingBlocks

21 / 22

Conclusion

File Systems are at the Heart of our Distributed Computing

∙ Most convenient way of connecting our applications to data

∙ The hierarchical namespace is a natural way to organize data

We can tailor file systems to our needsusing existing building blocks

With a focus on

1 Wide area data federation

2 Multi-level storage hierarchy from tape to memory

3 Erasure coding

4 Coalescing storage and compute nodes

22 / 22

Thank you for your time!

Backup Slides

Milestones in Distributed File SystemsBiased towards open-source, production file systems

1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

25 / 22


1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

1983

AFS

∙

Andrew File SystemClient-server

∙ Roaming home folders

∙ Identity tokens andaccess control lists (ACLs)

∙ Decentralized operation (“Cells”)

∙

“AFS was the first safe and efficient distributed com-puting system, available [. . . ] on campus. It was aclear precursor to the Dropbox-like software pack-ages today. [. . . ] [It] allowed students (like DrewHouston and Arash Ferdowsi) access to all theirstuff from any connected computer.”

http://www.wired.com/2011/12/backdrop-dropbox

25 / 22

http://www.wired.com/2011/12/backdrop-dropbox


1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

1985

NFS

∙

Network File SystemClient-server

∙ Focus on portability

∙ Separation of protocoland implementation

∙ Stateless servers

∙ Fast crash recovery

∙

Sandberg, Goldberg, Kleiman, Walsh, Lyon (1985)

25 / 22


1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

1995

Zebra

∙

Zebra File SystemParallel

∙ Striping and parity

∙ Redundant array ofinexpensive nodes (RAIN)

∙ Log-structured data

∙

sequential transfers. LFS is particularly effective for writingsmall files, since it can write many files in a single transfer;in contrast, traditional file systems require at least twoindependent disk transfers for each file. Rosenblum reporteda tenfold speedup over traditional file systems for writingsmall files. LFS is also well-suited for RAIDs because itbatches small writes together into large sequential transfersand avoids the expensive parity updates associated withsmall random writes.

Zebra can be thought of as a log-structured network filesystem: whereas LFS uses the logging approach at theinterface between a file server and its disks, Zebra uses thelogging approach at the interface between a client and itsservers. Figure 4 illustrates this approach, which we call

per-client

striping. Each Zebra client organizes its new filedata into an append-only log, which it then stripes across theservers. The client computes parity for the log, not forindividual files. Each client creates its own log, so a singlestripe in the file system contains data written by a singleclient.

Per-client striping has a number of advantages over per-file striping. The first is that the servers are used efficientlyregardless of file sizes: large writes are striped, allowingthem to be completed in parallel, and small writes arebatched together by the log mechanism and written to theservers in large transfers; no special handling is needed foreither case. Second, the parity mechanism is simplified.Each client computes parity for its own log without fear ofinteractions with other clients. Small files do not haveexcessive parity overhead because parity is not computedon a per-file basis. Furthermore, parity never needs to beupdated because file data are never overwritten in place.

The above introduction to per-client striping leavessome unanswered questions. For example, how can files beshared between client workstations if each client is writingits own log? Zebra solves this problem by introducing acentral

file manager

, separate from the storage servers, thatmanages metadata such as directories and file attributes andsupervises interactions between clients. Also, how is free

63

52

41

64

File Servers

1

⊗

2

⊗

34

⊗

5

⊗

6

1 2 3 5

File BFile A

File CFile D

Client’s Log

Figure 4

.

Per-client striping in Zebra

. Each clientforms its new file data into a single append-only log andstripes this log across the servers. In this example file Aspans several servers while file B is stored entirely on asingle server. Parity is computed for the log, not forindividual files.

space reclaimed from the logs? Zebra solves this problemwith a

stripe cleaner

, which is analogous to the cleaner in alog-structured file system. The next section provides a moredetailed discussion of these issues and several others.

3 Zebra Components

The Zebra file system contains four main componentsas shown in Figure 5:

clients

, which are the machines thatrun application programs;

storage servers

, which store filedata; a

file manager

, which manages the file and directorystructure of the file system; and a

stripe cleaner

, whichreclaims unused space on the storage servers. There may beany number of clients and storage servers but only a singlefile manager and stripe cleaner. More than one of thesecomponents may share a single physical machine; forexample, it is possible for one machine to be both a storageserver and a client. The remainder of this section describeseach of the components in isolation; Section 4 then showshow the components work together to implement operationssuch as reading and writing files, and Sections 5 and 6describe how Zebra deals with crashes.

We will describe Zebra under the assumption that thereare several storage servers, each with a single disk.However, this need not be the case. For example, storageservers could each contain several disks managed as aRAID, thereby giving the appearance to clients of a singledisk with higher capacity and throughput. It is also possibleto put all of the disks on a single server; clients would treatit as several logical servers, all implemented by the samephysical machine. This approach would still provide manyof Zebra’s benefits: clients would still batch small files fortransfer over the network, and it would still be possible toreconstruct data after a disk failure. However, a single-server Zebra system would limit system throughput to thatof the one server, and the system would not be able tooperate when the server is unavailable.

3.1 Clients

Clients are machines where application programsexecute. When an application reads a file the client must

Figure 5: Zebra schematic

. Clients run applications;storage servers store data. The file manager and thestripe cleaner can run on any machine in the system,although it is likely that one machine will run both ofthem. A storage server may also be a client.

Network

Client

FileManager

StripeCleaner

Client

StorageServer

StorageServer

StorageServer

StorageServer

Hartman, Ousterhout (1995)

25 / 22


1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

2000

OceanStore

∙

OceanStorePeer-to-peer

∙ “Global Scale”:1010 users, 1014 files

∙ Untrusted infrastructure

∙ Based on peer-to-peeroverlay network

∙ Nomadic data throughaggressive caching

∙ Foundation for today’sdecentral dropbox replacements

∙Ideally, a user would entrust all of his or her data to OceanStore;

in return, the utility’s economies of scale would yield much betteravailability, performance, and reliability than would be availableotherwise. Further, the geographic distribution of servers wouldsupport deep archival storage, i.e. storage that would survive ma-jor disasters and regional outages. In a time when desktop worksta-tions routinely ship with tens of gigabytes of spinning storage, themanagement of data is far more expensive than the storage media.OceanStore hopes to take advantage of this excess of storage spaceto make the management of data seamless and carefree.

1.2 Two Unique GoalsThe OceanStore system has two design goals that differentiate itfrom similar systems: (1) the ability to be constructed from an un-trusted infrastructure and (2) support of nomadic data.

Untrusted Infrastructure: OceanStore assumes that the infras-tructure is fundamentally untrusted. Servers may crash withoutwarning or leak information to third parties. This lack of trust is in-herent in the utility model and is different from other cryptographicsystems such as [35]. Only clients can be trusted with cleartext—allinformation that enters the infrastructure must be encrypted. How-ever, rather than assuming that servers are passive repositories ofinformation (such as in CFS [5]), we allow servers to be able toparticipate in protocols for distributed consistency management. Tothis end, we must assume that most of the servers are working cor-rectly most of the time, and that there is one class of servers that wecan trust to carry out protocols on our behalf (but not trust with thecontent of our data). This responsible party is financially responsi-ble for the integrity of our data.

Nomadic Data: In a system as large as OceanStore, locality isof extreme importance. Thus, we have as a goal that data can becached anywhere, anytime, as illustrated in Figure 1. We call thispolicy promiscuous caching. Data that is allowed to flow freely iscalled nomadic data. Note that nomadic data is an extreme con-sequence of separating information from its physical location. Al-though promiscuous caching complicates data coherence and loca-tion, it provides great flexibility to optimize locality and to trade offconsistency for availability. To exploit this flexibility, continuousintrospective monitoring is used to discover tacit relationships be-tween objects. The resulting “meta-information” is used for local-ity management. Promiscuous caching is an important distinctionbetween OceanStore and systems such as NFS [43] and AFS [23]in which cached data is confined to particular servers in particularregions of the network. Experimental systems such as XFS [3] al-low “cooperative caching” [12], but only in systems connected bya fast LAN.

The rest of this paper is as follows: Section 2 gives a system-leveloverview of the OceanStore system. Section 3 shows sample ap-plications of the OceanStore. Section 4 gives more architecturaldetail, and Section 5 reports on the status of the current prototype.Section 6 examines related work. Concluding remarks are given inSection 7.

2 SYSTEM OVERVIEWAn OceanStore prototype is currently under development. This sec-tion provides a brief overview of the planned system. Details on theindividual system components are left to Section 4.The fundamental unit in OceanStore is the persistent object.

Each object is named by a globally unique identifier, or GUID.

pool

pool

pool

pool

pool

pool

pool

Figure 1: The OceanStore system. The core of the system iscomposed of a multitude of highly connected “pools”, amongwhich data is allowed to “flow” freely. Clients connect to one ormore pools, perhaps intermittently.

Objects are replicated and stored on multiple servers. This replica-tion provides availability1 in the presence of network partitions anddurability against failure and attack. A given replica is independentof the server on which it resides at any one time; thus we refer tothem as floating replicas.A replica for an object is located through one of two mecha-

nisms. First, a fast, probabilistic algorithm attempts to find theobject near the requesting machine. If the probabilistic algorithmfails, location is left to a slower, deterministic algorithm.Objects in the OceanStore are modified through updates. Up-

dates contain information about what changes to make to an ob-ject and the assumed state of the object under which those changeswere developed, much as in the Bayou system [13]. In principle,every update to an OceanStore object creates a new version2. Con-sistency based on versioning, while more expensive to implementthan update-in-place consistency, provides for cleaner recovery inthe face of system failures [49]. It also obviates the need for backupand supports “permanent” pointers to information.OceanStore objects exist in both active and archival forms. An

active form of an object is the latest version of its data togetherwith a handle for update. An archival form represents a permanent,read-only version of the object. Archival versions of objects areencoded with an erasure code and spread over hundreds or thou-sands of servers [18]; since data can be reconstructed from any suf-ficiently large subset of fragments, the result is that nothing shortof a global disaster could ever destroy information. We call thishighly redundant data encoding deep archival storage.An application writer views the OceanStore as a number of ses-

sions. Each session is a sequence of read and write requests relatedto one another through the session guarantees, in the style of theBayou system [13]. Session guarantees dictate the level of con-sistency seen by a session’s reads and writes; they can range fromsupporting extremely loose consistency semantics to supporting theACID semantics favored in databases. In support of legacy code,OceanStore also provides an array of familiar interfaces such as theUnix file system interface and a simple transactional interface.

If application semantics allow it, this availability is provided at the expenseof consistency.In fact, groups of updates are combined to create new versions, and weplan to provide interfaces for retiring old versions, as in the Elephant FileSystem [44].

2

Kubiatowicz et al. (2000)

25 / 22


1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

2002

Venti

∙

VentiArchival storage

∙ De-duplication throughcontent-addressable storage

∙ Content hashes provideintrinsic file integrity

∙ Merkle trees verify thefile system hierarchy

∙

overlapped and do not benefit from the striping of theindex. One possible solution is a form of read-ahead.When reading a block from the data log, it is feasible toalso read several following blocks. These extra blockscan be added to the caches without referencing theindex. If blocks are read in the same order they werewritten to the log, the latency of uncached indexlookups will be avoided. This strategy should work wellfor streaming data such as multimedia files.

The basic assumption in Venti is that the growth incapacity of disks combined with the removal ofduplicate blocks and compression of their contentsenables a model in which it is not necessary to reclaimspace by deleting archival data. To demonstrate why webelieve this model is practical, we present somestatistics derived from a decade’s use of the Plan 9 filesystem.

The computing environment in which we work includestwo Plan 9 file servers named bootes and emelie.Bootes was our primary file repository from 1990 until1997 at which point it was superseded by emelie. Overthe life of these two file servers there have been 522user accounts of which between 50 and 100 were activeat any given time. The file servers have hostednumerous development projects and also containseveral large data sets including chess end games,astronomical data, satellite imagery, and multimediafiles.

Figure 6 depicts the size of the active file system asmeasured over time by du, the space consumed on thejukebox, and the size of the jukebox’s data if it were tobe stored on Venti. The ratio of the size of the archivaldata and the active file system is also given. As can beseen, even without using Venti, the storage required toimplement the daily snapshots in Plan 9 is relativelymodest, a result of the block level incremental approachto generating a snapshot. When the archival data isstored to Venti the cost of retaining the snapshots isreduced significantly. In the case of the emelie filesystem, the size on Venti is only slightly larger than theactive file system; the cost of retaining the dailysnapshots is almost zero. Note that the amount ofstorage that Venti uses for the snapshots would be thesame even if more conventional methods were used toback up the file system. The Plan 9 approach tosnapshots is not a necessity, since Venti will removeduplicate blocks.

When stored on Venti, the size of the jukebox data isreduced by three factors: elimination of duplicateblocks, elimination of block fragmentation, andcompression of the block contents. Table 2 presents thepercent reduction for each of these factors. Note, bootesuses a 6 Kbyte block size while emelie uses 16 Kbyte,so the effect of removing fragmentation is moresignificant on emelie.

Bootes: storage size

0

50

100

150

200

250

300

Jul-90

Jan-91

Jul-91

Jan-92

Jul-92

Jan-93

Jul-93

Jan-94

Jul-94

Jan-95

Jul-95

Jan-96

Jul-96

Jan-97

Jul-97

Jan-98

Size

(Gb)

Emelie: storage size

050

100150200250300350400450

Jan-97

Jul-97

Jan-98

Jul-98

Jan-99

Jul-99

Jan-00

Jul-00

Jan-01

Jul-01

Size

(Gb)

JukeboxVentiActive file system

Bootes: ratio of archival to active data

00.5

11.5

22.5

33.5

44.5

5

Jul-90

Jan-91

Jul-91

Jan-92

Jul-92

Jan-93

Jul-93

Jan-94

Jul-94

Jan-95

Jul-95

Jan-96

Jul-96

Jan-97

Jul-97

Rat

io

Emelie: ratio of archival to active data

0

1

2

3

4

5

6

7

Jan-97

Jul-97

Jan-98

Jul-98

Jan-99

Jul-99

Jan-00

Jul-00

Jan-01

Jul-01

Rat

io

Jukebox / ActiveVenti / Active

Figure 6. Graphs of the various sizes of two Plan 9 file servers.Quinlan and Dorward (2002)

25 / 22


1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

2003

GFS

∙

Google File SystemObject-based

∙ Co-designed for map-reduce

∙ Coalesce storage andcompute nodes

∙ Serialize data access

∙

Google AppEngine documentation

25 / 22


1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

2005

XRootD

∙

XRootDNamespace delegation

∙ Global tree of redirectors

∙ Flexibility throughdecomposition intopluggable components

∙ Namespace independentfrom data access

∙

redirect

redirect open()

641 = 64

642 = 4096

Client open()

cmsd

xrootd

cmsd

xrootd

cmsd

xrootd

cmsd

xrootd

cmsd

xrootd

cmsd

xrootd

cmsd

xrootd

open()

Hanushevsky (2013)

25 / 22


1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

2007

Ceph

∙

Ceph File System and RADOSParallel, distributed meta-data

∙ Peer-to-peer file systemat the cluster scale

∙ Data placementacross failure domains

∙ Adaptive workloaddistribution

∙

choose(1,row)

choose(3,cabinet)

choose(1,disk)·········

··················

······

··················

···············

···············

··················

root

row2

row1 row3 row4

cab24cab21

cab22

cab23

Figure 5.1: A partial view of a four-level cluster map hierarchy consisting of rows, cabinets, andshelves of disks. Bold lines illustrate items selected by each select operation in the placementrule and fictitious mapping described by Table 5.1.

Action Resulting i⃗take(root) rootselect(1,row) row2select(3,cabinet) cab21 cab23 cab24select(1,disk) disk2107 disk2313 disk2437emit

Table 5.1: A simple rule that distributes three replicas across three cabinets in the same row.

descends through any intermediate buckets, pseudo-randomly selecting a nested item in each

bucket using the function c(r,x) (defined for each kind of bucket in Section 5.2.4), until it finds

an item of the requested type t. The resulting n|⃗i| distinct items are placed back into the input i⃗

and either form the input for a subsequent select(n,t) or are moved into the result vector with an

emit operation.

As an example, the rule defined in Table 5.1 begins at the root of the hierarchy in

Figure 5.1 and with the first select(1,row) chooses a single bucket of type “row” (it selects row2).

91

Weil (2007)

25 / 22

Hardware Bandwidth and Capacity NumbersMethod and entries marked † from Patterson (2004) Link

Hard Disk Drives SSD (MLC Flash) DRAM EthernetYear Capacity Bandwidth Capacity Bandwidth (r/w) Capacity Bandwidth Bandwidth

1993 16Mibit/chip† 267MiB/s†

1994 4.3GB† 9MB/s†

1995 100Mbit/s†

2003 73.4GB† 86MB/s† 10 Gbit/s†2004 512Mibit/chip 3.2 GiB/s2008 160GB 250/100MB/s3

2012 800GB 500/460MB/s4

2014 6TB 220MB/s1 2TB 2.8/1.9 GB/s5 8Gibit/chip 25.6 GiB/s 100Gbit/s2015 8TB 195MB/s2

Increase ×1860 ×24 ×12.5 ×11.2 ×512 ×98 ×1000

1http://www.storagereview.com/seagate_enterprise_capacity_6tb_35_sas_hdd_review_v42http://www.storagereview.com/seagate_archive_hdd_review_8tb

3http://www.storagereview.com/intel_x25-m_ssd_review4http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review

5http://www.tomshardware.com/reviews/intel-ssd-dc-p3700-nvme,3858.html

HDD: Seagate ST15150 (1994)†, Seagate 373453 (2004)†,Seagate ST6000NM0034 (2014), Seagate ST8000AS0012 [SMR] (2015)

SSD: Intel X25-M (2008), Intel SSD DC S3700 (2012), Intel SSD DC P3700 (2014)DRAM: Fast Page DRAM (1993)†, DDR2-400 (2004), DDR4-3200 (2014)Ethernet: Fast Ethernet IEEE 802.3u (1995)†, 10 GbitE IEEE 802.3ae (2003)†,

100 GbitE IEEE 802.3bj (2014)

26 / 22

http://dl.acm.org/citation.cfm?id=1022596

http://www.storagereview.com/seagate_enterprise_capacity_6tb_35_sas_hdd_review_v4

http://www.storagereview.com/seagate_archive_hdd_review_8tb

http://www.storagereview.com/intel_x25-m_ssd_review

http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review

http://www.tomshardware.com/reviews/intel-ssd-dc-p3700-nvme,3858.html

Data Integrity and File System Snapshots

Root (/)

Contents: A, B

hash(hA, hB)

/A

Contents: 𝛼, 𝛽

hash(h𝛼, h𝛽) =: hA

/A/𝛼

Contents: data𝛼

hash(data𝛼) =: h𝛼

/A/𝛽

Contents: data𝛽

hash(data𝛽) =: h𝛽

/B

Contents: ∅

hash(∅) =: hB

Merkle tree

∙ Hash tree with cryptographichash function providessecure identifier for sub trees

∙ It is easy to sign a small hashvalue (data authenticity)

∙ Efficient calculation of changes(fast replication)

∙ Bonus: versioning and datade-duplication

∙ Full potential together withcontent-addressable storage

∙ Self-verifying data chunks,trivial to distribute and cache

27 / 22

Data Integrity and File System Snapshots

hash(hA, hB)

Contents: hA [A], hB [B]

hash(h𝛼, h𝛽)=: hA

Contents: h𝛼 [𝛼], h𝛽 [𝛽]

hash(data𝛼)=: h𝛼

Contents: data𝛼

hash(data𝛽)=: h𝛽

Contents: data𝛽

hash(∅)=: hB

Contents: ∅

Merkle tree⊕

Content-addressable storage

∙ Hash tree with cryptographichash function providessecure identifier for sub trees

∙ It is easy to sign a small hashvalue (data authenticity)

∙ Efficient calculation of changes(fast replication)

∙ Bonus: versioning and datade-duplication

∙ Full potential together withcontent-addressable storage

∙ Self-verifying data chunks,trivial to distribute and cache

27 / 22

System Integration

System-aided

∙ Kernel-level file system fast intrusive

∙ Virtual NFS server easy deployment limited by NFS semantics

Unprivileged options (mostly)

∙ Library (e. g. libhdfs, libXrd, . . . ) tuned API not transparent

∙ Interposition systems:

∙ Pre-loaded libraries∙ ptrace based∙ FUSE

(“file system in user space”)

Application

libc

OS Kernel

open()

system call FUSEupcall

28 / 22

Bibliography I

Survey Articles

Satyanarayanan, M. (1990).A survey of distributed file systems.Annual Review of Computer Science, 4(1):73–104.

Guan, P., Kuhl, M., Li, Z., and Liu, X. (2000).A survey of distributed file systems.University of California, San Diego.

Thanh, T. D., Mohan, S., Choi, E., Kim, S., and Kim, P. (2008).A taxonomy and survey on distributed file systems.In Proc. int. conf. on Networked Computing and Advanced InformationManagement (NCM’08), pages 144 – 149.

Peddemors, A., Kuun, C., Spoor, R., Dekkers, P., and den Besten, C. (2010).Survey of technologies for wide area distributed storage.Technical report, SURFnet.

Depardon, B., Séguin, C., and Mahec, G. L. (2013).Analysis of six distributed file systems.Technical Report hal-00789086, Université de Picardie Jules Verne.

29 / 22

Bibliography II

Donvito, G., Marzulli, G., and Diacono, D. (2014).Testing of several distributed file-systems (hdfs, ceph and glusterfs) forsupporting the hep experiment analysis.Journal of Physics: Conference Series, 513.

File Systems

Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., and Lyon, B. (1985).Design and implementation of the sun network filesystem.In Proc. of the Summer USENIX conference, pages 119–130.

Morris, J. H., Satyanarayanan, M., Conner, M. H., Howard, J. H., Rosenthal, D.S. H., and Smith, F. D. (1986).Andrew: A distributed personal computing environment.Communications of the ACM, 29(3):184–201.

Hartman, J. H. and Osterhout, J. K. (1995).The Zebra striped network file system.ACM Transactions on Computer Systems, 13(3):274–310.

30 / 22

Bibliography III

Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D.,Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B.(2000).OceanStore: An architecture for global-scale persistent storage.ACM SIGPLAN Notices, 35(11):190–201.

Quinlan, S. and Dorward, S. (2002).Venti: a new approach to archival storage.In Proc. of the 1st USENIX Conf. on File and Storage Technologies (FAST’02),pages 89–102.

Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003).The Google file system.ACM SIGOPS Operating Systems Review, 37(5):29–43.

Schwan, P. (2003).Lustre: Building a file system for 1,000-node clusters.In Proc. of the 2003 Linux Symposium, pages 380–386.

Dorigo, A., Elmer, P., Furano, F., and Hanushevsky, A. (2005).XROOTD - a highly scalable architecture for data access.WSEAS Transactions on Computers, 4(4):348–353.

31 / 22

Bibliography IV

Weil, S. A. (2007).Ceph: reliable, scalable, and high-performance distributed storage.PhD thesis, University of California Santa Cruz.

32 / 22

Documents

Experiences on File Systems - Which is the best file system for you? · 2018. 11. 16. · HDFS QFS MapR FS 9/22. Distributed FileSystems Super-computers Lustre GPFS Orange-FS BeeGFS