Distributed Access to Parallel File Systems...The first architecture, Split-Server NFSv4, targets parallel file system architectures that disallow customization and/or direct storage

Distributed Access to Parallel File Systems

by

Dean Hildebrand

A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy (Computer Science and Engineering)

in The University of Michigan 2007

Doctoral Committee:

Adjunct Professor Peter Honeyman, Chair Professor Farnam Jahanian Professor William R. Martin Associate Professor Peter M. Chen Professor Darrell Long, University of California, Santa Cruz

http://www.ucsc.edu/

http://www.santacruzpl.org/history/

© Reserved Rights All

HildebrandDean 2007

ii

To my family…

individually they exceed my wildest expectations,

together they give me strength to climb the highest mountains.

iii

ACKNOWLEDGEMENTS

This dissertation is a testament to the dedication and abilities of many people. Their

support and guidance transformed me into who I am today.

My Ph.D. advisor Peter Honeyman is the main pillar of this dissertation. His insights

and criticisms taught me the fundamental elements of successful research. He has a true

love of computer science, which transfers to everyone around him. The other members

of my committee, Farnam Jahanian, Bill Martin, Pete Chen, and Darrell Long, made

many helpful suggestions at my proposal that helped guide the focus of this dissertation.

My work depends heavily upon many bleeding edge technologies, each of which

would not exist without the dedication of many brilliant and talented people. Garth Gib-

son, Lee Ward, and Gary Grider in particular championed pNFS to the wider storage and

high-performance community, sparking interest for its continued research. The IETF

pNFS working group, with all their bluster, raised many critical requirements and issues.

pNFS would still be an unfinished IETF specification without the support and tireless ef-

forts put forth by Marc Eshel at IBM and Rob Ross, Rob Latham, Murali Vilayannur, and

the entire PVFS2 development team. In addition, I cannot forget Andy Adamson, Bruce

Fields, Trond Myklebust, Olga Kornievskaia, David Richter, Jim Rees, and everyone else

at CITI that have transformed Linux NFSv4 into the best distributed file system in exis-

tence.

This material is based upon work supported by the Department of Energy under

Award Numbers DE-FC02-06ER25766 and B548853, Lawrence Livermore National

Laboratory under contract B523296, and by grants from Network Appliance and IBM. I

am eternally thankful for the endless wisdom and knowledge shared by Lee Ward, Gary

Grider, and James Nunez. They brought a Canadian to Albuquerque, New Mexico and

bestowed motivation for this dissertation. Moreover, the snakes, yucca, caverns, deserts,

mountains, and heat of New Mexico bestowed a reason for living.

iv

Ann Arbor could not be a better place; its kind people share a thirst for knowledge

and a better world. Within Ann Arbor, late night beers with Jay, rock climbing, WCBN,

and the CBC all helped prevent me from burning out years ago.

Last but not least, without my family’s continued support, I would have lacked the

strength to tackle this phase of my life.

This report was prepared as an account of work sponsored by an agency of the United

States Government. Neither the United States Government nor any agency thereof, nor

any of their employees, makes any warranty, express or implied, or assumes any legal

liability or responsibility for the accuracy, completeness, or usefulness of any informa-

tion, apparatus, product, or process disclosed, or represents that its use would not infringe

privately owned rights. Reference herein to any specific commercial product, process, or

service by trade name, trademark, manufacturer, or otherwise does not necessarily consti-

tute or imply its endorsement, recommendation, or favoring by the United States Gov-

ernment or any agency thereof. The views and opinions of authors expressed herein do

not necessarily state or reflect those of the United States Government or any agency

thereof.

v

TABLE OF CONTENTS

DEDICATION................................................................................................................... ii

ACKNOWLEDGEMENTS ............................................................................................ iii

LIST OF FIGURES ........................................................................................................ vii

ABSTRACT...................................................................................................................... ix

CHAPTER

I. Introduction ............................................................................................................. 1

1.1. Motivation................................................................................................... 2 1.2. Thesis statement.......................................................................................... 2 1.3. Overview of dissertation ............................................................................. 3

II. Background ............................................................................................................ 5

2.1. Storage infrastructures ................................................................................ 5 2.2. Remote data access ..................................................................................... 6 2.3. High-performance computing................................................................... 14 2.4. Scaling data access.................................................................................... 20 2.5. NFS architectures...................................................................................... 24 2.6. Additional NFS architectures.................................................................... 26 2.7. NFSv4 protocol......................................................................................... 29 2.8. pNFS protocol........................................................................................... 32

III. A Model of Remote Data Access ....................................................................... 34

3.1. Architecture for remote data access.......................................................... 34 3.2. Parallel file system data access architecture ............................................. 37 3.3. NFSv4 data access architecture ................................................................ 37 3.4. Remote data access requirements ............................................................. 38 3.5. Other general data access architectures .................................................... 39

vi

IV. Remote Access to Unmodified Parallel File Systems....................................... 43

4.1. NFSv4 state maintenance.......................................................................... 44 4.2. Architecture............................................................................................... 44 4.3. Fault tolerance........................................................................................... 48 4.4. Security ..................................................................................................... 49 4.5. Evaluation ................................................................................................. 49 4.6. Related work ............................................................................................. 55 4.7. Conclusion ................................................................................................ 56

V. Flexible Remote Data Access .............................................................................. 57

5.1. pNFS architecture ..................................................................................... 58 5.2. Parallel virtual file system version 2......................................................... 62 5.3. pNFS prototype......................................................................................... 63 5.4. Evaluation ................................................................................................. 65 5.5. Additional pNFS design and implementation issues ................................ 69 5.6. Related work ............................................................................................. 70 5.7. Conclusion ................................................................................................ 71

VI. Large Files, Small Writes, and pNFS ............................................................... 72

6.1. Small I/O requests..................................................................................... 73 6.2. Small writes and pNFS ............................................................................. 74 6.3. Evaluation ................................................................................................. 79 6.4. Related work ............................................................................................. 85 6.5. Conclusion ................................................................................................ 86

VII. Direct Data Access with a Commodity Storage Protocol .............................. 87

7.1. Commodity high-performance remote data access................................... 89 7.2. pNFS and storage protocol-specific layout drivers................................... 90 7.3. Direct-pNFS.............................................................................................. 93 7.4. Direct-pNFS prototype.............................................................................. 96 7.5. Evaluation ................................................................................................. 96 7.6. Related work ........................................................................................... 104 7.7. Conclusion .............................................................................................. 106

VIII. Summary and Conclusion............................................................................. 108

8.1. Summary and supplemental remarks ...................................................... 108 8.2. Supplementary observations ................................................................... 111 8.3 Beyond NFSv4........................................................................................ 112 8.4. Extensions ............................................................................................... 114

BIBLIOGRAPHY ......................................................................................................... 117

vii

LIST OF FIGURES

Figure

2.1: ASCI platform, data storage, and file system architecture ............................. 14

2.2: The ASCI BlueGene/L hierarchical architecture ............................................ 15

2.3: A typical high-performance application.......................................................... 16

2.4: Symmetric and asymmetric out-of-band parallel file systems........................ 22

2.5: NFS remote data access .................................................................................. 24

2.6: NFS with databases......................................................................................... 25

2.7: NFS exporting symmetric and asymmetric parallel file systems.................... 27

3.1: General architecture for remote data access ................................................... 35

3.2: Parallel file system data access architectures.................................................. 37

3.3: NFSv4-PFS data access architecture .............................................................. 38

3.4: Swift architecture ............................................................................................ 40

3.5: Reference Model for Open Storage Systems Interconnection ........................ 41

3.6: General data access architecture view of OSSI model ................................... 42

4.1: Split-Server NFSv4 data access architecture .................................................. 45

4.2: Split-Server NFSv4 design and process flow ................................................. 46

4.3: Split-Server NFSv4 experimental setup.......................................................... 49

4.4: Split-Server NFSv4 aggregate read throughput.............................................. 52

4.5: Split-Server NFSv4 aggregate write throughput............................................. 52

5.1: pNFS data access architecture ........................................................................ 58

5.2: pNFS design.................................................................................................... 59

5.3: PVFS2 architecture ......................................................................................... 62

5.4: pNFS prototype architecture ........................................................................... 65

5.6: Aggregate pNFS write throughput.................................................................. 67

5.7: Aggregate pNFS read throughput ................................................................... 67

viii

6.1: pNFS small write data access architecture...................................................... 76

6.2: pNFS write threshold ...................................................................................... 77

6.3: Determining the write threshold value............................................................ 78

6.4: Write throughput with threshold. small write requests ................................... 80

6.5: ATLAS digitization write request size distribution with 500 events. ............. 82

6.6: ATLAS digitization write throughput for 50 and 500 events ......................... 84

7.1: pNFS file-based architecture with a parallel file system ................................ 91

7.2: pNFS file-based data access............................................................................ 92

7.3: Direct-pNFS data access architecture ............................................................. 93

7.4: Direct-pNFS with a parallel file system.......................................................... 94

7.5: Direct-pNFS prototype architecture with the PVFS2 parallel file system...... 97

7.6: Direct-pNFS aggregate write throughput........................................................ 99

7.7: Direct-pNFS aggregate read throughput....................................................... 100

7.8: Direct-pNFS scientific and macro benchmark performance ........................ 102

9.1: pNFS and inter-cluster data transfers across the WAN................................. 115

ix

ABSTRACT

Large data stores are pushing the limits of modern technology. Parallel file systems

provide high I/O throughput to large data stores, but are limited to particular operating

system and hardware platforms, lack seamless integration and modern security features,

and suffer from slow offsite performance. Meanwhile, advanced research collaborations

are requiring higher bandwidth as well as concurrent and secure access to large datasets

across myriad platforms and parallel file systems, forming a schism between file systems

and their users.

It is my thesis that a distributed file system can improve I/O throughput to modern

parallel file system architectures, achieving new levels of scalability, performance, secu-

rity, heterogeneity, transparency, and independence. This dissertation describes and ex-

amines prototypes of three data access architectures that use the NFSv4 distributed filing

protocol as a foundation for remote data access to parallel file systems while maintaining

file system independence.

The first architecture, Split-Server NFSv4, targets parallel file system architectures

that disallow customization and/or direct storage access. Split-Server NFSv4 distributes

I/O across the available parallel file system nodes, offering secure, heterogeneous, and

transparent remote data access. While scalable, the Split-Server NFSv4 prototype dem-

onstrates that the absence of direct data access limits I/O throughput.

Remote data access performance can be increased for parallel file system architec-

tures that allow direct data access plus some customization The second architecture ana-

lyzes the pNFS protocol, which uses storage-specific layout drivers to distribute I/O

across the bisectional bandwidth of a storage network between filing nodes and storage.

Storage-specific layout drivers allow universal storage protocol support and flexible secu-

rity and data access semantics, but can diminish the level of heterogeneity and transpar-

ency. The third architecture, Direct-pNFS, uses a commodity distributed file system for

x

direct access to a parallel file system’s storage nodes, bridging the gap between perform-

ance and transparency. The dissertation describes the importance and necessity for both

direct data access architectures depending on user and system requirements. I analyze

prototypes of both direct data access architectures and demonstrate their ability to match

and even exceed the performance of the underlying parallel file system.

1

CHAPTER I

Introduction

Modern research requires local and global access to massive data stores. Parallel file

systems, which provide direct and parallel access to storage, are highly specialized, lack

seamless integration and modern security features, often limited to a single operating sys-

tem and hardware platform, and suffer from slow offsite performance. However, grid

computing, legacy software, and other factors are increasing the heterogeneity of clients,

creating a schism between file systems and their users.

Distributed filing protocols such as NFS [1] and CIFS [2] are widely used to bridge

the interoperability gap between storage systems. Unfortunately, implementations of

these protocols deliver performance that is only a fraction of the exported storage system.

They continue to have limited network, CPU, memory, and disk I/O resources due to

their “single server” design, which binds one network endpoint to an entire file system.

By continuing to use RPC-based client/server architectures, distributed filing protocols

have an entrenched lack of scalability. The NFSv4 protocol [3] improves functionality

by providing integrated security and locking facilities, and migration and replication fea-

tures, but continues to use a client/server based architecture and consequently retains the

single server bottleneck.

Scalable file transfer protocols such as GridFTP [4] are also used to enable high

throughput, operating system independent, and secure WAN access to high-performance

file systems. The HTTP protocol is the most common way to access remote data stores,

and by far the most widespread. Both of them are difficult to integrate with a local file

system and neither provides shared access to a single data copy, which unnecessarily cre-

ates a copy for each user and increases the complexity of maintaining single-copy seman-

tics.

2

1.1. Motivation

Many application domains demonstrate the need for high bandwidth, concurrent, and

secure access to large datasets across a variety of platforms and file systems. DNA se-

quences, other biometric data, and artwork databases have large data sets, ranging up to

tens of gigabytes in size, that are often loaded independently by concurrent clients [5, 6].

Full database scans of huge files are often unavoidable, even when using indexing [7].

The Earth Observing System Data and Information System (EOSDIS) [8, 9] manages

data from NASA's earth science research satellites and field measurement programs, pro-

viding data archive, distribution, and information management services. As of 2005,

EOSDIS had stored more than three petabytes of data while continuing to generate more

than seven terabytes per week. In 2004, EOSDIS supported more than 1.9 million unique

users and fulfilled more than 36 million product requests [10].

Digital movie studios that generate terabytes of data every day require access from

Sun, Windows, SGI, and Linux workstations, and compute clusters [11]. Users edit files

in place or copy files between heterogeneous data stores.

The scientific computing community connects large computational and data facilities

around the globe to perform physical simulations that generate petabytes of data. The

Advanced Simulation and Computing program (ASC) in the U.S. Department of Energy

estimates that one gigabyte per second of aggregate I/O throughput is necessary for every

teraflop of computing power [12], which suggests that file systems will need to support

terabyte per second data transfer rates of by 2010 [13].

1.2. Thesis statement

To meet this diverse set of data access requirements, I set out to demonstrate the fol-

lowing thesis statement:

It is feasible to use a distributed file system to realize the I/O throughput metrics

of parallel file system architectures and to achieve different levels of scalability, per-

formance, security, heterogeneity, transparency, and independence.

3

1.3. Overview of dissertation

To validate this thesis I outline a general architecture for remote data access and use it

to describe and validate prototype implementations that use the NFSv4 distributed file

system protocol as a foundation for remote data access to modern parallel file system ar-

chitectures. I demonstrate that specialized reorganization of the general architecture to

suit a parallel file system can improve I/O throughput while maintaining operating sys-

tem, hardware platform, and parallel file system independence. The dissertation is organ-

ized as follows.

In Chapter II, I discuss the background of storing and accessing data. This includes a

history of remote data access techniques intended to accommodate the spiraling growth in

performance requirements. I also give an overview of supercomputing and its I/O re-

quirements, including a description of parallel file system architectures that require high-

performance remote data access.

Chapter II also details how organizations use NFS with UNIX file systems, parallel

file systems, and databases. I discuss the challenge of achieving full utilization of a stor-

age system’s available bandwidth with NFS, and discuss NFS architectures that attempt

to overcome the single server bottleneck with these data stores. I introduce NFSv4, the

emerging Internet standard for distributed filing, and discuss how its stateful server pre-

sents a new scalability obstacle. I also introduce pNFS, a high-performance extension of

NFSv4 under design by the Internet Engineering Task Force.

Chapter III presents a general architecture for remote data access to parallel file sys-

tems. I use this architecture to demonstrate the interactions of the data and metadata sub-

systems of NFSv4 with modern parallel file system architectures. The major components

of the data access architecture are the application interface, the metadata service, the data

and control components, and storage.

Chapter IV introduces Split-Server NFSv4, a variant of the general architecture that

targets parallel file system architectures that do not admit modification or prohibit direct

storage access. Split-Server NFSv4 distributes I/O across parallel file system nodes, of-

fering secure, heterogeneous, and transparent remote data access. The Split-Server

NFSv4 prototype scales I/O throughput linearly with available parallel file system band-

width, but lacks direct data access, which limits performance. The aggregate I/O per-

4

formance of the prototype achieves 70% of the maximum read bandwidth and 50% of the

maximum write bandwidth.

Empowering clients with the ability to access data directly from storage—eliminating

intermediary servers—is key to achieving high-performance. The dissertation introduces

two more variants of the general architecture tailored for parallel file systems that allow

direct data access.

In Chapter V, I analyze the pNFS protocol, which uses storage-specific layout drivers

to distribute I/O across the bisectional bandwidth of a storage network between filing

nodes and storage. Storage-specific layout drivers allow universal storage protocol sup-

port, flexible security, and well-defined data access semantics, but can diminish the level

of heterogeneity and transparency offered by distributed file systems. Discussion of a

prototype implementation that demonstrates and validates the potential of pNFS con-

cludes this chapter.

Chapter VI demonstrates how pNFS can be engineered to improve the overall write

performance of parallel file systems by using direct, parallel I/O for large write requests

and a distributed file system for small write requests.

Chapter VII introduces Direct-pNFS, a final variant of the general architecture, which

offers high I/O throughput while retaining the security, heterogeneity, and transparency

of NFSv4. Direct-pNFS uses a commodity distributed file system for direct access to the

storage nodes of a parallel file system, bridging the gap between performance and trans-

parency. Experiments with my Direct-pNFS prototype demonstrate that its I/O through-

put matches or outperforms the native parallel file system client across a range of work-

loads.

Through the understanding and exploration of remote data access architectures, this

dissertation helps to bring scalable, secure, heterogeneous, transparent, and file system

independent data access to high-performance computing and its large data stores.

5

CHAPTER II

Background

This chapter presents background information on storing and accessing data. This in-

cludes a history of remote data access, a description of modern storage architectures, and

techniques invented to accommodate the spiraling growth in performance requirements. I

also describe the storage architectures of modern supercomputers and characterize the I/O

requirements of supercomputing applications. I conclude the chapter with a discussion of

modern NFS infrastructures and techniques used to scale NFS data access.

2.1. Storage infrastructures

In the beginning, access to storage entailed directly attaching disks to a single com-

puter. Even the fastest disks consistently fail to transfer data at the rate offered by their

enclosure’s interface, e.g., SATA [14], Ultra320 [15], and Fibre Channel [16]. Salem and

Garcia-Molina [17] introduced the term disk striping for splitting files across multiple

disks to improve I/O performance, a technique well known to the designers of I/O sub-

systems for early supercomputers [18]. The use of a Redundant Array of Inexpensive

Disks (RAID) [19], now prolific, also improves performance and, with some RAID lev-

els, fault tolerance.

Although RAID systems can overcome the failure of a single disk, they continue to

suffer from the host server’s single point of failure and lack of scalability. Storage area

network (SAN) and network attached storage (NAS) architectures alleviate these prob-

lems by distributing data across multiple storage devices.

SAN is defined as a dedicated network, e.g., Fibre Channel, in which I/O requests ac-

cess storage devices directly using a block access protocol such as SCSI [20] or FCP

6

[21]. SAN delivers high-throughput data access, but is expensive, difficult to manage,

and lacks data sharing capabilities.

NAS is an IP-based network in which a NAS device—a processor plus disk storage—

handles I/O requests by using a file-access protocol such as NFS or CIFS. The NAS de-

vice translates file requests into storage device requests. NAS provides ease-of-

management, cost-effectiveness, and data sharing, but introduces performance bottle-

necks

As users demand the performance of SAN with the cost and manageability of NAS,

the differences between SAN and NAS are disappearing thanks to storage virtualization,

which takes several forms. One form enables direct data access using block protocols on

IP-based networks, e.g., iSCSI [22, 23]. Another form connects NAS appliances to both

IP-based and private networks. The IP-based network supports data sharing while the

private network provides high throughput data access.

2.2. Remote data access

The Internet file transfer protocol, FTP [24], was first developed in 1971 to transfer

data in ARPANET. FTP had three primary objectives: to promote data sharing, to pro-

vide storage system and hardware platform independence, and to transfer data reliably

and efficiently. The transition from an ARPANET consisting of few mainframe-based

timesharing type machines to a global Internet made up of many smaller PCs, each with

its own hard drive, introduced a model of computing alien to the FTP design. Many in-

dependent name spaces replaced the mainframe’s monolithic name space, resulting in

multiple copies of shared data. Increased storage requirements, data inconsistencies, and

slow networks sparked the creation of distributed file systems.

The initial goals of distributed file systems [25] were:

• Efficient remote data access.

• Avoid whole file transfers by transferring only requested data.

• Seamless integration of remote data into a single file system.

• Avoid clients accessing stale data.

7

• Enable diskless clients.

• Provide file locking.

In 1988, the Portable Operating System Interface (POSIX) [26] standard was defined

to promote portability of application programs across UNIX system environments by de-

veloping a clear, consistent, and unambiguous standard for the interface specification of

UNIX-like operating systems. POSIX quickly became synonymous with UNIX seman-

tics, which greatly influenced file system design. POSIX-compliant file systems increase

application portability by guaranteeing a specific set of semantics. Unfortunately, these

semantics sometimes prove difficult, impossible, or unnecessary to implement. The fol-

lowing sections describe how some file systems choose to support a relaxed version.

2.2.1. Distributed filing protocols

This section gives an overview of several successful and innovative distributed filing

protocols and distributed file systems.

The nomenclature for systems that provide remote data access can be confusing.

These systems are known by several terms: file access protocol, distributed filing proto-

col, filing protocol, distributed file system protocol, or distributed file system. At the

core of every client/server architecture is a wire protocol to communicate data between

the client and the server. Sometimes the publication and distribution of this protocol is

convoluted and limited, but it always exists in some form, although that form might be

source code. This dissertation focuses on the Network File System (NFS) protocol,

which is distinguished by its precise definition by the IETF, availability of open source

implementations, and support on virtually every modern operating system.

2.2.1.1 Newcastle Connection

Newcastle Connection [27], one of the first distributed file systems, is a portable user-

level C library that enables data transfer and supports full UNIX semantics. To stitch re-

mote data stores together, a superroot directory contains the local root directory and the

host names of all the available remote systems. Newcastle Connection performance is

hampered by a lack of data and attribute caching. In addition, it required programs to be

8

relinked with a new C library that routes system calls between the local and remote file

systems. Many modern distributed filing implementations now use a kernel-based client,

which increases transparency to programs at the expense of portability. A kernel-based

client does not require programs to relink and allows process sharing of attribute and data

caches. Newcastle Connection is no longer supported.

2.2.1.2 Apollo Domain operating system

Developed for Apollo workstations in the early 1980s, Domain [28-30] is one of the

first distributed file systems. It is a peer-based system designed for tightly integrated

groups of collaborators. A tuple consisting of an object’s creation time and a unique

number identifying the Apollo workstation on which an object was created (set at time of

manufacturing) identifies a system object. Domain stores an object on a single work-

station, supports data and attribute caching, and uses a lock manager to maintain consis-

tency. A user logged onto an Apollo workstation has access to all workstations in the

work group. Access lists enforce file access permissions. Domain’s tight integration into

the Apollo hardware and operating system has many benefits but also interferes with

adoption on other operating system platforms.

2.2.1.3 LOCUS operating system

Developed at UCLA in the early 1980s, the distributed file system in the LOCUS op-

erating system [31] supports location independence and used a primary copy replication

scheme. It also focuses on improving fault tolerance semantics in comparison to other

distributed operations systems. Like Domain, the LOCUS distributed file system’s tight

integration with the LOCUS operating system and use of specialized remote operation

protocols limits its portability and widespread use.

2.2.1.4 Remote File Sharing file system (RFS)

Developed by AT&T in the mid-1980s for UNIX System V release 3, the Remote

File Sharing (RFS) [32] distributed file system supports full UNIX semantics. A name

server advertises available file systems allowing clients to mount a file system without

9

knowledge of its precise location, using only the file system’s identifier. RFS eventually

supported client caching using a stateful server for lock management, although it is dis-

abled for multiple writers or on all readers when there is a single writer. RFS offers no

secure way for clients to authenticate, using standard UNIX file and directory protection

mechanisms instead. Lack of fault tolerance, sole support for UNIX System V version 3,

and the use of a specialized transport protocol limited the widespread adoption and com-

mercial success of RFS.

2.2.1.5 Network File System (NFS)

The Network File System (NFS) [1] was developed at Sun Microsystems in 1985.

Sun Microsystems distinguished NFS from previous distributed file systems by designing

a protocol instead of an implementation. The NFS protocol encouraged the development

of many implementations by being “agnostic” as to operating system, hardware platform,

underlying file system. This was accomplished by defining a virtual file system (VFS)

interface and by virtualizing file system metadata in the form of a Vnode definition [33].

In addition, NFS encapsulated the file system’s use of a network architecture and trans-

port protocols. Network architecture and transport protocol independence is achieved by

using the Open Network Computing Remote Procedure Call (ONC RPC) [34] for com-

munication. By hiding the transport protocol from the application, ONC RPC supports

heterogeneous transport protocols. NFS also uses an External Data Representation

(XDR) [35] format to ensure that data is understood by both the sender and recipient.

NFS versions 2 [36] and 3 [37, 38] are stateless, meaning that the NFS server does

not maintain state across client requests. This simplifies crash recovery and the protocol

itself, but weakens support for UNIX semantics and limits the features it can support.

For example, the POSIX “last close” requirement, which mandates that a removed file

not be physically removed until all clients have closed the file, is impossible to imple-

ment without keeping track of all clients that have the file open. The lack of server state

also interferes with client cache consistency. NFS supports close-to-open consistency

semantics: clients must flush all data blocks to the server when it closes a file. When the

client opens (or re-opens) the file, it checks to see if its cached data is out of date, and if

necessary, retrieves the latest copy from the server. Many implementations put a timeout

10

of three seconds on cached file blocks and a timeout of thirty seconds on cached directory

attributes. This creates a trade-off between performance and data integrity, with per-

formance heavily favored. An additional protocol, the Network Lock Manager, was later

created to isolate the inherently stateful aspects of file locking.

A supporting MOUNT protocol performs the operating system-specific functions that

allow clients to attach remote file systems to a node within the local file system. The

mount process also allows the server to grant remote access privileges to a restricted set

of clients via export control. An Automounter mechanism [39] can be used to enable

read-only replication by allowing a client to evaluate NFS mount points and choose the

best server at the time data is requested.

As with most distributed file systems at that time, security in NFS depends on stan-

dard UNIX file protection mechanisms. The UNIX security model trusts the identity of

users without authentication. This security model is sometimes feasible within small or-

ganizations, but it is definitely not sufficient in larger organizations or across the Internet.

2.2.1.6 Andrew File System (AFS) and Coda

Andrew [40] is a distributed computing environment developed at Carnegie Mellon

University in 1983 on the 4.3 BSD version of UNIX. Andrew features a distributed file

system called Vice, later renamed Andrew File System (AFS), which became a commer-

cial product in 1989. An early goal was scalability, targeting support for up to 10,000

clients with approximately 200 clients for every server, an order of magnitude improve-

ment over NFS in the ratio of clients to servers.

Disks are divided into partitions, and partitions into volumes. AFS can automatically

migrate or replicate heavily used volumes across server nodes to balance load. Only one

mutable copy of a replica exists, with all updates forwarded to it. Special servers main-

tain a fully replicated location database that maps volumes to home servers, enabling full

location transparency. All clients see a shared namespace under /afs, which contains

links to groups (or cells) of AFS servers.

AFS caches large chunks of files on local disk; early versions cached entire files. If a

server allows a client to cache data, the server returns a promise that it will inform (or

“callback”) the client if the data is modified. Note that since clients flush data on file

11

close or an explicit call to fsync, the server discovers changes only after the client has

already modified its cached version. Servers, which are stateful, may revoke the promise,

i.e., issue a callback, at any time and for any reason (principally memory exhaustion).

Callbacks give AFS a scalable way to achieve the close-to-open semantics in NFS by

enabling aggressive client caching, which reduces client/server communication.

NFS and most other distributed file systems in the 1980s were targeted for use by a

small collection of trusted workstations. The large number of AFS clients breaks this

model, requiring a stronger security mechanism. AFS therefore abjures UNIX file pro-

tection semantics, instead requiring users to obtain Kerberos [41] tokens that map onto

access control lists, which control access at the granularity of a directory. Kerberos is a

network authentication protocol that provides strong authentication for client/server ap-

plications by using secret-key cryptography. AFS uses a secure RPC called Rx for all

communication.

Some work has begun to create an implementation of AFS that provides remote ac-

cess to existing data stores, although it appears such a system does not yet exist.

Vice, a predecessor of AFS, forms the basis of the high availability Coda file system

[42], which adds support for mutable server replication and disconnected operation.

Coda has three client access strategies: read-one-data, read data from a single preferred

server; read-all-status, obtain version and status information from all servers; and write-

all, write updates to all available servers. Clients can continue to work with cached cop-

ies of files when disconnected, with updates propagated to the servers when the client is

reconnected. If servers contain different versions of the same file, stale replicas are asyn-

chronously refreshed. If conflicting versions exist, user intervention is usually required.

2.2.1.7 DCE/DFS

The Open Software Foundation uses AFS [40] as the basis for the DEcorum file sys-

tem (DCE/DFS) [43], a major component of its Distributed Computing Environment

(DCE). It redesigned the AFS server with an extended virtual file system interface,

called VFS+, enabling it to support a range of underlying file systems. A specialized un-

derlying file system, Episode [44], supports data replication, data migration, and access

control lists (ACLs), which specify the users that can access a file system resource. DFS

12

supports single-copy consistency semantics, ensuring that clients see the latest changes to

a file. A token manager running on each server manages consistency by returning vari-

ous types of tokens to clients, e.g., read and write tokens, open tokens, and file attribute

tokens. A server can prohibit multiple clients from modifying cached copies of the same

file. Leases [45] are placed on the tokens to let the server revoke tokens that are not re-

newed by the client within a lease period, allowing quick recovery from a failed client

holding exclusive-access tokens. A recovered server enters a grace period—lasting for a

few minutes—which allows clients to detect server failure and reacquire tokens.

2.2.1.8 AppleTalk

Developed by Apple Computer in the early 1980s, the AppleTalk protocol suite [46]

facilitated file transfer, printer sharing, and mail service among Apple systems. Built

from the ground up, AppleTalk managed every layer of the OSI reference model [47].

AppleTalk currently includes a set of protocols to work with existing data link protocols

such as Ethernet, Token Ring, and FDDI.

The AppleTalk Filing Protocol (AFP) allows Macintosh clients to access remote files

in the same manner as local files. AFP uses several other protocols in the AppleTalk pro-

tocol suite including the AppleTalk Session Protocol, the AppleTalk Transaction Proto-

col, and the AppleTalk Echo Protocol. The Mac OS continues to use AFP as a primary

file sharing protocol, but Mac support for NFS is growing.

2.2.1.9 Common Internet File System

The Server Message Block (SMB) protocol, now known as the Common Internet File

System (CIFS) [2], was created for PCs in the 1980's by IBM and later extended by

3COM, Intel, and Microsoft. SMB was designed to provide remote access to the

DOS/FAT file system, but now NTFS forms the basis for CIFS. CIFS uses NetBIOS

(Network Basic Input Output System) sessions, a session management layer originally

designed to operate over a proprietary transport protocol (NETBEUI), but now operates

over TCP/IP and UDP [48, 49]. Once a session ends, the server may close all open files.

13

Server failure results in the loss of all server state, including all open files and current file

offsets.

CIFS has a unique cache consistency model that uses opportunistic locks (oplocks).

On file open, a client specifies the access it requires (read, write, or both) and the access

to deny others, and in return receives an oplock (if caching is available). There are three

types of oplocks: exclusive, level II, and batch. The first client to open a file receives an

exclusive lock. The server disables caching if two clients request write access to the

same file, forcing both clients to write through the server. A client requesting read access

to a file with an exclusive lock disables caching on the client requesting read access and

changes the writing client’s cache to read-only (level II). Batch locks allow a client to

retain an oplock across multiple file opens and closes.

Other features of CIFS include request batching and the use of the Microsoft DFS fa-

cility to stitch together several file servers into a single namespace. Security is handled

with password authentication on the server. Although CIFS is exclusively controlled by

Microsoft, Samba [50] is a suite of open source programs that provides file and print ser-

vices to PC clients using the CIFS protocol.

2.2.1.10 Sprite operating system

The Sprite operating system [51, 52] was developed at the University of California at

Berkley for networked, diskless, and large main memory workstations. The Sprite dis-

tributed file system supports full UNIX semantics. To resolve a file location, each client

maintains a file prefix table, which caches the association of file paths and their home

server. Caching of file prefixes reduces recursive lookup traffic as a client walks through

the directory tree, but the broadcast mechanism used to distribute file prefix information

limits Sprite to LAN environments. Sprite supports single-copy semantics by tracking

open files on clients and whether clients are reading or writing, allowing only a single

writer or multiple readers to cache file data. The server uses callbacks to invalidate client

caches when conflicting open requests occur. Sprite uses a writeback cache, flushing

dirty blocks to the server after thirty seconds, and writing to disk in another thirty sec-

onds. This caching model was integrated into Spritely NFS [53] at the cost of file open

performance and complicating the server recovery model by storing the list of clients that

14

have opened a file on disk. Not Quite NFS (NQNFS) [54] avoids storing state on disk

and the use of open and close commands by using a lease mechanism on the client data

cache. This avoids having to introduce additional server recovery semantics to NFS by

simply allowing leases to expire before a failed server recovers.

2.3. High-performance computing

2.3.1. Supercomputers

This section gives an overview of modern supercomputers from a data perspective.

Figure 2.1 displays a typical supercomputer data access architecture, consisting of a pri-

mary I/O (storage) network that connects compute nodes, login/development nodes, visu-

alization facilities, archival facilities, and remote compute and data facilities.

The Thunderbird cluster at Sandia National Laboratories, currently the largest PC

cluster in the world, achieves 38 teraflops/s. Thunderbird consists of 4,096 dual proces-

sor nodes using Infiniband for inter-node communication and Gigabit Ethernet for stor-

age access. This architecture supports direct storage access for all nodes.

ASCI Purple has 1536 8-way nodes (12,208 CPUs) and achieves 75 teraflops/s [55].

ASCI Purple is a hybrid of commodity and specialized components. Custom designed

1 from ASCI Technology Prospectus, July 2001

Figure 2.1: ASCI platform, data storage, and file system architecture1.

15

compute nodes communicate via the IBM Federation interconnect, which has a peak bi-

directional bandwidth of 8 GB/s and a latency of 4.4 µs, and uses a commodity Infini-

band storage network for data access. To meet I/O throughput requirements, 128 nodes

are designated I/O nodes to act as a bridge between the Federation and Infiniband inter-

connects. Compute nodes trap I/O calls and automatically re-execute them on the I/O

nodes, with the results shipped back to the originating compute node. I/O nodes also

handle process authentication, accounting, and authorization on behalf of the compute

nodes.

Figure 2.2 describes the hierarchical architecture of the IBM BlueGene/L hybrid ma-

chine [56, 57]. A full BlueGene/L system has 65,536 dual-processor compute nodes, or-

ders of magnitude more than contemporary systems such as ASCI White [58], Earth

Simulator [59], and ASCI Red [60]. BlueGene/L has one I/O node for every sixty-four

compute nodes, although this ratio is configurable. Currently, each node has a small

amount of memory, limiting it to highly parallelizable applications.

2 from www.llnl.gov/asc/computing_resources/bluegenel/configuration.html

Figure 2.2: The ASCI BlueGene/L hierarchical architecture2.

16

2.3.2. High-performance computing applications

The design of modern distributed file system architectures derives mainly from sev-

eral workload characterization studies [40, 61-64] of UNIX users and their applications.

Applications that use thousands of processors have entirely different workloads. This

section gives an overview of HPC applications and their I/O requirements.

High-performance applications aim to use every available processor. Each node cal-

culates the results on a piece of the analysis domain, with the results from every node be-

ing combined at some later point. The example parallel application in Figure 2.3 exe-

cutes on all cluster nodes in parallel. A physical simulation, e.g., propagation of a wave,

moves forward in time. The unit of time depends on the required resolution of the result.

In step 1, nodes load their inputs individually or designate one node to load the data

and distribute it among the nodes. At the beginning of each computation (step 3), nodes

synchronize among themselves. After a node completes its computation for a time step,

step 5 communicates dependent data, a ghost cell, to domain neighbors for the next time

step. Step 6 checkpoints (writes) the data to guard against system or application failures.

The next section discusses checkpointing in more detail. In step 9, all nodes write their

results directly to storage or have a single node gather the results and write the combined

result to storage. Flash [65] is an example of the former; mpiBLAST [66] exemplifies

the latter.

Figure 2.3: A typical high-performance application. Pseudo-code of a typical high-performance computing application that each node executes in parallel.

1. Load initialization data 2. Begin loop 3. BARRIER 4. Compute results for current time step 5. Distribute ghost cells to dependent nodes 6. If required, checkpoint data to storage 7. Advance time forward 8. End loop 9. Write result

17

2.3.2.1 I/O

Miller and Katz [67] divided high-performance application I/O into three categories:

required, data staging, and checkpoint. Required I/O consists of loading initialization

data and storing the final results. Data staging supports applications whose data do not fit

in main memory. These data can be stored in virtual memory automatically by the oper-

ating system, or manually written to disk using out-of-core techniques for supercomput-

ers that do not support virtual memory. Checkpoint I/O, also known as defensive I/O,

stores intermediate results to prevent data loss. Checkpointing results after every compu-

tation increases application runtime to an unacceptable extent, but deferring checkpoint

I/O too long risks unacceptable data loss. The time between checkpoints depends on a

system’s mean time between hardware failures (MTBHF). Failure rates vary widely

across systems, depending mostly on the number of processors and the type and intensity

of application workloads [68].

Applications write checkpoint and result files in different ways. These are the most

common three methods [69]:

1. Single file per process/node. Each node or process creates a unique file. This

method is clumsy since an application must use the same number of nodes to re-

start computation after failure or interruption. In addition, the files must later be

integrated or mined by special tools and post-processors to generate a final result.

2. Small number of files. An application can write a smaller number of files than

processes/nodes in the computation by performing some integration work at each

checkpoint (and for the final result). This method allows an application to restart

computation on a different number of processes/nodes. However, special process-

ing and tools are still required.

3. Single file. An application integrates all information into a single file. This

method allows applications to restart computation easily on any number of proc-

esses/nodes and obviates the need for special post-processing and tools.

The overall performance of a supercomputer depends not only on its raw computa-

tional speed but also on job scheduling efficiency, reboot and recovery times (including

checkpoint and restart times), and the level of process management. The Effective Sys-

18

tem Performance (ESP) [70] of a supercomputer is a measure of its total system utiliza-

tion in a real-world operational environment. ESP is calculated by measuring the time it

takes to run a fixed number of parallel jobs through the batch scheduler of a supercom-

puter. The U.S. Department of Energy procures large systems with an ESP of at least

seventy percent [12].

Defensive I/O is very time consuming and decreases the ESP of a supercomputer.

Some systems report that up to seventy-five percent of I/O is defensive, with only twenty

five percent being productive I/O, consisting of visualization dumps, diagnostic physics

data, traces of key physics variables over time, etc. To ensure defensive I/O does not re-

duce the ESP of a machine below seventy percent, the Sandia National Research Lab

uses a general rule of thumb of requiring a throughput of one gigabyte per second for

every teraflop of computing power [12].

2.3.2.2 I/O workload characterizations

Workload characterization studies of supercomputing applications on vector com-

puters in the early 1990s [67, 71] found that I/O is cyclic, predictable, and bursty, with

file access sizes relatively constant within each application. Files are read from begin-

ning to end, benefiting from increased read-ahead. Buffered writes on the client was

found to be of little benefit due to the large amount of data being written and the small

data cache typical of vector machines.

The CHARISMA study [72-74] instrumented I/O libraries to characterize the I/O ac-

cess patterns of distributed memory computers in the mid-1990s. The main difference

from the vector machine study is the large number of small I/O requests. CHARISMA

found that approximately 90% of file accesses are small, but approximately 90% of data

is transferred in large requests. The large number of small write requests, and the short

interval between them, indicates that buffered-writes benefit performance in some in-

stances. Small requests often result from partitioning a data set across many processors

but may also be inherent in some applications.

Write requests dominate read requests in the applications studied by CHARISMA.

Write data sharing between clients is infrequent, since it is rarely useful to re-write the

same byte. With read-only files, approximately 24% are replicated on each client and

19

64% experience false sharing. A data set can be divided into a series of fixed size data

regions, or data blocks. With false sharing, nodes share data blocks, but do not access the

same byte ranges within the data blocks. This can occur when a data set is divided

among clients according to block divisions instead of the requested byte-ranges, creating

unnecessary data contention. Read-write files exhibit very little byte sharing between

clients due to the difficulty of maintaining consistency, but also experience false sharing.

Separate jobs never share files.

Clients interleave access to a single file, creating high inter-process spatial locality on

the I/O nodes, benefiting from I/O server data caching. Current strided data access usu-

ally uses standard UNIX I/O operations, with application developers citing a lack of port-

ability of the available and much more efficient strided interfaces provided by parallel

file systems. The rapid pace of technological change means that applications generally

outlast their targeted platform and must be portable to newer machines.

The Scalable I/O [75-77] initiative in the mid-1990s instrumented applications to

characterize I/O access patterns with distributed memory machines. The findings are

similar to CHARISMA, but emphasize the inefficiency of the UNIX I/O interface with

different file sizes, and spatial and temporal data access within a file. For example, one

sample application generates most of its I/O by seeking through a file. The study lead to

improved application performance by suggesting file access hints, access pattern infor-

mation passed from the application to the file system, to improve interaction with the I/O

nodes. The Scalable I/O initiative also found that using a single and shared file for stor-

ing information was still common and slow.

In 2004, a group at the University of California, Santa Cruz, analyzed applications on

a large Linux cluster [78]. Like the findings from a decade earlier, this study found that

I/O is bursty, most requests consist of small data transfers, and most data is transferred in

a few large requests. It is common for a master node to collect results from other nodes

and write them to storage using many small requests. Each client reads back the data in

large chunks. In addition, use of a single file is still common and accessing that file—

even with modern parallel file systems—is slower than accessing separate files by a fac-

tor of five.

20

2.4. Scaling data access

This section discusses common techniques for increasing file system I/O throughput.

2.4.1. Standard parallel application interfaces

A parallel application’s lifespan may be ten years or longer, and may run on several

different supercomputers architectures. A standard communication interface ensures the

portability of applications. Three communication libraries dominate the development and

execution of supercomputer applications.

MPI (Message Passing Interface) [79] is an API for inter-node communication. MPI-

2 [80] includes a new interface for data access named MPI-IO, which defines standard

I/O interfaces and standard data access hints. MPICH2 [81] is an open-source implemen-

tation of MPI that includes an MPI-IO framework named ROMIO [82]. Specialized im-

plementations of MPI-IO also exist [83].

ROMIO improves single client performance by increasing client request size through

a technique called data sieving [84]. With data sieving, noncontiguous data requests are

converted into a single, large contiguous request consisting of the byte range from the

first to the last requested byte. Data sieving then places the requested data portions in the

user’s buffer. Access to large chunks of data usually outperforms access to smaller

chunks, but data sieving also increases the amount of transferred data and number of

read-modify-write sequences.

MPI-IO can also improve I/O performance from multiple clients using collective I/O,

which can exist at the disk level (disk-directed I/O [85]), at the server level (server-

directed I/O [86]), or at the client level (two-phase I/O [87]). In disk-directed I/O, I/O

nodes use disk layout information to optimize the client I/O requests. In server-directed

I/O, a master client communicates in-memory and on-disk distributions for the arrays

prior to client data access with a master I/O node. The master I/O node then shares the

information with other I/O nodes. This communication allows clients and I/O nodes to

coordinate and improve access to logically sequential regions of files, not just physically

sequential regions. Finally, ROMIO uses two-phase I/O, which organizes noncontiguous

client requests into contiguous requests. For example, two-phase I/O converts inter-

21

leaved client read requests from a single file into contiguous read requests by having cli-

ents read contiguous data regions, re-distributing the data to the appropriate clients.

Writing a file is similar except clients first distribute data among themselves so that cli-

ents can write contiguous sections of the file. Accessing data in large chunks outweighs

the increased inter-process communication cost for data redistribution.

Other APIs for inter-process communication include OpenMP [88] and High-

performance Fortran (HPF) [89], but they do not include an I/O interface. OpenMP di-

vides computations among processors in a shared memory computer. Many applications

use a mixture of OpenMP and MPI in their applications, using MPI between nodes and

OpenMP to improve performance on a single SMP node. HPF first appeared in 1993.

HPF-2, released in 1997, includes support for data distribution, data and task parallelism,

data mapping, external language support, and asynchronous I/O, but has limited use due

to its restriction to programs written in Fortran.

2.4.2. Parallel file systems

Early high-performance computing engineered connected monolithic computers to

monolithic storage systems. The emergence of low cost clusters broke this model by cre-

ating many more producers and consumers of data than the memory, CPU, and network

interface of a single file server could manage.

Initial efforts to increase performance included client caches, buffered writes on the

client (Sprite, AFS), and write gathering on the server [90], but they did not address the

single server bottleneck. Many system administrators manually stitch together several

file servers to create a larger and more scalable namespace. This has implications in ad-

ministration costs, backup creation, load balancing, and quota management. In addition,

re-organizing the namespace to meet increased demand is visible to users.

A more transparent way to combine file servers into a single namespace is to forward

requests between file servers [91]. Clients access a single server; requests for files not

stored on that server are forwarded to the file’s home server. Unfortunately, servers are

still potential bottlenecks since a directory or file is still bound to a single server. Fur-

thermore, data may now travel through two servers.

22

Another method aggregates all storage behind file servers using a storage network

[92]. This allows a client to access any server, with servers acting as intermediaries be-

tween clients and storage. This is an in-band solution since control and data both traverse

the same path. This design still requires clients to send all requests for a file to a single

server and can require an expensive storage network.

Out-of-band solutions (OOB), which separate control and data message paths [93-96],

currently offer the best I/O performance in a LAN. They enable direct and parallel access

to storage from multiple endpoints. OOB parallel file systems stripe files across available

storage nodes, increasing the aggregate I/O throughput by distributing the I/O across the

bisectional bandwidth of the storage network between clients and storage. This technique

can reduce the likelihood of any one storage node becoming a bottleneck and offers scal-

able access to a single file. A single network connects clients, metadata servers, and stor-

age, with client-to-client communication occurring over this network or on an optional

host network.

Out-of-band separation of data and control paths has been advocated for decades [93,

94] because it allows an architecture to improve data transfer and control messages sepa-

rately. Note that inter-dependencies between data and control may restrict unbounded,

individual improvement.

Symmetric OOB parallel file systems, depicted in Figure 2.4a, combine clients and

metadata servers into single, fully capable servers, with metadata distributed among

them. Locks can be distributed or centralized. Maintaining consistent metadata informa-

tion among an increasing number of servers limits scalability. These systems generally

require a SAN. Examples include GPFS [97], GFS [98, 99], OCFS2 [100], and

PolyServe Matrix Server [101].

(a) Symmetric (b) Asymmetric

Figure 2.4: Symmetric and asymmetric out-of-band parallel file systems (PFS).

23

Asymmetric OOB parallel file systems, depicted in Figure 2.4b, divide nodes into cli-

ents and metadata servers. To perform I/O, clients first obtain a file layout map describ-

ing the placement of data in storage from a metadata server. Clients then use the file lay-

out map to access data directly and in parallel. These systems allow data to be accessed

at the block, object, or file level. Block-based systems access disks directly using the

SCSI protocol via FibreChannel or iSCSI. Object- and file-based systems have the po-

tential to improve scalability with a smaller file layout map, shifting the responsibility of

knowing the exact location of every block from clients to storage. Examples of block-

based systems include IBM TotalStorage SAN FS (also known as Storage Tank) [102],

and EMC HighRoad [103]. Examples of object-based systems include Lustre [104] and

Panasas ActiveScale [105]. Examples of file-based systems include Swift [96, 106] and

PVFS [107].

2.4.2.1 Parallel file systems and POSIX

The POSIX API and semantics impede efficient sharing of a single file’s address

space in a cluster of computers. The POSIX I/O programming model is a single system

in which processes employ fast synchronization and communication primitives to resolve

data access conflicts. Due to an increase in synchronization and communication time,

this model is invalid for multiple clients accessing data in a parallel file system.

Several semantics exemplify the mismatch between POSIX semantics and parallel

file systems.

• Single process/node. The POSIX API forces every application node to execute

the same operation when one node could perform operations such as name resolu-

tion on behalf of all nodes in the distributed application.

• Time stamp freshness. POSIX mandates that file metadata is kept current. Each

I/O operation on the storage nodes alters a time stamp, with single second resolu-

tion, which must be propagated to the metadata server. Examples include modifi-

cation time and access time.

• File size freshness. As part of file metadata, POSIX mandates that the size of a

file is kept current. Clients, storage nodes, and the metadata server must all coor-

24

dinate to determine the current size of a file while it is being updated and ex-

tended.

• Last writer wins. POSIX mandates that when two or more processes write to a

file at the same location, the file contain the last data written. While a parallel file

system can implement this requirement on each individual storage node, it is hard

to implement this requirement for write requests that span multiple storage nodes.

• Data visibility. POSIX mandates that modified data is immediately visible to all

processes immediately after a write operation. With each client maintaining a

separate data cache, satisfying this requirement is tricky.

The high-performance community is proposing extensions to the POSIX I/O API to

address the needs of the growing high-end computing sector [69]. These extensions lev-

erage the intensely cooperative nature of high-performance applications, which are capa-

ble of arranging data access to avoid file address space conflicts.

2.5. NFS architectures

Figure 2.5a displays the NFS architecture, in which trusted clients access a single disk

on a single server. Depending on the required performance, reliability, etc. of an NFS

installation, each component of this model can be realized in different ways using a wide-

range of technologies.

• NFS clients. NFS client implementations exist for nearly every operating system.

Clients can be diskless and can contain multiple network interfaces (NICs) for in-

creased bandwidth.

• Host network. IP-based networks use TCP, UDP, or SCTP [108]. Remote Di-

rect Memory Access (RDMA) support is currently under investigation [109].

(a) Simple NFS remote data access (b) NFS namespace partitioning

Figure 2.5: NFS remote data access.

25

• NFS server. The NFS server provides file, metadata, and lock services to the

NFSv4 client. The VFS/Vnode interface translates client requests to requests to

the underlying file system.

• Storage and storage network. NFS supports almost any modern storage system.

SCSI and SATA command sets are common with directly connected disks.

Hardware RAID systems are also common, and software RAID systems are

emerging as a cheaper (yet slower) alternative. iSCSI is emerging as a ubiquitous

storage protocol for IP-based networks. FCIP enables communication between

Fibre Channel SANs by tunneling FCP, the protocol for Fibre Channel networks,

over IP. iFCP, on the other hand, allows Fibre Channel devices to connect di-

rectly to IP-based networks by replacing the Fibre Channel transport with TCP/IP.

NFS works well with small groups, but is limited in every aspect of its design: con-

sumption of compute cycles, memory, bandwidth, storage capacity, etc. To scale up the

number of clients, Figure 2.5b shows how many enterprises, such as universities and

large organizations, partition a file system among several NFS servers. For example, par-

titioning student home directories among many servers spreads the load among the serv-

ers.

The Automounter automatically mounts file systems as clients access different parts

of the namespace [110]. High-demand read-only data may be replicated similarly, with

the Automounter automatically mounting the closest replica server. One problem with

the Automounter is a noticeable delay as users navigate into mounted-on-demand directo-

ries. Other disadvantages of this approach include administration costs, backup man-

agement, load balancing, visible namespace reorganization, and quota management. De-

spite these problems, many organizations use this technique to provide access to very

large data stores.

Figure 2.6: NFS with databases.

26

Database deployments are emerging as another environment for NFS. Figure 2.6 il-

lustrates database clients using a local disk with the database server storing object and log

files in NFS. Database systems manage caches on their own and depend on synchronous

writes, therefore these NFS installations disable client caching and asynchronous writes.

To improve the performance of synchronous writes, some NFS hardware vendors (such

as Network Appliance) write to NVRAM synchronously, and then asynchronously write

this data to disk. It is common for database applications to have many servers accessing

a single file at the same time. CIFS is unsuitable for this type of database deployment

due to its lack of appropriate lock semantics, a specific write block size, and appropriate

commit semantics.

The original goals of distributed file systems were to provide distributed access to lo-

cal file systems, but NFS is now widely used to provide distributed access to other net-

work-based file systems. Although parallel file systems already have remote data access

capability, many lack heterogeneous clients, a strong security model, and satisfactory

WAN performance.

Figure 2.7 illustrates standard NFSv4 clients accessing symmetric and asymmetric

out-of-band parallel file systems. The NFSv4 server accesses a single parallel file system

client and translates all NFS requests to parallel file system specific operations. Symmet-

ric OOB file systems are often limited to a small number of nodes due to the high cost of

a SAN. NFS can increase the number of clients accessing a symmetric OOB file system

by attaching additional NFSv3 clients to each node in Figure 2.7a.

2.6. Additional NFS architectures

This section examines NFS architecture variants that attempt to scale one or more as-

pects of NFS. These architectures transform NFS into a type of parallel file system, in-

creasing scalability but eliminating the file system independence of NFS.

27

2.6.1. NFS-based asymmetric parallel file system

Many examples exist of systems that use the NFS protocol to create an asymmetric

parallel file system. In these systems, a directory service (metadata manager) typically

manages the namespace while files are striped across NFS data servers. NFS clients use

the directory service to retrieve file metadata and file location information. Data servers

store file segments (stripes) in a local file system such as Ext3 [111] or ReiserFS [112].

Several directory service strategies have been suggested, each offering different ad-

vantages.

Explicit metadata node. This strategy uses an NFS server as the metadata node that

manages the file system. Clients access the metadata node to retrieve a list of data serv-

ers and layout information describing how files are striped across them. Clients maintain

data consistency by applying advisory locks to the metadata node. Unmodified NFS

servers are used for storage. Support for mandatory locks or access control lists require a

communication channel to coordinate state information among the metadata nodes and

data servers.

Store metadata information in files. The Expand file system [113] stores file metadata

and location information in regular files in the file system. Clients determine the NFS

data server and pathname of the metadata file by hashing the file pathname. To perform

I/O, a client opens the metadata file for a particular file to retrieve data access informa-

tion. Expand uses unmodified NFS servers but extends NFS clients to locate, parse, in-

terpret, and use file layout information. A major problem with file hashing based on

pathname is that renaming files requires migrating metadata between data servers. In ad-


Figure 2.7: NFS exporting symmetric and asymmetric parallel file systems (PFS).

28

dition, Expand uses data server names with the metadata file to describe file striping in-

formation, which complicates incremental expansion of data servers.

Directory service elimination. The Bigfoot-NFS file system [114] eliminates the direc-

tory service outright, instead requiring clients to gather file information by analyzing the

file system. Any given file is stored on a single unmodified NFS server. Clients discover

which data server stores a file by requesting file information from all servers. Clients ig-

nore failed responses and use the data server that returned a successful response to access

the file. Bigfoot-NFS reduces file discovery time by parallelizing NFS client requests.

The lack of a metadata service simplifies failure recovery, but the inability to stripe files

across multiple data servers and increased network traffic limits the I/O bandwidth to a

given file.

2.6.2. NFS client request forwarding

The nfsp file system [91] architecture contains clients, data servers, and a metadata

node. Unmodified NFS clients mount and issue file metadata and I/O requests to the

metadata node. Each file is stored on a single NFS data server. Metadata nodes forward

client I/O requests to the data server containing the file, which replies directly to the cli-

ent by spoofing its IP address. The inability to stripe files across multiple data servers

and a single metadata node forwarding I/O requests limits the I/O bandwidth to a given

file.

2.6.3. NFS-based peer-to-peer file system

The Kosha file system [115] uses the NFS protocol to create a peer-to-peer file sys-

tem. Kosha taps into available storage space on client nodes by placing a NFS client and

server on each node. To perform I/O, unmodified NFS clients mount and access their

local NFS server (through the loopback network device). A NFS server routes local cli-

ent requests to the remote NFS server containing the requested file. An NFS server de-

termines the correct NFS data server by hashing the file pathname. Each file is stored on

a single NFS data server, which limits the I/O bandwidth to a given file

29

2.6.4. NFS request routing

The Slice file system prototype [116] divides NFS requests into three classes: large

I/O, small-file I/O, and namespace. A µProxy, interposed between clients and servers,

routes NFS client requests between storage servers, small-file servers, and directory serv-

ers. Large I/O flows directly to storage servers while small-file servers aggregate I/O op-

erations of small files and the initial segments of large files. (Chapter VI investigates the

small I/O problem in more depth, demonstrating that small I/O requests are not limited to

small files.)

Slice introduces two policies for transparent scaling of the name space among the di-

rectory servers. The first method uses a directory as the unit of distribution. This works

well when the number of active directories is large relative to the number of directory

servers, but it binds large directories to a single server. The second method uses a file

pathname as the unit of distribution. This balances request distributions independent of

workload by distributing them probabilistically, but increases the cost and complexity of

coordination among directory servers.

2.7. NFSv4 protocol

NFSv4 extends versions 2 and 3 with the following features:

• Fully integrated security. NFSv4 offers authentication, integrity, and privacy by

mandating support of RPCSEC_GSS [117], an API that allows support for a vari-

ety of security mechanisms to be used by the RPC layer. NFSv4 requires support

of the LIPKEY [118] and SPKM-3 [118] pubic key mechanisms and the Kerberos

V5 symmetric key mechanism [119]. NFSv4 also supports security flavors other

than RPCSEC_GSS, such as AUTH_NONE, AUTH_SYS, and AUTH_DH.

AUTH_NONE provides no authentication. AUTH_SYS provides a UNIX-style

authentication. AUTH_DH provides DES-encrypted authentication based on a

network-wide string name, with session keys exchanged via the Diffie-Hellman

public key scheme. The requirement of support for a base set of security proto-

cols is a departure from earlier NFS versions, which left data privacy and integrity

support as implementation details.

30

• Compound requests. Operation bundling, a feature supported in CIFS, allows

clients to combine multiple operations into a single RPC request. This feature re-

duces the number of round-trip-times between the client and the server to accom-

plish a job, e.g., opening a file, and simplifies the specification of the protocol.

• Incremental protocol extensions. NFSv4 allows extensions that do not com-

promise backward compatibility through a series of minor versions.

• Stateful server. The introduction of OPEN and CLOSE commands creates a

stateful server. This allows enhancements such as mandatory locking and server

callbacks and opens the door to consistent client caching. See Section 2.7.1.

• Root file handles. NFSv4 does not use a separate mount protocol to provide the

initial mapping between a path name and file handle. Instead, a client uses a root

file handle and navigates through the file system from there.

• New attribute types. NFSv4 supports three new types of attributes: mandatory,

recommended, and named. NFSv4 also supports access control lists. This attrib-

ute model is extensible in that new attributes can be introduced in minor revisions

of the protocol.

• Internationalization. NFSv4 encodes file and directory names with UTF-8 to

deal accommodate international character sets.

• File system migration and replication. The fs_location attribute provides for

file system migration and replication.

• Cross-platform interoperability. NFSv4 enhances interoperability with the in-

troduction of recommended and named attributes, and by mandating support for

TCP and Windows share reservations.

2.7.1. Stateful server: the new NFSv4 scalability hurdle

The broadest architectural change for NFSv4 is the introduction of a stateful server to

support exclusive opens called share reservations, mandatory locking, and file delega-

tions. This change significantly increases the complexity of the protocol, its implementa-

31

tions, and most notably its fault tolerance semantics. In addition, access to a single and

shared data store can no longer be exported by multiple NFSv4 servers without a mecha-

nism for maintaining global state consistency among the servers.

A share reservation controls access to a file, based on the CIFS oplocks model [2]. A

client issuing an OPEN operation to a server specifies both the type of access required

(read, write, or both) and the types of access to deny others (deny none, deny read, deny

write, or deny both). The NFSv4 server maintains access/deny state to ensure that future

OPEN requests do not conflict with current share reservations. NFSv4 also supports

mandatory and advisory byte-range locks.

An NFSv4 server maintains information about clients and their currently open files,

and can therefore safely pass control of a file to the first client that opens it. A delegation

grants a client exclusive responsibility for consistent access to the file, allowing client

processes to acquire file locks without server communication. Delegations come in two

flavors. A read delegation guarantees the client that no other client has the ability to

write to the file. A write delegation guarantees the client that no other client has read or

write access to the file. If another client opens the file, it breaks these conditions, so the

server must revoke the delegation by way of a callback. The server places a lease on all

state, e.g., client connection information, file and byte-range locks, delegations. If the

server does not hear from a given client within the lease period, the server is permitted to

discard all of the client’s associated state. A failed server that recovers enters a grace pe-

riod, lasting up to a few minutes, which allows clients to detect the server failure and re-

acquire their previously acquired locks and delegations.

The NFSv4 caching model combines the efficient support for close-to-open semantics

of AFS with the block based caching and easier recovery semantics provided by

DCE/DFS, Sprite, and Not Quite NFS. As with NFSv3, clients cache attribute and direc-

tory information for a duration determined by the client. However, a client holding a

delegation is assured that the cached data for that file is consistent.

Proposed extensions to NFSv4 include directory delegations, which grant clients ex-

clusive responsibility for consistent access to directory contents, and sessions, which pro-

vide exactly-once semantics, multipathing and trunking of transport connections, RDMA

support, and enhanced security.

32

2.8. pNFS protocol

To meet enterprise and grand challenge-scale performance and interoperability re-

quirements, the University of Michigan hosted a workshop on December 4, 2003, titled

“NFS Extensions for Parallel Storage (NEPS)” [120]. This workshop raised awareness of

the need for increasing the scalability of NFSv4 [121] and created a set of requirements

and design considerations [122]. The result is the pNFS protocol [123], which promises

file access scalability as well as operating system and storage system independence.

pNFS separates the control and data flows of NFSv4, allowing data to transfer in parallel

from many clients to many storage endpoints. This removes the single server bottleneck

by distributing I/O across the bisectional bandwidth of the storage network between the

clients and storage devices.

The goals of pNFS are to:

• Enable implementations to match or exceed the performance of the underlying

file system.

• Provide high per-file, per-directory and per-file system bandwidth and capacity.

• Support any storage protocol, including (but not limited to) block-, object-, and

file-based storage protocols.

• Obey NFSv4 minor versioning rules, which require that all future versions have

legacy support.

• Support existing storage protocols and infrastructures, e.g., SBC on Fibre Channel

[16], and iSCSI, OSD on Fibre Channel and iSCSI, NFSv4.

For a file system to realize scalable data access, it must be able to achieve perform-

ance gains relative to the amount of additional hardware. For example, if physical disk

access is the I/O bottleneck, a truly scalable file system can realize benefits from increas-

ing the number of disks. The cycle of identifying bottlenecks, removing them, and in-

creasing performance is endless. pNFS provides a framework for continuous saturation

of system resources by separating the data and control flows and by not specifying a data

flow protocol. Focusing on the control flow and leaving the details of the data flow to

implementers allows continuous I/O throughput improvements without protocol modifi-

33

cation. Implementers are free to use the best storage protocol and data access strategy for

their system.

pNFS extensions to NFSv4 focus on device discovery and file layout management.

Device discovery informs clients of available storage devices. A file layout consists of

all information required by a client to access a byte range of a file. For example, a layout

for the block-based Fibre Channel Protocol may contain information about block size,

offset of the first block on each storage device, and an array of tuples that contains device

identifiers, block numbers, and block counts. To ensure the consistency of the file layout

and the data it describes, pNFS includes operations that synchronize the file layout

among the pNFS server and its clients. To ensure heterogeneous storage protocol support

and unlimited data layout strategies, the file layout is opaque in the protocol.

pNFS does not address client caching or coherency of data stored in separate client

caches. Rather, it assumes that existing NFSv4 cache-coherency mechanisms suffice.

Separating the control and data paths in pNFS introduces new security concerns. Al-

though RPCSEC_GSS continues to secure the NFSv4 control path, securing the data path

may require additional effort. pNFS does not define a new security architecture but dis-

cusses general security considerations. For example, certain storage protocols cannot

provide protection against eavesdropping. Environments that require confidentiality must

either isolate the communication channel or use standard NFSv4.

In addition, pNFS does not define mechanisms to recover from errors along the data

path, but leaves their definition to the supporting data access protocols instead.

34

CHAPTER III

A Model of Remote Data Access

Modern data infrastructures are complex, consisting of numerous hardware and soft-

ware components. Issuing a simple question such as, “What is the size of a file?” may

spark a flurry of network traffic and disk accesses, as each component gathers its portion

of the answer. A necessary condition for improving data access is a clear picture of a

system’s control and data flows. This picture includes the components that generate and

receive requests, the number of components that generate and receive requests, the timing

and sequence of requests, and the traversal path of the requests and responses. Successful

application of this knowledge helps to identify bottlenecks and inefficiencies that fetter

the scalability of the system.

To clarify the novel contributions of the NFSv4 and parallel file system architecture

variants discussed in the proceeding chapters, this chapter presents a course architecture

that identifies the components and communication channels for remote data access. I use

this architecture to illustrate the performance bottlenecks of using NFSv4 with parallel

file systems. Finally, I detail the principal requirements for accessing remote data stores.

3.1. Architecture for remote data access

This section describes an architecture that encapsulates the components and commu-

nication channels of data access. Shown in Figure 3.1, remote data access consists of five

major components: application, data, control, metadata, and storage. In addition to the

flow of data, remote data access also consists of five major control paths that manage

data integrity and facilitate data access. These components and communication channels

form a course architecture for describing remote data access, which I use to describe and

analyze the remote data access architectures in subsequent chapters. I use circles,

35

squares, pentagons, hexagons, and disks to represent the application, data, control, meta-

data, and storage components of the data access architecture, respectively.

This dissertation applies the architecture at the granularity of a file system, providing

a clear picture of file system interactions. A file system contains each of the five compo-

nents, although some are “virtual”, and comprise other components. In addition, a single

machine may assume the roles of multiple components, which I portray by adjoining

components.

The following provides a detailed description of each component:

1. Application. Generate file system, file, and data requests. Typically, these

are nodes running applications.

2. Data. Fulfill application component I/O requests through communication

with storage. These support a specific storage protocol.

3. Control. Fulfill application component metadata requests through commu-

nication with metadata components. These support a specific metadata protocol.

Figure 3.1: General architecture for remote data access. Application components generate and analyze data. Data and control components fulfill application requests by accessing storage and metadata components. Metadata components describe and con-trol access to storage. Storage is the persistent repository for data. Directional arrows originate at the node that initiated the communication.

36

4. Metadata. Describe and control access to storage, e.g., file and directory

location information, access control, and data consistency mechanisms. Examples

include an NFS server and parallel file system metadata nodes.

5. Storage. Persistent repository for data, e.g., Fibre Channel disk array, or

nodes with a directly attached disk.

With data components in the middle, application and storage components bookend

the flow of data. Control components support a metadata protocol for communication

with file system metadata component(s). These components connect to one or more net-

works, each supporting different types of traffic. A storage network, e.g., Fibre Channel,

Infiniband, Ethernet, or SCSI, connects data and storage components. A host network is

IP-based and uses metadata components to facilitate data sharing.

Independent control flows, request, control, and manage different types of informa-

tion. The different types of control flows are as follows:

1. Control ↔ Metadata. To satisfy application component metadata requests, con-

trol components retrieve file and directory information, lock file system objects,

and authenticate. To ensure data consistency, metadata components update or re-

voke file system resources on clients.

2. Control ↔ Control. This flow coordinates data access, e.g., collective I/O.

3. Metadata ↔ Metadata. Systems with multiple metadata nodes use this flow to

maintain metadata consistency and to balance load.

4. Metadata ↔ Storage. This flow manages storage, synchronizing file and direc-

tory metadata information as well as access control information. It can also facili-

tate recovery.

5. Storage ↔ Storage. This flow facilitates data redistribution and migration.

37

3.2. Parallel file system data access architecture

Figure 3.2a depicts a symmetric parallel file system with data, control, metadata, and

application components all residing on the same machine. Storage consists of storage

devices accessed via a block-based protocol, e.g., GPFS, or GFS. Figure 3.2b shows an

asymmetric parallel file system with application, control, and data components residing

on the same machine, metadata components on separate machines. Storage consists of

storage devices accessed via a block-, file-, or object-based protocol, e.g., Lustre, or

PVFS2. Storage for block-based systems consists of a disk array while storage for ob-

ject- and file-based systems consists of a fully functional node formatted with a local file

system such as Ext3 or XFS [124].

3.3. NFSv4 data access architecture

Viewing the NFSv4 and parallel file system architectures as a single, integrated archi-

tecture allows the identification of performance bottlenecks and opens the door for devis-

ing mitigation strategies. Figure 3.3 displays my base instantiation of the general archi-

tecture, NFSv4 exporting a parallel file system. NFSv4 application metadata requests are

fulfilled by an NFSv4 control component that communicates with an NFSv4 metadata

component on the NFSv4 server, which in turn uses a PFS control component to commu-

nicate with the PFS metadata component. NFSv4 application data requests are fulfilled


Figure 3.2: Parallel file system data access architectures. (a) A symmetric parallel file system has data, control, metadata, and application components all on the same ma-chine; storage consists of storage devices accessed via a block-based protocol. (b) An asymmetric parallel file system has data, control, and application components on the same machine, metadata components on separate machines; storage consists of storage devices accessed via a block-, file-, or object-based protocol.

38

by an NFSv4 data component that proxies requests through a PFS data component, which

in turn communicates directly with storage. The NFSv4 virtual storage component is the

entire PFS architecture. The PFS virtual application component is the entire NFSv4 ar-

chitecture. Figure 3.3 does not display the virtual components.

Figure 3.3 readily illustrates the NFSv4 “single server” bottleneck discussed in Sec-

tion 2.4.2.1. Data requests from every NFSv4 client must fit through a single NFSv4

server using a single PFS data component. Subsequent chapters vary this architecture to

attain different levels of scalability, performance, security, heterogeneity, transparency,

and independence.

3.4. Remote data access requirements

The utility of each remote data access architecture variant presented in this disserta-

tion derives from several data access requirements. These requirements fall into the fol-

lowing categories:

• I/O workload. A data access architecture must deliver satisfactory performance.

An application’s I/O workload determines the type of performance required, e.g.,

single and multiple client I/O throughput, small I/O request, file creation and

metadata management, etc.

Figure 3.3: NFSv4-PFS data access architecture. With NFSv4 exporting a parallel file system, NFSv4 application metadata requests are fulfilled by an NFSv4 control compo-nent that communicates with the NFSv4 metadata component, which in turn uses a PFS control component to communicate with the PFS metadata component. NFSv4 applica-tion data requests are fulfilled by an NFSv4 data component that proxies requests through a PFS data component, which in turn communicates directly with storage. The NFSv4 storage component is the entire PFS architecture. The PFS application compo-nent is the entire NFSv4 architecture. With a symmetric parallel file system, the PFS metadata and data components are coupled.

39

• Security and access control. Many high-end computers run applications that

deal with both private and public information. Systems must be able to handle

both types of applications, ensuring that sensitive data is separate and secure. Se-

curity can be realized through air-gap, encryption, node fencing, and numerous

other methods. In addition, cross-realm access control to encourage research col-

laboration must be transparent and foolproof.

• Wide area networks. Beyond heightened security and access control require-

ments, successful global collaborations require high performance, heterogeneous,

and transparent access to data, independent of the underlying storage system.

• Local area networks. Performance and scalability are key requirements of ap-

plications designed to run in LAN environments. Heterogeneous data access and

storage system independence are also becoming increasingly important. For ex-

ample, in many multimedia studios, designers using PCs and large UNIX render-

ing clusters access multiple on- and off-site storage systems [11].

• Development and management. With today’s increasing reliance on middle-

ware applications, reducing development and administrator training costs and

problem determination time is vital.

3.5. Other general data access architectures

3.5.1. Swift architecture

The Swift parallel file system was an early pioneer in achieving scalable I/O through-

put using distributed disk striping [96]. Figure 3.4 displays the Swift architecture com-

ponents. Swift did not define a specific architecture, but instead listed four optional

components. In general, client components perform I/O by using a distribution agent

component to retrieve a transfer plan from a storage mediator component and transfer

data to/from storage agent components. The original Swift prototype used a standard

transfer plan, obviating the need for storage mediators.

40

Swift architecture components map almost one-to-one with the general architecture

components introduced in this chapter. Swift storage mediators function as metadata

components, Swift storage agents function as storage components, and Swift clients func-

tion as application components. The architecture presented here splits the role of a Swift

distribution agent into control and data components. Separating I/O and metadata re-

quests into separate components lets us represent out-of-band systems that use different

protocols for each channel. In addition, applying the architecture components iteratively

and including all communication channels provides a holistic view of remote data access.

3.5.2. Mass storage system reference model

In the late 1980s, the IEEE Computer Society Mass Storage Systems and Technology

Technical Committee attempted to organize the evolving storage industry by creating a

Mass Storage System Reference Model [93, 94], now referred to as the IEEE Reference

Model for Open Storage Systems Interconnection (OSSI model) [125]. Shown in Figure

3.5, its goal is to provide a framework for the coordination of standards development for

storage systems interconnection and a common perspective for existing standards. One

system—perhaps the only one—based directly on the OSSI model is the High-

performance Storage System (HPSS) [126].

The OSSI model decomposes a complete storage system into the following storage

modules, which are defined by several IEEE P1244 standards documents:

Figure 3.4: Swift architecture [96]. The Swift architecture consists of four components: clients, distribution agents, a storage mediator, and storage agents. Clients perform I/O by using a distribution agent to retrieve a transfer plan from a storage mediator and trans-fer data to/from storage agents.

41

• Application Environment Profile. The environmental software interfaces re-

quired by open storage system services.

• Object Identifiers. The format and algorithms used to generate globally unique

and immutable identifiers for every element within an open storage systems.

• Physical Volume Library. The software interfaces for services that manage re-

movable media cartridges and their optimization.

• Physical Volume Repository. The human and software interfaces for services

that stow cartridges and mount these cartridges onto devices, employing either ro-

botic or human transfer agents.

• Data Mover. The software interfaces for services that transfer data between two

endpoints.

• Storage System Management. A framework for consistent and portable services

to monitor and control storage system resources as motivated by site-specified

storage management policies.

• Virtual Storage Service. The software interfaces to access and organize persis-

tent storage.

The goals of the architecture presented in this chapter complement those of the OSSI

model. Figure 3.6 demonstrates how my architecture encompasses the OSSI modules

and protocols. The IEEE developed the OSSI model to expose areas where standards are

Figure 3.5: Reference Model for Open Storage Systems Interconnection (OSSI) [125]. The OSSI model diagram displays the software design relationships between the primary modules in a mass storage system to facilitate standards development for storage systems interconnection.

42

necessary (or need improvement), so they could be implemented and turned into com-

mercial products. . The OSSI model does not capture the physical nodes or data and con-

trol flows in a data architecture, but rather the design relationships between components.

For example, Figure 3.5 displays data movers and clients as separate objects connected

with a request flow, a representation more in line with modern software design tech-

niques than physical implementation. The architecture presented in this chapter focuses

on identifying potential bottlenecks by grouping a node’s components and identifying the

data and control flows that bind the nodes.

Figure 3.6: General data access architecture view of OSSI model. The OSSI modules in the general data access architecture. The Virtual Storage Service uses Data Movers to route data between application and storage components. Control components use the Physical Volume Li-brary, Physical Volume Repository, and Virtual Storage Service to mount and obtain file metadata information. Metadata components use the Storage System Management protocol to manage storage.

43

CHAPTER IV

Remote Access to Unmodified Parallel File Systems

Collaborations such as TeraGrid [127] allow global access to massive data sets in a

nearly seamless environment distributed across several sites. Data access transparency

allows users to seamlessly access data from multiple sites using a common set of tools

and semantics. The degree of transparency between sites can determine the success of

these collaborations. Factors affecting data access transparency include latency, band-

width, security, and software interoperability.

To improve performance and transparency at each site, the use of parallel file systems

is on the rise, allowing applications high-performance access to a large data store using a

single set of semantics. Parallel file systems can adapt to spiraling storage needs and re-

duce management costs by aggregating all available storage into a single framework.

Unfortunately, parallel file systems are highly specialized, lack seamless integration and

modern security features, often limited to a single operating system and hardware plat-

form, and suffer from slow offsite performance. In addition, many parallel file systems

are proprietary, which makes it almost impossible to add extensions for a user’s specific

needs or environment.

NFSv4 allows researchers access to remote files and databases using the same pro-

grams and procedures that they use to access local files, as well as obviating the need to

create and update local copies of a data set manually. To meet quality of service re-

quirements across metropolitan and wide-area networks, NFSv4 may need to use all

available bandwidth provided by the parallel file system. In addition, NFSv4 must be

able to provide parallel access to a single file from large numbers of clients, a common

requirement of scientific applications.

This chapter discusses the challenge of achieving full utilization of an unmodified

storage system’s available bandwidth while retaining the security, consistency, and het-

44

erogeneity features of NFSv4—features missing in many storage systems. I introduce

extensions that allow NFSv4 to scale beyond a single server by distributing data access

across the data components of the remote data store. These extensions include a new

server-to-server protocol and a file description and location mechanism. I refer to NFSv4

with these extensions as Split-Server NFSv4.

The remainder of this chapter is organized as follows. Section 4.1 discusses scaling

limitations of the NFSv4 protocol. Section 4.2 describes the NFSv4 protocol extensions

in Split-Server NFSv4. Sections 4.3 and 4.4 discuss fault tolerance and security implica-

tions of these extensions. Section 4.5 provides performance results of my Linux-based

prototype and discusses performance issues of NFS with parallel file systems. Section

4.5.5 reviews alternate possible architectures and Section 4.6 concludes this chapter.

4.1. NFSv4 state maintenance

NFSv4 server state is used to support exclusive opens (called share reservations),

mandatory locking, and file delegations. The need to manage consistency of state infor-

mation on multiple nodes fetters the ability to export an object via multiple NFSv4 serv-

ers. This “single server” constraint becomes a bottleneck if load increases while other

nodes in the parallel file system are underutilized. Partitioning the file system space

among multiple NFS servers helps, but increases administrative complexity, management

cost, and fails to address scalable access to a single file or directory, a critical require-

ment of many high-performance applications [7].

4.2. Architecture

Figure 4.1 shows how Split-Server NFSv4 modifies the NFSv4-PFS architecture of

Figure 3.3 by exporting the file system from all available parallel file system clients.

NFSv4 clients use their data component to send data requests to every available PFS data

component, distributing data requests across the bisectional bandwidth of the client net-

work. Any increase or decrease in available throughput of the parallel file system, e.g.,

additional nodes or increased network bandwidth, is reflected in Split-Server NFSv4 I/O

throughput. NFSv4 access control components exist with each PFS data component to

45

ensure that data servers allow only authorized data requests. The PFS data component

uses a control component to retrieve PFS file layout information for storage access.

The Split-Server NFSv4 extensions have the following goals:

• Read and write performance to scale linearly as parallel file system nodes are

added or removed.

• Support for unmodified parallel file systems.

• Single file system image with no partitioning.

• Negligible impact to NFSv4 security model and fault tolerance semantics.

• No dependency on special features of the underlying parallel file system.

4.2.1. NFSv4 extensions

To export a file from multiple NFSv4 servers exporting shared storage, the servers

need a common view of their shared state. NFSv4 servers must therefore share state in-

formation and do so consistently, i.e., with single-copy semantics. Without an identical

view of the shared state, conflicting file and byte-range locks can cause data corruption or

allow malicious clients to read and write data without proper authorization.

Figure 4.1: Split-Server NFSv4 data access architecture. NFSv4 application compo-nents use a control component to obtain metadata information from an NFSv4 metadata component and an NFSv4 data component to fulfill I/O requests. The NFSv4 metadata component uses a PFS control component to retrieve PFS metadata and shares its ac-cess control information with the data servers to ensure the data servers allow only au-thorized data requests. The PFS data component uses a PFS control component to ob-tain PFS file layout information for storage access.

46

To provide a consistent view, I use a state server to copy the portions of state needed

to serve READ, WRITE, and COMMIT requests at I/O nodes, (designated data servers).

Figure 4.2a shows the Split-Server NFSv4 architecture. Transforming NFSv4 into the

out-of-band protocol shown in Figure 4.2b, unleashes the I/O scalability of the underlying

parallel file system.

Many clients performing simultaneous metadata operations can overburden a state

server. For example, coordinating clients simultaneously opening separate result files.

To reduce the load on the state server, a system administrator can partition file system

metadata among several state servers, ensuring that all state for a single file resides on a

single state server. In addition, control processing can be distributed by allowing data

servers to handle operations that do not affect NFSv4 server state, e.g., SETATTR and

GETATTR.

4.2.2. Configuration and setup

The mechanics of a client connection to a server are the same as NFSv4, with the cli-

ent mounting the state server managing the file space of interest. Data servers register

with the state server at start-up or any time thereafter and are immediately available to

Split-Server NFSv4 clients, allowing easy incremental growth.

(a) Design (b) Process flow

Figure 4.2: Split-Server NFSv4 design and process flow. Storage consists of a paral-lel file system such as GPFS. NFSv4 servers are divided into data servers, which handle s READ, WRITE, and COMMIT requests, and a state server, which handles file system and stateful requests. The state server coordinates with the data servers to ensure only authorized client I/O requests are fulfilled.

47

4.2.3. Distribution of state information

On receiving an OPEN request, a state server picks a data server to service the data

request. The selection algorithm is implementation defined. In my prototype, I use

round-robin. The state server then places share reservation state for the request on the

selected data server. The following items constitute a unique identifier for share reserva-

tion state:

• Client name, IP address, and verifier

• Access/Deny authority

• File handle

• File open owner

When a client issues a CLOSE request, the state server first reclaims the state from

the data server. Once reclamation is complete, the standard NFSv4 close procedure pro-

ceeds.

Support for locks does not require distributing additional state beyond share reserva-

tions. NFSv4 uses POSIX locks and relies on the locking subsystem of the underlying

parallel file system. Delegations also require no additional state on the data servers as the

state server manages conflicting access requests for a delegated file.

4.2.4. Redirection of clients

Split-Server NFSv4 extends the NFSv4 protocol with a new attribute called

FILE_LOCATION to enable Split-Server NFSv4 to provide access to a single file via multi-

ple nodes.

The FILE_LOCATION attribute specifies:

• Data server location information

• Root pathname

• Read-only flag

Clients use FILE_LOCATION information to direct READ, WRITE, and COMMIT re-

quests to the named server. The root pathname allows each data server to have its own

48

namespace. The read-only flag declares whether the data server will accept WRITE

commands.

4.3. Fault tolerance

The failure model for Split-Server NFSv4 follows that of NFSv4 with the following

modifications:

1. A failed state server can recover its runtime state by retrieving each part of the

state from the data servers.

2. The failure of a data server is not critical to system operation.

4.3.1. Client failure and recovery

An NFSv4 server places a lease on all share reservations, locks, and delegations is-

sued to a client. Clients must send RENEW operations, akin to heartbeat messages, to

the server to retain their leases. If a server does not receive a RENEW operation from a

client within the lease period, the server may unilaterally revoke all state associated with

the given client. Leases are also implicitly renewed as a side effect of a client request

that includes its identifier. However, Split-Server NFSv4 redirects READ, WRITE, and

COMMIT operations to the data servers, so the renewal implicit in these operations is no

longer visible to the state server. Therefore, RENEW operations are sent to a client’s

mounted state server either by the modifying a client to send explicit RENE operations,

or by engineering the data server that is actively fulfilling client requests to send them.

Enabling data servers to send RENEW messages on behalf of a client improves scalabil-

ity by limiting the maximum number of renewal messages received by a state server to

the number of data server nodes.

4.3.2. State server failure and recovery

A recovering state server stops servicing requests and queries data servers to rebuild

its state.

49

4.3.3. Data server failure and recovery

A failed data server is discovered by the state server when it tries to replicate state

and by clients who issue requests. A client obtains a new data server by reissuing the re-

quest for the FILE_LOCATION attribute. A data server that experiences a network partition

from the state server immediately stops fulfilling client requests, preventing a state server

from granting conflicting file access requests.

4.4. Security

The addition of data servers to the NFSv4 protocol does not require extra security

mechanisms. The client uses the security protocol negotiated with a state server for all

nodes. Servers communicate over RPCSEC_GSS, the secure RPC mandated for all

NFSv4 commands.

4.5. Evaluation

This section compares unmodified NFSv4 with Split-Server NFSv4 as they export a

GPFS file system. The test environment is shown in Figure 4.3. All nodes are connected

via an IntraCore 35160 Gigabit Ethernet switch with 1500-byte Ethernet frames.

Server System: The five server nodes are equipped with Pentium 4 processors with a

clock rate of 850 MHz and a 256 KB cache; 2 GB of RAM; one Seagate 80 GB, 7200

RPM hard drive with an Ultra ATA/100 interface and a 2 MB cache; and two 3Com

3C996B-T Gigabit Ethernet cards. Servers run a modified Linux 2.4.18 kernel with Red

Hat 9.

Figure 4.3: Split-Server NFSv4 experimental setup. The system has four Split-Server NFSv4 clients and five GPFS servers exporting a common file system. The GPFS serv-ers are exported by Split-Server NFSv4, consisting of a state server and at most four data servers.

50

Client System: Client nodes one through three are equipped with dual 1.7 GHz Pen-

tium 4 processors with a 256 KB cache; 2 GB of RAM; a Seagate 80 GB, 7200 RPM

hard drive with an Ultra ATA/100 interface and a 2 MB cache; and a 3Com 3C996B-T

Gigabit Ethernet card. Client node four is equipped with an Intel Xeon processor with a

clock rate of 1.4 GHz and a 256 KB cache; 1 GB RAM; an Adaptec 40 GB, 10K RPM

SCSI hard drive using Ultra 160 host adapter; and a AceNIC Gigabit Ethernet card. All

clients run the Linux 2.6.1 kernel with a Red Hat 9 distribution.

Netapp FAS960 Filer: The storage device has two processors, 6 GB of RAM, and a

quad Gigabit Ethernet card. It is connected to eight disks running RAID4.

The five servers run the GPFS v1.3 parallel file system with a 40 GB file system and

a 16 KB block size. GPFS maintains a 32 MB file and metadata cache known as the

pagepool. All experiments use forty NFSv4 server threads except the Split-Server

NFSv4 write experiments, which uses a single NFSv4 server thread to improve perform-

ance. (Discussed in Section 4.5.4)

4.5.1. Scalability experiments

To evaluate scalability, I measure the aggregate I/O throughput while increasing the

number of clients accessing GPFS, NFSv4, and Split-Server NFSv4. Since both standard

NFSv4 and Split-Server NFSv4 export a GPFS file system, the GPFS configuration con-

stitutes the theoretical ceiling on NFSv4 and Split-Server NFSv4 I/O throughput. The

extra hop between the GPFS server and the NFS client prevents the performance of

NFSv4 and Split-Server NFSv4 from equaling GPFS performance. The goal is for Split-

Server NFSv4 to scale linearly with GPFS.

GPFS is configured as a four node GPFS file system directly connected to the filer.

NFSv4 is configured with a single NFSv4 server running on a GPFS node and four cli-

ents. Split-Server NFSv4 is configured with a state server, four data servers (each run-

ning on a GPFS file system node), and four clients. At most one client accesses each data

server during an experiment.

To measure the aggregate I/O throughput, I use the IOZone [128] benchmark tool. In

the first set of experiments, each client reads/writes a separate 500 MB file. In the second

set of experiments, each client reads/writes disjoint 500 MB portions of a single pre-

51

existing file. The aggregate I/O throughput is calculated when the last client completes

its task. The value presented is the average over ten executions of the benchmark. The

write timing includes the time to flush the client’s cache to the server. Clients and serv-

ers purge their caches before each read experiment. All read experiments use a warm

filer cache to reduce the effect of disk access irregularities.

The experimental goal is to test whether that Split-Server NFSv4 scales linearly with

additional resources. I engineered a server bottleneck in the system by using a small

GPFS pagepool and block size, and by cutting the number of server clock cycles in half.

This ensures that each server is fully utilized, which implies that the results are applicable

to any system that needs to scale with additional servers.

4.5.2. Read performance

First, I measure read performance while increasing the number of clients from one to

four. Figure 4.4a shows the results with separate files. Figure 4.4b presents the results

with a single file. GPFS imposes a ceiling on performance with an aggregate read

throughput of 23 MB/s with a single server. With four servers, GPFS reaches 94.1 MB/s

and 91.9 MB/s in multiple and single file experiments respectively. The decrease in per-

formance for the single file experiment arises because all servers must access a single

metadata server. With Split-Server NFSv4, as I increase the number of clients and data

servers the aggregate read throughput increases linearly, reaching 65.7 MB/s with multi-

ple files and 59.4 MB/s for the single file experiment. NFSv4 aggregate read throughput

remains flat at approximately 16 MB/s in both experiments, a consequence of the single

server bottleneck.

4.5.3. Write performance

The second experiment measures the aggregate write throughput as I increase the

number of clients from one to four. I first measure the performance of all clients writing

to separate files, shown in Figure 4.5a.

GPFS sets the upper limit with an aggregate write throughput of 16.7 MB/s with a

single server and 61.4 MB/s with four servers. The fourth server overloads the filer’s

52

CPU. NFSv4 and Split-Server NFSv4 initially have an aggregate write throughput of ap-

proximately 8 MB/s. The aggregate write throughput of Split-Server NFSv4 increases

linearly, reaching a maximum of 32 MB/s. As in the read experiments, the aggregate

write throughput of NFSv4 remains flat as the number of clients is increased.

Figure 4.5b shows the results of each client writing to different regions of a single

file. The write performance of GPFS and NFSv4 is similar to the separate file experi-

ments. The major difference occurs with Split-Server NFSv4, achieving an initial aggre-

gate throughput of 6.1 MB/s and increasing to 18.7 MB/s. Poor performance and lack of

scalability is the result of modification time (mtime) synchronization between GPFS

servers. GPFS avoids synchronizing the mtime attribute when accessed directly. GPFS

must synchronize the mtime attribute when accessed with NFSv4 to ensure NFSv4 client

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4

Number of Nodes

Agg

rega

te T

hrou

ghpu

t (M

B/s

)

GPFSSplit-Server NFSv4NFSv4

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4

Number of Nodes

Agg

rega

te T

hrou

ghpu

t (M

B/s

)


(a) Separate files (b) Single file Figure 4.4: Split-Server NFSv4 aggregate read throughput. GPFS consists of up to four file system nodes. NFSv4 is up to four clients accessing a single GPFS server. Split-Server NFSv4 consists of up to four clients accessing up to four data servers and a state server. Split-Server NFSv4 scales linearly as I increase the number of GPFS nodes but NFSv4 performance remains flat.

0

10

20

30

40

50

60

70

80

1 2 3 4

Number of Nodes

Agg

rega

te T

hrou

ghpu

t (M

B/s

)


0

10

20

30

40

50

60

70

80

1 2 3 4

Number of Nodes

Agg

rega

te T

hrou

ghpu

t (M

B/s

)


(a) Separate files (b) Single file Figure 4.5: Split-Server NFSv4 aggregate write throughput. GPFS consists of up to four file system nodes. NFSv4 is up to four clients accessing a single GPFS server. Split-Server NFSv4 consists of up to four clients accessing up to four data servers and a state server. With separate files, Split-Server NFSv4 scales linearly as I increase the number of GPFS nodes but NFSv4 performance remains flat. With a single file, Split-Server NFSv4 performance is fettered by mtime synchronization.

53

cache consistency. Furthermore, GPFS includes the state server among servers that syn-

chronize the mtime attribute, further reducing performance.

4.5.4. Discussion

Split-Server NFSv4 scales linearly with the number of GPFS nodes except when mul-

tiple clients write to a single file, which experiences lower performance since GPFS syn-

chronizes the mtime attribute to comply with the NFS protocol. Client cache synchroni-

zation relies on the mtime attribute, but it is unnecessary in some environments. For ex-

ample, some programs cache data themselves and use the OPEN option O_DIRECT to

disable client caching for a file. Other programs require only non-conflicting write con-

sistency, handling data consistency without relying on locks or cache consistency mecha-

nisms. PVFS2 [129] is designed for such programs. To succeed in these environments,

the NFS protocol must relax its client cache consistency semantics.

NFS block sizes have tended to be small. Block sizes were 4 KB in NFSv2, and grew

to 8 KB in NFSv3. Most recent implementations now support 32 KB or 64 KB. Syn-

chronous writes along with hardware and kernel limitations are some of the original rea-

sons for small block sizes. Another is UDP, which uses IP fragmentation to divide each

block into multiple requests. Consequently, the loss of a single request means the loss of

the entire block. The introduction in 2002 of TCP and a larger buffer space to the Linux

implementation of NFS allows for larger block sizes, but the current Linux kernel has a

32 KB limit. This creates a disparity with many parallel file systems, which use a stripe

size of greater than 64 KB. To avoid this data request inefficiency, NFS implementations

need to catch up to parallel file systems like GPFS that support block sizes of greater than

1 MB.

Multiple NFS server threads can also reduce I/O throughput. Even with a single NFS

client, the parallel file system assumes all requests are from different sources and per-

forms locking between threads. In addition, server threads can process read and write

requests out of order, hampering the parallel file system’s ability to improve its interac-

tion with the physical disk.

In NFSv3, the lack of OPEN and CLOSE commands leads to an implicit open and

close of a file in the underlying file system on every request. This does not degrade per-

54

formance with local file systems such as Ext3, but the extra communication required to

contact a metadata server in parallel file systems restricts NFSv3 throughput.

4.5.5. Supplementary Split-Server NFSv4 designs

4.5.5.1 File system directed load balancing

Split-Server NFSv4 distributes clients among the data servers using a round-robin al-

gorithm. Allowing the underlying file system to direct clients to data servers may im-

prove efficiency of available resources since the parallel file system may have more in-

sight into the current client load. For example, coordinated use of the parallel file sys-

tem’s data cache may prove effective with certain I/O access patterns. Allowing the un-

derlying file system to direct client load may also facilitate the use of multiple metadata

servers without an additional server-to-server protocol. This suggests extending the inter-

face between an NFSv4 server and its exported parallel file system.

4.5.5.2 Client directed load balancing

Split-Server NFSv4’s use of the FILE_LOCATION attribute enables a centralized way to

balance client load. To avoid having to modify the NFSv4 protocol, random distribution

of client load among data servers may prove sufficient in certain cases. Once clients dis-

cover the available data servers through configuration files or specialized mount options,

they can randomly distribute requests among the data servers. Data servers can retrieve

required state from the state server as needed.

4.5.5.3 NFSv4 client directed state distribution

Split-Server NFSv4’s server-to-server communication increases the load on a central

resource and delays stateful operations. Clients can reduce load on the metadata server

by assuming responsibility for the distribution of state. After a client successfully exe-

cutes a state generating request, e.g., LOCK, on the state server, it sends the same opera-

tion to every data server in the FILE_LOCATION attribute. In addition to reducing load on

the state server, this design also isolates all modifications to the NFS client.

55

4.6. Related work

Several systems aggregate partitioned NFSv3 servers into a single file system image

[91, 113, 114]. These systems, discussed in Section 2.6, transform NFS into a type of

parallel file system, which increases scalability but eliminates the file system independ-

ence of NFS.

The Storage Resource Broker (SRB) [130] aggregates storage resources, e.g., a file

system, an archival system, or a database, into a single data catalogue. The HTTP proto-

col is the most common and widespread way to access remote data stores. SRB and

HTTP also have some limits: they do not enable parallel I/O to multiple storage endpoints

and do not integrate with the local file system.

The notion of serverless or peer-to-peer file systems was popularized by xFS [131],

which eliminates the single server bottleneck and provides data redundancy through net-

work disk striping. More recently, wide-area file systems such as LegionFS[132] and

Gfarm [133] provide a fully integrated and distributed environment and a secure means

of cross-domain access. Targeted for the grid, these systems use data replication to pro-

vide reasonable performance to globally distributed data. The major drawback of these

systems is their lack of interoperability with other file systems–mandating themselves as

the only grid file system. Split-Server NFSv4 allows file system independent access to

remote data stores in the LAN or across the WAN.

GPFS-WAN [134, 135], used extensively in the TeraGrid [127], features exceptional

throughput across high-speed, long haul networks, but is focused on large I/O transfers

and is restricted to GPFS storage systems.

GridFTP [4] is also used extensively in Grid computing to enable high I/O through-

put, operating system independence, and secure WAN access to high-performance file

systems. Successful and popular, GridFTP nevertheless has some serious limitations: it

copies data instead of providing shared access to a single copy, which complicates its

consistency model and decreases storage capacity; it lacks direct data access and a global

namespace; runs as an application, and cannot be accessed as a file system without oper-

ating system modification. Split-Server NFSv4 is not intended to replace GridFTP, but to

work alongside it. For example, in tiered projects such as ATLAS at CERN, GridFTP

remains a natural choice for long-haul scheduled transfers among the upper tiers, while

56

Split-Server NFSv4 offers advantages in the lower tiers by letting scientists work with

files directly, promoting effective data management.

4.7. Conclusion

This chapter introduces extensions to NFSv4 to utilize an unmodified parallel file sys-

tem’s available bandwidth while retaining NFSv4 features and semantics. Using a new

FILE_LOCATION attribute, Split-Server NFSv4 provides parallel and scalable access to ex-

isting parallel file systems. The I/O throughput of the prototype scales linearly with the

number of parallel file system nodes except when multiple clients write to a single file,

which experiences lower performance due to mtime synchronization in the underlying

parallel file system.

57

CHAPTER V

Flexible Remote Data Access

Parallel file systems teach us two important lessons. First, direct and parallel data ac-

cess techniques can fully utilize available hardware resources, and second, standard data

flow protocols such as iSCSI [22], OSD [136], and FCP [21] can increase interoperability

and reduce development and management costs. Unfortunately, standard protocols are

useful only if file systems use them, and most parallel file systems support only one pro-

tocol for their data channel. In addition, most parallel file systems use proprietary control

protocols.

Distributed file systems view storage through parallel file system data components,

intermediary nodes through which data must travel. This extra layer of processing pre-

vents distributed file systems from matching the performance of the exported file system,

even for a single client.

An architectural framework should be able to encompass all storage architectures,

i.e., symmetric or asymmetric; in-band or out-of-band; and block-, object-, or file-based,

without sacrificing performance. The NFSv4 file service, with its global namespace,

high level of interoperability and portability, simple and cost-effective management, and

integrated security provides an ideal base for such a framework.

This chapter analyzes a prototype implementation of pNFS [121, 122], an extension

of NFSv4 that provides file access scalability plus operating system, hardware platform,

and storage system independence. pNFS eliminates the performance bottlenecks of NFS

by enabling the NFSv4 client to access storage directly. pNFS facilitates interoperability

between standard protocols by providing a framework for the co-existence of NFSv4 and

other file access protocols. My prototype demonstrates and validates the potential of

pNFS. The I/O throughput of my prototype equals that of its exported file system

(PVFS2 [129]) and is dramatically better than standard NFSv4.

58

The remainder of this chapter is organized as follows. Section 5.1 describes the

pNFS architecture. Sections 5.2 and 5.3 present PVFS2 and my pNFS prototype. Sec-

tion 5.4 measures performance of my Linux-based prototype. Section 5.5 discusses addi-

tional pNFS design and implementation issues, including the impact of locking and secu-

rity support on the pNFS protocol. Section 5.6 summarizes and concludes this chapter.

5.1. pNFS architecture

To meet enterprise and grand challenge-scale performance and interoperability re-

quirements, a group of engineers—initially ad-hoc but now integrated into the IETF—is

designing extensions to NFSv4 that provide parallel access to storage systems. The result

is pNFS, which promises file access scalability as well as operating system and storage

system independence. pNFS separates the control and data flows of NFSv4, allowing

data to transfer in parallel from many clients to many storage endpoints. This removes

the single server bottleneck by distributing I/O across the bisectional bandwidth of the

storage network between the clients and storage devices.

Figure 5.1 shows how pNFS alters the original NFSv4-PFS data access architecture

(Figure 3.3) by integrating NFSv4 application and control components with the parallel

file system data component. The NFSv4 client (NFSv4 control component) continues to

send control operations to the NFSv4 server (NFSv4 metadata component), but shifts the

Figure 5.1: pNFS data access architecture. NFSv4 control components access the NFSv4 metadata component and use a parallel file system (PFS) data component to per-form I/O directly to storage.

59

responsibility for achieving scalable I/O throughput to a storage-specific driver (PFS data

component).

Figure 5.2 depicts the architecture of pNFS, which adds a layout and I/O driver, and a

file layout retrieval interface to the standard NFSv4 architecture. pNFS clients send I/O

requests directly to storage and access the pNFS server for file metadata information.

A benefit of pNFS is its ability to match the performance of the underlying storage

system’s native client while continuing to support all standard NFSv4 features. This sup-

port is ensured by introducing pNFS extensions into a “minor version”, a standard exten-

sion mechanism of NFSv4. In addition, pNFS does not impose restrictions that might

limit the underlying file system’s ability to provide quality-enhancing features such as

usage statistics or storage management interfaces.

5.1.1. Layout and I/O driver

The layout driver understands the file layout and storage protocol of the storage sys-

tem. A layout consists of all information required to access any byte range of a file. For

example, a block layout may contain information about block size, offset of the first

block on each storage device, and an array of tuples that contains device identifiers, block

numbers, and block counts. An object layout specifies the storage devices for a file and

the information necessary to translate a logical byte sequence into a collection of objects.

A file layout is similar to an object layout but uses file handles instead of object identifi-

ers. The layout driver uses the layout to translate read and write requests from the pNFS

Server

Layout

I/O

Client

Layout Driver

pNFSParallel I/O

NFSv4 I/O and MetadataStorageNodes

pNFSClient I/O

Driver

StorageSystem

Layout

ControlFlow

pNFSServer

Figure 5.2: pNFS design. pNFS extends NFSv4 with the addition of a layout and I/O driver, and a file layout retrieval interface. The pNFS server obtains an opaque file layout map from the storage system and transfers it to the pNFS client and subsequently to its layout driver for direct and parallel data access.

60

client into I/O requests understood by the storage devices. The I/O driver performs raw

I/O, e.g., Myrinet GM [137], Infiniband [138], TCP/IP, to the storage nodes.

The layout driver can be specialized or (preferably) implement a standard protocol

such as the Fibre Channel Protocol (FCP), allowing multiple file systems to use the same

layout driver. Storage systems adopting this architecture reduce development and man-

agement obligations by obviating a specialized file system client. This holds the promise

of reducing development and maintenance of high-end storage systems.

5.1.2. NFSv4 protocol extensions

This section describes the NFSv4 protocol extensions to support pNFS.

File system attribute. A new file system attribute, LAYOUT_CLASSES, contains the

layout driver identifiers supported by the underlying file system. Upon encountering an

unknown file system identifier, a pNFS client retrieves this attribute and uses it to select

an appropriate layout driver, To prevent namespace collisions, a global registry main-

tainer such as IANA [139] specifies layout driver identifiers.

LAYOUTGET operation. The LAYOUTGET operation obtains file access information

for a byte-range of a file from the underlying storage system. The client issues a

LAYOUTGET operation after it opens a file and before it accesses file data. Implemen-

tations determine the frequency and byte range of the request.

The arguments are:

• File handle

• Layout type

• Access type

• Offset

• Extent

• Minimum size

• Maximum count

61

The file handle uniquely identifies the file. The layout type identifies the preferred

layout type. The offset and extent arguments specify the requested region of the file.

The access type specifies whether the requested file layout information is for reading,

writing, or both. This is useful for file systems that, for example, provide read-only rep-

licas of data. The minimum size specifies the minimum overlap with the requested offset

and length. The maximum count specifies the maximum number of bytes for the result,

including XDR overhead.

LAYOUTGET returns the requested layout as an opaque object and its associated

offset and extent. By returning file layout information to the client as an opaque object,

pNFS is able to support arbitrary file layout types. At no time does the pNFS client at-

tempt to interpret this object, it acts simply as a conduit between the storage system and

the layout driver. The byte range described by the returned layout may be larger than the

requested size due to block alignments, layout prefetching, etc.

LAYOUTCOMMIT operation. The LAYOUTCOMMIT operation commits changes

to the layout information. The client uses this operation to commit or discard provision-

ally allocated space, update the end of file, and fill in existing holes in the layout.

LAYOUTRETURN operation. The LAYOUTRETURN operation informs the NFSv4

server that layout information obtained earlier is no longer required. A client may return

a layout voluntarily or upon receipt of a server recall request.

CB_LAYOUTRECALL operation. If layout information is exclusive to a specific cli-

ent and other clients require conflicting access, the server can recall a layout from the cli-

ent using the CB_LAYOUTRECALL callback operation.3 The client should complete

any in-flight I/O operations using the recalled layout and write any buffered dirty data

directly to storage before returning the layout, or write it later using normal NFSv4 write

operations.

62

GETDEVINFO and GETDEVLIST operations. The GETDEVINFO and

GETDEVLIST operations retrieve additional information about one or more storage

nodes. The layout driver issues these operations if the device information inside the file

layout does not provide enough information for file access, e.g., SAN volume label in-

formation or port numbers.

5.2. Parallel virtual file system version 2

This section presents an overview of PVFS2, a user-level, open-source, scalable,

asymmetric parallel file system designed as a research tool and for production environ-

ments. Despite its lack of locking and security support, I chose PVFS2 because its user

level design provides a streamlined architecture for rapid prototyping of new ideas.

Figure 5.3 depicts the PVFS2 architecture.

PVFS2 consists of clients, storage nodes, and metadata servers. Metadata servers

store all information about the file system in a Berkeley DB database [140], distributing

metadata via a hash on the file name. File data is striped across storage nodes, which can

be increased in number as needed.

PVFS2 uses algorithmic file layouts for distributing data among the storage nodes.

The data distribution algorithm is user defined, defaulting to round-robin striping. The

clients and storage nodes share the data distribution algorithm, which does not change

during the lifetime of the file. A series of file handles, one for each storage node,

3 NFSv4 already contains a callback operation infrastructure for delegation support.

Kernel

User

Client

PVFS2 Client

Paralleli/O

ControlFlow

Application

PVFS2 StorageNodes

Kernel

User

PVFS2Metadata

ServerPVFS2Clientkmod

Figure 5.3: PVFS2 architecture. PVFS2 consists of clients, metadata servers, and storage nodes. The PVFS2 kernel module enables integration with the local file system. Data is striped across storage nodes using a user-defined algorithm.

63

uniquely identifies the set of file data stripes. Data is not committed with the metadata

server; instead, the client ensures that all data is committed to storage by negotiating with

each individual storage node.

An operating system specific kernel module provides for integration into user envi-

ronments and for access by other VFS file systems. Users are thus able to mount and ac-

cess PVFS2 through a POSIX interface. Currently, only Linux implementations exist of

this module. Data is memory mapped between the kernel module and the PVFS2 client

program to avoid extra data copies.

Efficient lock management with large numbers of clients is a hard problem. Large

parallel applications generally avoid using locks and manage data consistency through

organized and cooperative clients. PVFS2 shuns POSIX consistency semantics, which

require sequential consistency of file system operations, and replaces them with noncon-

flicting writes semantics, guaranteeing that writes to non-overlapping file regions are

visible on all subsequent reads once the write completes.

5.3. pNFS prototype

Prototypes of new protocols are essential for their clarification and provide insight

and evidence of their viability. A minimum requirement for the fitness of pNFS is its

ability to provide parallel access to arbitrary storage systems. This agnosticism toward

storage system particulars is vital for widespread adoption. As such, my prototype fo-

Client

Layout

I/O

pNFSParallel I/O

NFSv4 I/O and Metadata

Kernel

User

PVFS2 StorageNodes

pNFSClient

ApplicationPVFS2Layout

and I/O

Driver

Server

Layout

ControlFlow

Kernel

User

pNFSServer

PVFS2Client

Kernel

User

PVFS2Metadata

Server

Figure 5.4: pNFS prototype architecture. The pNFS server obtains the opaque file layout from the PVFS2 metadata server via the PVFS2 client, transferring it back to the pNFS client and subsequently to the PVFS2 layout driver for direct and parallel data ac-cess.

64

cuses on the retrieval and processing of the file layout to demonstrate that pNFS is agnos-

tic of the underlying storage system and can match the performance of the storage system

it exports. Figure 5.4 displays the architecture of my pNFS prototype with PVFS2 as the

exported file system.

5.3.1. PVFS2 layout

The PVFS2 file layout information consists of:

• File system id

• Set of file handles, one for each storage node

• Distribution id, uniquely defines layout algorithm

• Distribution parameters, e.g., stripe size

Since a PVFS2 layout applies to an entire file, no matter what byte range the pNFS

client requests using the LAYOUTGET operation, the returned byte range is the entire

file. Therefore, my prototype requests a layout once for each open file, incurring a single

additional round trip. If a pNFS client is eager with its requests, it can even eliminate this

single round trip time by including the LAYOUTGET in the same request as the OPEN

operation. The differences between these two designs are apparent in my evaluation.

The pNFS server obtains the layout from PVFS2 via a Linux VFS export operation.

5.3.2. Extensible “Pluggable” layout and I/O drivers

Our prototype facilitates interoperability by providing a framework for the co-

existence of the NFSv4 control protocol with all storage protocols. As shown in Figure

5.5, layout drivers are pluggable, using a standard set of interfaces for all storage proto-

cols. An I/O interface, based on the Linux file_operations interface4, facilitates the man-

agement of layout information and performing I/O with storage. A policy interface in-

forms the pNFS client of storage system specific policies, e.g., stripe and block size, lay-

4 The file_operations interface is the VFS interface that manages access to a file.

65

out retrieval timing. The policy interface also enables layout drivers to specify whether

they support NFSv4 data management services or use customized implementations. The

pNFS client can provide the following data management services: data cache, writeback

cache with write gathering, and readahead.

A layout driver registers with the pNFS client along with a unique identifier. The

pNFS client matches this identifier with the value of the LAYOUT_CLASSES attribute

to select the correct layout driver for file access. If there is no matching layout driver,

standard NFSv4 read and write mechanisms are used.

The PVFS2 layout driver supports three operations: read, write, and set_layout. To

inject the file layout map, the pNFS client passes the opaque layout as an argument to the

set_layout function. Once the layout driver has finished processing the layout, the pNFS

client is free to call the layout driver’s read and write functions. When data access is

complete, the pNFS client issues a standard NFSv4 close operation to the pNFS server.

The syntax for the PVFS2 layout driver I/O interface is: ssize_t read(struct file* file,char __user* buf, size_t count, loff_t* offset)

ssize_t write(struct file* file,const char __user* buf,size_t count,loff_t* offset)

int set_layout(struct inode* ino,struct file* file,unsigned int cmd,unsigned long arg)

5.4. Evaluation

In this section, I describe experiments that assess the performance of my pNFS proto-

type. They demonstrate that pNFS can use the standard layout driver interface to scale

with PVFS2, and can achieve performance vastly superior to NFSv4.

StorageProtocol

Control

StorageNodes

Client

pNFS Client

Layout Driver

I/O Driver

I/O API Policy API

Server

Storage System

pNFS Server

Linux VFS API

ManagementProtocol

Figure 5.5: Linux pNFS prototype internal structure. pNFS clients use I/O and pol-icy interfaces to access storage nodes and determine file system polices. The pNFS server uses VFS export operations to communicate with the underlying file system.

66

5.4.1. Experimental Setup

The experiments are performed on a network of forty identical nodes partitioned into

twenty-three clients, sixteen storage nodes, and one metadata server. Each node is a 2

GHz dual-processor Opteron with 2 GB of DDR RAM and four Western Digital Caviar

Serial ATA disks, which have a nominal data rate of 150 MB/s and an average seek time

of 8.9 ms. The disks are configured with software RAID 0. The operating system kernel

is Linux 2.6.9-rc3. The version of PVFS2 is 1.0.1.

I test four configurations: two that access PVFS2 storage nodes directly via pNFS and

PVFS2 clients and two with unmodified NFSv4 clients. One NFSv4 configuration ac-

cesses an Ext3 file system. The other accesses a PVFS2 file system with an NFSv4

server, exported PVFS2 client, and PVFS2 metadata server all residing on the metadata

server. The metadata server runs eight pNFS or NFSv4 server threads when exporting

the PVFS2 or Ext3 file systems. I verified that varying the number of pNFS or NFSv4

server threads does not affect performance.

I compare the aggregate I/O throughput using the IOZone [128] benchmark tool while

increasing numbers of clients. The first set of experiments has two processes on each cli-

ent reading and writing separate 200 MB files. In the second set of experiments, each

client reads and writes disjoint 100 MB portions of a single pre-existing file. Aggregate

I/O throughput is calculated when the last client completes its task. The value presented

is the average over several executions of the benchmark. The write time includes a flush

of the client’s cache to the server. All read experiments use warm storage node caches to

reduce disk access irregularities.

5.4.2. LAYOUTGET performance

If a layout does not apply to an entire file, a LAYOUTGET request would be required

on every read or write. In the test environment, the time for a LAYOUTGET request is

0.85 ms. On a 1 MB transfer, this reduces I/O throughput by only 3-4 percent; with a 10

MB transfer, the relative cost is less than 0.5 percent.

67

5.4.3. I/O throughput performance

In all experiments, the performance of NFSv4 exporting PVFS2 achieves an aggre-

gate read and write throughput of only 1.9 MB/s and 0.9 MB/s respectively. I discuss

this in Section 5.4.4.

Figure 5.6a shows the write performance with each client writing to separate files. In

Figure 5.6b, all clients write to a single file. NFSv4 with Ext3 achieves an average ag-

gregate write throughput of 38 MB/s and 68 MB/s for the separate and single file experi-

ments. pNFS performance tracks PVFS2, reaching a maximum aggregate write through-

put of 384 MB/s with sixteen processes for separate files and 240 MB/s with seven cli-

0.00

100.00

200.00

300.00

400.00

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32Number of Processes

Agg

rega

te T

hrou

ghpu

t (M

B/s

)

pNFSPVFS2NFSv4-PVFS2NFSv4-Ext3

0.00

100.00

200.00

300.00

400.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

B/s

)


(a) Separate files (b) Single file Figure 5.6: Aggregate pNFS write throughput. pNFS scales with PVFS2 while NFSv4 performance remains flat. pNFS and PVFS2 use sixteen storage nodes. With separate files, each client spawns two write processes.

0.00

100.00

200.00

300.00

400.00

500.00

600.00

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46Number of Processes

Agg

rega

te T

hrou

ghpu

t (M

B/s

)


0.00

100.00

200.00

300.00

400.00

500.00

600.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

B/s

)

pNFSpNFS-2PVFS2NFSv4-PVFS2NFSv4-Ext3

(a) Separate files (b) Single file Figure 5.7: Aggregate pNFS read throughput. pNFS and PVFS2 scale linearly while NFSv4 performance remains flat. With a single file, pNFS performance is slightly below PVFS2 due to increasing layout retrieval congestion. pNFS-2, which removes the extra round trip time of LAYOUTGET, matches PVFS2 performance. pNFS and PVFS2 use sixteen storage nodes. With separate files, each client spawns two read processes.

68

ents writing to single file. With separate files, the bottleneck is the number of storage

nodes. Metadata processing limits the performance with a single file.

Figure 5.7a shows the read performance with two processes on each client reading

separate files. NFSv4 with Ext3 achieves its maximum network bandwidth of 115 MB/s.

pNFS again achieves the same performance as PVFS2. Initially, the extra overhead re-

quired to write to sixteen storage nodes reduces read throughput for two processes to 27

MB/s, but it scales almost linearly, reaching an aggregate read throughput of 550 MB/s

with 46 processes.

Figure 5.7b shows the read performance with each client reading from disjoint por-

tions of the same pre-existing file. NFSv4 with Ext3 again achieves its maximum net-

work bandwidth of 115 MB/s. PVFS2 scales linearly, starting with an aggregate read

throughput of 15 MB/s with a single client, increasing to 360 MB/s with twenty-three cli-

ents. The pNFS prototype, which incurs a single round trip time for the LAYOUTGET,

suffers slightly as the PVFS2 layout retrieval function takes longer with increasing num-

bers of clients, reaching an aggregate read throughput of 311 MB/s. A modified proto-

type combines the LAYOUTGET and OPEN operations into a single call. The prototype

labeled pNFS-2 excludes the LAYOUTGET operation from the measurements and

matches the performance of PVFS2.

5.4.4. Discussion

While these experiments offer convincing evidence that pNFS can match the per-

formance of the underlying file system, they also demonstrate that pNFS performance

can be adversely affected by a costly LAYOUTGET operation.

The poor performance of NFSv4 with PVFS2 suffers from a difference in block sizes.

Per-read and per-write processing overhead is small in NFSv4, which justifies a small

block size—32 KB on Linux, but PVFS2 has a much larger per-read and per-write over-

head and therefore uses a block size of 4 MB. In addition, PVFS2 does not perform write

gathering on the client, assuming each data request to be a multiple of the block size. To

make matters worse, the Linux kernel breaks the NFSv4 client’s request on the NFSv4

server into 4 KB chunks before it issues the requests to the PVFS2 client. Data transfer

69

overhead, e.g., creating connections to the storage nodes, and determining stripe loca-

tions, dominates with 4 KB requests. The impact on performance is devastating.

Lack of a commit operation in the PVFS2 kernel module also reduces the write per-

formance of NFSv4 with PVFS2. To prevent data loss, PVFS2 commits every write op-

eration, ignoring the NFSv4 COMMIT operation. Write gathering [90] on the server

combined with a commit from the PVFS2 client would comply with NFSv4 fault toler-

ance semantics and improve the interaction of PVFS2 with the disk.

5.5. Additional pNFS design and implementation issues

5.5.1. Locking

NFSv4 supports mandatory locking, which requires an additional piece of shared state

between the NFSv4 client and server: a unique identifier of the locking process. An

NFSv4 client includes a locking identifier with every read and write operation.

How pNFS storage nodes support mandatory locks is not covered in the pNFS opera-

tions Internet Draft [123]. Several possibilities exist: enable the storage nodes to interpret

NFSv4 lock identifiers, bundle a new pNFS operation to retrieve file system specific lock

information with the NFSv4 LOCK operation, or include lock information in the existing

file layout.

5.5.2. Security considerations

Separating control and data paths in pNFS introduces new security concerns to

NFSv4. Although RPCSEC_GSS continues to secure the NFSv4 control path, securing

the data path requires additional care. The current pNFS operations Internet Draft de-

scribes the general mechanisms that will be required, but does not go all the way in defin-

ing the new security architecture.

A file-based layout driver uses the RPCSEC_GSS security mechanism between the

client and storage nodes.

Object storage uses revocable cryptographic capabilities for file system objects that

the metadata server passes to clients. For data access, the layout driver requires the cor-

70

rect capability to access the storage nodes. It is expected that the capability will be

passed to the layout driver within the opaque layout object.

Block storage access protocols rely on SAN-based security, which is perhaps a mis-

nomer, as clients are implicitly trusted to access only their allotted blocks. LUN mask-

ing/unmapping and zone-based security schemes can fence clients to specific data blocks.

Some systems employ IPsec to secure the data stream. Placing more trust in the client for

SAN file systems is a step backwards to the NFSv4 trust model.

5.6. Related work

Several systems aggregate partitioned NFSv3 servers into a single file system image

[91, 113, 114]. These systems, discussed in Section 2.6, transform NFS into a type of

parallel file system, which increases scalability but eliminates the file system independ-

ence of NFS.






EMC HighRoad [103] uses the NFS or CIFS protocol for its control operations and

stores data in an aggregated LAN and SAN environment. Its use of file semantics facili-

tates data sharing in SAN environments, but is limited to the EMC Symmetrix storage

system. A similar, non-commercial version is also available [141].

Several pNFS layout drivers are under development. At this writing, Sun Microsys-

tems, Inc. is developing file- and object-based layout implementations. Panasas object

and EMC bock drivers are currently under development.







71




ating system modification.

Distributed replicas can be vital in reducing network latency when accessing data.

pNFS is not intended to replace GridFTP, but to work alongside it. For example, in

tiered projects such as ATLAS at CERN, GridFTP remains a natural choice for long-haul

scheduled transfers among the upper tiers, while the file system semantics of pNFS offers

advantages in the lower tiers by letting scientists work with files directly, promoting ef-

fective data management.

5.7. Conclusion

This chapter analyzes an early implementation of pNFS, an NFSv4 extension that

uses the storage protocol of the underlying file system to bypass the server bottleneck and

enable direct and parallel storage access. The prototype validates the viability of the

pNFS protocol by demonstrating that it is possible to achieve high throughput access to a

high-performance file system while retaining the benefits of NFSv4. Experiments dem-

onstrate that the aggregate throughput with the prototype equals that of its exported file

system and far exceeds NFSv4 performance.

72

CHAPTER VI

Large Files, Small Writes, and pNFS

Parallel file systems improve the aggregate throughput of bulk data transfers by scal-

ing disks, disk controllers, network, and servers—every aspect of the system architecture.

As system size increases, the cost of locating, managing, and protecting data increases the

per-request overhead. This overhead is small relative to the overall cost of large data

transfers, but considerable for smaller data requests. Many parallel file systems ignore

this high penalty for small I/O, focusing entirely on large data transfers.

Unfortunately, not all data comes in big packages. Numerous workload characteriza-

tion studies have highlighted the prevalence of small and sequential data requests in

modern scientific applications [72-78]. This trend will likely continue since many HPC

applications take years to develop, have a productive lifespan of ten years or more, and

are not easily re-architected for the latest file access paradigm [12]. Furthermore, many

current data access libraries such as HDF5 and NetCDF rely heavily on small data ac-

cesses to store individual data elements in a common (large) file [142, 143].

This chapter investigates the performance of parallel file systems with small writes.

Distributed file systems are optimized for small data accesses [2, 36]; not surprisingly,

studies demonstrate that small I/O is their middleware niche [63]. I demonstrate that dis-

tributed file systems can increase write throughput to parallel data stores—regardless of

file size—by overcoming small write inefficiencies in parallel file systems. By using di-

rect, parallel I/O for large write requests and a distributed file system for small write re-

quests, pNFS improves the overall write performance of parallel file systems. The pNFS

heterogeneous metadata protocol allows these improvements in write performance with

any parallel file system.

The remainder of this chapter is organized as follows. Section 6.1 explores the issues

that arise when writing small amounts of data in scientific applications. Section 6.2 de-

73

scribes how pNFS can improve the performance of these applications. Section 6.3 re-

ports the results of experiments with synthetic benchmarks and a real scientific applica-

tion. Section 6.5 summarizes and concludes the chapter.

6.1. Small I/O requests

Several scientific workload characterization studies demonstrate the need to improve

performance of small I/O requests to small and large files.

The CHARISMA study [72-74] finds that file sizes in scientific workloads are much

larger than those typically found in UNIX workstation environments and that most scien-

tific applications access only a few files. Approximately 90% of file accesses are

small—less than 4 KB—and represent a considerable portion of application execution

time, even though approximately 90% of the data is transferred in large accesses. In ad-

dition, most files are read-only or write-only and are accessed sequentially, but read-write

files are accessed primarily non-sequentially.

The Scalable I/O study [75-77] had similar findings, but remarked that most requests

are small writes into gigabyte sized files, consuming, for example, 98% of the execution

time of one application that was studied. Furthermore, it is common for a single node to

handle the majority of reads and writes, gathering the data from, or broadcasting the data

to the other nodes as necessary. This indicates that single node performance still requires

attention from parallel file systems. The study also notes that a lack of portability pre-

vents applications from using enhanced parallel file system interfaces.

A more recent study in 2004 of two physics applications [78] amplifies the earlier

findings. This study found that I/O is bursty, most requests consist of small data trans-

fers, and most data is transferred in a few large requests. It is common for a master node

to collect results from other nodes and write them to storage using many small requests.

Each client reads back the data in large chunks. In addition, use of a single file is still

common and accessing that file—even with modern parallel file systems—is slower than

accessing separate files by a factor of five.

NetCDF (Network Common Data Form) provides a portable and efficient mechanism

for sharing data between scientists and applications [142]. It is the predominant file for-

mat standard within many scientific communities [144]. NetCDF defines a file format

74

and an API for the storage and retrieval of a file’s contents. NetCDF stores data in a sin-

gle array-oriented file, which contains dimensions, variables, and attributes. Applications

individually define and write thousands of data elements, creating many sequential and

small write requests.

HDF5 is another popular portable file format and programming interface for storing

scientific data in a single file. It provides a rich data model, with emphasis on efficiency

of access, parallel I/O, and support for high-performance computing, but continues to de-

fine and store each data element separately, creating many small write requests.

This chapter demonstrates how pNFS can improve small write performance with par-

allel file systems for small and large files, regardless of whether an application or file

format library generates the write requests.

6.2. Small writes and pNFS

pNFS improves file access scalability by providing the NFSv4 client with support for

direct storage access. I now turn to an investigation of the relative costs of the direct I/O

path and the NFSv4 path.

6.2.1. File system I/O features

A single large I/O request can saturate a client’s network endpoint. Engineering a

parallel file system for large requests entails the use of large transfer buffers, synchronous

data requests, deploying many storage nodes, and the use of a write-through cache or no

cache at all.

NFS implementations have several features that are sometimes an advantage over the

direct write path:

• Asynchronous client requests. Many parallel file systems incur a per-request

overhead, which adds up for small requests. Directing requests to the NFSv4

server allows the server to absorb this overhead without delaying the client appli-

cation or consuming client CPU cycles. In addition, asynchrony allows request

pipelining on the NFSv4 server, reducing aggregate latency to the storage nodes.

75

• One server per request. Data written to a byte-range that spans multiple storage

nodes (e.g., multiple stripes) requires two separate requests, further increasing the

per-request overhead. The NFSv4 single server design can reduce client request

overhead for small requests in these instances.

• Open Network Computing Remote Procedure Call. NFSv4 uses ONC RPC

[34], a low-overhead, low-latency network protocol that is well suited for small

data transfers.

• Client writeback cache. NFSv4 gathers sequential write requests into a single

request, which lowers the aggregate cost of small write requests.

• Server write gathering. Similarly, the NFSv4 server combines sequential write

requests into a single request to the exported parallel file system. This can be use-

ful, e.g., for applications performing strided access into a single file.

6.2.2. Small write performance example: Postmark benchmark

Comparing the performance of the Postmark benchmark on my pNFS prototype, un-

modified NFSv4, and Ext3 demonstrates the performance mismatch of parallel file sys-

tems in a real-world computing environment. Postmark simulates applications that have

a large number of metadata and small I/O requests such as electronic mail, NetNews, and

Web-based services [145]. Postmark creates and performs transactions on a large num-

ber of small files (between 1 KB and 500 KB). Each transaction first deletes, creates, or

opens a file. If the transaction creates or opens a file, it then appends 1 KB. Data is sent

to stable storage before the file is closed. Postmark performs 2,000 transactions on 100

files. The experimental evaluation uses eight nodes with dual 1.7 GHz P4 processors and

File System Write Throughput (MB/s) Ext3 5.02 NFSv4/Ext3 4.03 pNFS/PVFS2 0.65 NFSv4/PVFS2 2.44

Table 6.1: Postmark write throughput with 1 KB block size. NFSv4 outperforms di-rect, parallel I/O for small writes.

76

a 3Com 3C996B-T Gigabit Ethernet card. PVFS2 has six storage nodes and one meta-

data server.

Table 6.1 shows the Postmark results for Ext3, NFSv4, and pNFS. Ext3 outperforms

remote clients, achieving a write throughput of 5.02 MB/s. NFSv4 achieves a write

throughput of 4.03 MB/s. pNFS exporting the PVFS2 parallel file system achieves a

write throughput of only 0.65 MB/s due to its inability to parallelize requests effectively

and its use of a write-through cache. By using the features discussed in Section 6.2.1,

NFSv4 raises the write throughput to the same PVFS2 file system by 1.79 MB/s. This

demonstrates that the parallel, direct I/O path is not always the best choice and the indi-

rect path is not always the worst choice.

6.2.3. pNFS write threshold

To enable the indirect I/O path for small writes, I modified the pNFS client prototype

to allow it to choose between the NFSv4 storage protocol and the storage protocol of the

underlying file system. To switch between them, I added a write threshold to the layout

driver. Write requests smaller than the threshold follow the NFSv4 data path. Write re-

quests larger than the threshold follow the layout driver data path. Figure 6.1 shows how

the write threshold alters the pNFS data access architecture. Clients use a PFS data com-

ponent for large write requests and an NFSv4 data component for small write requests.

An additional PFS data component on the metadata server funnels small write requests to

Figure 6.1: pNFS small write data access architecture. Clients use a PFS data com-ponent for large write requests and an NFSv4 data component for small write requests.

77

storage. Figure 6.2 illustrates the implementation of the write threshold in both the gen-

eral pNFS architecture and in the prototype.

pNFS features a heterogeneous metadata protocol that enables it to benefit from the

strengths of disparate storage protocols. A write threshold improves overall write per-

formance for pNFS by hitting the sweet spot of both the NFSv4 and underlying file sys-

tem storage protocols.

Just as any improvement to NFSv4 improves access to the file system it exports, these

improvements to pNFS are portable and benefit all parallel file systems equally by allow-

ing pNFS (and its exported parallel file systems) to concentrate on large data require-

ments, while NFSv4 efficiently processes small I/O.

6.2.4. Setting the write threshold

The advantage of a write threshold is that applications that mix small and large write

requests get the better performing I/O path automatically.

The optimal write threshold value depends on several factors, including

server capacity, network performance and capability, system load, and features specific to

the distributed and parallel file system. One way to choose a good threshold value is to

compare execution times for distributed and parallel file systems with various write sizes

and see where the performance indicators cross.

Client pNFSParallel I/O

NFSv4 Small Writes

Kernel

User

Application

PVFS2 StorageNodes

Write Threshold

I/O

LayoutpNFSClient

PVFS2Layout

and I/O

Driver

Kernel

User

PVFS2Metadata

Server

Server

Kernel

User

pNFSServer

Layout

Small Writes PVFS2Client

(a) pNFS data paths (b) pNFS prototype with write threshold Figure 6.2: pNFS write threshold. (a) pNFS utilizes NFSv4 I/O along the small write path when the write request size is less than the write threshold. (b) pNFS retrieves the write threshold from PVFS2 layout driver to determine the correct data path for a write request.

78

Figure 6.3 abstracts write request execution time with increasing request size for a

parallel file system and for an idle and busy distributed file system. When the distributed

file system is lightly loaded, the transfer size at which the parallel file system outper-

forms the distributed file system, labeled B, is the optimal write threshold. When the dis-

tributed file system is heavily loaded, each request takes longer to complete, so the slope

increases and intersects the parallel file system at the smaller threshold size, labeled A.

(If the distributed file system is thoroughly overloaded, the threshold value tends to zero,

i.e., an overloaded distributed file system is never a better choice.)

The workload characterization studies mentioned in Section 6.1 state that scientific

applications usually have a large gap between small and large write request sizes, with

very few requests in the middle. Experiments reveal that small requests are smaller than

the “busy” write threshold value, shown as A in Figure 6.3, and the large requests are lar-

ger than the “idle” write threshold values, shown as B. Applications should reap large

gains for any write threshold value between A and B.

For example, the ATLAS digitization application (Section 6.3.3) achieves the same

performance for any write threshold between 32 KB and 274 KB. In addition, 87 percent

of the write requests are smaller than 4 KB, which suggests that the threshold could be

even smaller without hurting performance.

The write threshold can be set at any time, including compile time, when a module

loads, and run time. For example, system administrators can determine the write thresh-

old as part of a file system and network installation and optimization. A natural value for

the write threshold is the write gather size of the distributed file system.

Figure 6.3: Determining the write threshold value. Write execution time increases with larger request sizes. Application write requests are either small or large, with few requests in the middle. The write threshold can be any value in this middle region.

79

6.3. Evaluation

In this section, I evaluate the performance of the write threshold heuristic in my pNFS

prototype.

6.3.1. Experimental setup

IOR and random write IOZone experiments use a pair of sixteen node clusters con-

nected with Myrinet. One cluster consists of dual 1.1 GHz processor PIII Xeon nodes.

The other consists of dual 1 GHz processor PIII Xeon nodes. Each node has 1 GB of

memory. The PVFS2 1.1.0 file system has eight storage nodes and one metadata server.

Each storage node has an Ultra160 SCSI disk controller and one Seagate Cheetah 18 GB,

10,033 RPM drive, with an average seek time of 5.2 ms. The NFSv4 server, PVFS2 cli-

ent, and PVFS2 metadata server are installed on a single node. All nodes run Linux

2.6.12-rc4.

ATLAS experiments use an eight node cluster of 1.7 GHz dual P4 processors, 2 GB

of memory, a Seagate 80 GB 7200 RPM hard drive with an Ultra ATA/100 interface and

a 2 MB cache, and a 3Com 3C996B-T Gigabit Ethernet card. The PVFS2 1.1.0 file sys-

tem has six storage nodes and one metadata server. The NFSv4 server, PVFS2 client,

and PVFS2 metadata server are installed on a single node. All nodes run Linux 2.6.12-

rc4.

6.3.2. IOR and IOZone benchmarks

6.3.2.1 Experimental design

The first experiment consists of a single client issuing one thousand sequential write

requests to a file, using the IOR benchmark [146]. A test completes when data is com-

mitted to disk. I repeat this experiment with ten clients writing to disjoint portions of a

single file. The second experiment consists of a single client randomly writing a 32 MB

file using IOZone [128].

For each experiment, I first compare the aggregate write throughput of pNFS and

NFSv4 with a range of individual request sizes. I then set the write threshold to be the

80

request size at which pNFS and NFSv4 have the same performance, and re-execute the

benchmark.

6.3.2.2 Experimental evaluation

Our first experiment, shown in Figure 6.4a, examines single client performance. The

performance of NFSv4 writing to PVFS2 or Ext3 is comparable because the NFSv4 32

KB write size is less than the PVFS2 64 KB stripe size, which isolates writes to a single

disk.

With a single pNFS client, writing through the NFSv4 server to PVFS2 is superior to

writing directly to PVFS2 until the request size reaches 64 KB. For 16-byte writes,

NFSv4 has sixty-seven times the throughput, with the ratio decreasing to one at 64 KB.

Write performance through the NFSv4 server reaches its peak at 32 KB, the NFSv4 client

request size. At 64 KB, direct storage access begins to outperform indirect access. pNFS

with a write threshold of 32 KB offers the performance benefits of both storage protocols

by using NFSv4 I/O until 32 KB, then switching to direct storage access with the PVFS2

storage protocol.

Figure 6.4b shows the results of ten nodes writing to disjoint segments of the same

file. Ext3 performance is limited by random requests from the NFSv4 server daemons.

0

5

10

15

20

25

30

16 32 64 128

256

512

1,024

2,048

4,096

8,192

16,38

4

32,76

8

65,53

6

131,0

72

Individual Write Size (Bytes)

Thro

ughp

ut (M

B/s

)

pNFS - 32KB Write ThresholdpNFSNFSv4/PVFS2NFSv4/Ext3

0

5

10

15

20

25

30

16 32 64 128

256

512

1,024

2,048

4,096

8,192

16,38

4

Individual Write Size (bytes)

Agg

rega

te T

hrou

ghpu

t (M

B/s

)

pNFS - 4KB Write ThresholdpNFSNFSv4/PVFS2NFSv4/Ext3

0

5

10

15

20

25

30

35

40

1 2 4 8 16 32 64 128 256

Individual Write Size (KB)

Thro

ughp

ut (M

B/s

)

pNFS - 64KB Write Threshold

pNFS

NFSv4/PVFS2

(a) Single/Consecutive (b) Multiple/Consecutive (c) Single/Random

Figure 6.4: Write throughput with threshold. (a) Write throughput of a single client is-suing consecutive small write requests. NFSv4 exporting PVFS2 outperforms pNFS until a write size of 64 KB. pNFS with a 32 KB write threshold achieves the best overall per-formance. (b) Aggregate write throughput of a ten clients issuing consecutive small write requests to a single file. NFSv4 exporting PVFS2 outperforms pNFS until a write size of 8 KB. pNFS with a 4 KB write threshold achieves the best overall performance. (c) Write throughput of a single client issuing random small write requests. NFSv4 exporting PVFS2 outperforms pNFS until a write size of 128 KB. pNFS with a 64 KB write thresh-old achieves the best overall performance. Data points are a power of two; lines are for readability.

81

Using NFSv4 I/O to access PVFS2 does not incur as many random accesses since the

writes are spread over eight disks.

PVFS2 throughput grows approximately linearly as the impact of the request over-

head diminishes. The aggregate performance of NFSv4 is the same as with a single cli-

ent, with the write performance crossover point between pNFS and NFSv4 occurring at 4

KB. With 16-byte writes, NFSv4 has twenty times the bandwidth, with the ratio decreas-

ing to one at just below 8 KB. The maximum bandwidth difference of 9 MB/s occurs at

1 KB. At 8 KB, direct storage access begins to outperform indirect access. pNFS with a

write threshold of 4 KB offers the performance benefits of both storage protocols.

Figure 6.4c shows the performance of writing to a 32 MB file in a random manner

with increasing request sizes. NFSv4 outperforms pNFS until the individual write size

reaches 128 KB, with a maximum difference of 13 MB/s occurring at 16 KB. pNFS us-

ing a write threshold of 64 KB experiences the performance benefits of both storage pro-

tocols.

6.3.3. ATLAS applications

Not every application behaves like the ones studied in Section 6.1. For example,

large writes dominate the FLASH I/O benchmark workload [147], with 99.7 percent of

requests greater than 163 KB (with default input parameters). However, beyond the

workload characterization studies, there is increasing anecdotal evidence to suggest that

small writes are quite common.

To assess the impact of the small write heuristic I use the ATLAS simulator, which

does make many small writes. ATLAS [148] is a particle physics experiment that seeks

new discoveries in head on collisions of high-energy protons using the Large Hadron

Collider accelerator [149]. Scheduled for completion in 2007, ATLAS will generate over

a petabyte of data each year to be distributed for analysis to a multi-tiered collection of

decentralized sites.

Currently, ATLAS physicists are performing large-scale simulations of the events

that will occur within its detector. These simulation efforts influence detector design and

the development of real-time event filtering algorithms for reducing the volume of data.

The ATLAS detector can detect one billion events with a combined data volume of forty

82

terabytes each second. After filtering, data from fewer than one hundred events per sec-

ond are stored for offline analysis.

The ATLAS simulation event data model consists of four stages. The Event Gen-

eration stage produces pseudo-random events drawn from a statistical distribution of

previous experiments. The Simulation stage then simulates the passage of particles

(events) through the detectors. The Digitization stage combines hit information

with estimates of internal noise, subjecting the hits to a parameterization of the known

response of the detectors to produce simulated digital output (digits). The Recon-

struction stage performs pattern recognition and track reconstruction algorithms on

the digits, converting raw digital data into meaningful physics quantities.

6.3.3.1 Experimental design

Experiments focus on the Digitization stage, the only stage that generates a

large amount of data. With 500 events, Digitization writes approximately 650 MB

of output data to a single file. Data are written randomly, with write request size distribu-

tions shown in Figure 6.5. Figure 6.5a shows that only 4 percent of write request sizes

are 275 KB or greater, with the rest below 32 KB. Figure 6.5b shows that 96 percent of

write requests are only responsible for 5 percent of the data, while 95 percent of the data

are written in requests whose size is greater than 275 KB. This distribution of write re-

quest size and total amount of data output closely matches the workload characterization

studies discussed in Section 6.1. Analysis of the Digitization write request distribu-

(a) Breakdown of total number of requests (b) Breakdown of total amount of data output

Figure 6.5: ATLAS digitization write request size distribution with 500 events.

83

tion with varying numbers of events indicates that the distribution in Figure 6.5 is a rep-

resentative sample.

Analysis of the Digitization trace data reveals a large number of fsync system

calls. For example, executing Digitization with 50 events produces more than 900

synchronous fsync calls. Synchronously committing data to storage reduces request par-

allelism and the effectiveness of write gathering.

ATLAS developers explain that the overwhelming use of fsync is an implementation

issue rather than an application necessity [150]. Therefore, to evaluate Digitization

write throughput I used IOZone to replay the write trace data while omitting fsync calls

for 50 and 500 events.

6.3.3.2 Experimental evaluation

To evaluate pNFS with the ATLAS simulator, I analyze the Digitization write

throughput with several write threshold values.

First, I use the IOZone benchmark to determine the maximum PVFS2 write through-

put. The maximum write throughput for a single-threaded application and an entire client

is 18 MB/s and 54 MB/s respectively. The single threaded application maximum per-

formance value sets the upper limit for ATLAS write throughput. Increasing the number

of threads simultaneously writing to storage increases the maximum write throughput

three-fold. Since ATLAS Digitization is a single threaded application generating

output for serialized events, it cannot directly take advantage of this extra performance.

As shown in Figure 6.6, pNFS achieves a write throughput of 11.3 MB/s and 11.9

MB/s with 50 and 500 events respectively. The small write requests reduce the applica-

tion’s peak write throughput by approximately 6 MB/s.

With a write threshold of 1 KB, 49 percent of requests are re-directed to the NFSv4

server, increasing performance by 23 percent. With a write threshold of 32 KB, 96 per-

cent of write requests use the NFSv4 I/O path. With 50 events, the increase in write per-

formance is 57 percent, for a write throughput of 17.8 MB/s. With 500 events, the in-

crease in write performance is 100 percent, for a write throughput of 23.8 MB/s.

84

It is interesting to note that 32 KB write threshold performance exceeds the single-

threaded application maximum write throughput. The NFSv4 server is multi-threaded, so

it can process multiple simultaneous write requests and outperform a single-threaded ap-

plication. This is another benefit of the increased parallelism available in distributed file

systems.

When pNFS funnels all Digitization output through the NFSv4 server, perform-

ance drops dramatically, but is still slightly better than the performance of pNFS with di-

rect I/O. In this experiment, the improved write performance of the smaller requests

overshadows the reduced performance of sending large write requests through the NFSv4

server.

The 50 and 500 event experiments have slightly different write request size and offset

distributions. In addition, the 500 event simulation has ten times the number of write re-

quests. The difference between the pNFS write threshold performance improvements in

the 50 and 500 event experiments seems to be due to a difference in behavior of the

NFSv4 writeback cache with these different write workloads.

6.3.4. Discussion

Experiments show that writing to the direct data path is not always the best choice.

Write request size plays an important role in determining the preferred data path.

0

5

10

15

20

25

30

50 Events 500 Events

Thro

ughp

ut (M

B/s

)

pNFSpNFS - 1KB Write ThresholdpNFS - 32KB Write ThresholdNFSv4/PVFS2

Application Max Write Throughput = 18 MB/sClient Max Write Throughput = 54 MB/s

Figure 6.6: ATLAS digitization write throughput for 50 and 500 events. pNFS with a 32 KB write threshold achieves the best overall performance by directing small requests through the NFSv4 server and the 275 KB and 1MB requests to the PVFS2 storage nodes.

85

The Linux NFSv4 client gathers small writes into 32 KB requests. With very small

requests, the overhead of gathering requests diminishes the benefit. As the size of each

write request grows, the increase in throughput is considerable.

Performing an increased number of parallel asynchronous write requests also im-

proves performance. This is seen in both Figure 6.4a and Figure 6.4c, as the performance

of writing 32 KB requests exceeds that of writing directly to storage.

The Linux NFSv4 server does not perform write gathering. Our experiments clearly

show the benefit of increasing the write request size. The ability for the NFSv4 server to

combine small requests from multiple clients into a single large request should lead to

further advantages.

6.4. Related work

Log-structured file systems [151] increase the size of writes by appending small I/O

requests to a log and later flushing the log to disk. Zebra [152] extends this to distributed

environments. Side effects include large data layouts and erratic block sizes.

The Vesta parallel file system [153] improves I/O performance by using workload

characteristics provided by applications to optimize data layout on storage. Providing

this information can be difficult for applications that lack regular I/O patterns or whose

I/O access patterns change over time.

The Slice file system prototype [116] divides NFS requests into three classes: large

I/O, small-file I/O, and namespace. A µProxy, interposed between clients and servers,

routes NFS client requests between storage, small-file servers, and directory servers, re-

spectively. Large I/O flows directly to storage while small-file servers aggregate I/O op-

erations of small files and the initial segments of large files. This method benefits small

file performance, but ignores small I/O to large files.

Both the EMC HighRoad file system [103] and the RAID-II network file server [95]

transfer small files over a low-bandwidth network and use a high-bandwidth network for

large file requests, but differentiating small and large files does not help with small re-

quests to large files. This re-direction benefits only large requests, and may reduce the

performance of small requests.

86

GPFS [97] forwards data between I/O nodes for requests smaller than the block size.

This reduces the number of messages with the lock manager and possibly reduces the

number of read-modify-write sequences.

Both the Lustre [104] and the Panasas ActiveScale [105] file systems use a write-

behind cache to perform buffered writes. In addition, Lustre allows clients to place small

files on a single storage node to reduce access overhead.

Implementations of MPI-IO such as ROMIO [82] use application hints and file access

patterns to improve I/O request performance. The work reported here benefits and com-

plements MPI-IO and its implementations. MPI-IO is useful to applications that use its

API and have regular I/O access patterns, e.g., strided I/O, but MPI-IO small write per-

formance is limited by the deficiencies of the underlying parallel file system. Our pNFS

enhancements are beneficial for existing and unmodified applications. They are also

beneficial at the file system layer of MPI-IO implementations, to improve the perform-

ance of the underlying parallel file system.

6.5. Conclusion

Diverse file access patterns and computing environments in the high-performance

community make pNFS an indispensable tool for scalable data access. This chapter

demonstrates that pNFS can increase write throughput to parallel data stores—regardless

of file size—by overcoming the inefficient performance of parallel file systems when

write request sizes are small. pNFS improves the overall write performance of parallel

file systems by using direct, parallel I/O for large write requests and a distributed file sys-

tem for small write requests. Evaluation results using a real scientific application and

several benchmark programs demonstrate the benefits of this design. The pNFS hetero-

geneous metadata protocol allows any parallel file system to realize these write perform-

ance improvements.

87

CHAPTER VII

Direct Data Access with a Commodity Storage Protocol

Parallel file systems feature impressive throughput, but they are highly specialized,

have limited operating system and hardware platform support, poor cross-site perform-

ance, and often lack strong security mechanisms. In addition, while parallel file systems

excel at large data transfers, many do so at the expense of small I/O performance. While

large data transfers dominate many scientific applications, numerous workload charac-

terization studies have highlighted the prevalence of small, sequential data requests in

modern scientific applications [74, 77, 78].

Many application domains demonstrate the need for high bandwidth, concurrent, and

secure access to large datasets across a variety of platforms and file systems. Scientific

computing connects large computational and data facilities across the globe and can gen-

erate petabytes of data. Digital movie studios that generate terabytes of data every day

require access from Sun, Windows, SGI, and Linux workstations, and compute clusters

[11]. This need for heterogeneous data access creates a conflict between parallel file sys-

tems and application platforms. Distributed file systems such as NFS [38] and CIFS [2]

bridge the interoperability gap, but they are unable to deliver the superior performance of

a high-end storage system.

pNFS overcomes these enterprise- and grand challenge-scale obstacles by enabling

direct access to storage from clients while preserving operating system, hardware plat-

form, and parallel file system independence. pNFS provides file access scalability by

using the storage protocol of the underlying parallel file system to distribute I/O across

the bisectional bandwidth of the storage network between clients and storage devices,

removing the single server bottleneck that is so vexing to client/server-based systems. In

combination, the elimination of the single server bottleneck and the ability for direct ac-

cess to storage by clients yields superior file access performance and scalability.

88

Regrettably, pNFS does not retain NFSv4 file system access transparency, and can

therefore not shield applications from different parallel file system security protocols and

metadata and data consistency semantics. In addition, implementing pNFS support for

every storage protocol on every operating system and hardware platform is a colossal un-

dertaking. File systems that support standard storage protocols may be able to share de-

velopment costs, but full support for a particular protocol is often unrealized, hampering

interoperability. The pNFS file-based layout access protocol helps bridge this gap in

transparency with middle-tier data servers, but eliminates direct data access, which can

hurt performance.

This chapter introduces Direct-pNFS, a novel augmentation to pNFS that increases

portability and regains parallel file system access transparency while continuing to match

the performance of native parallel file system clients. Architecturally, Direct-pNFS uses

a standard distributed file system protocol for direct access to a parallel file system’s

storage nodes, bridging the gap between performance and transparency. Direct-pNFS

leverages the strengths of NFSv4 to improve I/O performance over the entire range of I/O

workloads. I know of no other distributed file system that offers this level of perform-

ance, scalability, file system access transparency, and file system independence.

Direct-pNFS makes the following contributions:

Heterogeneous and ubiquitous remote file system access. Direct-pNFS benefits are

available with an conventional pNFS client: Direct-pNFS uses the pNFS file-based layout

type, and does not require file system specific layout drivers, e.g., object [154] or PVFS2

[155].

Remote file system access transparency and independence. pNFS uses file system

specific storage protocols that can expose gaps in the underlying file system semantics

(such as security support). Direct-pNFS, on the other hand, retains NFSv4 file system

access transparency by using the NFSv4 storage protocol for data access. In addition,

Direct-pNFS remains independent of the underlying file system and does not interpret file

system-specific information.

I/O workload versatility. While distributed file systems are usually engineered to per-

form well on small data accesses [63], parallel file systems target scientific workloads

89

dominated by large data transfers. Direct-pNFS combines the strengths of both, provid-

ing versatile data access to manage efficiently a diversity of workloads.

Scalability and throughput. Direct-pNFS can match the I/O throughput and scalability

of the exported parallel file system without requiring the client to support any protocol

other than NFSv4. This chapter uses numerous benchmark programs to demonstrate that

Direct-pNFS matches the I/O throughput of a parallel file system and has superior per-

formance in workloads that contain many small I/O requests.

A case for commodity high-performance remote data access. Direct-pNFS complies

with emerging IETF standards and can use an unmodified pNFS client. This chapter

makes a case for open systems in the design of high-performance clients, demonstrating

that standards-compliant commodity software can deliver the performance of a custom

made parallel file system client. Using standard clients to access specialized storage sys-

tems offers ubiquitous data access and reduces development and support costs without

cramping storage system optimization.

The remainder of this chapter is organized as follows. Section 7.1 makes the case for

open systems in distributed data access. Section 7.2 reviews pNFS and its departure from

traditional client/server distributed file systems. Sections 7.3 and 7.4 describe the Direct-

pNFS architecture and Linux prototype. Section 7.5 reports the results of experiments

with micro-benchmarks and four different I/O workloads. I summarize and conclude in

Section 7.6.

7.1. Commodity high-performance remote data access

NFS owes its success to an open protocol, platform ubiquity, and transparent access

to file systems, independent of the underlying storage technology. Beyond performance

and scalability, standards-based high-performance data access needs all these properties

to be successful in Grid, cluster, enterprise, and personal computing.

The benefits of standards-based data access with these qualities are numerous. A sin-

gle client can access data within a LAN and across a WAN, which reduces the cost of

development, administrative, and support. System administrators can select a storage so-

lution with confidence that no matter the operating system and hardware platform, users

90

are able to access the data. In addition, storage vendors are free to focus on advanced

data management features such as fault tolerance, archiving, manageability, and scalabil-

ity without having to custom tailor their products across a broad spectrum of client plat-

forms.

7.2. pNFS and storage protocol-specific layout drivers

This section revisits the pNFS architecture described in Chapter V and discusses the

drawbacks of using storage protocol-specific layout drivers.

7.2.1. Hybrid file system semantics

Although parallel file systems separate control and data flows, there is tight integra-

tion of the control and data protocols. Users must adapt to different semantics for each

data repository. pNFS, on the other hand, allows applications to realize common file sys-

tem semantics across data repositories. As users access heterogeneous data repositories

with pNFS, the NFSv4 metadata protocol provides a degree of consistency with respect

to the file system semantics within each repository.

Unfortunately, certain semantics are layout driver and storage protocol dependent,

and they can drastically change application behavior. For example, Panasas Activescale

[105] supports the OSD security protocol [136], while Lustre [104] specialized security

protocol. This forces clients that need to access both parallel file systems to support mul-

tiple authentication, integrity, and privacy mechanisms. Additional examples of these

semantics include client caching, and fault tolerance.

7.2.2. The burden of layout and I/O driver development

The pNFS layout and I/O drivers are the workhorses of pNFS high-performance data

access. These specialized components understand the storage system’s storage protocol,

security protocol, file system semantics, device identification, and layout description and

management. For pNFS to achieve broad heterogeneous data access, layout and I/O

drivers must be developed and supported on a multiplicity of operating system and hard-

91

ware platforms—an effort comparable in magnitude to the development of a parallel file

system client.

7.2.3. The pNFS file-based layout driver

Currently, the IETF is developing three layout specifications: file, object, and block.

The pNFS protocol includes only the file-based layout format, with object- and block-

based to follow in separate specifications. As such, all pNFS implementations will sup-

port the file-based layout format for remote data access, while support for the object- and

block-based access methods will be optional.

A pNFS file-based layout governs an entire file and is valid until recalled by the

pNFS server. To perform data access, the file-based layout driver combines the layout

information with a known list of data servers for the file system, and sends READ,

WRITE, and COMMIT operations to the correct data servers. Once I/O is complete, the

client sends updated file metadata, e.g., size or modification time, to the pNFS server.

pNFS Client

Layout

I/O

NFSv4 I/Oand

Metadata

Kernel

User

pNFSClient

Application

pNFS Metadata Server

ControlFlow

Kernel

User

pNFSServer

PFSClient

File Layout Driver

I/O Driver

NFSv4Parallel I/O

PFSManagement Protocol

PFSMetadata

PFSMetadata

MgmtDaemon

PFSStorage

Kernel

User

Kernel

User

PFSMetadata

Server

Data Servers

Kernel

User

pNFSServer

PFSClient

I/O

Kernel

User

pNFSServer

PFSClient

I/O

PFSParallel I/O

MgmtDaemon

PFSStorage

Kernel

User

Storage

Data Servers

(3-tier pNFS)

(2-tier pNFS)

Figure 7.1: pNFS file-based architecture with a parallel file system. The pNFS file-based layout architecture consists of pNFS data servers, clients and a metadata server, plus parallel file system (PFS) storage nodes, clients, and metadata servers. The three-tier design prevents direct storage access and creates overlapping and redundant stor-age and metadata protocols. The two-tier design, pNFS servers, PFS clients, and stor-age on the same node, suffers from these problems plus diminished single client band-width.

92

pNFS file-based layout information consists of:

• Striping type and stripe size

• Data server identifiers

• File handles (one for each data server)

• Policy parameters

Figure 7.1 illustrates how the pNFS file-based layout provides access to an asymmet-

ric parallel file system. (Henceforth, I refer to this unspecified file system as PFS).

pNFS clients access pNFS data servers that export PFS clients, which in turn access data

from PFS storage nodes and metadata from PFS metadata servers. A PFS management

protocol binds metadata servers and storage, providing a consistent view of the file sys-

tem. pNFS clients use NFSv4 for I/O while PFS clients use the PFS storage protocol.

7.2.3.1 Performance issues

Architecturally, using a file-based layout offers some latitude. The architecture de-

picted in Figure 7.1 might have two tiers, or it might have three. The three-tier architec-

ture places PFS clients and storage on separate nodes, while the two-tier architecture

places PFS clients and storage on the same nodes. As shown in Figure 7.2, neither choice

features direct data access: the three-tier model has intermediary data servers while with

two tiers, tier-two PFS clients access data from other tier-two storage nodes. In addition,

the two-tier model transfers data between data servers, reducing the available bandwidth

(a) 3-tier (b) 2-tier

Figure 7.2: pNFS file-based data access. Both 3-tier and 2-tier architectures lose di-rect data access. (a) Intermediary pNFS data servers access PFS storage nodes. (b) pNFS data servers access both local and remote PFS storage nodes.

93

between clients and data servers. These architectures can improve NFS scalability, but

the lack of direct data access—a primary benefit of pNFS—scuttles performance.

Block size mismatches and overlapping metadata protocols also diminish perform-

ance. If the pNFS block size is greater than the PFS block size, a large pNFS data request

produces extra PFS data requests, each incurring a fixed amount of overhead. Con-

versely, a small pNFS data request forces a large PFS data request, unnecessarily taxing

storage resources and delaying the pNFS request. pNFS file system metadata requests to

the pNFS server, e.g., file size, layout information, become PFS client metadata requests

to the PFS metadata server. This ripple effect increases overhead and delay for pNFS

metadata requests.

It is hard to address these remote access inefficiencies with a fully connected block-

based parallel file systems, e.g., GPFS [97], GFS [98, 99], and PolyServe Matrix Server

[101], but for parallel file systems whose storage nodes admit NFS servers, Direct-pNFS

offers a solution.

7.3. Direct-pNFS

Direct-pNFS supports direct data access—without requiring a storage system specific

layout driver on every operating system and hardware platform—by exploiting file-based

layouts to describe the exact distribution of data on the storage nodes. Since a Direct-

Figure 7.3: Direct-pNFS data access architecture. NFSv4 application components use an NFSv4 data component to perform I/O directly to a PFS data component bundled with storage. The NFSv4 metadata component shares its access control information with the data servers to ensure the data servers allow only authorized data requests.

94

pNFS client knows the exact location of a file’s contents, it can target I/O requests to the

correct data servers. Direct-pNFS supports direct data access to any parallel file system

that allows NFS servers on its storage nodes—such as object based [104, 105], PVFS2

[129], and IBRIX Fusion [156]—and inherits the operational, fault tolerance, and security

semantics of NFSv4

7.3.1. Architecture

In the two- and three-tier pNFS architectures shown in Figure 7.1, the underlying data

layout is opaque to pNFS clients. This forces them to distribute I/O requests among data

servers without regard for the actual location of the data. To overcome this inefficient

data access, Direct-pNFS, shown in Figure 7.4, uses a layout translator to convert a par-

allel file system’s layout into a pNFS file-based layout. A pNFS server, which exists on

every PFS data server, can satisfy Direct-pNFS client data requests by accessing the local

PFS storage component. Direct-pNFS and PFS metadata components also co-exist on the

same node, which eliminates remote PFS metadata requests from the pNFS server.

The Direct-NFSv4 data access architecture, shown in Figure 7.3, alters the NFSv4-

PFS data access architecture (Figure 3.3) by using an NFSv4 data component to perform

direct I/O to a PFS data component bundled with storage. The PFS data component prox-

ies NFSv4 I/O requests to the local disk. An NFSv4 metadata component on storage

maintains NFSv4 access control semantics.

Direct-pNFS Client

Layout

I/O

Kernel

User

pNFSClient

Application

File Layout Driver

I/O Driver

pNFSServer

Data Servers

I/O

NFSv4Parallel I/O

NFSv4 I/Oand

Metadata

Metadata Server

FileLayout

Control Flow

Kernel

User

pNFSServer PFS

Management Protocol

MgmtDaemon

PFSStorage

Kernel

User

OptionalAggregation

Driver

PFSMetadata

Server

LayoutTranslator

PFSLayout

Figure 7.4: Direct-pNFS with a parallel file system. Direct-pNFS eliminates overlap-ping I/O and metadata protocols and uses the NFSv4 storage protocol to directly access storage. The PFS uses a layout translator to converts its layout into a pNFS file-based layout. A Direct-pNFS client may use an aggregation driver to support specialized file striping methods.

95

In combination, the use of accurate layout information and the placement of pNFS

servers on PFS storage and metadata nodes eliminates extra PFS data and metadata re-

quests and obviates the need for data servers to support the PFS storage protocol alto-

gether. The use of a single storage protocol also eliminates block size mismatches be-

tween storage protocols.

7.3.2. Layout translator

To give Direct-pNFS clients exact knowledge of the underlying data layout, a parallel

file system uses the layout translator to specify a file’s storage nodes, file handles, aggre-

gation type, and policy parameters. The layout translator is independent of the underly-

ing parallel file system and does not interpret PFS layout information. The layout trans-

lator simply gathers file-based layout information, as specified by the PFS, and creates a

pNFS file-based layout. The overhead for a PFS to use the layout translator is small and

confined to the PFS metadata server.

7.3.3. Optional aggregation drivers

It is impossible for the pNFS protocol to support every method of distributing data

among the storage nodes. At this writing, the pNFS protocol supports two aggregation

schemes: round-robin striping and a second method that specifies a list of devices that

form a cyclical pattern for all stripes in the file. To broaden support for unconventional

aggregation schemes such as variable stripe size [157] and replicated or hierarchical strip-

ing [19, 158], Direct-pNFS also supports optional “pluggable” aggregation drivers. An

aggregation driver provides a compact way for the Direct-pNFS client to understand how

the underlying parallel file system maps file data onto the storage nodes.

Aggregation drivers are operating system and platform independent, and are based on

the distribution drivers in PVFS2, which use a standard interface to adapt to most striping

schemes. Although aggregation drivers are non-standard components, their development

effort is minimal compared to the effort required to develop an entire layout driver.

96

7.4. Direct-pNFS prototype

I implemented a Direct-pNFS prototype that maintains strict agnosticism of the un-

derlying storage system and, as we shall see, matches the performance of the storage sys-

tem that it exports. Figure 7.5 displays the architecture of my Direct-pNFS prototype,

using PVFS2 for the exported file system.

Scientific data is easily re-created, so PVFS2 buffers data on storage nodes and sends

the data to stable storage only when necessary or at the application’s request (fsync). To

match this behavior, my Direct-pNFS prototype departs from the NFSv4 protocol, com-

mitting data to stable storage only when an application issues an fsync or closes the file.

At this writing, the user-level PVFS2 storage daemon does not support direct VFS ac-

cess. Instead, the Direct-pNFS data servers simulate direct storage access by way of the

existing PVFS2 client and the loopback device. The PVFS2 client on the data servers

functions solely as a conduit between the NFSv4 server and the PVFS2 storage node on

the node.

My Direct-pNFS prototype uses special NFSv4 StateIDs for access to the data serv-

ers, round robin striping as its aggregation scheme, and the GETDEVLIST,

LAYOUTGET, AND LAYOUTCOMMIT pNFS operations. A layout pertains to an en-

tire file, is stored in the file’s inode, and is valid for the lifetime of the inode.

7.5. Evaluation

In this section I asses the performance and I/O workload versatility of Direct-pNFS. I

first use the IOR micro-benchmark [146] to demonstrate the scalability and performance

of Direct-pNFS compared with PVFS2, pNFS file-based layout with two and three tiers,

and NFSv4. To explore the versatility of Direct-pNFS, I use two scientific I/O bench-

marks and two macro benchmarks to represent a variety of access patterns to large stor-

age systems:

NAS Parallel Benchmark 2.4 – BTIO. The NAS Parallel Benchmarks (NPB) are used

to evaluate the performance of parallel supercomputers. The BTIO benchmark is based

on a CFD code that uses an implicit algorithm to solve the 3D compressible Navier-

Stokes equations. I use the class A problem set, which uses a 64x64x64 grid, performs

97

200 time steps, checkpoints data every five time steps, and generates a 400 MB check-

point file. The benchmark uses MPI-IO collective file operations to ensure large write

requests to the storage system. All parameters are left as default.

ATLAS Application. ATLAS [148] is a particle physics experiment that seeks new dis-

coveries in head-on collisions of high-energy protons using the Large Hadron Collider

accelerator [149] under construction at CERN. The ATLAS simulation runs in four

stages; the Digitization stage simulates detector data generation. With 500 events,

Digitization spreads approximately 650 MB randomly over a single file. Each cli-

ent writes to a separate file. More information regarding ATLAS can be found in Section

6.3.3.

OLTP: OLTP models a database workload as a series of transactions on a single large

file. Each transaction consists of a random 8 KB read, modify, and write. Each client

performs 20,000 transactions, with data sent to stable storage after each transaction.

Postmark: The Postmark benchmark simulates metadata and small I/O intensive applica-

tions such as electronic mail, NetNews, and Web-based services [145]. Postmark per-

forms transactions on a large number of small randomly sized files (between 1 KB and

500 KB). Each transaction first deletes, creates, or opens a file, then reads or appends

Direct-pNFS Client

File Layout

I/O

NFSv4 I/Oand

Metadata

Kernel

User

pNFSClient

Application

Metadata Server

Layout

ControlFlow

Kernel

User

pNFSServer

PVFS2Metadata

Server

PVFS2Client

File Layout/Aggregation

Driver

Transport Driver

Kernel

User

pNFSServer

PVFS2Client

Data Servers

I/O

Loopback

PVFS2StorageServer

NFSv4Parallel I/O

PFSManagement Protocol

Figure 7.5: Direct-pNFS prototype architecture with the PVFS2 parallel file system. The PVFS2 metadata server converts the PVFS2 layout into an pNFS file-based layout, which is passed to the pNFS server and then to the Direct-pNFS file-based layout driver. The pNFS data server uses the PVFS2 client as a conduit to retrieve data from the local PVFS2 storage server. Data servers do not communicate.

98

512 bytes. Data are sent to stable storage before the file is closed. Postmark performs

2,000 transactions on 100 files in 10 directories. All other parameters are left as default.

7.5.1. Experimental setup

All experiments use a sixteen-node cluster connected via Gigabit Ethernet with jumbo

frames. To ensure a fair comparison between architectures, we keep the number of nodes

and disks in the back end constant. The PVFS2 1.5.1 file system has six storage nodes,

with one storage node doubling as a metadata manager, and a 2 MB stripe size. The

pNFS three-tier architecture uses three NFSv4 servers and three PVFS2 storage nodes.

For the three-tier architecture, we move the disks from the data servers to the storage

nodes. All NFS experiments use eight server threads and 2 MB wsize and rsize. All

nodes run Linux 2.6.17.

Storage System: Each PVFS2 storage node is equipped with dual 1.7 GHz P4 proces-

sor, 2 GB memory, one Seagate 80 GB 7200 RPM hard drive with Ultra ATA/100 inter-

face and 2 MB cache, and one 3Com 3C996B-T Gigabit Ethernet card.

Client System: Client nodes one through seven are equipped with dual 1.3 GHz P3

processor, 2 GB memory, and an Intel Pro Gigabit Ethernet card. Client nodes eight and

nine have the same configuration as the storage nodes.

7.5.2. Scalability and performance

Our first set of experiments uses the IOR benchmark to compare the scalability and

performance of Direct-pNFS, PVFS2, pNFS file-based layout with two and three tiers,

and NFSv4. In the first set of experiments, clients sequentially read and write separate

500 MB files. In the second set of experiments, clients sequentially read and write a dis-

joint 500 MB portion of a single file. To view the effect of I/O request size on perform-

ance, the experiments use a large block size (2 to 4 MB) and a small block size (8 KB).

Read experiments use a warm server cache. The presented value is the average over sev-

eral executions of the benchmark

Figure 7.6a and Figure 7.6b display the maximum aggregate write throughput with

separate files and a single file. Direct-pNFS matches the performance of PVFS2, reach-

99

ing a maximum aggregate write throughput of 119.2 MB/s and 110 MB/s for separate and

single file experiments. .

pNFS-3tier write performance levels off at 83 MB/s with four clients. pNFS-3tier

must split the six available servers between data servers and storage nodes, which cuts

the maximum network bandwidth in half relative to the network bandwidth for the other

pNFS and PVFS2 architectures. In addition, using two disks in each storage node does

not offer twice the disk bandwidth of a single disk due to the constant level of CPU,

memory, and bus bandwidth.

Lacking direct data access, pNFS-2tier incurs a write delay and performs a little

worse than Direct-pNFS and PVFS2. The additional transfer of data between data serv-

ers limits the maximum bandwidth between the pNFS clients and data servers. This is

not visible in Figure 7.6a and Figure 7.6b because network bandwidth exceeds disk

bandwidth, so Figure 7.6c repeats the multiple file write experiments with 100 Mbps

Ethernet. With this change, pNFS-2tier yields only half the performance of Direct-pNFS

and PVFS2, clearly demonstrating the network bottleneck of the pNFS-2tier architecture.

NFSv4 performance is unaffected by the number of clients, indicating a single server

bottleneck

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8

Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

B/s

)

Direct-pNFSPVFS2pNFS-2tierpNFS-3tierNFSv4

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8

Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

B/s

)


0

10

20

30

40

50

60

70

1 2 3 4 5 6 7 8

Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

B/s

)

Direct-pNFSPVFS2pNFS-2tier

(a) (b) (c)

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8

Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

B/s

)Direct-pNFSPVFS2pNFS-2tierpNFS-3tierNFSv4

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8

Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

B/s

)


(d) (e)

Figure 7.6: Direct-pNFS aggregate write throughput. (a) and (b) With a separate or single file and a large block size, Direct-pNFS scales with PVFS2 while pNFS-2tier suf-fers from a lack of direct file access. pNFS-3tier and NFSv4 are CPU limited. (c) With separate files and 100 Mbps Ethernet, pNFS-2tier is bandwidth limited due to its need to transfer data between data servers. (d) and (e) With a separate or single file and an 8 KB block size, all NFSv4 architectures outperform PVFS2.

100

Figure 7.6d and Figure 7.6e display the aggregate write throughput with separate files

and a single file using an 8 KB block size. The performance for all NFSv4-based archi-

tectures is unaffected in the large block size experiments due to the NFSv4 client write

back cache, which combines write requests until they reach the NFSv4 wsize (2 MB in

my experiments). However, the performance of PVFS2, a parallel file system designed

for large I/O, decreases dramatically with small block sizes, reaching a maximum aggre-

gate write throughput of 39.4 MB/s.

Figure 7.7a and Figure 7.7b display the maximum aggregate read throughput with

separate files and a single file. With separate files, Direct-pNFS matches the perform-

ance of PVFS2, reaching a maximum aggregate read throughput of 509 MB/s and 482

MB/s. With a single file, PVFS2 has lower throughput than Direct-pNFS with only a few

clients, but outperforms Direct-pNFS with eight clients, reaching a maximum aggregate

read throughput of 530.7 MB/s.

Direct-pNFS places the NFSv4 and PVFS2 server modules on the same node, inher-

ently placing a higher demand on server resources. In addition, PVFS2 uses a fixed

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8

Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

B/s

)


0

100

200

300

400

500

600

1 2 3 4 5 6 7 8

Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

B/s

)


(a) (b)

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8

Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

B/s

)


0

100

200

300

400

500

600

1 2 3 4 5 6 7 8

Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

B/s

)


(c) (d)

Figure 7.7: Direct-pNFS aggregate read throughput. (a) With separate files and a large block size, Direct-pNFS outperforms PVFS2 for some numbers of clients. pNFS-2tier and pNFS-3tier are bandwidth limited due to a lack of direct file access. NFSv4 is bandwidth and CPU limited. (b) With a single file and a large block size, PVFS2 eventu-ally outperforms Direct-pNFS due to a prototype software limitation. pNFS-2tier and pNFS-3tier are bandwidth limited due to a lack of direct file access. NFSv4 is CPU lim-ited. (c) and (d) With a separate or single file and an 8 KB block size, all NFSv4 architec-tures outperform PVFS2.

101

number of buffers to transfer data between the kernel and the user-level storage daemon,

which creates an additional bottleneck. This is evident in Figure 7.7b, where PVFS2

achieves a higher aggregate I/O throughput than Direct-pNFS.

The division of the six available servers between data servers and storage nodes in

pNFS-3tier limits its maximum performance again, achieving a maximum aggregate

bandwidth of only 115 MB/s. NFSv4 aggregate performance is flat, limited to the band-

width of a single server.

The pNFS-2tier bandwidth bottleneck is readily visible in Figure 7.7a and Figure

7.7b, where disk bandwidth is no longer a factor. Each data server is responding to client

read requests and transferring data to other data servers so they can satisfy their client

read requests. Sending data to multiple targets limits each data server’s maximum read

bandwidth.

Figure 7.7c and Figure 7.7d display the aggregate read throughput with separate files

and a single file using an 8 KB block size. The performance for all NFSv4-based archi-

tectures remains unaffected in the large block size experiments due to NFSv4 client read

gathering. The performance of PVFS2 again decreases dramatically with small block

sizes, reaching a maximum aggregate read throughput of 51 MB/s.

7.5.3. Micro-benchmark discussion

Direct-pNFS matches or outperforms the aggregate I/O throughput of PVFS2. In ad-

dition, the asynchronous, multi-threaded design of Linux NFSv4 combined with its write

back cache achieves superior performance with smaller block sizes.

In the write experiments, both Direct-pNFS and PVFS2 fully utilize the available disk

bandwidth. In the read experiments, data are read directly from server cache, so the disks

are not a bottleneck. Instead, client and server CPU performance becomes the limiting

factor. The pNFS-2tier architecture offers comparable performance with fewer clients,

but is limited by network bandwidth as I increase the number of clients. The pNFS-3tier

architecture demonstrates that using intermediary data servers to access data is ineffi-

cient: those resources are better used as storage nodes.

The remaining experiments further demonstrate the versatility of Direct-pNFS with

workloads that use a range of block sizes.

102

7.5.4. Scientific application benchmarks

This section uses two scientific benchmarks to assess the performance of Direct-

pNFS in high-end computing environments.

7.5.4.1 ATLAS

To evaluate ATLAS Digitization write throughput I use IOZone to replay the

write trace data for 500 events. Each client writes to a separate file.

Figure 7.8a shows that Direct-pNFS can manage efficiently the mix of small and

large write requests, achieving an aggregate write throughput of 102.5 MB/s with eight

clients. While small write requests reduce the maximum write throughput achievable by

Direct-pNFS by approximately 14 percent, they severely reduce the performance of

PVFS2, which achieves only 41 percent of its maximum aggregate write throughput.

0

20

40

60

80

100

120

1 4 8Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

B/s

)

Direct-pNFSPVFS2

0

200

400

600

800

1000

1200

1400

1600


Tim

e (s

)

Direct-pNFSPVFS2

(a) ATLAS (b) BTIO

0

5

10

15

20

25

30


Agg

rega

te T

hrou

ghpu

t (M

B/s

)

Direct-pNFSPVFS2

0

10

20

30

40


Tran

sact

ions

/Sec

ond

(tps)

Direct-pNFSPVFS2

(c) OLTP (d) Postmark

Figure 7.8: Direct-pNFS scientific and macro benchmark performance. (a) ATLAS. Direct-pNFS outperforms PVFS2 with a small and large write request workload. (b) BTIO. Direct-pNFS and PVFS2 achieve comparable performance with a large read and write workload. Lower time values are better. (c) OLTP. Direct-pNFS outperforms PVFS2 with an 8 KB read-modify-write write request workload. (d) Postmark. Direct-pNFS outperforms PVFS2 in a small read and append workload.

103

7.5.4.2 NAS Parallel Benchmark 2.4 – BTIO

The Block-Triangle I/O benchmark is an industry standard for measuring the I/O per-

formance of a cluster. Without optimization, BTIO I/O requests are small, ranging from

a few hundred bytes to eight kilobytes. The version of the benchmark used in this chap-

ter uses MPI-IO collective buffering [80], which increases the I/O request size to one MB

and greater. The benchmark times also include the ingestion and verification of the result

file.

BTIO performance experiments are shown in Figure 7.8b. BTIO running time is ap-

proximately the same for Direct-pNFS and PVFS2, with a maximum difference of five

percent with nine clients.

7.5.5. Synthetic workloads

This section uses two macro-benchmarks to analyze the performance of Direct-pNFS

in a more general setting.

7.5.5.1 OLTP

Figure 7.8c displays the OLTP experimental results. Direct-pNFS scales well with

the workload’s random 8 KB read-modify-write transactions, achieving 26 MB/s with

eight clients. As expected, PVFS2 performs poorly with small I/O requests, achieving an

aggregate I/O throughput of 6 MB/s.

7.5.5.2 Postmark

For the small I/O workload of Postmark, I reduce the stripe size, wsize, and rsize to

64 KB. This allows a more even distribution of requests among the storage nodes.

The Postmark experiments are shown in Figure 7.8d, with results given in transac-

tions per second. Direct-pNFS again leverages the asynchronous, multi-threaded Linux

NFSv4 implementation, designed for small I/O intensive workloads like Postmark, to

perform up to 36 times as many transactions per second as PVFS2.

104

7.5.6. Macro-benchmark discussion

This set of experiments demonstrates that Direct-pNFS performance compares well to

the exported parallel file system with the large I/O scientific application benchmark

BTIO. Direct-pNFS performance for ATLAS, for which 95% of the I/O requests are

smaller than 275 KB, far surpasses native file system performance. The Postmark and

OLTP benchmarks, also dominated by small I/O, yield similar results.

With Direct-pNFS demonstrating good performance on small I/O workloads, a natu-

ral next step is to explore performance with routine tasks such as a build/development

environment. Following the SSH build benchmark [159], I created a benchmark that un-

compresses, configures, and builds OpenSSH [160]. Using the same systems as above, I

compare the SSH build execution time using Direct-pNFS and PVFS2.

I find that Direct-pNFS reduces compilation time, a stage heavily dominated by small

read and write requests, but increases the time to uncompress and configure OpenSSH,

stages dominated by file creates and attribute updates. Tasks like file creation—

relatively simple for standalone file systems—become complex on parallel file systems,

leading to a lot of inter-node communication. Consequently, many parallel file systems

distribute metadata over many nodes and have clients gather and reconstruct the informa-

tion, relieving the overloaded metadata server. The NFSv4 metadata protocol relies on a

central metadata server, effectively recentralizing the decentralized parallel file system

metadata protocol. The sharp contrast in metadata management cost between NFSv4 and

parallel file systems—beyond the scope of this dissertation—merits further study.

7.6. Related work

Several pNFS layout drivers are under development. At this writing, Sun Microsys-

tems, Inc. is developing file- and object-based layout implementations. Panasas object

and EMC bock drivers are currently under development. pNFS file-based layout drivers

with the architecture in Figure 7.1 have been demonstrated with GPFS, Lustre, and

PVFS2.

Network Appliance is using the Linux file-based layout driver to bind disparate filers

(NFS servers) into a single file system image. The architecture differs from Figure 7.1 in

105

that the filers are not fully connected to storage; each filer is a standalone server attached

to Fibre Channel disk arrays. This continues previous work that aggregates partitioned

NFS servers into a single file system image [91, 113, 114]. Direct-pNFS generalizes

these architectures to be independent of the underlying parallel file system.






EMC’s HighRoad [103] uses the NFS or CIFS protocol for its control operations and

stores data in an aggregated LAN and SAN environment. Its use of file semantics facili-

tates data sharing in SAN environments, but is limited to the EMC Symmetrix storage

system. A similar, non-commercial version is also available [141].

Another commodity protocol used along the high-performance data channel is the

Object Storage Device (OSD) command set, which transmits variable length storage ob-

jects over SCSI transports. Currently OSD can be used to access only OSD-based file

systems, and it cannot be used for file system independent remote data access.










ating system modification.

Distributed replicas can be vital in reducing network latency when accessing data.

Direct-pNFS is not intended to replace GridFTP, but to work alongside it. For example,

in tiered projects such as ATLAS at CERN, GridFTP remains a natural choice for long-

haul scheduled transfers among the upper tiers, while the file system semantics of Direct-

106

pNFS offers advantages in the lower tiers by letting scientists work with files directly,

promoting effective data management.

7.6.1. Small I/O performance

Log-structured file systems [151] increase the size of writes by appending small I/O

requests to a log and later flushing the log to disk. Zebra [152] extends this to distributed

environments. Side effects include large data layouts and erratic block sizes.

The Vesta parallel file system [153] improves I/O performance by using workload

characteristics provided by applications to optimize data layout on storage. Providing

this information can be difficult for applications that lack regular I/O patterns or whose

I/O access patterns change over time.

Both the Lustre [104] and the Panasas ActiveScale [105] file systems use a write-

behind cache to perform buffered writes. In addition, Lustre allows clients to place small

files on a single storage node to reduce access overhead.

Implementations of MPI-IO such as ROMIO [82] use application hints and file access

patterns to improve I/O request performance. The work reported here benefits and com-

plements MPI-IO and its implementations. MPI-IO is useful to applications that use its

API and have regular I/O access patterns, e.g., strided I/O, but MPI-IO small write per-

formance is limited by the deficiencies of the underlying parallel file system. Direct-

pNFS is beneficial for existing and unmodified applications. It is also beneficial at the

file system layer of MPI-IO implementations, to improve the performance of the underly-

ing parallel file system.

7.7. Conclusion

Universal and transparent data access is a critical enabling feature for high-

performance access to storage. Direct-pNFS enhances the heterogeneity and transpar-

ency of pNFS by using an unmodified NFSv4 client to support high-performance remote

data access. Experiments demonstrate that a commodity storage protocol can match the

I/O throughput of the specialized parallel file system that it exports. Furthermore, Direct-

107

pNFS also “scales down” to outperform the parallel file system client in diverse work-

loads.

108

CHAPTER VIII

Summary and Conclusion

This chapter summarizes my dissertation research on high-performance remote data

access and discusses possible extensions of the work.

8.1. Summary and supplemental remarks

Scientific collaboration requires scalable and widespread access to massive data sets.

This dissertation introduces data access architectures that use the NFSv4 distributed file

system to realize levels of scalability, performance, security, heterogeneity, transparency,

and independence that were heretofore unrealized.

Parallel file systems manage massive data sets by scaling disks, disk controllers, net-

work, and servers—every aspect of the system architecture. As the number of hardware

components increases, the difficulty of locating, managing, and protecting data grows

rapidly. Parallel file systems consist of a large number of increasingly complex and in-

terconnected components, e.g., a metadata service, data and control components, and

storage. As described in this dissertation, these components manage many different as-

pects of the system, including file system control information, data, metadata, data shar-

ing, security, and fault tolerance.

Providing distributed access to parallel file systems further increases the complexity

of interacting components. Additional components can include new data and control

components as well as a separate metadata service. Distributed access components lack

direct and coherent integration with parallel file system components, increasing overhead

and reducing system performance. For example, client/server-based distributed file sys-

tems require a server translation component to convert remote file system requests into

parallel file system requests.

109

This dissertation analyzed the organization and properties of distributed and parallel

file system components and how these components influence the performance and capa-

bilities of remote data access. For example, tight integration of their separate metadata

services can provide performance benefits through reduced communication and overhead,

but decreases the level of independence between the file systems and increases the devel-

opment effort required to support remote data access. Similarly, the placement (and/or

existence) of both data components can affect the performance, heterogeneity, transpar-

ency, and security levels of remote data access.

Chapter IV discusses possible remote data access architectures with existing and un-

modified parallel file system components. Lack of access to detailed parallel file system

information prevents the distributed and parallel file system components from optimizing

their interaction. I introduce Split-Server NFSv4, which scales I/O throughput by spread-

ing client load among existing ports of entry into the parallel file system. Split-Server

NFSv4 retains NFSv4 semantics and offers heterogeneous, secure, transparent, and inde-

pendent file system access.

The Split-Server NFSv4 prototype raises several issues with accessing existing and

unmodified parallel file systems, including:

1. Parallel file systems target single-threaded applications such as scientific MPI

codes. As a result, some parallel file systems serialize data access among threads

in a multi-threaded application. Overlooking the distinction between competing

and non-competing components can considerably reduce performance.

2. A similar issue arises when clients spread I/O requests among data servers. The

parallel file system’s failure to adapt to an understanding that the I/O requests are

from a single application leads to unnecessary coordination and communication.

In addition, synchronization of file attributes between data servers degrades re-

mote access performance.

3. Block size mismatches unnecessarily tax storage resources and delay application

requests.

4. Distributing I/O requests across multiple data servers inhibits the underlying file

system’s ability to perform effective readahead.

110

5. Overlapping metadata protocols lead to a communication ripple effect that in-

creases overhead and delay for remote metadata requests.

Chapter V investigates a remote data access architecture that allows full access to the

information stored by parallel file system components. This enables a distributed file

system to utilize resources using the same information parallel file systems use to scale to

large numbers of clients. I demonstrate that NFSv4 can use this information along with

the storage protocol of the underlying file system to increase performance and scalability

while retaining file system independence. In combination, the pNFS protocol, storage-

specific layout drivers, and some parallel file system customizations can overcome the

above I/O inefficiencies and enable customizable security and data access semantics.

The pNFS prototype matches the performance of the underlying parallel file system and

demonstrates that efficient layout generation is vital to achieve continuous scalability.

While components in distributed and parallel file systems may provide similar ser-

vices, their capabilities can be vastly different. Chapter VI demonstrates that pNFS can

increase the overall write throughput to parallel data stores—regardless of file size—by

using direct, parallel I/O for large write requests and a distributed file system for small

write requests. The large buffers, limited asynchrony, and high per-request overhead in-

herent to parallel file system scuttles small I/O performance. By completely isolating and

separating the control and data protocols in pNFS, a single file system can use any com-

bination of storage and metadata protocols, each excelling at specific workloads or sys-

tem environments. Use of multiple storage protocols increases the overall write perform-

ance of the ATLAS Digitization application by 57 to 100 percent.

Beyond the interaction of distributed and parallel file system components, the physi-

cal location of a component can affect overall system capabilities. Chapter VII analyzes

the cost of having pNFS clients support a parallel file system data component. A storage-

specific layout driver must be developed for every platform and operating system, reduc-

ing heterogeneity and widespread access to parallel file systems. In addition, requiring

support for semantics that are layout driver and storage protocol dependent, e.g., security,

client caching, and fault tolerance, reduces the data access transparency of NFSv4.

To increase the heterogeneity and transparency of pNFS, Direct-pNFS removes the

requirement that clients incorporate parallel file system data components. Direct-pNFS

111

uses the NFSv4 storage protocol for direct access to NFSv4-enabled parallel file system

storage nodes. A single layout driver potentially reduces development effort while re-

taining NFSv4 file access and security semantics. To perform heterogeneous data access

with only a single layout driver, parallel file-system specific data layout information is

converted into the standard pNFS file-based layout format. Pluggable aggregation driv-

ers provide support for most file distribution schemes. While aggregation drivers can

limit widespread data access, their development effort is likely to be less than for a layout

driver and they can be shared across storage systems. Direct-pNFS experiments demon-

strate that a commodity storage protocol can match the I/O throughput of the exported

parallel file system. Furthermore, Direct-pNFS leverages the small I/O strengths of

NFSv4 to outperform the parallel file system client in diverse workloads.

The pNFS extensions to the NFSv4 protocol are included in the upcoming NFSv4.1

minor version specification [123]. Implementation of NFSv4.1 is under way on several

major operating systems, bringing effective global data access closer to reality.

8.2. Supplementary observations

From a practical standpoint, this dissertation can be used as a guide for capacity plan-

ning decisions. Organizations often allocate limited hardware resources to satisfy local

data access requirements, adding resources for remote access is an afterthought. Used

during the planning and acquisition phases, the analyses in this dissertation can serve as a

guide for improving remote data access.

Unfortunately, the storage community suffers from a narrow vision of data access.

As this dissertation explains, many parallel file system providers continue to believe that

a storage solution can exist in isolation. This idea is quickly becoming old-fashioned.

Specialization of parallel file systems limits the availability of data and reduces collabo-

ration, a vital part of innovation [161]. In addition, the bandwidth available for remote

access across the WAN is continuing to increase, with global collaborations now using

multiple ten Gigabit Ethernet networks [162].

Storage solutions need a holistic approach that accounts for every data access per-

sona. Each persona has a specific I/O workload, set of supported operating systems, and

required level of scalability, performance, transparency, security, and data sharing. For

112

example, the data access requirements of compute clusters, archival systems, and indi-

vidual users (local and remote) are all different, but they all need to access the same stor-

age system. Specializing file systems for a single persona widens the gap between appli-

cations and data.

pNFS attempts to address these diverse requirements by combining the strengths of

NFSv4 with direct storage access. As a result, pNFS uses NFSv4 semantics, which in-

clude a level of fault tolerance as well as close-to-open cache consistency. As described

in Section 7.4, NFSv4 fault tolerance semantics can decrease write performance by push-

ing data aggressively onto stable storage. In addition, some applications or data access

techniques, e.g., collective I/O, require more control over the NFSv4 client data cache.

Mandating that an application must close and re-open a file to refresh its data cache (or

acquire a lock on the file, which has the same effect) can increase delay and reduce per-

formance. Maintaining consistent semantics across remote parallel file systems is impor-

tant, but a distributed file system should allow applications to tune these semantics to

meet their needs.

8.3. Beyond NFSv4

This dissertation focuses on the Network File System (NFS) protocol, which is dis-

tinguished by its precise definition by the IETF, availability of open source implementa-

tions, and support on virtually every modern operating system. It is natural to ask

whether the scalability and performance benefits of the pNFS architecture can be realized

by other distributed file systems such as AFS or CIFS. In other words, is it possible to

design and engineer pAFS, pCIFS, or even pNFSv3? If so, what are the necessary re-

quirements of a distributed file system that permit this transformation?

To answer these questions, let us first review the pNFS architecture:

1. pNFS client. pNFS extends the standard NFSv4 client by delegating application

I/O requests to a storage-specific layout driver. Either the pNFS client or each in-

dividual layout driver can manage and cache layout information.

2. pNFS server. pNFS extends the standard NFSv4 server with the ability to relay

client file layout requests to the underlying parallel file system and respond with

the resultant opaque layout information. In addition, the pNFS server tracks out-

113

standing layout information so that it can be recalled in case a file is renamed or

its layout information is modified5.

3. pNFS metadata protocol. The base NFSv4 metadata protocol enables clients

and servers to request, update, and recall file system and file metadata informa-

tion, e.g., vnode information, and request file locks. pNFS extends this protocol

to request, return, and recall file layout information.

Fundamentally, pNFS extends NFSv4 in its ability to retrieve, manage, and utilize file

layout information. Any distributed file system with a client, server, and a metadata pro-

tocol that can be extended to retrieve layout information is a candidate for pNFS trans-

formation (although the implementation details will vary). A single server is not neces-

sary along the control path, but a distributed file system must have the ability to fulfill

metadata requests. For example, some file systems do not centralize file size on a meta-

data server, but dynamically build the file size through queries to the data servers.

pNFS does not rely on distributed file system support of a storage protocol since it

isolates the storage protocol in the layout driver. However, it is important to note that

Split-Server NFSv4, Direct-pNFS, and file-based pNFS use the NFSv4 storage protocol

along the data path, and a distributed file system must support its own storage protocol to

realize the benefits of these architectures.

The stateful nature of NFSv4, which enables server callbacks to the clients, is not

strictly required to support the pNFS protocol. Instead, a layout driver could discover

that its layout information is invalid through error messages returned by the data servers.

Unfortunately, block-based data servers such as a Fibre Channel disks cannot run NFSv4

servers and therefore cannot return pNFS error codes. This would prohibit stateless dis-

tributed file systems, such as NFSv3, from supporting block-based layout drivers.

A core benefit of pNFS is its ability to provide remote access to parallel file systems.

pNFS achieves this level of independence by leveraging the NFS Vnode definition and

VFS interface. DFS, SMB, and the AppleTalk Filing Protocol all use their own mecha-

5 File layout information can change if it is restriped due to load balancing, lack of free disk space on a par-ticular data server, or server intervention.

114

nisms to achieve this level of independence. Other distributed file systems, such as AFS,

do not currently support remote data access to existing data stores6.

Most NFSv4 implementations also include additional features such as locks, reada-

head, a write back cache, and a data cache. These features provide many benefits but are

unrelated to the support of the pNFS protocol.

8.4. Extensions

This section presents some research themes that hold the potential to further improve

the overall performance of remote data access.

8.4.1. I/O performance

MPI is the dominant tool for inter-node communication while MPI-IO is the nascent

tool for cluster I/O. Native support for MPI-IO implementations (e.g., ROMIO) by re-

mote access file systems such as pNFS is critical. Figure 8.1 offers an example of an in-

ter-cluster data transfer over the WAN. The application cluster is running an MPI appli-

cation that wants to read a large amount of data from the server cluster and perhaps write

to its backend. The MPI head node obtains the data location from the server cluster and

distributes portions of the data location information (via MPI) to other application cluster

nodes, enabling direct access to server cluster storage devices. The application then uses

MPI-IO to read data in parallel from the server cluster across the WAN, processes the

data, and directs output to the application cluster backend. A natural use case for this ar-

chitecture is a visualization application processing the results of a scientific application

run on the server cluster. Another use case is an application making a local copy of data

from the server cluster on the application cluster.

Regarding NFSv4, several improvements to the protocol could be beneficial to appli-

cations. For example, List I/O support has shown great application performance benefits

with other storage protocols [163, 164]. In addition, many NFSv4 implementations have

6 HostAFSd, currently under development, will allow AFS servers to export the local file system.

115

a maximum data transfer size of 64 KB to 1 MB, while many parallel file systems sup-

port a maximum transfer size of 4 MB and larger. Efficient implementation of larger

NFSv4 data transfer sizes is vital for high-performance access to parallel file systems.

Finally, this dissertation uses a distributed file system to improve I/O throughput to

parallel file systems, but leaves the architecture of the parallel file system largely un-

touched. Parallel file system architecture modifications may also improve remote data

access performance, e.g., data and metadata replication and caching [165, 166], and vari-

ous readahead algorithms.

8.4.2. Metadata management

This dissertation focuses on improving I/O throughput for scientific applications.

Some research has focused on improving metadata performance in high-performance

computing [116, 167, 168], but most parallel file systems implement proprietary metadata

management schemes7.

To offload overburdened metadata servers, many parallel file systems distribute

metadata information among multiple servers and employ their clients to perform tasks

that involve communication with storage, e.g., file creation. Most distributed file systems

Figure 8.1: pNFS and inter-cluster data transfers across the WAN. A pNFS cluster retrieves data from a remote storage system, processes the data, and writes to its local storage system. The MPI head node distributes layout information to pNFS clients.

116

use a single metadata server, creating a metadata management bottleneck that threatens to

eliminate the benefits of the parallel file system’s metadata management distribution.

Stateful distributed file servers only exacerbate the problem. Additional research is

needed to explore how distributed file systems can optimize metadata management with-

out introducing these new bottlenecks and inefficiencies.

7 I am unaware of any additional research beyond this dissertation into the interaction between distributed and parallel file system metadata subsystems.

117

Bibliography

118

Bibliography

[1] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, "Design and Im-plementation of the Sun Network Filesystem," in Proceedings of the Summer USENIX Technical Conference, Portland, OR, 1985.

[2] Common Internet File System File Access Protocol (CIFS), msdn.microsoft.com/library/en-us/cifs/protocol/cifs.asp.

[3] S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. Eisler, and D. Noveck, NFS Version 4 Protocol Specification. RFC 3530, 2003.

[4] B. Allcock, J. Bester, J. Bresnahan, A.L. Chervenak, I. Foster, C. Kesselman, S. Meder, V. Nefedova, D. Quesnal, and S. Tuecke., "Data Management and Transfer in High-Performance Computational Grid Environments," Parallel Computing, 28(5):749-771, 2002.

[5] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, "Query by Image and Video Content: The QBIC System," IEEE Computer, 28(9):23-32, 1995.

[6] Biowulf at the NIH, biowulf.nih.gov/apps/blast.

[7] S. Berchtold, C. Boehm, D.A. Keim, and H. Kriegel, "A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space," in Proceedings of the 16th ACM PODS Symposium, Tucson, AZ, 1997.

[8] P. Caulk, "The Design of a Petabyte Archive and Distribution System for the NASA ECS Project," in Proceedings of the 4th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 1995.

[9] J. Behnke and A. Lake, "EOSDIS: Archive and Distribution Systems in the Year 2000," in Proceedings of the 17th IEEE/8th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 2000.

[10] J. Behnke, T.H. Watts, B. Kobler, D. Lowe, S. Fox, and R. Meyer, "EOSDIS Petabyte Archives: Tenth Anniversary," in Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies, Monterey, CA, 2005.

119

[11] D. Strauss, "Linux Helps Bring Titanic to Life," Linux Journal, 46, 1998.

[12] ASCI Purple RFP, www.llnl.gov/asci/platforms/purple/rfp.

[13] Petascale Data Storage Institute, www.pdl.cmu.edu/PDSI.

[14] Serial ATA Workgroup, "Serial ATA: High Speed Serialized AT Attachment," Rev. 1, 2001.

[15] Adaptec Inc., "Ultra320 SCSI: New Technology-Still SCSI," www.adaptec.com.

[16] T.M. Anderson and R.S. Cornelius, "High-Performance Switching with Fibre Channel," in Proceedings of the 37th IEEE International Conference on COMPCON, San Francisco, CA, 1992.

[17] K. Salem and H. Garcia-Molina, "Disk Striping," in Proceedings of the 2nd Inter-national Conference on Data Engineering, Los Angeles, CA, 1986.

[18] O.G. Johnson, "Three-Dimensional Wave Equation Computations on Vector Com-puters," IEEE Computer, 72(1):90-95, 1984.

[19] D.A. Patterson, G.A. Gibson, and R.H. Katz, "A Case for Redundant Arrays of In-expensive Disks (RAID)," in Proceedings of the ACM SIGMOD Conference on Management of Data, Chicago, IL, 1988.

[20] Small Computer Serial Interface (SCSI) Specification. ANSI X3.131-1986, www.t10.org, 1986.

[21] T10 Committee, "Draft Fibre Channel Protocol - 3 (FCP-3) Standard," 2005, www.t10.org/ftp/t10/drafts/fcp3/fcp3r03d.pdf.

[22] J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka, and E. Zeidner, Internet Small Computer Systems Interface (iSCSI). RFC 3720, 2001.

[23] K.Z. Meth and J. Satran, "Design of the iSCSI Protocol," in Proceedings 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, San Diego, CA, 2003.

[24] J. Postel and J. Reynolds, File Transfer Protocol (FTP). RFC 765, 1985.

[25] B. Callaghan, NFS Illustrated. Essex, UK: Addison-Wesley, 2000.

[26] PASC, "IEEE Standard Portable Operating System Interface for Computer Envi-ronments," IEEE Std 1003.1-1988, 1988.

120

[27] D.R. Brownbridge, L.F. Marshall, and B. Randell, "The Newcastle Connection or UNIXes of the World Unite!," Software-Practice and Experience, 12(12):1147-1162, 1982.

[28] P.J. Leach, P.H. Levine, J.A. Hamilton, and B.L. Stumpf, "The File System of an Integrated Local Network," in Proceedings of the 13th ACM Annual Conference on Computer Science, New Orleans, LA, 1985.

[29] P.H. Levine, The Apollo DOMAIN Distributed File System. New York, NY: Springer-Verlag, 1987.

[30] P.J. Leach, B.L. Stumpf, J.A. Hamilton, and P.H. Levine, "UIDs as Internal Names in a Distributed File System," in Proceedings of the 1st Symposium on Principles of Distributed Computing, Ottawa, Canada, 1982.

[31] G. Popek, The LOCUS Distributed System Architecture. Cambridge, MA: MIT Press, 1986.

[32] A.P. Rifkin, M.P. Forbes, R.L. Hamilton, M. Sabrio, S. Shah, and K. Yueh, "RFS Architectural Overview," in Proceedings of the Summer USENIX Technical Con-ference, Atlanta, GA, 1986.

[33] S.R. Kleiman, "Vnodes: An Architecture for Multiple File System Types in Sun UNIX," in Proceedings of the Summer USENIX Technical Conference, Altanta, GA, 1986.

[34] R. Srinivasan, RPC: Remote Procedure Call Protocol Specification Version 2. RFC 1831, 1995.

[35] R. Srinivasan, XDR: External Data Representation Standard. RFC 1832, 1995.

[36] Sun Microsystems Inc., NFS: Network File System Protocol Specification. RFC 1094, 1989.

[37] B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz, "NFS Version 3 Design and Implementation," in Proceedings of the Summer USENIX Technical Conference, Boston, MA, 1994.

[38] B. Callaghan, B. Pawlowski, and P. Staubach, NFS Version 3 Protocol Specifica-tion. RFC 1813, 1995.

[39] B. Callaghan and T. Lyon, "The Automounter," in Proceedings of the Winter USENIX Technical Conference, San Diego, CA, 1989.

[40] J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West, "Scale and Performance in a Distributed File System," ACM Transac-tions on Computer Systems, 6(1):51-81, 1988.

121

[41] J.G. Steiner, C. Neuman, and J.I. Schiller, "Kerberos: An Authentication Service for Open Network Systems," in Proceedings of the Winter USENIX Technical Con-ference, Dallas, TX, 1988.

[42] M. Satyanarayanan, J.J. Kistler, P. Kumar, M.E. Okasaki, E.H. Siegel, and D.C. Steere, "Coda: A Highly Available File System for a Distributed Workstation Envi-ronment," IEEE Transactions on Computers, 39(4):447-459, 1990.

[43] M.L. Kazar, B.W. Leverett, O.T. Anderson, V. Apostolides, B.A. Bottos, S. Chu-tani, C.F. Everhart, W.A. Mason, S. Tu, and R. Zayas, "DEcorum File System Ar-chitectural Overview," in Proceedings of the Summer USENIX Technical Confer-ence, Anaheim, CA, 1990.

[44] S. Chutani, O.T. Anderson, M.L. Kazar, B.W. Leverett, W.A. Mason, and R.N. Sidebotham, "The Episode File System," in Proceedings of the Winter USENIX Technical Conference, Berkeley, CA, 1992.

[45] C. Gray and D. Cheriton, "Leases: an Efficient Fault-tolerant Mechanism for Dis-tributed File Cache Consistency," in Proceedings of the 12th ACM Symposium on Operating Systems Principles, Litchfield Park, AZ, 1989.

[46] G.S. Sidhu, R.F. Andrews, and A.B. Oppenheimer, Inside AppleTalk. Reading, MA: Addison-Wesley, 1989.

[47] International Standards Organization, Information Processing Systems - Open Sys-tems Interconnection - Basic Reference Model. Draft International Standard 7498, 1984.

[48] Protocol Standard for a NetBIOS Service on a TCP/UDP Transport: Detailed Specifications. RFC 1002, 1987.

[49] Protocol Standard for a NetBIOS Service on a TCP/UDP Transport: Concepts and Methods. RFC 1001, 1987.

[50] J.D. Blair, SAMBA, Integrating UNIX and Windows,. Specialized Systems Consult-ants Inc., 1998.

[51] J. K. Ousterhout, A. R. Cherenson, F. Douglis, M. N. Nelson, and B. B. Welch, "The Sprite Network Operating System," IEEE Computer, 21(2):23-36, 1988.

[52] M.N. Nelson, B.B. Welch, and J.K. Ousterhout, "Caching in the Sprite Network File System," ACM Transactions on Computer Systems, 6(1):134-154, 1988.

[53] V. Srinivasan and J. Mogul, "Spritely NFS: Experiments With Cache-Consistency Protocols," in Proceedings of the 12th ACM Symposium on Operating Systems Principles, 1989.

122

[54] R. Macklem, "Not Quite NFS, Soft Cache Consistency for NFS," in Proceedings of the Winter USENIX Technical Conference, San Fransisco, CA, 1994.

[55] "Purple: Fifth Generation ASC Platform," www.llnl.gov/asci/platforms/purple.

[56] The BlueGene/L Team, "An Overview of the BlueGene/L Supercomputer," in Pro-ceedings of Supercomputing '02, Baltimore, MD, 2002.

[57] BlueGene/L, www.llnl.gov/asc/computing_resources/bluegenel/ bluegene_home.html.

[58] ASCI White, www.llnl.gov/asci/platforms/white.

[59] S. Habata, M. Yokokawa, and S. Kitawaki, "The Earth Simulator System," NEC Research and Development Journal, 44(1):21-26, 2003.

[60] ASCI Red, www.sandia.gov/ASCI/Red.

[61] M. Satyanarayanan, "A Study of File Sizes and Functional Lifetimes," in Proceed-ings of the 8th ACM Symposium on Operating System Principles, Pacific Grove, CA, 1981.

[62] D. Ellard, J. Ledlie, P. Malkani, and M. Seltzer, "Passive NFS Tracing of Email and Research Workloads," in Proceedings of the USENIX Conference on File and Stor-age Technologies, San Francisco, CA, 2003.

[63] M.G. Baker, J.H. Hartman, M.D. Kupfer, K.W. Shirriff, and J.K. Ousterhout, "Measurements of a Distributed File System," in Proceedings of the 13th ACM Symposium on Operating Systems Principles, Pacific Grove, CA, 1991.

[64] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M. Kupfer, and J.G. Thompson, "A Trace-Driven Analysis of the UNIX 4.2 BSD File System," in Pro-ceedings of the 10th ACM Symposium on Operating Systems Principles, Orcas Is-land, WA, 1985.

[65] B. Fryxell, K. Olson, P. Ricker, F.X. Timmes, M. Zingale, D.Q. Lamb, P. MacNeice, R. Rosner, and H. Tufo, "FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes," Astrophysical Journal Supplement, 131:273-334, 2000.

[66] A. Darling, L. Carey, and W. Feng, "The Design, Implementation, and Evaluation of mpiBLAST," in Proceedings of the ClusterWorld Conference and Expo, in con-junction with the 4th International Conference on Linux Clusters: The HPC Revo-lution, San Jose, CA, 2003.

123

[67] E.L. Miller and R.H. Katz, "Input/Output Behavior of Supercomputing Applica-tions," in Proceedings of Supercomputing '91, Albuquerque, NM, 1991.

[68] B. Schroeder and G. Gibson, "A Large Scale Study of Failures in High-Performance-Computing Systems," in Proceedings of the International Conference on Dependable Systems and Networks, Philadelphia, PA, 2006.

[69] G. Grider, L. Ward, R. Ross, and G. Gibson, "A Business Case for Extensions to the POSIX I/O API for High End, Clustered, and Highly Concurrent Computing," www.opengroup.org/platform/hecewg, 2006.

[70] A.T. Wong, L. Oliker, W.T.C. Kramer, T.L. Kaltz, and D.H. Bailey, "ESP: A Sys-tem Utilization Benchmark," in Proceedings of Supercomputing '00, Dallas, TX, 2000.

[71] B.K. Pasquale and G.C. Polyzos, "Dynamic I/O Characterization of I/O Intensive Scientific Applications," in Proceedings of Supercomputing '94, Washington, D.C., 1994.

[72] D. Kotz and N. Nieuwejaar, "Dynamic File-Access Characteristics of a Production Parallel Scientific Workload," in Proceedings of Supercomputing '94, Washington, D.C., 1994.

[73] A. Purakayastha, C. Schlatter Ellis, D. Kotz, N. Nieuwejaar, and M. Best, "Charac-terizing Parallel File-Access Patterns on a Large-Scale Multiprocessor," in Pro-ceedings of the Ninth International Parallel Processing Symposium, Santa Barbara, CA, 1995.

[74] N. Nieuwejaar, D. Kotz, A. Purakayastha, C. Schlatter Ellis, and M. Best, "File-Access Characteristics of Parallel Scientific Workloads," IEEE Transactions on Parallel and Distributed Systems, 7(10):1075-1089, 1996.

[75] E. Smirni and D.A. Reed, "Workload Characterization of Input/Output Intensive Parallel Applications," in Proceedings of the Conference on Modeling Techniques and Tools for Computer Performance Evaluation, Saint Malo, France, 1997.

[76] E. Smirni, R.A. Aydt, A.A. Chien, and D.A. Reed, "I/O Requirements of Scientific Applications: An Evolutionary View," in Proceedings of the 5th IEEE Conference on High Performance Distributed Computing, Syracuse, NY, 1996.

[77] P.E. Crandall, R.A. Aydt, A.A. Chien, and D.A. Reed, "Input/Output Characteris-tics of Scalable Parallel Applications," in Proceedings of Supercomputing '95, San Diego, CA, 1995.

[78] F. Wang, Q. Xin, B. Hong, S.A. Brandt, E.L. Miller, D.D.E Long, and T.T. McLarty, "File System Workload Analysis For Large Scale Scientific Computing

124

Applications," in Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 2004.

[79] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, and J. Dongarra, MPI: The Complete Reference. Cambridge, MA: MIT Press, 1996.

[80] W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir, and M. Snir, MPI: The Complete Reference, volume 2--The MPI-2 Extensions. Cam-bridge, MA: MIT Press, 1998.

[81] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, "A High-Performance, Portable Im-plementation of the MPI Message Passing Interface Standard," Parallel Computing, 22(6):789-828, 1996.

[82] R. Thakur, W. Gropp, and E. Lusk, "Data Sieving and Collective I/O in ROMIO," in Proceedings of the 7th Symposium on the Frontiers of Massively Parallel Com-putation, 1999.

[83] J.P Prost, R. Treumann, R. Hedges, B. Jia, and A.E. Koniges, "MPI-IO/GPFS, an Optimized Implementation of MPI-IO on top of GPFS," in Proceedings of Super-computing '01, Denver, CO, 2001.

[84] R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi, "Passion: Optimized I/O for Parallel Applications," IEEE Computer, 29(6):70-78, 1996.

[85] D. Kotz, "Disk-directed I/O for MIMD Multiprocessors," ACM Transactions on Computer Systems, 15(1):41-74, 1997.

[86] K. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett, "Server-Directed Col-lective I/O in Panda," in Proceedings of Supercomputing '95, 1995.

[87] J. del Rosario, R. Bordawekar, and A. Choudhary, "Improved Parallel I/O via a Two-Phase Run-time Access Strategy," in Proceedings of the Workshop on I/O in Parallel Computer Systems at IPPS '93, Newport Beach, CA, 1993.

[88] OpenMP Consortium, "OpenMP C and C++ Application Program Interface, Ver-sion 1.0," www.openmp.org, 1997.

[89] High Performance Fortran Forum, "High Performance Fortran language specifica-tion version 2.0," hpff.rice.edu/versions/hpf2/hpf-v20, 1997.

[90] C. Juszczak, "Improving the Write Performance of an NFS Server," in Proceedings of the Winter USENIX Technical Conference, San Fransisco, CA, 1994.

[91] P. Lombard and Y. Denneulin, "nfsp: A Distributed NFS Server for Clusters of Workstations," in Proceedings of the 16th International Parallel and Distributed Processing Symposium, Fort Lauderdale, FL, 2002.

125

[92] C.A. Thekkath, T. Mann, and E.K. Lee, "Frangipani: A Scalable Distributed File System," in Proceedings of the 16th ACM Symposium on Operating Systems Prin-ciples, Saint-Malo, France, 1997.

[93] R.A. Coyne and H. Hulen, "An Introduction to the Mass Storage System Reference Model, Version 5," in Proceedings of the 12th IEEE Symposium on Mass Storage Systems, Monterey, CA, 1993.

[94] S.W. Miller, A Reference Model for Mass Storage Systems. San Diego, CA: Aca-demic Press Professional, Inc., 1988.

[95] A.L. Drapeau, K. Shirriff, E.K. Lee, J.H. Hartman, E.L. Miller, S. Seshan, R.H. Katz, K. Lutz, D.A. Patterson, P.H. Chen, and G.A. Gibson, "RAID-II: A High-Bandwidth Network File Server," in Proceedings of the 21st International Sympo-sium on Computer Architecture, Chicago, IL, 1994.

[96] L. Cabrera and D.D.E. Long, "SWIFT: Using Distributed Disk Striping To Provide High I/O Data Rates," Computing Systems, 4(4):405-436, 1991.

[97] F. Schmuck and R. Haskin, "GPFS: A Shared-Disk File System for Large Comput-ing Clusters," in Proceedings of the USENIX Conference on File and Storage Technologies, San Francisco, CA, 2002.

[98] Red Hat Software Inc., "Red Hat Global File System," www.redhat.com/ software/rha/gfs.

[99] S.R. Soltis, T.M. Ruwart, and M.T. O'Keefe, "The Global File System," in Pro-ceedings of the 5th NASA Goddard Conference on Mass Storage Systems, College Park, MD, 1996.

[100] M. Fasheh, "OCFS2: The Oracle Clustered File System, Version 2," in Proceedings of the Linux Symposium, Ottawa, Canada, 2006.

[101] Polyserve Inc., "Matrix Server Architecture," www.polyserve.com.

[102] C. Brooks, H. Dachuan, D. Jackson, M.A. Miller, and M. Resichini, "IBM Total-Storage: Introducing the SAN File System," IBM Redbooks, 2003, www.redbooks.ibm.com/redbooks/pdfs/sg247057.pdf.

[103] EMC Corp., "EMC Celerra HighRoad," www.emc.com/pdf/products/ celerra_file_server/HighRoad_wp.pdf, 2001.

[104] Cluster File Systems Inc., Lustre: A Scalable, High-Performance File System. www.lustre.org, 2002.

[105] Panasas Inc., "Panasas ActiveScale File System," www.panasas.com.

126

[106] D.D.E. Long, B.R. Montague, and L. Cabrera, "Swift/RAID: A Distributed RAID System," Computing Systems, 7(3):333-359, 1994.

[107] P.H. Carns, W.B. Ligon III, R.B. Ross, and R. Thakur, "PVFS: A Parallel File Sys-tem for Linux Clusters," in Proceedings of the 4th Annual Linux Showcase and Conference, Atlanta, GA, 2000.

[108] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and V. Paxson, Stream Control Transmission Protocol. RFC 2960, 2000.

[109] RDMA Consortium, www.rdmaconsortium.org.

[110] B. Callaghan and S. Singh, "The Autofs Automounter," in Proceedings of the Summer USENIX Technical Conference, Cincinnati, OH, 1993.

[111] S. Tweedie, "Ext3, Journaling Filesystem," in Proceedings of the Linux Symposium, Ottawa, Canada, 2000.

[112] ReiserFS, www.namesys.com.

[113] F. Garcia-Carballeira, A. Calderon, J. Carretero, J. Fernandez, and J.M. Perez, "The Design of the Expand File System," International Journal of High Performance Computing Applications, 17(1):21-37, 2003.

[114] G.H. Kim, R.G. Minnich, and L. McVoy, "Bigfoot-NFS: A Parallel File-Striping NFS Server (Extended Abstract)," 1994, www.bitmover.com/lm.

[115] A. Butt, T.A. Johnson, Y. Zheng, and Y.C. Hu, "Kosha: A Peer-to-Peer Enhance-ment for the Network File System," in Proceedings of Supercomputing '04, Pitts-burgh, PA, 2004.

[116] D.C. Anderson, J.S. Chase, and A.M. Vahdat, "Interposed Request Routing for Scalable Network Storage," in Proceedings of the 4th Symposium on Operating System Design and Implementation, San Diego, CA, 2000.

[117] M. Eisler, A. Chiu, and L. Ling, RPCSEC_GSS Protocol Specification. RFC 2203, 1997.

[118] M. Eisler, LIPKEY - A Low Infrastructure Public Key Mechanism Using SPKM. RFC 2847, 2000.

[119] J. Linn, The Kerberos Version 5 GSS-API Mechanism. RFC 1964, 1996.

[120] NFS Extensions for Parallel Storage (NEPS), www.citi.umich.edu/NEPS.

[121] G. Gibson and P. Corbett, pNFS Problem Statement. Internet Draft, 2004.

127

[122] G. Gibson, B. Welch, G. Goodson, and P. Corbett, Parallel NFS Requirements and Design Considerations. Internet Draft, 2004.

[123] S. Shepler, M. Eisler, and D. Noveck, NFSv4 Minor Version 1. Internet Draft, 2006.

[124] A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto, and G. Peck, "Scal-ability in the XFS File System," in Proceedings of the USENIX Annual Technical Conference, San Diego, CA, USA, 1996.

[125] IEEE Storage System Standards Working Group (SSSWG) (Project 1244), "Refer-ence Model for Open Storage Systems Interconnection - Mass Storage System Ref-erence Model Version 5," ssswg.org/public_documents/MSSRM/V5-pref.html, 1994.

[126] R.A. Coyne, H. Hulen, and R. Watson, "The High Performance Storage System," in Proceedings of Supercomputing '93, Portland, OR, 1993.

[127] TeraGrid, www.teragrid.org.

[128] W.D. Norcott and D. Capps, "IOZone Filesystem Benchmark," 2003, www.iozone.org.

[129] Parallel Virtual File System - Version 2, www.pvfs.org.

[130] C. Baru, R. Moore, A. Rajasekar, and M. Wan, "The SDSC Storage Resource Bro-ker," in Proceedings of the Conference of the Centre for Advanced Studies on Col-laborative Research, Toronto, Canada, 1998.

[131] T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patterson, D.S. Roselli, and R.Y. Wang, "Serverless Network File Systems," in Proceedings of the 15th ACM Sym-posium on Operating System Principles, Copper Mountain Resort, CO, 1995.

[132] B.S. White, M. Walker, M. Humphrey, and A.S. Grimshaw, "LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applica-tions," in Proceedings of Supercomputing '01, Denver CO, 2001.

[133] O. Tatebe, Y. Morita, S. Matsuoka, N. Soda, and S. Sekiguchi, "Grid Datafarm Ar-chitecture for Petascale Data Intensive Computing," in Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, Berlin, Germany, 2002.

[134] P. Andrews, C. Jordan, and W. Pfeiffer, "Marching Towards Nirvana: Configura-tions for Very High Performance Parallel File Systems," in Proceedings of the HiperIO Workshop, Barcelona, Spain, 2006.

128

[135] P. Andrews, C. Jordan, and H. Lederer, "Design, Implementation, and Production Experiences of a Global Storage Grid," in Proceedings of the 23rd IEEE/14th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 2006.

[136] R.O. Weber, SCSI Object-Based Storage Device Commands (OSD). Storage Net-working Industry Association. ANSI/INCITS 400-2004, www.t10.org, 2004.

[137] N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, and C.L. Seitz, "Myrinet A Gigabit-per-Second Local-Area Network," IEEE Micro, 15(1):29-36, 1995.

[138] Infiniband. Arch. Spec. Vol 1 & 2. Rel. 1.0, www.infinibandta.org/download_spec10.html, 2000.

[139] Internet Assigned Numbers Authority, www.iana.org.

[140] M. Olson, K. Bostic, and M. Seltzer, "Berkeley DB," in Proceedings of the Summer USENIX Technical Conference, FREENIX track, Monterey, CA, 1999.

[141] A. Bhide, A. Engineer, A. Kanetkar, A. Kini, C. Karamanolis, D. Muntz, Z. Zhang, and G. Thunquest, "File Virtualization with DirectNFS," in Proceedings of the 19th IEEE/10th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 2002.

[142] R. Rew and G. Davis, "The Unidata netCDF: Software for Scientific Data Access," in Proceedings of the 6th International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, Anaheim, CA, 1990.

[143] NCSA, "HDF5 ", hdf.ncsa.uiuc.edu/HDF5.

[144] Unidata Program Center, "Where is NetCDF Used?," www.unidata.ucar.edu/software/netcdf/usage.html.

[145] J. Katcher, "PostMark: A New File System Benchmark," Network Appliance, Technical Report TR3022, 1997.

[146] IOR Benchmark, www.llnl.gov/asci/purple/benchmarks/limited/ior.

[147] FLASH I/O Benchmark, flash.uchicago.edu/~jbgallag/io_bench.

[148] ATLAS, atlasinfo.cern.ch.

[149] The Large Hadron Collider, lhc.web.cern.ch.

[150] ATLAS Development Team (private communication), 2005.

129

[151] M. Rosenblum and J.K. Ousterhout, "The Design and Implementation of a Log-Structured File System," ACM Transactions on Computer Systems, 10(1):26-52, 1992.

[152] J.H. Hartman and J.K. Ousterhout, "The Zebra Striped Network File System," ACM Transactions on Computer Systems, 13(3):274-310, 1995.

[153] P.F. Corbett and D.G. Feitelson, "The Vesta Parallel File System," ACM Transac-tions on Computer Systems, 14(3):225-264, 1996.

[154] B. Halevy, B. Welch, J. Zelenka, and T. Pisek, Object-based pNFS Operations. Internet Draft, 2006.

[155] D. Hildebrand and P. Honeyman, "Exporting Storage Systems in a Scalable Manner with pNFS," in Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies, Monterey, CA, 2005.

[156] IBRIX Fusion, www.ibrix.com.

[157] S.V. Anastasiadis, K. C. Sevcik, and M. Stumm, "Disk Striping Scalability in the Exedra Media Server," in Proceedings of the ACM/SPIE Multimedia Computing and Networking, San Jose, CA, 2001.

[158] F. Isaila and W.F. Tichy, "Clusterfile: A Flexible Physical Layout Parallel File Sys-tem," in Proceedings of the IEEE International Conference on Cluster Computing, Newport Beach, CA, 2001.

[159] M. Seltzer, G. Ganger, M.K. McKusick, K. Smith, C. Soules, and C. Stein., "Jour-naling versus Soft Updates: Asynchronous Meta-data Protection in File Systems.," in Proceedings of the USENIX Annual Technical Conference, San Diego, CA, 2000.

[160] OpenSSH, www.openssh.org.

[161] M. Enserink and G. Vogel, "Deferring Competition, Global Net Closes In on SARS," Science, 300(5617):224-225, 2003.

[162] H. Newman, J. Bunn, R. Cavanaugh, I. Legrand, S. Low, S. McKee, D. Nae, S. Ra-vot, C. Steenberg, X. Su, M. Thomas, F. van Lingen, and Y. Xia, "The UltraLight Project: The Network as an Integrated and Managed Resource in Grid Systems for High Energy Physics and Data Intensive Science," Computing in Science and Engi-neering, 7(6), 2005.

[163] A. Ching, W. Feng, H. Lin, X. Ma, and A. Choudhary, "Exploring I/O Strategies for Parallel Sequence Database Search Tools with S3aSim," in Proceedings of the International Symposium on High Performance Distributed Computing, Paris, France, 2006.

130

[164] A. Ching, A. Choudhary, W.K. Liao, R. Ross, and W. Gropp, "Noncontiguous I/O through PVFS," in Proceedings of the IEEE International Conference on Cluster Computing, Chicago, IL, 2002.

[165] S. Well, S.A. Brandt, E.L. Miller, and C. Maltzahn, "CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data," in Proceedings of Supercomputing '06, Tampa, FL, 2006.

[166] S. Ghemawat, H. Gobioff, and S.T. Leung, "The Google File System," in Proceed-ings of the 19th ACM Symposium on Operating Systems Principles, Bolton Land-ing, NY, 2003.

[167] S. Well, K.T. Pollack, S.A. Brandt, and E.L. Miller, "Dynamic Metadata Manage-ment for Petabyte-scale File Systems," in Proceedings of Supercomputing '04, Pittsburgh, PA, 2004.

[168] R.B. Avila, P.O.A. Navaux, P. Lombard, A. Lebre, and Y. Denneulin, "Perform-ance Evaluation of a Prototype Distributed NFS Server," in Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing, Foz do Iguaçu, Brazil, 2004.

Documents

Distributed Access to Parallel File Systems...The first architecture, Split-Server NFSv4, targets parallel file system architectures that disallow customization and/or direct storage