Distributed File Systems Concepts and e 61384

8/8/2019 Distributed File Systems Concepts and e 61384

1/54

Distributed File Systems: Concepts and Examples

ELIEZER LEVY and ABRAHAM SILBERSCHATZ

Department of Computer Sciences, University of Texas at Austin, Austin, Texas 78712-l 188

The purpose of a distributed file system (DFS) is to allow users of physically distributedcomputers to share data and storage resources by using a common file system. A typicalconfiguration for a DFS is a collection of workstations and mainframes connected by alocal area network (LAN). A DFS is implemented as part of the operating system of eachof the connected computers. This paper establishes a viewpoint that emphasizes thedispersed structure and decentralization of both data and control in the design of suchsystems. It defines the concepts of transparency, fault tolerance, and scalability anddiscusses them in the context of DFSs. The paper claims that the principle of distributedoperation is fundamental for a fault tolerant and scalable DFS design. It also presentsalternatives for the semantics of sharing and methods for providing access to remote files.A survey of contemporary UNIX@-based systems, namely, UNIX United, Locus, Sprite,Suns Network File System, and ITCs Andrew, illustrates the concepts and demonstratesvarious implementations and design alternatives. Based on the assessment of thesesystems, the paper makes the point that a departure from the approach of extendingcentralized file systems over a communication network is necessary to accomplish sounddistributed file system design.

Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]:Distributed Systems-distributed applications; network operating systems; D.4.2[Operating Systems]: Storage Management-allocation/deallocation strategies; storagehierarchies; D.4.3 [Operating Systems]: File Systems Management-directorystructures; distributed file systems; file organization; maintenance; D.4.4 [OperatingSystems]: Communication Management-buffering; network communication; D.4.5[Operating Systems]: Reliability-fault tolerance; D.4.7 [Operating Systems]:Organization and Design-distributed systems; F.5 [Files]: Organization/structure

General Terms: Design, Reliability

Additional Key Words and Phrases: Caching, client-server communication, networktransparency, scalability, UNIX

INTRODUCTIONdiscusses Distributed File Systems (DFSs)

The need to share resources in a commuter as the means of sharing storage space and.system arises due to economics or the na- data.

ture of some applications. In such cases, it A file system is a subsystem of an oper-

is necessary to facilitate sharing long-term ating system whose purpose is to provide

storage devices and their data. This paper long-term storage. It does so by implement-ing files-named objects that exist from

@UNIX is a trademark of AT&T Bell Laboratories.th& explicit creatidn until their explicitdestruction and are immune to temporary

Permission to copy without fee all or part of this material is granted provided that the copies are not made ordistributed for direct commercial advantage, the ACM copyright notice and the title of the publication and itsdate appear, and notice is given that copying is by permission of the Association for Computing Machinery. Tocopy otherwise, or to republish, requires a fee and/or specific permission.0 1990 ACM 0360-0300/90/1200-0321 $01.50

ACM Computing Surveys, Vol. 22, No.4, December 1990


2/54

322 l E. Levy and A. Silberschatz

CONTENTS

INTRODUCTION

1. TRENDS AND TERMINOLOGY2. NAMING AND TRANSPARENCY

2.1 Location Transparency and Independence2.2 Naming Schemes2.3 Implementat ion Techniques

3. SEMANTICS OF SHARING3.1 UNIX Semantics3.2 Session Semantics3.3 Immutable Shared Files Semantics3.4 Transaction-like Semantics

4. REMOTE-ACCESS METHODS4.1 Designing a Caching Scheme4.2 Cache Consistency4.3 Comparison of Caching and Remote Service

5. FAULT TOLERANCE ISSUES5.1 Stateful Versus Stateless Service5.2 Improving Availability5.3 File Replication

6. SCALABILITY ISSUES6.1 Guidelines by Negative Examples6.2 Lightweight Processes

7. UNIX UNITED7.1 Overview7.2 Implementation-Newcastle Connection7.3 Summary

8. LOCUS8.1 Overview8.2 Name Structure8.3 File Operations8.4 Synchronizing Accesses to Files8.5 Operation in a Faulty Environment8.6 Summary

9. SUN NETWORK FILE SYSTEM9.1 Overview9.2 NFS Services9.3 Implementation9.4 Summary

10. SPRITE10.1 Overview10.2 Looking Up Files with Prefix Tables

10.3 Caching and Consistency10.4 summary11. ANDREW

11.1 Overview11.2 Shared Name Space11.3 File Operations and Sharing Semantics11.4 Implementation11.5 Summary

12. OVERVIEW OF RELATED WORK13. CONCLUSIONSACKNOWLEDGMENTSREFERENCESBIBLIOGRAPHY

failures in the system. A DFS is a distrib-uted implementation of the classical time-sharing model of a file system, where mul-tiple users share files and storage resources.The UNIX time-sharing file system is usu-

ally regarded as the model [Ritchie andThompson 19741. The purpose of a DFS isto support the same kind of sharing whenusers are physically dispersed in a distrib-uted system. A distributed system is a col-lection of loosely coupled machines-eithera mainframe or a workstation-intercon-nected by a communication network. Un-less specified otherwise, the network is alocal area network (LAN). From the pointof view of a specific machine in a distrib-

uted system, the rest of the machines andtheir respective resources are remote andthe machines own resources are local.

To explain the structure of a DFS, weneed to define service, server, and client[Mitchell 19821. A service is a softwareentity running on one or more machinesand providing a particular type of functionto a priori unknown clients. A server is theservice software running on a single ma-chine. A client is a process that can invokea service using a set of operations that formits client interface (see below). Sometimes,a lower level interface is defined for theactual cross-machine interaction. Whenthe need arises, we refer to this interface asthe intermachine interface. Clients imple-ment interfaces suitable for higher levelapplications or direct access by humans.

Using the above terminology, we say afile system provides file services to clients.A client interface for a file service is formedby a set of file operations. The most primi-tive operations are Create a file, Delete afile, Read from a file, and Write to a file.The primary hardware component a fileserver controls is a set of secondary storagedevices (i.e., magnetic disks) on which filesare stored and from which they are re-trieved according to the clients requests.We often say that a server, or a machine,stores a file, meaning the file resides on oneof its attached devices. We refer to the filesystem offered by a uniprocessor, time-sharing operating system (e.g., UNIX 4.2BSD) as a conventional file system.

ACM Computing Surveys, Vol. 22, No. 4, December 1990


3/54

Distributed File Systems l 323

the survey paper by Tanenbaum and VanRenesse [ 19851, where the broader contextof distributed operating systems and com-munication primitives are discussed.

In light of the profusion of UNIX-based

DFSs and the dominance of the UNIX filesystem model, five UNIX-based systemsare surveyed. The first part of the paper isindependent of this choice as much as pas-sible. Since a vast majorit,y of the actualDFSs (and all systems surveyed and men-tioned in this paper) have some relation toUNIX, however, it is inevitable that theconcepts are understood best in the UNIXcontext. The choice of the five systemsand the order of their presentation demon-

strate the evolution of DFSs in the lastdecade.Section 1 presents the terminology and

concepts of transparency, fault tolerance,and scalability. Section 2 discusses trans-parency and how it is expressed in namingschemes in greater detail. Section 3 intro-duces notions that are important for thesemantics of sharing files, and Section 4compares methods of caching and remoteservice. Sections 5 and 6 discuss issues

related to fault tolerance and scalability,respectively, pointing out observationsbased on the designs of the surveyed sys-tems. Sections 7-11 describe each of thefive systems mentioned above, includingdistinctive features of a system not relatedto the issues presented in the first part.Each description is followed by a summaryof the prominent features of the corre-sponding system. A table compares the fivesystems and concludes the survey. Manyimportant aspects of DFSs and systems areomitted from this paper; thus, Section 12reviews related work not emphasized in ourdiscussion. Finally, Section 13 providesconclusions and a bibliography provides re-lated literature not directly referenced.

A DFS is a file system, whose clients,servers, and storage devices are dispersedamong the machines of a distributed sys-tem. Accordingly, service activity has to becarried out across the network, and instead

of a single centralized data repository thereare multiple and independent storage de-vices. As will become evident, the concreteconfiguration and implementation of aDFS may vary. There are configurationswhere servers run on dedicated machines,as well as configurations where a machinecan be both a server and a client. A DFScan be implemented as part of a distributedoperating system or, alternatively, by asoftware layer whose task is to manage

the communication between conventionaloperating systems and file systems. Thedistinctive features of a DFS are themultiplicity and autonomy of clients andservers in the system.

The paper is divided into two parts . Inthe first part, which includes Sections 1 to6, the basic concepts underlying the designof a DFS are discussed. In particular, alter-natives and trade-offs regarding the designof a DFS are pointed out. The second part

surveys five DFSs: UNIX United [Brown-bridge et al. 1982; Randell 19831, Locus[Popek and Walker 1985; Walker et al.19831, Suns Network File System (NFS)[Sandberg et al. 1985; Sun MicrosystemsInc. 19881, Sprite [Nelson et al., 1988;Ousterhout et al. 19881, and Andrew[Howard et al. 1988; Morris et al. 1986;Satyanarayanan et al. 19851. These systemsexemplify the concepts and observationsmentioned in the first part and demon-strate various implementations. A point inthe first part is often illustrated by referringto a later section covering one of the sur-veyed systems.

The fundamental concepts of a DFS canbe studied without paying significant atten-tion to the actual operating system of whichit is a component. The first part of thepaper adopts this approach. The secondpart reviews actual DFS architectures thatserve to demonstrate approaches to inte-gration of a DFS with an operating systemand a communication network. To comple-ment our discussion, we refer the reader to

1. TRENDS AND TERMINOLOGY

Ideally, a DFS should look to its clients likea conventional, centralized file system.That is, the multiplicity and dispersion ofservers and storage devices should be trans-parent to clients. As will become evident,



4/54


transparency has many dimensions and de-grees. A fundamental property, called net-work transparency, implies that clientsshould be able to access remote files usingthe same set of file operations applicable to

local files. That is, the client interface of aDFS should not distinguish between localand remote files. It is up to the DFS tolocate the files and arrange for the trans-port of the data.

Another aspect of transparency is usermobility, which implies that users can login to any machine in the system; that is,they are not forced to use a specific ma-chine. A transparent DFS facilitates usermobility by bringing the users environ-

ment (e.g., home directory) to wherever heor she logs in.The most important performance mea-

surement of a DFS is the amount of timeneeded to satisfy service requests. In con-ventional systems, this time consists of diskaccess time and a small amount of CPUprocessing time. In a DFS, a remote accesshas the additional overhead attributed tothe distributed structure. This overheadincludes the time needed to deliver the re-

quest to a server, as well as the time neededto get the response across the networkback to the client. For each direction, inaddition to the actual transfer of the infor-mation, there is the CPU overhead of run-ning the communication protocol software.The performance of a DFS can be viewedas another dimension of its transparency;that is, the performance of a DFS shouldbe comparable to that of a conventional filesystem.

We use the term fault tolerance in abroad sense. Communication faults, ma-chine failures (of type fail stop), storagedevice crashes, and decays of storage mediaare all considered to be faults that shouldbe tolerated to some extent. A fault-tolerant system should continue function-ing, perhaps in a degraded form, in the faceof these failures. The degradation can be inperformance, functionality, or both butshould be proportional, in some sense, tothe failures causing it. A system that grindsto a halt when a small number of its com-ponents fail is not fault tolerant.

The capability of a system to adapt toincreased service load is called scalability.

Systems have bounded resources and canbecome completely saturated under in-creased load. Regarding a file system, sat-uration occurs, for example, when a serversCPU runs at very high utilization rate or

when disks are almost full. As for a DFS inparticular, server saturation is even a biggerthreat because of the communication over-head associated with processing remoterequests. Scalability is a relative property;a scalable system should react more grace-fully to increased load than a nonscalableone will. First, its performance shoulddegrade more moderately than that of anonscalable system. Second, its resourcesshould reach a saturated state later, when

compared with a nonscalable system.Even a perfect design cannot accommo-date an ever-growing load. Adding new re-sources might solve the problem, but itmight generate additional indirect load onother resources (e.g., adding machines to adistributed system can clog the networkand increase service loads). Even worse,expanding the system can incur expensivedesign modifications. A scalable systemshould have the potential to grow without

the above problems. In a distributed sys-tem, the ability to scale up gracefully is ofspecial importance, since expanding thenetwork by adding new machines or inter-connecting two networks together is com-monplace. In short, a scalable design shouldwithstand high-service load, accommodategrowth of the user community, and enablesimple integration of added resources.

Fault tolerance and scalability are mu-tually related to each other. A heavilyloaded component can become paralyzedand behave like a faulty component. Also,shifting a load from a faulty component toits backup can saturate the latter. Gener-ally, having spare resources is essential forreliability, as well as for handling peakloads gracefully.

An advantage of distributed systems overcentralized systems is the potential for faulttolerance and scalability because of themultiplicity of resources. Inappropriate de-sign can, however, obscure this potentialand, worse, hinder the systems scalabilityand make it failure prone. Fault toleranceand scalability considerations call for a de-sign demonstrating distribution of control



5/54


numerical identifier, which in turn ismapped to disk blocks. This multilevelmapping provides users with an abstractionof a file that hides the details of how andwhere the file is actually stored on the disk.

In a transparent DFS, a new dimensionis added to the abstraction, that of hidingwhere in the network the file is located. Ina conventional file system the range of thename mapping is an address within a disk;in a DFS it is augmented to include thespecific machine on whose disk the file isstored. Going further with the concept oftreating files as abstractions leads to thenotion of file replication. Given a file name,the mapping returns a set of the locationsof this files replicas [Ellis and Floyd19831. In this abstraction, both the exist-ence of multiple copies and their locationsare hidden.

In this section, we elaborate on transpar-ency issues regarding naming in a DFS.After introducing the properties in thiscontext, we sketch approaches to namingand discuss implementation techniques.

and data. Any centralized entity, be it acentral controller or a central data reposi-tory, introduces both a severe point offailure and a performance bottleneck.Therefore, a scalable and fault-tolerantDFS should have multiple and independentservers controlling multiple and indepen-dent storage devices.

The fact that a DFS manages a set ofdispersed storage devices is its key distin-guishing feature. The overall storage spacemanaged by a DFS consists of different andremotely located smaller storage spaces.Usually there is correspondence betweenthese constituent storage spaces and sets offiles. We use the term component unit todenote the smallest set of files that can bestored on a single machine, independentlyfrom other units. All files belonging to thesame component unit must reside in thesame location. We illustrate the definitionof a component unit by drawing an analogyfrom (conventional) UNIX, where multipledisk partitions play the role of distributedstorage sites. There, an entire removablefile system is a component unit, since a filesystem must fit within a single disk parti-tion [Ritchie and Thompson 19741. In allfive systems, a component unit is a partialsubtree of the UNIX hierarchy.

Before we proceed, we stress that thedistributed nature of a DFS is fundamentalto our view. This characteristic lays thefoundation for a scalable and fault-tolerantsystem. Yet, for a distributed system to beconveniently used, its underlying dispersedstructure and activity should be madetransparent to users. We confine ourselvesto discussing DFS designs in the context oftransparency, fault tolerance, and scalabil-ity. The aim of this paper is to develop anunderstanding of these three concepts onthe basis of the experience gained withcontemporary systems.

2. NAMING AND TRANSPARENCY

Naming is a mapping between logical andphysical objects. Users deal with logicaldata objects represented by file names,whereas the system manipulates physicalblocks of data stored on disk tracks. Usu-ally, a user refers to a file by a textualname. The latter is mapped to a lower-level

2.1 Location Transparencyand Independence

This section discusses transparency in thecontext of file names. First, two relatednotions regarding name mappings in a DFSneed to be differentiated:

l Location Transparency. The name of afile does not reveal any hint as to itsphysical storage location.

l Location Independence. The name of a

file need not be changed when the filesphysical storage location changes.

Both definitions are relative to the dis-cussed level of naming, since files havedifferent names at different levels (i.e.,user-level textual names, and system-levelnumerical identifiers). A location-indepen-dent naming scheme is a dynamic mapping,since it can map the same file name todifferent locations at two different in-stances of time. Therefore, location inde-pendence is a stronger property thanlocation transparency. Location indepen-dence is often referred to as file migrationor file mobility. When referring to file mi-gration or mobility, one implicitly assumes



6/54


that the movement of files is totally trans-parent to users. That is, files are migratedby the system without the users beingaware of it.

In practice, most of the current file sys-tems (e.g., Locus, NFS, Sprite) provide astatic, location-transparent mapping foruser-level names. The notion of locationindependence is, however, irrelevant forthese systems. Only Andrew and some ex-perimental file systems support locationindependence and file mobility (e.g., Eden[Almes et al., 1983; Jessop et al. 19821).Andrew supports file mobility mainly foradministrative purposes. A protocol pro-vides migration of Andrews componentunits upon explicit request without chang-ing the user-level or the low-level names ofthe corresponding files (see Section 11.2 fordetails).

There are few other aspects that canfurther differentiate and contrast locationindependence and location transparency:

l Divorcing data from location, as exhib-ited by location independence, provides

a better abstraction for files. Location-independent files can be viewed as logicaldata containers not attached to a specificstorage location. If only location trans-parency is supported, however, the filename still denotes a specific, though hid-den, set of physical disk blocks.

l Location transparency provides userswith a convenient way to share data.Users may share remote files by namingthem in a location-transparent manneras if they were local. Nevertheless, shar-ing the storage space is cumbersome,since logical names are still statically at-tached to physical storage devices. Loca-tion independence promotes sharing thestorage space itself, as well as sharing thedata objects. When files can be mobilized,the overall, systemwide storage spacelooks like a single, virtual resource. Apossible benefit of such a view is theability to balance the utilization of disksacross the system. Load balancing of theservers themselves is also made possibleby this approach, since files can be mi-grated from heavily loaded servers tolightly loaded ones.

l Location independence separates thenaming hierarchy from the storage de-vices hierarchy and the interserver struc-ture. By contrast, if only locationtransparency is used (although namesare transparent), one can easily exposethe correspondence between componentunits and machines. The machines areconfigured in a pattern similar to thenaming structure. This may restrict thearchitecture of the system unnecessarilyand conflict with other considerations. Aserver in charge of a root directory isan example for a structure dictated bythe naming hierarchy and contradictsdecentralizat,ion guidelines. An excellentexample of separation of the servicestructure from the naming hierarchy canbe found in the design of the Grapevinesystem [Birrel et al. 1982; Schroeder etal. 19841.

The concept of file mobility deserves moreattention and research. We envision futureDFS that supports location independencecompletely and exploits the flexibility thatthis property entails.

2.2 Naming Schemes

There are three main approaches to namingschemes in a DFS [Barak et al. 19861. Inthe simplest approach, files are named bysome combination of their host name andlocal name, which guarantees a unique sys-tem-wide name. In Ibis for instance, afile is uniquely identified by the namehostzlocal-name, where local name is a

UNIX-like path [Tichy and Ruan 19841.This naming scheme is neither locationtransparent nor location independent.Nevertheless, the same file operations canbe used for both local and remote files; thatis, at least the fundamental network trans-parency is provided. The structure of theDFS is a collection of isolated componentunits that are entire conventional file sys-tems. In this first approach, componentunits remain isolated, although means areprovided to refer to a remote file. We donot consider this scheme any further in thispaper.

The second approach, popularized bySuns NFS, provides means for individual



7/54

Distributed File Systems

2.3.1 Pathname Translation

l 327

The mapping of textual names to low-levelidentifiers is typically done by a recursivelookup procedure based on the one used in

conventional UNIX [Ritchie and Thomp-son 19741. We briefly review how thisprocedure works in a DFS scenario by il-lustrating the lookup of the textual name/a/b/c of Figure 1. The figure shows a par-tial name structure constructed from threecomponent units using the third schemementioned above. For simplicity, we as-sume that the location table is available toall the machines. Suppose that the lookupis initiated by a client on machinel. First,

the root directory 1 (whose low-level iden-tifier and hence its location on disk isknown in advance) is searched to find theentry with the low-level identifier of a.Once the low-level identifier of a is found,the directory a itself can be fetched fromdisk. Now, b is looked for in this directory.Since b is remote, an indication that bbelongs to cu2 is recorded in the entry of bin the directory a. The component of thename looked up so far is stripped off and

the remainder (/b/c) is passed on tomachine2. On machine2, the lookup is con-tinued and eventually machine3 is con-tacted and the low-level identifier of /a/b/cis returned to the client. All five systemsmentioned in this paper use a variant ofthis lookup procedure. Joining componentunits together and recording the pointswhere they are joined (e.g., b is such a pointin the above example) is done by the mountmechanism discussed below.

There are few options to consider whenmachine boundaries are crossed in thecourse of a pat,hname traversal. We referagain to the above example. Oncemachine2 is contacted, it can look up b andrespond immediately to machinel. Alter-natively, machine2 can initiate the contactwith machine3 on behalf of the client onmachinel. This choice has ramifications onfault tolerance that are discussed in Section5.2. Among the surveyed systems, only in

UNIX United are lookups forwarded frommachine to machine on behalf of the lookupinitiator. If machine2 responds immedi-ately, it can either respond with the low-level identifier of b or send as a reply the

machines to attach (or mount in UNIXjargon) remote directories to their localname spaces. Once a remote directory isattached locally, its files can be named in alocation-transparent manner. The result-ing name structure is versatile; usually it isa forest of UNIX trees, one for each ma-chine, with some overlapping (i.e., shared)subtrees. A prominent property of thisscheme is the fact that the shared namespace may not be identical at all the ma-chines. Usually this is perceived as a seriousdisadvantage; however, the scheme has thepotential for creating customized namespaces for individual machines.

Total integration between the compo-nent file systems is achieved using the thirdapproach-a single global name structurethat spans all the files in the system. Con-sequently, the same name space is visible toall clients. Ideally, the composed file systemstructure should be isomorphic to the struc-ture of a conventional file system. In prac-tice, however, there are many special filesthat make the ideal goal difficult to attain.(In UNIX, for example, I/O devices aretreated as ordinary files and are repre-sented in the directory Jdev; object code ofsystem programs reside in the directory/bin. These are special files specific to aparticular hardware setting.) Different var-iations of this approach are examined inthe sections on UNIX United, Locus,Sprite, and Andrew.

All important criterion for evaluating theabove naming structures is administrativecomplexity. The most complex structureand most difficult to maintain is the NFSstructure. The effects of a failed machine,or taking a machine off-line, are that somearbitrary set of directories on different.machines becomes unavailable. Likewise,migrating files from one machine to an-other requires changes in the name spacesof all the affected machines. In addition, aseparate accreditation mechanism had tobe devised for controlling which machine isallowed to attach which directory to itsname space.

2.3 Implementation Techniques

This section reviews commonly used tech-niques related to naming.



8/54


component unit server 7cul machine1cu2 machine2cu3 machine3

Location Table

Figure 1. Lookup example.

ent.ire parent directory of b. In the formerit is the server (machine2 in the example)that performs the lookup, whereas in thelatter it is the client that initiates thelookup that actually searches the directory.In case the servers CPU is loaded, thischoice is of consequence. In Andrew andLocus, clients perform the lookups; in NFS

and Sprite the servers perform it.

2.3.2 Structured Identifiers

Implementing transparent naming requiresthe provision of the mapping of a file nameto its location. Keeping this mapping man-ageable calls for aggregating sets of filesinto component units and providing themapping on a component unit basis ratherthan on a single file basis. Typically, struc-

tured identifiers are used for this aggrega-tion. These are bit strings that usually havetwo parts. The first part identifies the com-ponent unit to which file belongs; the sec-

ond identifies the particular file within theunit. Variants with more parts are possible.The invariant of structured names is, how-ever, that individual parts of the name areunique for all times only within the contextof the rest of the parts. Uniqueness at alltimes can be obtained by not reusing aname that is still used, or by allocating a

sufficient number of bits for the names(this method is used in Andrew), or by usinga time stamp as one of the parts of thename (as done in Apollo Domain [Leach etal. 19821).

To enhance the availability of the crucialname to location mapping information,methods such as replicating it or cachingparts of it locally by clients are used. Aswas noted, location independence meansthat the mapping changes in time and,

hence, replicating the mapping makes up-dating the information consistently a com-plicated matter. Structured identifiers arelocation independent; they do not mention



9/54

Distributed File Systems 8 329

servers locations at all. Hence, these iden-tifiers can be replicated and cached freelywithout being invalidated by migration ofcomponent units. A smaller, second level ofmapping that maps component units to

locations is the only information that doeschange when files migrate. The usage ofthe techniques of aggregation of files intocomponent units and lcwer-level, location-independent file identifiers is exempli-fied in Andrew (Section 11) and Locus(Section 8).

We illustrate the above techniques withthe example in Figure 1. Suppose the path-name /a/b/c is translated to the structured,low-level identifier , where cu3denotes that files component unit and 11identifies it in that unit. The only placewhere machine locations are recorded is inthe location table. Hence, the correspon-dence between /a/b/c and is notinvalidated once cu3 is migrated tomachine2; only the location table should beupdated.

2.3.3 Hints

A technique often used for location map-ping in a DFS is that of hints [Lampson1983; Terry 19871. A hint is a piece ofinformation that speeds up performance ifit is correct and does not cause any se-mantically negative effects if it is incorrect.In essence, a hint improves performancesimilarly to cached information. A hint maybe wrong, however; therefore, its correct-ness must be validated upon use. To illus-trate how location information is treated as

hints, assume there is a location server thatalways reflects the correct and completemapping of files to locations. Also assumethat clients cache parts of this mappinglocally. The cached location information istreated as a hint. If a file is found using thehint, a substantial performance gain is ob-tained. On the other hand, if the hint wasinvalidated because the file had been mi-grated, the clients lookup would fail. Con-sequently, the client must resort to the

more expensive procedure of querying thelocation server; but, still, no semanticallynegative effects are caused. Examples ofusing hints abound: Clients in Andrew

cache location information from serversand treat this information as hints (seeSection 11.4). Sprite uses an effective formof hints called prefix tables and resorts tobroadcasting when the hint is wrong (see

Section 10.2). The location mechanism ofApollo Domain is based on hints and heu-ristics [Leach et al. 19821. The Grapevinemail system counts on hints to locatemailboxes of mail recipients [Birrel et al.19821.

2.3.4 Mount Mechanism

Joining remote file systems to create aglobal name structure is often done by themount mechanism. In conventional UNIX,the mount mechanism is used to join to-gether several self-contained file systems toform a single hierarchical name space[Quarterman et al. 1985; R.itchie andThompson 19741. A mount operation bindsthe root of one file system to a directory ofanother file system. The former file systemhides the subtree descending from themounted-over directory and looks like anintegral subtree of the latter file system.The directory that glues together the twofile systems is called a mount point. Allmount operations are recorded by the op-erating system kernel in a mount table. Thistable is used to redirect name lookups tothe appropriate file systems. The same se-mantics and mechanisms are used to mounta remote file system over a local one. Oncethe mount is complete, files in the remotefile system can be accessed locally as if theywere ordinary descendants of the mountpoint directory. The mount mechanism isused with slight variations in Locus, NFS,Sprite, and Andrew. Section 9.2.1 presentsa detailed example of the mount operation.

3. SEMANTICS OF SHARING

The semantics of sharing are importantcriteria for evaluating any file system thatallows multiple clients to share files. It is a

characterization of the system that speci-fies the effects of multiple clients accessinga shared file simultaneously. In partic-ular, these semantics should specify when



10/54

330 l E. L.evy and A. Silberschatz

modifications of data by a client are ob-servable, if at all, by remote clients.

For the following discussion we need toassume that a series of file accesses (i.e.,Reads and Writes) attempted by a client tothe same file are always enclosed betweenthe Open and Close operations. We denotesuch a series of accesses as a file session.

It should be realized that applicationsthat use the file system to store data andpose constraints on concurrent accesses inorder to guarantee the semantic consis-tency of their data (i.e., database applica-tions) should use special means (e.g., locks)for this purpose and not rely on the under-lying semantics of sharing provided by thefile system.

To illustrate the concept, we sketch sev-eral examples of semantics of sharing men-tioned in this paper. We outline the gist ofthe semantics and not the whole detail.

3.1 UNIX Semantics

Every Read of a file sees the effects of allprevious Writes performed on that file inthe DFS. In particular, Writes to an openfile by a client are visible immediately byother (possibly remote) clients who havethis file open at the same time.It is possible for clients to share thepointer of current location into the file.Thus, the advancing of the pointer byone client affects all sharing clients.

Consider a sequence interleaving all theaccesses to the same file regardless of theidentity of the issuing client. Enforcing

the above semantics guarantees that eachsuccessive access sees the effects of the onesthat precede it in that sequence. In a filesystem context, such an interleaving can betotally arbitrary, since, in contrast to da-tabase management systems, sequences ofaccesses are not defined as transactions.These semantics lend themselves to an im-plementation where a file is associated witha single physical image that serves all ac-cesses in some serial order (which is the

order captured in the above sequence).Contention for this single image results inclients being delayed. The sharing of thelocation pointer mentioned above is an ar-

tifact of UNIX and is needed primarily forcompatibility of distributed UNIX systemswith conventional UNIX software. MostDFSs try to emulate these semantics tosome extent (e.g., Locus, Sprite) mainlybecause of compatibility reasons.

3.2 Session Semantics

Writes to an open file are visible imme-diately to local clients but are invisible toremote clients who have the same fileopen simultaneously.Once a file is closed, the changes made toit are visible only in later starting ses-sions. Already open instances of the filedo not reflect these changes.According to these semantics, a file may

be temporarily associated with several (pos-sibly different) images at the same time.Consequently, multiple clients are allowedto perform both Read and Write accessesconcurrently on their image of the file,without being delayed. Observe that whena file is closed, all remote active sessionsare actually using a stale copy of the file.

Here, it is evident that application pro-grams that care about the serialization ofaccesses (e.g., a distributed database appli-cation) should coordinate their accessesexplicitly and not rely on these semantics,

3.3 Immutable Shared Files Semantics

-4 different, quite unique approach is thatof immutable shared files [Schroeder et al.19851. Once a file is declared as shared byits creator, it cannot be modified any more.An immutable file has two important prop-erties: Its name may not be reused, and itscontents may not be altered. Thus, thename of an immutable file signifies thefixed contents of the file, not the file asa container for variable information. Theimplementation of these semantics in a dis-tributed system is simple since the sharingis in read-only mode.

3.4 Transaction-Like Semantics

Identifying a file session with a transactionyields the following, familiar semantics:The effects of file sessions on a file and



11/54

their output are equivalent to the effect andoutput of executing the same sessions insome serial order. Locking a file for theduration of a session implements thesesemantics. Refer to the rich literature ondatabase management systems to under-stand the concepts of transactions andlocking [Bernstein et al. 19871. In the Cam-bridge File Server, the beginning and endof a transaction are implicit in the Openfile, Close file operations, and transactionscan involve only one file [Needham andHerbert 19821. Thus, a file session in thatsystem is actually a transaction.

Variants of UNIX and (to a lesser de-gree) session semantics are the mostcommonly used policies. An importanttrade-off emerges when evaluating thesetwo extremes of sharing semantics. Sim-plicity of a distributed implementation istraded for the strength of the semanticsguarantee. UNIX semantics guarantee thestrong effect of making all accesses see thesame version of the file, thereby ensuringthat every access is affected by all previousones. On the other hand, session semanticsdo not guarantee much when a file is ac-cessed concurrently, since accesses at dif-ferent machines may observe differentversions of the accessed file. The ramifica-tions on the ease of implementation arediscussed in the next section.

4. REMOTE-ACCESS METHODS

Consider a client process that requests toaccess (i.e., Read or Write) a remote file.Assuming the server storing the file waslocated by the naming scheme, the actualdata transfer to satisfy the clients requestfor the remote access should take place.There are two complementary methods forhandling this type of data transfer.

l Remote Service. Requests for accessesare delivered to the server. The servermachine performs the accesses, and theirresults are forwarded back to the client.There is a direct correspondence between

accesses and traffic to and from theserver. Access requests are translated tomessages for the servers, and server re-plies are packed as messages sent back to


the clients. Every access is handled bythe server and results in network traffic.For example, a Read corresponds to arequest message sent to the server and areply to the client with the requesteddata. A similar notion called RemoteOpen is defined in Howard et al. [1988].

l Caching. If the data needed to satisfy theaccess request are not present locally, acopy of those data is brought from theserver to the client. Usually the amountof data brought over is much larger thanthe data actually requested (e.g., wholefiles or pages versus a few blocks). Ac-cesses are performed on the cached copyin the client side. The idea is to retainrecently accessed disk blocks in cacheso repeated accesses to the same infor-mation can be handled locally, withoutadditional network traffic. Cachingperforms best when the stream of fileaccesses exhibits locality of reference. Areplacement policy (e.g., Least RecentlyUsed) is used to keep the cache sizebounded. There is no direct correspon-dence between accesses and traffic tothe server. Files are still identified, withone master copy residing at the servermachine, but copies of (parts of) the fileare scattered in different caches. When acached copy is modified, the changes needto be reflected on the master copy and,depending on the relevant sharing se-mantics, on any other cached copies.Therefore, Write accesses may incur sub-stantial overhead. The problem of keep-ing the cached copies consistent with themaster file is referred to as the cacheconsistency problem [Smith 19821.It should be realized that there is a direct

analogy between disk access methods inconventional file systems and remote ac-cess methods in DFSs. A pure remote serv-ice method is analogous to performing adisk access for each and every access re-quest. Similarly, a caching scheme in a DFSis an extension of caching or buffering tech-niques in conventional file systems (e.g.,

buffering block I/O in UNIX [McKusick etal. 19841). In conventional file systems, therationale behind caching is to reduce diskI/O, whereas in DFSs the goal is to reduce



12/54

332 . E. Levy and A. Silberschatz

network traffic. For these reasons, a pureremote service method is not practical. Im-plementations must incorporate some formof caching for performance enhancement.Many implementations can be thought ofas a hybrid of caching and remote service.In Locus and NFS, for instance, the imple-mentation is based on remote service but isaugmented with caching for performance(see Sections 8.3, 8.4, and 9.3.3). On theother hand, Sprites implementation isbased on caching, but under certain circum-stances a remote service method is adopted(see Section 10.3). Thus, when we evalu-ate the two methods we actually evaluateto what degree one method should beemphasized over the other.

An interesting study of the performanceaspects of the remote access problem canbe found in Cheriton and Zwaenepoel[ 19831. This paper evaluates to what extentremote access (using the simplest remoteservice paradigm) is more expensive thanlocal access.

The remote service method is straight-forward and does not require further expla-nation. Thus, the following material isprimarily concerned with the method ofcaching.

4.1 Designing a Caching Scheme

The following discussion pertains to a (filedata) caching scheme between a clientscache and a server. The latter is viewed asa uniform entity and its main memory anddisk are not differentiated. Thus, we ab-

stract the traditional caching scheme onthe server side, between its own cache anddisk.

A caching scheme in a DFS shouldaddress the following design decisions[Nelson et al. 19881:

The granularity of cached data.The location of the clients cache (mainmemory or local disk).How to propagate modifications of

cached copies.How to determine if a clients cached dataare consistent.

The choices for these decisions are inter-twined and related to the selected sharingsemantics.

4.1.1 Cache Unit SizeThe granularity of the cached data can varyfrom parts of a file to an entire file. Usually,more data are cached than needed to satisfya single access, so many accesses can beserved by the cached data. An early versionof Andrew caches entire files. Currently,Andrew still performs caching in bigchunks (64Kb). The rest of the systemssupport caching individual blocks driven byclients demand, where a block is the unitof transfer between disk and main memorybuffers (see sample sizes below). Increasingthe caching unit increases the likelihoodthat data for the next access will be foundlocally (i.e., the hit ratio is increased); onthe other hand, the time required for thedata transfer and the potential for consis-tency problems are increased, too. Selectingthe unit of caching involves parameterssuch as the network transfer unit and theRemote Procedure Call (RPC) protocolservice unit (in case an RPC protocol isused) [Birrel and Nelson 19841. The net-work transfer unit is relatively small (e.g.,Ethernet packets are about 1.5Kb), so bigunits of cached data need to be disassem-bled for delivery and reassembled uponreception [Welch 19861.

Typically, block-caching schemes use atechnique called read-ahead. This tech-nique is useful when sequentially reading alarge file. Blocks are read from the serverdisk and buffered on both the server andclient sides before they are actually neededin order to speed up the reading.

One advantage of a large caching unit isreduced network overhead. Recall that run-ning communication protocols accounts fora substantial portion of this overhead.Transferring data in bulks amortizes theprotocol cost over many transfer units. Atthe sender side, one context switch (to loadthe communication software) suffices toformat and transmit multiple packets. Atthe receiver side, there is no need to ac-knowledge each packet individually.



13/54


action of sending dirty blocks to be writtenon the master copy.

The policy used to flush dirty blocks backto the servers master copy has a criticaleffect on the systems performance and re-liability. (In this section we assume cachesare held in main memories.) The simplestpolicy is to write data through to the serv-ers disk as soon as it is written to anycache. The advantage of the write-throughmethod is its reliability: Little informationis lost when a client crashes. This policyrequires, however, that each Write accesswaits until the information is sent to theserver, which results in poor Write perfor-mance. Caching with write-through isequivalent to using remote service for Writeaccesses and exploiting caching only forRead accesses.

An alternate write policy is to delay up-dates to the master copy. Modifications arewritten to the cache and then writtenthrough to the server later. This policy hastwo advantages over write-through. First,since writes are to the cache, Write accessescomplete more quickly. Second, data maybe deleted before they are written back, inwhich case they need never be written atall. Unfortunately, delayed-write schemesintroduce reliability problems, since un-written data will be lost whenever a clientcrashes.

There are several variations of thedelayed-write policy that differ in when toflush dirty blocks to the server. One alter-native is to flush a block when it is aboutto be ejected from the clients cache. Thisoption can result in good performance, butsome blocks can reside in the clients cachefor a long time before they are written backto the server [Ousterhout et al. 19851. Acompromise between the latter alternativeand the write-through policy is to scan thecache periodically, at regular intervals, andflush blocks that have been modified sincethe last scan. Sprite uses this policy with a30-second interval.

Yet another variation on delayed-write,called write-on-close, is to write data backto the server when the file is closed. Incases of files open for very short periods orrarely modified, this policy does not signif-

Block size and the total cache size areimportant for block-caching schemes. InUNIX-like systems, common block sizesare 4Kb or 8Kb. For large caches (morethan lMb), large block sizes (more than8Kb) are beneficial since the advantagesof large caching unit size are dominant[Lazowska et al. 1986; Ousterhout et al.19851. For smaller caches, large block sizesare less beneficial because they result infewer blocks in the cache and most ofthe cache space is wasted due to internalfragmentation.

4.1.2 Cache Location

Regarding the second decision, disk cacheshave one clear advantage-reliability.Modifications to cached data are lost in acrash if the cache is kept in volatile mem-ory. Moreover, if the cached data are kepton disk, the data are still there during re-covery and there is no need to fetch themagain. On the other hand, main-memorycaches have several advantages. First, mainmemory caches permit workstations to bediskless. Second, data can be accessed more

quickly from a cache in main memory thanfrom one on a disk. Third, the server caches(used to speed up disk I/O) will be in mainmemory regardless of where client cachesare located; by using main-memory cacheson clients, too, it is possible to build a singlecaching mechanism for use by both serversand clients (as it is done in Sprite). It turnsout that the two cache locations emphasizedifferent functionality. Main-memorycaches emphasize reduced access time;

disk caches emphasize increased reliabilityand autonomy of single machines. Noticethat the current technology trend is largerand cheaper memories. With large main-memory caches, and hence high hit ratios,the achieved performance speed up is pre-dicted to outweigh the advantages of diskcaches.

4.1.3 Modification Policy

In the sequel, we use the term dirty blockto denote a block of data that has beenmodified by a client. In the context of cach-ing, we use the term to flush to denote the



14/54


icantly reduce network traffic. In addition,the writ.e-on-close policy requires the clos-ing process to delay while the file is writtenthrough, which reduces the performanceadvantages of delayed-writes. The per-formance advantages of this policy overdelayed-write with more frequent flushingare apparent for files that are both open forlong periods and modified frequently.

As a reference, we present data regardingthe utility of caching in UNIX 4.2 BSD.UNIX 4.2 BSD uses a cache of about 400Kbholding different size blocks (the most com-mon size is 4Kb). A delayed-write policywith 30-second intervals is used. A miss

ratio (ratio of the number of real disk I/Oto logical disk accesses) of 15 percent isreported in McKusick et al. [1984], and of50 percent in Ousterhout et al. [1985]. Thelatter paper also provides the following sta-tistics, which were obtained by simulationson UNIX: A 4Mb cache of 4Kb blockseliminates between 65 and 90 percent of alldisk accesses for file data. A write-throughpolicy resulted in the highest miss ratio.Delayed-write policy with flushing when

the block is ejected from cache had thelowest miss ratio.There is a tight relation between the

modification policy and semantics sharing.Write-on-close is suitable for session se-mantics. By contrast, using any delayed-write policy, when situations of files thatare updated concurrently occur frequentlyin conjunction with UNIX semantics, is notreasonable and will result in long delaysand complex mechanisms. A write-through

policy is more suitable for UNIX semanticsunder such circumstances.

4.1.4 Cache Validation

A client is faced with the problem of decid-ing whether or not its locally cached copyof the data is consistent with the mastercopy. If the client determines that itscached data is out of date, accesses can nolonger be served by that cached data. Anup-to-date copy of the data must be broughtover. There are basically two approaches toverifying the validity of cached data:

l Client-initiated approach. The client in-itiates a validity check in which it con-

tacts the server and checks whether thelocal data are consistent with the mastercopy. The frequency of the validity checkis the crux of this approach and deter-mines the resulting sharing semantics. Itcan range from a check before every sin-gle access to a check only on first accessto a file (on file Open). Every access thatis coupled with a validity check is de-layed, compared with an access servedimmediately by the cache. Alternatively,a check can be initiated every fixed inter-val of time. Usually the validity checkinvolves comparing file header informa-tion (e.g., time stamp of the last update

maintained as i-node information inUNIX). Depending on its frequency, thiskind of validity check can cause severenetwork traffic, as well as consume pre-cious server CPU time. This phenome-non was the cause for Andrew designersto withdraw from this approach (Howardet al. [ 19881 provide detailed performancedata on this issue).

l Server-initiated approach. The serverrecords for each client the (parts of) files

the client caches. Maintaining informa-tion on clients has significant fault tol-erance implications (see Section 5.1).When the server detects a potential forinconsistency, it must now react. A po-tential for inconsistency occurs when afile is cached in conflicting modes by twodifferent clients (i.e., at least one of theclients specified a Write mode). If sessionsemantics are implemented, whenever aserver receives a request to close a file

that has been modified, it should react bynotifying the clients to discard theircached data and consider it invalid.Clients having this file open at that time,discard their copy when the current ses-sion is over. Other clients discard theircopy at once. Under session semantics,the server need not be informed aboutOpens of already cached files. The serveris informed about the Close of a writingsession, however. On the other hand, if a

more restrictive sharing semantics is im-plemented, like UNIX semantics, theserver must be more involved. The servermust be notified whenever a file isopened, and the intended mode (Read or



15/54


Write) must be indicated. Assuming suchnotification, the server can act when itdetects a file that is opened simultane-ously in conflicting modes by disablingcaching for that particular file (as donein Sprite). Disabling caching results inswitching to a remote service mode ofoperation.

A problem with the server-initiated ap-proach is that it violates the traditionalclient-server model, where clients initiateactivities by requesting service. Such vi-olation can result in irregular and com-plex code for both clients and servers.

In summary, the choice is longer accesses

and greater server load using the formermethod versus the fact that the servermaintains information on its clients usingthe latter.

4.2 Cache Consistency

Before delving into the evaluation and com-parison of remote service and caching, werelate these remote access methods to theexamples of sharing semantics introduced

in Section 3.l Session semantics are a perfect match for

caching entire files. Read and Writeaccesses within a session can be handledby the cached copy, since the file can beassociated with different images accord-ing to the semantics. The cache consis-tency problem diminishes to propagatingthe modifications performed in a sessionto the master copy at the end of a session.This model is quite attractive since it has

simple implementation. Observe thatcoupling these semantics with cachingparts of files may complicate matters,since a session is supposed to read theimage of the entire file that correspondsto the time it was opened.

l A distributed implementation of UNIXsemantics using caching has serious con-sequences. The implementation mustguarantee that at all times only one clientis allowed to write to any of the cached

copies of the same file. A distributed con-flict resolution scheme must be used inorder to arbitrate among clients wishingto access the same file in conflicting

modes. In addition, once a cached copy ismodified, the changes need to be propa-gated immediately to the rest of thecached copies. Frequent Writes can gen-erate tremendous network traffic andcause long delays before requests are sat-isfied. This is why implementations (e.g.,Sprite) disable caching altogether and re-sort to remote service once a file is con-currently open in conflicting modes.Observe that such an approach impliessome form of a server-initiated validationscheme, where the server makes a note ofall Open calls. As was stated, UNIX se-mantics lend themselves to an implemen-tation where a file is associated with asingle physical image. A remote serviceapproach, where all requests are directedand served by a single server, fits nicelywith these semantics.

l The immutable shared files semanticswere invented for a whole file cachingscheme [Schroeder et al. 19851. Withthese semantics, the cache consistencyproblem vanishes totally.

l Transactions-like semantics can be im-plemented in a straightforward mannerusing locking, when all the requests forthe same file are served by the sameserver on the same machine as done inremote service.

4.3 Comparison of Cachingand Remote Service

Essentially, the choice between caching andremote service is a choice between potentialfor improved performance and simplicity.We evaluate the trade-off by listing themerits and demerits of the two methods.

l When caching is used, a substantialamount of the remote accesses can behandled efficiently by the local cache.Capitalizing on locality in file access pat-terns makes caching even more attrac-tive. Ramifications can be performancetransparency: Most of the remote ac-cesses will be served as fast as local ones.

Consequently, server load and networktraffic are reduced, and the potential forscalability is enhanced. By contrast,when using the remote service method,



16/54

336 . E. Levy and A. Silberschatz

each remote access is handled across thenetwork. The penalty in network traffic,server load, and performance is obvious.

l Total network overhead in transmittingbig chunks of data, as done in caching, islower than when series of short responsesto specific requests are transmitted (as inthe remote service method).Disk access routines on the server maybe better optimized if it is known thatrequests are always for large, contiguoussegments of data rather than for randomdisk blocks. This point and the previousone indicate the merits of transferringdata in bulk, as done in Andrew.

The cache consistency problem is themajor drawback to caching. In accesspatterns that exhibit infrequent writes,caching is superior. When writes are fre-quent, however, the mechanisms used toovercome the consistency problem incursubstantial overhead in terms of perfor-mance, network traffic, and server load.It is hard to emulate the sharing seman-tics of a centralized system in a systemusing caching as its remote access

method. The problem is the cache consis-tency; namely, the fact that accesses aredirected to distributed copies, not to acentral data object. Observe that the twocaching-oriented semantics, session se-mantics and immutable shared filessemantics, are not restrictive and do notenforce serializability. On the other hand,when using remote service, the serverserializes all accesses and, hence, is ableto implement any centralized sharing

semantics.To use caching and benefit from its mer-its, clients must have either local disksor large main memories. Clients withoutdisks can use remote-service methodswithout any problems.Since, for caching, data are transferreden masse between the server and client,and not in response to the specific needsof a file operation, the lower interma-chine interface is quite different from theupper client interface. The remote ser-vice paradigm, on the other hand, is justan extension of the local file system in-terface across the network. Thus, the

intermachine interface mirrors the localclient-file system interface.

5. FAULT TOLERANCE ISSUES

Fault tolerance is an important and broadsubject in the context of DFS. In thissection we focus on the following faulttolerance issues. In Section 5.1 we examinetwo service paradigms in the context offaults occurring while servicing a client. InSection 5.2 we define the concept of avail-ability and discuss how to increase theavailability of files. In Section 5.3 we reviewfile replication as another means for en-hancing availability.

5.1 Stateful Versus Stateless Service

When a server holds on to information onits clients between servicing their requests,we say the server is stateful. Conversely,when the server does not maintain anyinformation on a client once it finishedservicing its request, we say the server isstateless.

The typical scenario of a stateful fileservice is as follows. A client must performan Open on a file before accessing it. Theserver fetches some information about thefile from its disk, stores it in its memory,and gives the client some connection iden-tifier that is unique to the client and theopen file. (In UNIX terms, the serverfetches the i-node and gives the client a filedescriptor, which serves as an index to anin-core table of i-nodes.) This identifier isused by the client for subsequent accessesuntil the session ends. Typically, the iden-tifier serves as an index into in-memorytable that records relevant information theserver needs to function properly (e.g.,timestamp of last modification of the cor-responding file and its access rights). Astateful service is characterized by a virtualcircuit between the client and the serverduring a session. The connection identifierembodies this virtual circuit. Either uponclosing the file or by a garbage collectionmechanism, the server must reclaim themain-memory space used by clients thatare no longer active.

The advantage of stateful service is per-formance. File information is cached in



17/54

main memory and can be easily accessedusing the connection identifier, therebysaving disk accesses. The key point regard-ing fault tolerance in a stateful serviceapproach is t,he main-memory informationkept by the server on its clients.

A stateless server avoids this state infor-mation by making each request self-contained. That is, each request identifiesthe file and position in the file (for Readand Write accesses) in full. The server neednot keep a table of open files in main mem-ory, although this is usually done for effi-ciency reasons. Moreover, there is no needto establish and terminate a connection by

Open and Close operations. They are to-tally redundant, since each file operationstands on its own and is not considered aspart of a session.

The distinction between stateful andstateless service becomes evident whenconsidering the effects of a crash during aservice activity. A stateful server loses allits volatile state in a crash. A graceful re-covery of such a server involves restoringthis state, usually by a recovery protocol

based on a dialog with clients. Less gracefulrecovery implies abortion of the operationsthat were underway when the crash oc-curred. A different problem is caused byclient failures. The server needs to becomeaware of such failures in order to reclaimspace allocated to record the state ofcrashed clients. These phenomena aresometimes referred to as orphan detectionand elimination.

A stateless server avoids the above prob-

lems, since a newly reincarnated server canrespond to a self-contained request withoutdifficulty. Therefore, the effects of serverfailures and recovery are almost not notice-able. From a clients point of view, there isno difference between a slow server and arecovering server. The client keeps retrans-mitting its request if it gets no response.Regarding client failures, no obsolete stateneeds to be cleaned up on the server side.

The penalty for using the robust stateless

service is longer request messages andslower processing of requests, since there isno in-core information to speed the pro-cessing. In addition, stateless service im-poses other constraints on the design of the


DFS. First, since each request identifies thetarget file, a uniform, systemwide, low-levelnaming is advised. Translating remote tolocal names for each request would implyeven slower processing of the requests. Sec-ond, since clients retransmit requests forfiles operations, these operations must beidempotent. An idempotent operation hasthe same effect and returns the same outputif executed several times consecutively.Self-contained Read and Write accesses areidempotent, since they use an absolute bytecount to indicate the position within a fileand do not rely on an incremental offset(as done in UNIX Read and Write system

calls). Care must be taken when imple-menting destructive operations (such asDelete a file) to make them idempotent too.

In some environments a stateful serviceis a necessity. If a Wide Area Network(WAN) or Internetworks is used, it is pos-sible that messages are not received in theorder they were sent. A stateful, virtual-circuit-oriented service would be preferablein such a case, since by the maintainedstate it is possible to order the messages

correctly. Also observe that if the serveruses the server-initiated method for cachevalidation, it cannot provide stateless serv-ice since it maintains a record of which filesare cached by which clients. On the otherhand, it is easier to build a stateless servicethan a stateful service on top of a datagramcommunication protocol [Postel 19801.

The way UNIX uses file descriptors andimplicit offsets is inherently stateful. Serv-ers must maintain tables to map the file

descriptors to i-nodes and store the currentoffset within a file. This is why NFS, whichuses a stateless service, does not use filedescriptors and includes an explicit offsetin every access (see Section 9.2.2).

5.2 Improving Availability

Svobodova [1984] defines two file proper-ties in the context of fault tolerance: A fileis recoverable if is possible to revert it to an

earlier, consistent state when an operationon the file fails or is aborted by the client.A file is called robust if it is guaranteed tosurvive crashes of the storage device anddecays of the storage medium. A robust



18/54


file is not necessarily recoverable and viceversa. Different techniques must be usedto implement these two distinct concepts.Recoverable files are realized by atomic

update techniques. (We do not give accountof atomic updates techniques in this paper.)Robust files are implemented by redun-dancy techniques such as mirrored files andstable storage [ Lampson 19811.

It is necessary to consider the additionalcriterion of auailability. A file is called avail-able if it can be accessed whenever needed,despite machine and storage device crashesand communication faults. Availability isoften confused with robustness, probably

because they both can be implemented byredundancy techniques. A robust file isguaranteed to survive failures, but it maynot be available until the faulty componenthas recovered. Availability is a fragile andunstable property. First, it is temporal;availability varies as the systems statechanges. Also, it is relative to a client; forone client a file may be available, whereasfor another client on a different machine,the same file may be unavailable.

Replicating files enhances their availa-bility (see Section 5.3); however, merelyreplicating file is not sufficient. There aresome principles destined to ensure in-creased availability of the files describedbelow.

The number of machines involved in afile operation should be minimal, since theprobability of failure grows with the num-ber of involved parties. Most systems ad-here to the client-server pair for all file

operations. (This refers to a LAN environ-ment, where no routing is needed.) Locusmakes an exception, since its service modelinvolves a triple: a client, a server, and aCentralized Synchronization site (CSS).The CSS is involved only in Open andClose operations; but if the CSS cannot bereached by a client, the file is not availableto that particular client. In general, havingmore than two machines involved in a fileoperation can cause bizarre situations in

which a file is available to some but not allclients.Once a file has been located there is no

reason to involve machines other than the

client and the server machines. Identifyingthe server that stores the file and establish-ing the client-server connection is moreproblematic. A file location mechanism is

an important factor in determining theavailability of files. Traditionally, locatinga file is done by a pathname traversal,which in a DFS may cross machine bound-aries several times and hence involve morethan two machines (see Section 2.3.1). Inprinciple, most systems (e.g., Locus, NFS,Andrew) approach the problem by requir-ing that each component (i.e., directory) inthe pathname would be looked up directlyby the client. Therefore, when machine

boundaries are crossed, the server in theclient-server pair changes, but the clientremains the same. In UNIX United, par-tially because of routing concerns, thisclient-server model is not preserved in thepathname traversal. Instead, the pathnametraversal request is forwarded from ma-chine to machine along the pathname,without involving the client machine eachtime.

Observe that if a file is located by path-

name traversal, the availability of a filedepends on the availability of all the direc-tories in its pathname. A situation can arisewhereby a file might be available to readingand writing clients, but it cannot be locatedby new clients since a directory in its path-name is unavailable. Replicating top-leveldirectories can partially rectify the prob-lem, and is indeed used in Locus to increasethe availability of files.

Caching directory information can both

speed up the pathname traversal and avoidthe problem of unavailable directories inthe pathname (i.e., if caching occurs beforethe directory in the pathname becomes un-available). Andrew and NFS use this tech-nique. Sprite uses a better mechanism forquick and reliable pathname traversal. InSprite, machines maintain prefix tablesthat map prefixes of pathnames to the serv-ers that store the corresponding componentunits. Once a file in some component unit

is open, all subsequent Opens of files withinthat same unit address the right serverdirectly, without intermediate lookups atother servers. This mechanism is faster and



19/54


guarantees better availability. (For com-plete description of the prefix table mech-anism refer to Section 10.2.)

5.3 File ReplicationReplication of files is a useful redundancyfor improving availability. We focus on rep-lication of files on different machinesrather than replication on different mediaon the same machine (such as mirroreddisks [Lampson 19811). Multimachinereplication can benefit performance too,since selecting a nearby replica to servean access request. results in shorter servicetime.

The basic requirement from a replicationscheme is that different replicas of the samefile reside on failure-independent ma-chines. That is, the availability of one rep-lica is not affected by the availability of therest of the replicas. This obvious require-ment implies that replicat,ion managementis inherently a location-dependent activity.Provisions for placing a replica on a partic-ular machine must be available.

It, is desirable to hide the details of rep-lication from users. It is the task of thenaming scheme to map a replicated filename to a particular replica. The existenceof replicas should be invisible to higherlevels. At some level, however, the replicasmust be distinguished from one another byhaving different lower level names. Thiscan be accomplished by first mapping a filename to an entity that is abie to differen-tiate the replicas (as done in Locus). An-

other t,ransparency issue is providingreplication control at higher levels. Repli-cation control includes determining the de-gree of replication and placement ofreplicas. Under certain circumstances, it isdesirable to expose these details to users.Locus, for instance, provides users and sys-tem administrators with mechanism tocontrol the rephcation scheme.

The main problem associated with rep-licas is their update. From a users point of

view, replicas of a file denote the samelogical entity; thus, an update to any replicamust be reflect,ed on all other replicas. Moreprecisely, the relevant sharing semantics

must be preserved when accesses to replicasare viewed as virtual accesses to their logi-cal files. The analogous database term isOne-Copy Serializability [Bernstein et al.19871. Davidson et al. [1985] survey ap-proaches to replication for database sys-tems, where consistency considerations areof major importance. If consistency is notof primary importance, it can be sacrificedfor availability and performance. This is anincarnation of a fundamental trade-off inthe area of fault tolerance. The choice isbetween preserving consistency at all costs,thereby creating a potential for indefiniteblocking, or sacrificing consistency undersome (we hope rare) circumstance of cat-astrophic failures for the sake of guaran-teed progress. We illustrate this trade-offby considering (in a conceptual manner)the problem of updating a set of replicas ofthe same file. The atomicity of such anupdate is a desirable property; that is, asituation in which both updated and notupdated replicas serve accesses should beprevented. The only way to guarantee theatomicity of such an update is by using a

commit protocol (e.g., Two-phase commit),which can lead to indefinite blocking in theface of machine and network failures[Bernstein et al. 19871. On the other hand,if only the available replicas are updated,progress is guaranteed; stale replicas,however, are present.

In most cases, the consistency of file datacannot be compromised, and hence theprice paid for increased availability byreplication is a complicated update prop-

agation protocol. One case in which consis-tency can be traded for performance, aswell as availability, is replication of thelocation hints discussed in Section 2.3.2.Since hints are validated upon use, theirreplication does not require maintainingtheir consistency. When a location hint iscorrect, it results in quick location of thecorresponding file without relying on a lo-cation server. Among the surveyed systems,Locus uses replication extensively and sac-

rifices consistency in a partitioned environ-ment for the sake of availability of files forboth Read and Write accesses (see Section8.5 for details).



20/54


Facing the problems associated withmaintaining the consistency of replicas, apopular compromise is read-only replica-tion. Files known to be frequently read and

rarely modified are replicated using thisrestricted variant of replication. Usually,only one primary replica can be modified,and the propagation of the updates involveseither taking the file off line or using somecostly procedure that guarantees atomicityof the updates. Files containing the objectcode of system programs are good candi-dates for this kind of replication, as aresystem data files (e.g., location databasesand user registries).

As an illustration of the concepts dis-cussed above, we describe the replicationscheme in Ibis, which is quite unique [Tichyand Ruan 19841. Ibis uses a variation of theprimary copy approach. The domain of thename mapping is a pair: primary replicaidentifier and local replica identifier, ifthere is one. (If there is no replica locally,a special value is returned.) Thus, the map-ping is relative to a machine. If the localreplica is the primary one, the pair contains

two identical identifiers. Ibis supportsdemand replication, which is an automaticreplication control policy (similar to whole-file caching). Demand replication meansthat reading a nonlocal replica causes it tobe cached locally, thereby generating a newnonprimary replica. Updates are performedonly on the primary copy and cause allother replicas to be invalidated by sendingappropriate messages. Atomic and serial-ized invalidation of all nonprimary replicas

is not guaranteed. Hence, it is possible thata stale replica is considered valid. Consis-tency of replicas is sacrificed for a simpleupdate protocol. To satisfy remote Writeaccesses, the primary copy is migrated tothe requesting machine.

6. Scalability Issues

Very large-scale DFSs, to a great extent,are still visionary. Andrew is the closest

system to be classified as a very large-scalesystem with a planned configuration ofthousands of workstations. There are nomagic guidelines to ensure the scalabilityof a system. Examples of nonscalable de-signs, however, are abundant. In Section

6.1 we discuss several designs that poseproblems and propose possible solutions,all in the context of scalability. In Section6.2 we describe an implementation tech-

nique, Light Weight Processes, essentialfor high-performance and scalable designs.

6.1 Guidelines by Negative Examples

Barak and Kornatzky [1987] list severalprinciples for designing very large-scalesystems. The first is called BoundedResources: The service demand from anycomponent of the system should bebounded by a constant. This constant is

independent of the number of nodes in thesystem. Any server whose load is propor-tional to the size of the system is destinedto become clogged once the system growsbeyond a certain size. Adding more re-sources will not alleviate the problem. Thecapacity of this server simply limits thegrowth of the system. This is why the CSSof Locus is not a scalable design. In Locus,every filegroup (the Locus component unit,which is equivalent to a UNIX removable

file system) is assigned a CSS, whose re-sponsibility it is to synchronize accesses tofiles in that filegroup. Every Open requestto a file within that filegroup must gothrough this machine. Beyond a certainsystem size, CSSs of frequently accessedfilegroups are bound to become a point ofcongestion, since they would need to satisfya growing number of clients.

The principle of bounded resources canbe applied to channels and network traffic,

too, and hence prohibits the use of broad-casting. Broadcasting is an activity thatinvolves every machine in the network.A mechanism that relies on broadcast-ing is simply not realistic for large-scalesystems.

The third example combines aspects ofscalability and fault tolerance. It was al-ready mentioned that if a stateless serviceis used, a server need not detect a clientscrash nor take any precautions because of

it. Obviously this is not the case with state-ful service, since the server must detectclients crashes and at least discard thestate it maintains for them. It is interestingto contrast the ways MOS and Locusreclaim obsolete state storage on servers



21/54


machines violates functional symmetry.Autonomy and symmetry are, however, im-portant goals to which to aspire.

An important aspect of decentralizationis system administration. Administrativeresponsibilities should be delegated to en-courage autonomy and symmetry, withoutdisturbing the coherence and uniformity ofthe distributed system. Andrew and ApolloDomain support decentralized system man-agement [Leach et al. 19851.

The practical approximation to symmet-ric and autonomous configuration is clus-tering, where a system is partitioned intoa collection of semiautonomous clusters.A cluster consists of a set of machinesand a dedicated cluster server. To makecross-cluster file references relatively infre-quent, most of the time, each machinesrequests should be satisfied by its own clus-ter server. Such a requirement depends onthe ability to localize file references and theappropriate placement of component units.If the cluster is well balanced, that is, theserver in charge suffices to satisfy a major-ity of the cluster demands, it can be usedas a modular building block to scale up thesystem. Observe that clustering complieswith the Bounded Resources Principle. Inessence, clustering attempts to associate aserver with a fixed set of clients and a setof files they access frequently, not just withan arbitrary set of files. Andrews use ofclusters, coupled with read-only replicationof key files, is a good example for a scalableclustering scheme.

UNIX United emphasizes the concept ofautonomy. There, UNIX systems are joinedtogether in a recursive manner to create alarger global system [Randell 19831. Eachcomponent system is a complex UNIX sys-tem that can operate and be administeredindependently. Again, modular and auton-omous components are combined to createa large-scale system. The emphasis onautonomy results in some negative effects,however, since component boundaries arevisible to users.

[Barak and Litman 1985; Barak andParadise 19861.

The approach taken in MOS is garbagecollection. It is the clients responsibility toset, and later reset, an expiration date onstate information the servers maintain forit. Clients reset this date whenever theyaccess the server or by special, infrequentmessages. If this date has expired, aperiodic garbage collector reclaims thatstorage. This way, the server need not de-tect clients crashes. By contrast, Locusinvokes a clean-up procedure whenever aserver machine determines that a particu-lar client machine is unavailable. Amongother things, this procedure releases spaceoccupied by the state of clients from thecrashed machine. Detecting crashes can bevery expensive, since it is based on pollingand time-out mechanisms that incur sub-stantial network overhead. The schemeMOS uses requires tolerable and scalableoverhead, where every client signals abounded number of objects (the object itowns), whereas a failure detection mecha-nism is not scalable since it depends on thesize of the system.

Network congestion and latency aremajor obstacles to large-scale systems. Aguideline worth pursuing is to minimizecross-machine interactions by means ofcaching, hints, and enforcement of relaxedsharing semantics. There is, however, atrade-off between the strictness of the shar-ing semantics in a DFS and the networkand server loads (and hence necessarily thescalability potential). The more stringentthe semantics, the harder it is to scale thesystem up.

Central control schemes and central re-sources should not be used to build scalable(and fault-tolerant) systems. Examples ofcentralized entities are central authentica-tion server, central naming server, and cen-tral file server. Centralization is a form offunctional asymmetry among the machinescomprising the system. The ideal alterna-tive is a configuration that is functionallysymmetric; that is, all the componentmachines have an equal role in the opera-tion of the system, and hence each machinehas some degree of autonomy. Practically,it is impossible to comply with such a prin-ciple. For instance, incorporating diskless

6.2 Lightweight Processes

A major problem in the design of any serv-ice is the process structure of the server.Servers are supposed to operate efficiently



22/54


in peak periods when hundreds of activeclients need to be served simultaneously. Asingle server process is certainly not a goodchoice, since whenever a request necessi-tates disk I/O the whole service is delayeduntil the I/O is completed. Assigning aprocess for each client is a better choice;however, the overhead of multiplexing theCPU among the processes (i.e., the contextswitches) is an expensive price that mustbe paid.

A related problem has to do with the factthat all the server processes need to shareinformation, such as file headers and serv-ice tables. In UNIX 4.2 BSD processesare not permitted to share addressspaces, hence sharing must be done exter-naliy by using files and other unnaturalmechanisms.

It appears that one of the best solutionsfor the server architecture is the use ofLightweight Processes (LWPs) or Threads.A thread is a process that has very littlenonshared state. A group of peer threadsshare code, address space, and operatingsystem resources. An individual thread hasat least its own register state. The extensivesharing makes context switches among peerthreads and threads creation inexpensive,compared with context switches among tra-ditional, heavy-weight processes. Thus,blocking a thread and switching to anotherthread is a reasonable solution to the prob-lem of a server handling many requests.The abstraction presented by a group ofLWPs is that of multiple threads of controlassociated with some shared resources.

There are many alternatives regardingthreads; we mention a few of them briefly.Threads can be supported above the kernel,at the user level (as done in Andrew) or bythe kernel (as in Mach [Tevanian et al.19871). Usually, a lightweight process is notbound to a particular client. Instead, itserves single requests of different clients.Scheduling threads can be preemptive ornonpreemptive. If threads are allowed torun to completion, their shared data neednot be explicitly protected. Otherwise, someexplicit locking mechanism must be usedto synchronize the accesses to the shareddata.

Typically, when LWPs are used to im-plement a service, client requests accumu-


late in a common queue and threads areassigned to requests from the queue. Theadvantages of using an LWPs scheme toimplement the service are twofold. First,an I/O request delays a single thread, notthe entire service. Second, sharing commondata structures (e.g., the requests queue)among the threads is easily facilitated.

It is clear that some form of LWPsscheme is essential for servers to be scal-able. Locus, Sprite, Andrew, use suc

Documents

Distributed File Systems Concepts and e 61384