vrable2

Embed Size (px)

Citation preview

  • 8/10/2019 vrable2

    1/14

    BlueSky: A Cloud-Backed File System for the Enterprise

    Michael Vrable , Stefan Savage , and Geoffrey M. Voelker

    Department of Computer Science and EngineeringUniversity of California, San Diego

    Abstract

    We present BlueSky, a network le system backed bycloud storage. BlueSky stores data persistently in a cloudstorage provider such as Amazon S3 or Windows Azure,allowing users to take advantage of the reliability andlarge storage capacity of cloud providers and avoid theneed for dedicated server hardware. Clients access thestorage through a proxy running on-site, which cachesdata to provide lower-latency responses and additionalopportunities for optimization. We describe some of theoptimizations which are necessary to achieve good per-formance and low cost, including a log-structured designand a secure in-cloud log cleaner. BlueSky supports mul-tiple protocolsboth NFS and CIFSand is portable todifferent providers.

    1 Introduction

    The promise of third-party cloud computing services isa trifecta of reduced cost, dynamic scalability, and highavailability. While there remains debate about the precisenature and limit of these properties, it is difcult to denythat cloud services offer real utilityevident in the largenumbers of production systems now being cloud-hostedvia services such as Amazons AWS and MicrosoftsAzure. However, thus far, services hosted in the cloudhave largely fallen into two categories: consumer-facingWeb applications (e.g., Netix customer Web site andstreaming control) and large-scale data crunching (e.g.,Netix media encoding pipeline).

    Little of this activity, however, has driven widespreadoutsourcing of enterprise computing and storage applica-tions. The reasons for this are many and varied, but theylargely reect the substantial inertia of existing client-server deployments. Enterprises have large capital andoperational investments in client software and depend onthe familiar performance, availability and security char-acteristics of traditional server platforms. In essence,cloud computing is not currently a transparent drop inreplacement for existing services.

    Current afliation: Google. The work in this paper was performedwhile a student at UC San Diego.

    There are also substantive technical challenges toovercome, as the design points for traditional client-server applications (e.g., le systems, databases, etc.)frequently do not mesh well with the services offeredby cloud providers. In particular, many such applica-tions are designed to be bandwidth-hungry and latency-sensitive (a reasonable design in a LAN environment),while the remote nature of cloud service naturally in-

    creases latency and the cost of bandwidth. Moreover,while cloud services typically export simple interfacesto abstract resources (e.g., put le for Amazons S3),traditional server protocols can encapsulate signicantlymore functionality. Thus, until such applications are re-designed, much of the latent potential for outsourcingcomputing and storage services remains untapped. In-deed, at $115B/year, small and medium business (SMB)expenditures for servers and storage represent an enor-mous market should these issues be resolved [9]. Evenif the eventual evolution is towards hosting all applica-tions in the cloud, it will be many years before such amigration is complete. In the meantime, organizationswill need to support a mix of local applications and useof the cloud.

    In this paper, we explore an approach for bridgingthese domains for one particular application: network leservice. In particular, we are concerned with the extentto which traditional network le service can be replacedwith commodity cloud services. However, our designis purposely constrained by the tremendous investment(both in capital and training) in established le systemclient software; we take as a given that end-system soft-ware will be unchanged. Consequently, we focus on aproxy-based solution, one in which a dedicated proxy

    server provides the illusion of a single traditional leserver in an enterprise setting, translating requests intoappropriate cloud storage API calls over the Internet.

    We explore this approach through a prototype sys-tem, called BlueSky, that supports both NFS and CIFSnetwork le system protocols and includes drivers forboth the Amazon EC2/S3 environment and MicrosoftsAzure. The engineering of such a system faces a numberof design challenges, the most obvious of which revolvearound performance (i.e., caching, hiding latency, and

  • 8/10/2019 vrable2

    2/14

    maximizing the use of Internet bandwidth), but less intu-itively also interact strongly with cost. In particular, theinteraction between the storage interfaces and fee sched-ule provided by current cloud service providers conspireto favor large segment-based layout designs (as wellas cloud-based le system cleaners). We demonstrate

    that ignoring these issues can dramatically inate costs(as much as 30 in our benchmarks) without signi-cantly improving performance. Finally, across a seriesof benchmarks we demonstrate that, when using such adesign, commodity cloud-based storage services can pro-vide performance competitive with local le servers forthe capacity and working sets demanded by enterpriseworkloads, while still accruing the scalability and costbenets offered by third-party cloud services.

    2 Related Work

    Network storage systems have engendered a vast litera-ture, much of it focused on the design and performanceof traditional client server systems such as NFS, AFS,CIFS, and WAFL [6, 7, 8, 25]. Recently, a range of efforts has considered other structures, including thosebased on peer-to-peer storage [ 16] among distributed setsof untrusted servers [ 12, 13] which have indirectly in-formed subsequent cloud-based designs.

    Cloud storage is a newer topic, driven by the availabil-ity of commodity services from Amazons S3 and otherproviders. The elastic nature of cloud storage is reminis-cent of the motivation for the Plan 9 write-once le sys-tems [19, 20], although cloud communication overheadsand monetary costs argue against a block interface andno storage reclamation. Perhaps the closest academicwork to our own is SafeStore [ 11], which stripes erasure-coded data objects across multiple storage providers, ul-timately exploring access via an NFS interface. How-ever, SafeStore is focused clearly on availability, ratherthan performance or cost, and thus its design decisionsare quite different. A similar, albeit more complex sys-tem, is DepSky [2], which also focuses strongly on avail-ability, proposing a cloud of clouds model to replicateacross providers.

    At a more abstract level, Chen and Sion create aneconomic framework for evaluating cloud storage costs

    and conclude that the computational costs of the cryp-tographic operations needed to ensure privacy can over-whelm other economic benets [3]. However, this work predates Intels AES-NI architecture extension whichsignicantly accelerates data encryption operations.

    There have also been a range of non-academic at-tempts to provide traditional le system interfaces for thekey-value storage systems offered by services like Ama-zons S3. Most of these install new per-client le systemdrivers. Exemplars include s3fs [ 22], which tries to map

    the le system directly on to S3s storage model (whichboth changes le system semantics, but also can dramat-ically increase costs) and ElasticDrive [ 5], which exportsa block-level interface (potentially discarding optimiza-tions that use le-level knowledge such as prefetching).

    However, the systems closest to our own are cloud

    storage gateways, a new class of storage server that hasemerged in the last few years (contemporaneous with oureffort). These systems, exemplied by companies suchas Nasuni, Cirtas, TwinStrata, StorSimple and Panzura,provide caching network le system proxies (or gate-ways) that are, at least on the surface, very similar toour design. Pricing schedules for these systems gener-ally reect a 2 premium over raw cloud storage costs.While few details of these systems are public, in generalthey validate the design point we have chosen.

    Of commercial cloud storage gateways, Nasuni [17]is perhaps most similar to BlueSky. Nasuni provides avirtual NAS appliance (or ler), software packagedas a virtual machine which the customer runs on theirown hardwarethis is very much like the BlueSky proxysoftware that we build. The Nasuni ler acts as a cacheand writes data durably to the cloud. Because Nasunidoes not publish implementation details it is not possi-ble to know precisely how similar Nasuni is to BlueSky,though there are some external differences. In terms of cost, Nasuni charges a price based simply on the quantityof disk space consumed (around $0.30/GB/month, de-pending on the cloud provider)and not at all a functionof data transferred or operations performed. Presumably,Nasuni optimizes their system to reduce the network andper-operation overheadsotherwise those would eat intotheir protsbut the details of how they do so are un-clear, other than by employing caching.

    Cirtas [ 4] builds a cloud gateway as well but sells itin appliance form: Cirtass Bluejet is a rack-mountedcomputer which integrates software to cache le systemdata with storage hardware in a single package. Cirtasthus has a higher up-front cost than Nasunis product,but is easier to deploy. Panzura [18] provides yet anotherCIFS/NFS gateway to cloud storage. Unlike BlueSkyand the others, Panzura allows multiple customer sitesto each run a cloud gateway. Each of these gateways ac-cesses the same underlying le system, so Panzura is par-

    ticularly appropriate for teams sharing data over a widearea. But again, implementation details are not provided.TwinStrata [ 29] and StorSimple [28] implement gate-

    ways that present a block-level storage interface, likeElasticDrive, and thus lose many potential le system-level optimizations as well.

    In some respects BlueSky acts like a local storageserver that backs up data to the clouda local NFSserver combined with Mozy [15], Cumulus [ 30], or sim-ilar software could provide similar functionality. How-

  • 8/10/2019 vrable2

    3/14

    ever, such backup tools may not support a high backupfrequency (ensuring data reaches the cloud quickly) andefcient random access to les in the cloud. Further, theytreat the local data (rather than the cloud copy) as au-thoritative, preventing the local server from caching justa subset of the les.

    3 Architecture

    BlueSky provides service to clients in an enterprise us-ing a transparent proxy-based architecture that storesdata persistently on cloud storage providers (Figure 1).The enterprise setting we specically consider consistsof a single proxy cache colocated with enterprise clients,with a relatively high-latency yet high-bandwidth link tocloud storage, with typical ofce and engineering requestworkloads to les totaling tens of terabytes. This sec-tion discusses the role of the proxy and cloud providercomponents, as well as the security model supported byBlueSky. Sections 4 and 5 then describe the layout andoperation of the BlueSky le system and the BlueSkyproxy, respectively.

    Cloud storage acts much like another layer in the stor-age hierarchy. However, it presents new design consid-erations that, combined, make it distinct from other lay-ers and strongly inuence its use as a le service. Thehigh latency to the cloud necessitates aggressive cachingclose to the enterprise. On the other hand, cloud storagehas elastic capacity and provides operation service timesindependent of spatial locality, thus greatly easing freespace management and data layout. Cloud storage inter-faces often only support writing complete objects in an

    operation, preventing the efcientupdate of just a portionof a stored object. This constraint motivates an appendrather than an overwrite model for storing data.

    Monetary cost also becomes an explicit metric of optimization: cloud storage capacity might be elastic,but still needs to be parsimoniously managed to min-imize storage costs over time [ 30]. With an appendmodel of storage, garbage collection becomes a neces-sity. Providers also charge a small cost for each opera-tion. Although slight, costs are sufciently high to moti-vate aggregating small objects (metadata and small les)into larger units when writing data. Finally, outsourcingdata storage makes security a primary consideration.

    3.1 Local Proxy

    The central component of BlueSky is a proxy situatedbetween clients and cloud providers. The proxy commu-nicates with clients in an enterprise using a standard net-work le system protocol, and communicates with cloudproviders using a cloud storage protocol. Our prototypesupports both the NFS (version 3) and CIFS protocols forclients, and the RESTful protocols for the Amazon S3and Windows Azure cloud services. Ideally, the proxy

    SegmentWrites

    RangeReads

    Disk JournalWrites

    Disk CacheReads

    Client Requests

    Client Responses

    NFS

    CIFS

    S3

    WAS E n c r y p t i o n

    Disk

    Network

    Memory

    Front Ends

    Back Ends

    ResourceManagers

    Figure 1: BlueSky architecture.

    runs in the same enterprise network as the clients to min-imize latency to them. The proxy caches data locally andmanages sharing of data among clients without requiringan expensive round-trip to the cloud.

    Clients do not require modication since they continueto use standard le-sharing protocols. They mount Blue-Sky le systems exported by the proxy just as if they

    were exported from an NFS or CIFS server. Further, thesame BlueSky le system can be mounted by any type of client with shared semantics equivalent to Samba.

    As described in more detail later, BlueSky lowers costand improves performance by adopting a log-structureddata layout for the le system stored on the cloudprovider. A cleaner reclaims storage space by garbage-collecting old log segments which do not contain any liveobjects, and processing almost-empty segments by copy-ing live data out of old segments into new segments.

    As a write-back cache, the BlueSky proxy can fullysatisfy client write requests with local network le sys-tem performance by writing to its local diskas long asits cache capacity can absorb periods of write bursts asconstrained by the bandwidth the proxy has to the cloudprovider (Section 6.5). For read requests, the proxy canprovide local performance to the extent that the proxycan cache the working set of the client read workload(Section 6.4).

    3.2 Cloud Provider

    So that BlueSky can potentially use any cloud providerfor persistent storage service, it makes minimal assump-tions of the provider; in our experiments, we use bothAmazon S3 and the Windows Azure blob service. Blue-

    Sky requires only a basic interface supporting get , put ,list , and delete operations. If the provider also sup-ports a hosting service, BlueSky can co-locate the lesystem cleaner at the provider to reduce cost and improvecleaning performance.

    3.3 Security

    Security becomes a key concern with outsourcing criticalfunctionality such as data storage. In designing BlueSky,our goal is to provide high assurances of data conden-

  • 8/10/2019 vrable2

    4/14

    tiality and integrity. The proxy encrypts all client databefore sending it over the network, so the provider can-not read private data. Encryption is at the level of objects(inodes, le blocks, etc.) and not entire log segments.Data stored at the provider also includes integrity checksto detect any tampering by the storage provider.

    However, some trust in the cloud provider is unavoid-able, particularly for data availability. The provider canalways delete or corrupt stored data, rendering it unavail-able. These actions could be intentionale.g., if theprovider is maliciousor accidental, for instance dueto insufcient redundancy in the face of correlated hard-ware failures from disasters. Ultimately, the best guardagainst such problems is through auditing and the use of multiple independent providers [2, 11]. BlueSky couldreadily incorporate such functionality, but doing so re-mains outside the scope of our current work.

    A buggy or malicious storage provider could alsoserve stale data. Instead of returning the most recentdata,it could return an old copy of a data object that nonethe-less has a valid signature (because it was written by theclient at an earlier time). By authenticating pointers be-tween objects starting at the root, however, BlueSky pre-vents a provider from selectively rolling back le data.A provider can only roll back the entire le system to anearlier state, which customers will likely detect.

    BlueSky can also take advantage of computation inthe cloud for running the le system cleaner. As withstorage, we do not want to completely trust the compu-tational service, yet doing so provides a tension in thedesign. To maintain condentiality, data encryption keysshould not be available on cloud compute nodes. Yet,if cloud compute nodes are used for le system mainte-nance tasks, the compute nodes must be able to read andmanipulate le system data structures. For BlueSky, wemake the tradeoff of encrypting le data while leavingthe metadata necessary for cleaning the le system un-encrypted. As a result, storage providers can understandthe layout of the le system, but the data remains con-dential and the proxy can still validate its integrity.

    In summary, BlueSky provides strong condentialityand slightly weaker integrity guarantees (some data roll-back attacks might be possible but are largely prevented),but must rely on the provider for availability.

    4 BlueSky File System

    This section describes the BlueSky le system layout.We present the object data structures maintained in thele system and their organization in a log-structured for-mat. We also describe how BlueSky cleans the logs com-prising the le system, and how the design convenientlylends itself to providing versioned backups of the datastored in the le system.

    4.1 Object Types

    BlueSky uses four types of objects for representing dataand metadata in its log-structured le system [ 23] for-mat: data blocks, inodes, inode maps, and checkpoints.These objects are aggregated into log segments for stor-age. Figure 2 illustrates their relationship in the layout of the le system. On top of this physical layout BlueSkyprovides standard POSIX le system semantics, includ-ing atomic renames and hard links.

    Data blocks store le data. Files are broken apart intoxed-size blocks (except the last block may be short).BlueSky uses 32 KB blocks instead of typical disk lesystem sizes like 4 KB to reduce overhead: block point-ers as well as extra header information impose a higherper-block overhead in BlueSky than in an on-disk lesystem. In the evaluations in Section 6, we show thecost and performance tradeoffs of this decision. Noth-ing fundamental, however, prevents BlueSky from usingvariable-size blocks optimized for the access patterns of each le, but we have not implemented this approach.

    Inodes for all le types include basic metadata: own-ership and access control, timestamps, etc. For regu-lar les, inodes include a list of pointers to data blockswith the le contents. Directory entries are stored inlinewithin the directory inode to reduce the overhead of pathtraversals. BlueSky does not use indirect blocks for lo-cating le datainodes directly contain pointers to alldata blocks (easy to do since inodes are not xed-size).

    Inode maps list the locations in the log of the mostrecent version of each inode. Since inodes are not storedat xed locations, inode maps provide the necessary level

    of indirection for locating inodes.A checkpoint object determines the root of a le sys-temsnapshot. A checkpoint contains pointers to the loca-tions of the current inode map objects. On initializationthe proxy locates the most recent checkpoint by scan-ning backwards in the log, since the checkpoint is alwaysone of the last objects written. Checkpoints are usefulfor maintaining le system integrity in the face of proxyfailures, for decoupling cleaning and le service, and forproviding versioned backup.

    4.2 Cloud Log

    For each le system, BlueSky maintains a separate log

    for each writer to the le system. Typically there aretwo: the proxy managing the le system on behalf of clients and a cleaner that garbage collects overwrittendata. Each writer stores its log segments to a separatedirectory (different key prex), so writers can make up-dates to the le system independently.

    Each log consists of a number of log segments, andeach log segment aggregates multiple objects togetherinto an approximately xed-size container for storageand transfer. In the current implementation segments are

  • 8/10/2019 vrable2

    5/14

    Figure 2: BlueSky lesystem layout. The top portion shows the logical organization. Object pointers are shown withsolid arrows. Shaded objects are encrypted (but pointers are always unencrypted). The bottom of the gure illustrateshow these log items are packed into segments stored in the cloud.

    up to about 4 MB, large enough to avoid the overheadof dealing with many small objects. Though the storageinterface requires that each log segment be written in asingle operation, typically cloud providers allow partialreads of objects. As a result, BlueSky can read individualobjects regardless of segment size. Section 6.6 quantiesthe performance benets of grouping data into segmentsand of selective reads, and Section 6.7 quanties theircost benets.

    A monotonically-increasing sequence number identi-es each log segment within a directory, and a byte offset identies a specic object in the segment. Together, the

    triple (directory,

    sequence number ,offset ) describes thephysical location of each object. Object pointers also in-

    clude the size of the object; while not required this hintallows BlueSky to quickly issue a read request for theexact bytes needed to fetch the object.

    In support of BlueSkys security goals (Section 3.3),le systemobjects are individually encrypted (with AES)and protected with a keyed message authentication code(HMAC-SHA-256) by the proxy before uploading to thecloud service. Each object contains data with a mix of protections: some data is encrypted and authenticated,some data is authenticated plain-text, and some data isunauthenticated. The keys for encryption and authenti-

    cation are not shared with the cloud, though we assumethat customers keep a safe backup of these keys for dis-aster recovery. Figure 3 summarizes the elds includedin objects.

    BlueSky generates a unique identier (UID) for eachobject when the object is written into the log. The UIDremains constant if an item is simply relocated to a newlog position. An object can contain pointers to otherobjectsfor example, an inode pointing to data blocksand the pointer lists both the UID and the physical lo-

    Authenticated:

    Object typeUnique identier (UID)Inode number

    Encrypted: Object payload

    Object pointers: UIDs

    Unauthenticated: Object pointers: Physical locations

    Figure 3: Data elds included in most objects.

    cation. A cleaner in the cloud can relocate objects andupdate pointers with the new locations; as long as theUID in the pointer and the object match, the proxy canvalidate that the data has not been tampered with.

    4.3 Cleaner

    As with any log-structured le system, BlueSky requiresa le system cleaner to garbage collect data that has beenoverwritten. Unlike traditional disk-based systems, theelastic nature of cloud storage means that the le sys-tem can grow effectively unbounded. Thus, the cleaneris not necessary to make progress when writing out newdata, only to reduce storage costs and defragment datafor more efcient access.

    We designed the BlueSky cleaner so that it can runeither at the proxy or on a compute instance within thecloud provider where it has faster, cheaper access to thestorage. For example, when running the cleaner in Ama-zon EC2 and accessing storage in S3, Amazon does notcharge for data transfers (though it still charges for op-erations). A cleaner running in the cloud does not needto be fully trustedit will need permission to read andwrite cloud storage, but does not require the le systemencryption and authentication keys.

  • 8/10/2019 vrable2

    6/14

    The cleaner runs online with no synchronous interac-tions with the proxy: clients can continue to access andmodify the le system even while the cleaner is running.Conicting updates to the same objects are later mergedby the proxy, as described in Section 5.3.

    4.4 Backups

    The log-structured design allows BlueSky to integratele system snapshots for backup purposes easily. In fact,so long as a cleaner is never run, any checkpoint recordever written to the cloud can be used to reconstruct thestate of the le system at that point in time. Though notimplemented in our prototype, the cleaner or a snapshottool could record a list of checkpoints to retain and pro-tect all required log segments from deletion. Those seg-ments could also be archived elsewhere for safekeeping.

    4.5 Multi-Proxy Access

    In the current BlueSky implementation only a single

    proxy can write to the le system, along with the cleanerwhich can run in parallel. It would be desirable to havemultiple proxies reading from and writing to the sameBlueSky le system at the same timeeither from a sin-gle site, to increase capacity and throughput, or frommultiple sites, to optimize latency for geographically-distributed clients.

    The support for multiple le system logs in BlueSkyshould make it easier to add support for multiple concur-rent proxies. Two approaches are possible. Similar toIvy [16], the proxies could be unsynchronized, offeringloose consistency guarantees and assuming only a singlesite updates a le most of the time. When conictingupdates occur in the uncommon case, the system wouldpresent the user with multiple le versions to reconcile.

    A second approach is to provide stronger consistencyby serializing concurrent access to les from multipleproxies. This approach adds the complexity of sometype of distributed lock manager to the system. Sincecloud storage itself does not provide the necessary lock-ing semantics, a lock manager would either need to runon a cloud compute node or on the proxies (ideally, dis-tributed across the proxies for fault tolerance).

    Exploring either option remains future work.

    5 BlueSky Proxy

    This section describes the design and implementation of the BlueSky proxy, including how it caches data in mem-ory and on disk, manages its network connections to thecloud, and indirectly cooperates with the cleaner.

    5.1 Cache Management

    The proxy uses its local disk storage to implement awrite-back cache. The proxy logs le system write re-quests from clients (both data and metadata) to a journal

    on local disk, and ensures that data is safely on disk be-fore telling clients that data is committed. Writes aresent to the cloud asynchronously. Physically, the journalis broken apart into sequentially-numbered les on disk (journal segments) of a few megabytes each.

    This write-back caching does mean that in the event of

    a catastrophic failure of the proxyif the proxys storageis lostthat some data may not have been written to thecloud and will be lost. If the local storage is intact no datawill be lost; the proxy will replay the changes recordedin the journal. Periodically, the proxy snapshots the lesystem state, collects new le system objects and any in-ode map updates into one or more log segments, and up-loads those log segments to cloud storage. Our prototypeproxy implementation does not currently perform dedu-plication, and we leave exploring the tradeoffs of such anoptimization for future work.

    There are tradeoffs in choosing how quickly to ushdata to the cloud. Writing data to the cloud quickly mini-mizes the window for data loss. However, a longer time-out has advantages as well: it enables larger log segmentsizes, and it allows overlapping writes to be combined. Inthe extreme case of short-lived temporary les, no dataneed be uploaded to the cloud. Currently the BlueSkyproxy commits data as frequently as once every ve sec-onds. BlueSky does not start writing a new checkpointuntil the previous one completes, so under a heavy writeload checkpoints may commit less frequently.

    The proxy keeps a cache on disk to satisfy many readrequests without going to the cloud; this cache consistsof old journal segments and log segments downloadedfrom cloud storage. Journal and log segments are dis-carded from the cache using an LRU policy, except that journal segments not yet committed to the cloud are keptpinned in the cache. At most half of the disk cache can bepinned in this way. The proxy sends HTTP byte-rangerequests to decrease latency and cost when only part of a log segment is needed. It stores partially-downloadedsegments as sparse les in the cache.

    5.2 Connection Management

    The BlueSky storage backends reuse HTTP connectionswhen sending and receiving data from the cloud; theCURL library handles the details of this connection pool-

    ing. Separate threads perform each upload or download.BlueSky limits uploads to no more than 32 segmentscon-currently, to limit contention among TCP sessions and tolimit memory usage in the proxy (it buffers each segmententirely in memory before sending).

    5.3 Merging System State

    As discussed in Section 4.3, the proxy and the cleaneroperate independently of each other. When the cleanerruns, it starts from the most recent checkpoint written by

  • 8/10/2019 vrable2

    7/14

    merge inode (ino p , ino c ):if ino p . id = ino c . id :

    return ino c // No conicting changes // Start with proxy version and merge cleaner changesino m ino p ; ino m . id fresh uuid () ; updated falsefor i in [0 . . . num blocks (ino p ) 1]:

    bp ino p . blocks [i ]; b c ino c . blocks [i ]

    if b c . id = b p . id and b c . loc = b p . loc : // Relocated data by cleaner is current ino m .blocks . append (b c ); updated true

    else: // Take proxys version of data block ino m .blocks . append (b p )

    return ( ino m if updated else ino p )

    Figure 4: Pseudocode for the proxy algorithm thatmerges state for possibly divergent inodes. Subscripts p and c indicate state written by the proxy and cleaner,respectively; m is used for a candidate merged version.

    the proxy. The cleaner only ever accesses data relative

    to this le system snapshot, even if the proxy writes ad-ditional updates to the cloud. As a result, the proxy andcleaner each may make updates to the same objects (e.g.,inodes) in the le system. Since reconciling the updatesrequires unencrypted access to the objects, the proxy as-sumes responsibility for merging le system state.

    When the cleaner nishes execution, it writes an up-dated checkpoint record to its log; this checkpoint recordidenties the snapshot on which the cleaning was based.When the proxy sees a new checkpoint record from thecleaner, it begins merging updates made by the cleanerwith its own updates.

    BlueSky does not currently support the general caseof merging le system state from many writers, and onlysupports the special case of merging updates from a sin-gle proxy and cleaner. This case is straightforward sinceonly the proxy makes logical changes to the le systemand the cleaner merely relocates data. In the worst case,if the proxy has difculty merging changes by the cleanerit can simply discard the cleaners changes.

    The persistent UIDs forobjects can optimize the check for whether merging is needed. If both the proxy andcleaner logs use the same UID for an object, the cleanersversion may be used. The UIDs will differ if the proxyhas made any changes to the object, in which case the

    objects must be merged or the proxys version used. Fordata blocks, the proxys version is always used. For in-odes, the proxy merges le data block-by-block accord-ing to the algorithm shown in Figure 4. The proxy cansimilarly use inode map objects directly if possible, orwrite merged maps if needed.

    Figure 5 shows an example of concurrent updates bythe cleaner and proxy. State (a) includes a le with fourblocks, stored in two segments written by the proxy.At (b) the cleaner runs and relocates the data blocks.

    Figure 5: Example of concurrent updates by cleaner andproxy, and the resulting merged state.

    Concurrently, in (c) the proxy writes an update to thele, changing the contents of block 4. When the proxymerges state in (d), it accepts the relocated blocks 13written by the cleaner but keeps the updated block 4. Atthis point, when the cleaner runs again it can garbagecollect the two unused proxy segments.

    5.4 Implementation

    Our BlueSky prototype is implemented primarily in C,with small amounts of C++ and Python. The core Blue-Sky library, which implements the le system but not anyof the front-ends, consists of 8500 lines of code (includ-

    ing comments and whitespace). BlueSky uses GLib fordata structures and utility functions, libgcrypt for cryp-tographic primitives, and libs3 and libcurl for interactionwith Amazon S3 and Windows Azure.

    Our NFS server consists of another 3000 lines of code,not counting code entirely generated by the rpcgen RPCprotocol compiler. The CIFS server builds on top of Samba 4, adding approximately 1800 lines of code in anew backend. These interfaces do not fully implementall le system features such as security and permissionshandling, but are sufcient to evaluate the performanceof the system. The prototype in-cloud le system cleaneris implemented in just 650 lines of portable Python code

    and does not depend on the BlueSky core library.

    6 Evaluation

    In this section we evaluate the BlueSky proxy proto-type implementation. We explore performance from theproxy to the cloud, the effect of various design choiceson both performance and cost, and how BlueSky perfor-mance varies as a function of its ability to cache clientworking sets for reads and absorb bursts of client writes.

  • 8/10/2019 vrable2

    8/14

    6.1 Experimental Setup

    We ran experiments on Dell PowerEdge R200 serverswith 2.13 GHz Intel Xeon X3210 (quad-core) proces-sors, a 7200 RPM 80 GB SATA hard drive, and gigabitnetwork connectivity (internal and to the Internet). Onemachine, with 4 GB of RAM, is used as a load generator.The second machine, with 8 GB of RAM and an addi-tional 1.5 TB 7200 RPM disk drive, acts as a standardle server or a BlueSky proxy. Both servers run Debiantesting; the load generator machine is a 32-bit install (re-quired for SPECsfs) while the proxy machine uses a 64-bit operating system. For comparison purposes we alsoran a few tests against a commercial NAS ler in pro-duction use by our group. We focused our efforts on twoproviders: Amazons Simple Storage Service (S3) [1]and Windows Azure storage [ 14]. For Amazon S3, welooked at both the standard US region (East Coast) aswell as S3s West Coast (Northern California) region.

    We use the SPECsfs2008 [27] benchmark in many of our performance evaluations. SPECsfs can generate bothNFSv3 and CIFS workloads patterned after real-worldtraces. In these experiments, SPECsfs subjects the serverto increasing loads (measured in operations per second)while simultaneously increasing the size of the workingset of les accessed. Our use of SPECsfs for researchpurposes does not follow all rules for fully-compliantbenchmark results, but should allow for relative compar-isons. System load on the load generator machine re-mains low, and the load generator is not the bottleneck.

    In several of the benchmarks, the load generator ma-chine mounts the BlueSky le system with the standard

    Linux NFS client. In Section 6.4, we use a synthetic loadgenerator which directly generates NFS read requests(bypassing the kernel NFS client) for better control.

    6.2 Cloud Provider Bandwidth

    To understand the performance bounds on any imple-mentation and to guide our specic design, we measuredthe performance our proxy is able to achieve writing datato Amazon S3. Figure 6 shows that the BlueSky proxyhas the potential to fully utilize its gigabit link to S3if it uses large request sizes and parallel TCP connec-tions. The graph shows the total rate at which the proxycould upload data to S3 for a variety of request sizes and

    number of parallel connections. Network round-trip timefrom the proxy to the standard S3 region, shown in thegraph, is around 30 ms. We do not pipeline requestswewait for conrmation for each object on a connection be-fore sending another oneso each connection is mostlyidle when uploading small objects. Larger objects betterutilize the network, but objects of one to a few megabytesare sufcient to capture most gains. A single connec-tion utilizes only a fraction of the total bandwidth, so tofully make use of the network we need multiple parallel

    0.0001

    0.001

    0.01

    0.1

    1

    10

    100

    1000

    1 100 10000 1e+06 1e+08

    E f f e c

    t i v e

    U p

    l o a d B a n

    d w

    i d t h ( M b p s )

    Object Size (bytes)

    1248

    1632

    Threads: 64

    Figure 6: Measured aggregate upload performance toAmazon S3, as a function of the size of the objects up-loaded ( x -axis) and number of parallel connections made(various curves). A gigabit network link is available. Fulluse of the link requires parallel uploads of large objects.

    TCP connections. These measurements helped informthe choice of 4 MB log segments (Section 4.1) and a poolsize of 32 connections (Section 5.2).

    The S3 US-West data center is closer to our proxy lo-cation and has a correspondingly lower measured round-trip time of 12 ms. The round-trip time to Azure from ourlocation was substantially higher, around 85 ms. Yet net-work bandwidth was not a bottleneck in either case, withthe achievable bandwidth again approaching 1 Gbps. Inmost benchmarks, we use the Amazon US-West regionas the default cloud storage service.

    6.3 Impact of Cloud Latency

    To underscore the impact latency can have on le sys-tem performance, we rst run a simple, time-honoredbenchmark of unpacking and compiling a kernel sourcetree. We measure the time for three steps: (1) extractthe sources for Linux 2.6.37, which consist of roughly400 MB in 35,000 les (a write-only workload); (2)checksum the contents of all les in the extracted sources(a read-only workload); (3) build an i386 kernel usingthe default conguration and the -j4 ag for up to fourparallel compiles (a mixed read/write workload). For arange of comparisons, we repeat this experiment on a

    number of system congurations. In all cases with aremote le server, we ushed the clients cache by un-mounting the le system in between steps.

    Table 1 shows the timing results of the benchmark steps for the various system congurations. Recall thatthe network links client proxy and proxy S3 are both1 Gbpsthe only difference is latency (12 ms from theproxy to BlueSky/S3-West and 30 ms to BlueSky/S3-East). Using a network le system, even locally, addsconsiderably to the execution time of the benchmark

  • 8/10/2019 vrable2

    9/14

    Unpack Check Compile

    Local le systemwarm client cache 0:30 0:02 3:05cold client cache 0:27

    Local NFS serverwarm server cache 10:50 0:26 4:23

    cold server cache 0:49Commercial NAS ler

    warm cache 2:18 3:16 4:32NFS server in EC2

    warm server cache 65:39 26:26 74:11BlueSky/S3-West

    warm proxy cache 5:10 0:33 5:50cold proxy cache 26:12 7:10full segment 1:49 6:45

    BlueSky/S3-Eastwarm proxy 5:08 0:35 5:53cold proxy cache 57:26 8:35full segment 3:50 8:07

    Table 1: Kernel compilation benchmark times for variousle server congurations. Steps are (1) unpack sources,(2) checksum sources, (3) build kernel. Times are givenin minutes:seconds. Cache ushing and prefetching areonly relevant in steps (2) and (3).

    compared to a local disk. However, running an NFSserver in EC2 compared to running it locally increasesexecution times by a factor of 630 due to the high la-tency between the client and server and a workload withoperations on many small les. In our experiments weuse a local Linux NFS server as a baseline. Our commer-

    cial NAS ler does give better write performance than aLinux NFS server, likely due in part to better hardwareand an NVRAM write cache. Enterprises replacing suchlers with BlueSky on generic rack servers would there-fore experience a drop in write performance.

    The substantial impact latency can have on workloadperformance motivates the need for a proxy architec-ture. Since clients interact with the BlueSky proxy withlow latency, BlueSky with a warm disk cache is ableto achieve performance similar to a local NFS server.(In this case, BlueSky performs slightly better than NFSbecause its log-structured design is better-optimized forsome write-heavy workloads; however, we consider this

    difference incidental.) With a cold cache, it has to readsmall les from S3, incurring the latency penalty of read-ing from the cloud. Ancillary prefetching from fetchingfull 4 MB log segments when a client requests data inany part of the segment greatly improves performance,in part because this particular benchmark has substantiallocality; later on we will see that, in workloads with littlelocality, full segment fetches hurt performance. How-ever, execution times are still multiples of BlueSky witha warm cache. The differences in latencies between S3-

    0

    50

    100

    150

    200

    250

    300

    350

    400

    0 20 40 60 80 100

    R e a

    d L a

    t e n c y

    ( m s

    )

    Proxy Cache Size (% Working Set)

    Single-Client Request Stream

    32 KB128 KB

    1024 KB

    Figure 7: Read latency as a function of working set cap-tured by the proxy. Results are from a single run.

    West and S3-East for the cold cache and full segmentcases again underscore the sensitivity to cloud latency.

    In summary, greatly masking the high latency to cloud

    storageeven with high-bandwidth connectivity to thestorage servicerequires a local proxy to minimize la-tency to clients, while fully masking high cloud latencyfurther requires an effective proxy cache.

    6.4 Caching the Working Set

    The BlueSky proxy can mask the high latency overheadof accessing data on a cloud service by caching data closeto clients. For what kinds of le systems can such aproxy be an effective cache? Ideally, the proxy needs tocache the working set across all clients using the le sys-tem to maximize the number of requests that the proxycan satisfy locally. Although a number of factors canmake generalizing difcult, previous studies have esti-mated that clients of a shared network le system typi-cally have a combined working set that is roughly 10%of the entire le system in a day, and less at smaller timescales [ 24, 31]. For BlueSky to provide acceptable per-formance, it must have the capacity to hold this workingset. As a rough back-of-the-envelope using this conser-vative daily estimate, a proxy with one commodity 3 TBdisk of local storage could capture the daily working setfor a 30 TB le system, and ve such disks raises the lesystem size to 150 TB. Many enterprise storage needsfall well within this envelope, so a BlueSky proxy can

    comfortably capture working sets for such scenarios.In practice, of course, workloads are dynamic. Evenif proxy cache capacity is not an issue, clients shifttheir workloads over time and some fraction of the clientworkload to the proxy cannot be satised by the cache.To evaluate these cases, we use synthetic read and writeworkloads, and do so separately because they interactwith the cache in different ways.

    We start with read workloads. Reads that hit in thecache achieve local performance, while reads that miss

  • 8/10/2019 vrable2

    10/14

    in the cache incur the full latency of accessing data in thecloud, stalling the clients accessing the data. The ratio of read hits and misses in the workload determines overallread performance, and fundamentally depends on howwell the cache capacity is able to capture the le systemworking set across all clients in steady state.

    We populate a BlueSky le system on S3 with 32 GBof data using 16 MB les .1 We then generate a steadystream of xed-size NFS read requests to random lesthrough the BlueSky proxy. We vary the size of the proxydisk cache to represent different working set scenarios.In the best case, the capacity of the proxy cache is largeenough to hold the entire working set: all read requestshit in the cache in steady state, minimizing latency. Inthe worst case, the cache capacity is zero, no part of theworking set ts in the cache, and all requests go to thecloud service. In practice, a real workload falls in be-tween these extremes. Since we make uniform randomrequests to any of the les, the working set is equivalentto the size of the entire le system.

    Figure 7 shows that BlueSky with S3 provides goodlatency even when it is able to cache only 50% of theworking set: with a local NFS latency of 21 ms for 32 KBrequests, BlueSky is able to keep latency within 2 thatvalue. Given that cache capacity is not an issue, this sit-uation corresponds to clients dramatically changing thedata they are accessing such that 50% of their requestsare to new data objects not cached at the proxy. Largerrequests take better advantage of bandwidth: 1024 KBrequests are 32 larger than the 32 KB requests, but havelatencies only 4 longer.

    6.5 Absorbing Writes

    The BlueSky proxy represents a classic write-back cachescenario in the context of a cache for a wide-area stor-age backend. In contrast to reads, the BlueSky proxy canabsorb bursts of write trafc entirely with local perfor-mance since it implements a write-back cache. Two fac-tors determine the proxys ability to absorb write bursts:the capacity of the cache, which determines the instan-taneous size of a burst the proxy can absorb; and thenetwork bandwidth between the proxy and the cloud ser-vice, which determines the rate at which the proxy candrain the cache by writing back data. As long as the writeworkload from clients falls within these constraints, theBlueSky proxy can entirely mask the high latency to thecloud service for writes. However, if clients instanta-neously burst more data than can t in the cache, or if the steady-state write workload is higher than the band-width to the cloud, client writes start to experience delaysthat depend on the performance of the cloud service.

    1For this and other experiments, we use relatively small le systemsizes to keep the time for performing experiments manageable.

    0

    20

    40

    60

    80

    100

    120

    140

    0 5 10 15 20 25 30 35 A v e r a g e

    W r i t e

    L a t e n c y

    ( m s /

    1 M B w r i t e )

    Client Write Rate (MB/s): 2-Minute Burst

    Latency vs. Write Rate with Constrained Upload

    128 MB Write Buffer1 GB Write Buffer

    Figure 8: Write latencies when the proxy is uploadingover a constrained ( 100 Mbps) uplink to S3 as a func-tion of the write rate of the client and the size of the writecache to temporarily absorb writes.

    We populate a BlueSky le system on S3 with 1 MBles and generate a steady stream of xed-size 1 MBNFS write requests to random les in the le system. Theclient bursts writes at different rates for two minutes andthen stops. So that we can overload the network betweenthe BlueSky proxy and S3, we rate limit trafc to S3 at100 Mbps while keeping the client proxy link unlim-ited at 1 Gbps. We start with a rate of write requests wellbelow the trafc limit to S3, and then steadily increasethe rate until the offered load is well above the limit.

    Figure 8 shows the average latency of the 1 MB writerequests as a function of offered load, with error barsshowing standard deviation across three runs. At lowwrite rates the latency is determined by the time to com-mit writes to the proxys disk. The proxy can upload atup to about 12 MB/s to the cloud (due to the rate limit-ing), so beyond this point latency increases as the proxymust throttle writes by the client when the write bufferlls. With a 1 GB write-back cache the proxy can tem-porarily sustain write rates beyond the upload capacity.Over a 10 Mbps network (not shown), the write cachells at correspondingly smaller client rates and latenciessimilarly quickly increase.

    6.6 More Elaborate Workloads

    Using the SPECsfs2008 benchmark we next examine the

    performance of BlueSky under more elaborate workloadscenarios, both to subject BlueSky to more interestingworkload mixes as well as to highlight the impact of different design decisions in BlueSky. We evaluate anumber of different system congurations, including anative Linux nfsd in the local network (Local NFS) aswell as BlueSky communicating with both Amazon S3sUS-West region and Windows Azures blob store. Un-less otherwise noted, BlueSky evaluation results are forcommunication with Amazon S3. In addition to the base

  • 8/10/2019 vrable2

    11/14

  • 8/10/2019 vrable2

    12/14

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    0 200 400 600 800 1000 1200 1400 1600

    0 10 20 30 40 50

    A c

    h i e v e

    d O p e r a

    t i o n s p e r

    S e c o n

    d

    Requested Operations per Second

    Working Set Size (GB)

    0

    20

    40

    60

    80

    100

    0 200 400 600 800 1000 1200 1400 1600

    0 10 20 30 40 50

    O p e r a

    t i o n

    L a

    t e n c y

    ( m s

    )

    Requested Operations per Second

    Working Set Size (GB)

    Local NFSBlueSky

    BlueSky (crypto)BlueSky (noseg)

    BlueSky (norange)BlueSky (100 Mbps)

    Figure 10: Comparison of various le server congurations subjected to the SPECsfs benchmark, with a high degreeof parallelism (16 client processes). Most tests have cryptography disabled, but the +crypto test re-enables it.

    Down Op Total (Up)

    Baseline $0.18 $0.09 $0.27 $0.56

    4 KB blocks 0.09 0.07 0.16 0.47Full segments 25.11 0.09 25.20 1.00No segments 0.17 2.91 3.08 0.56

    Table 2: Cost breakdown and comparison of variousBlueSky congurations for using cloud storage. Costsare normalized to the cost per one million NFS opera-tions in SPECsfs. Breakdowns include trafc costs foruploading data to S3 (Up), downloading data (Down),operation costs (Op), and their sum (Total). Amazoneliminated Up costs in mid-2011, but values using theold price are still shown for comparison.

    6.7 Monetary CostOfoading le service to the cloud introduces monetarycost as another dimension for optimization. Figure 9showed the relative performance of different variants of BlueSky using data from the low-parallelism SPECsfsbenchmark runs. Table 2 shows the cost breakdownof each of the variants, normalized per SPECsfs opera-tion (since the benchmark self-scales, different experi-ments have different numbers of operations). We use theSeptember 2011 prices (in US Dollars) from Amazon S3as the basis for the cost analysis: $0.14/GB stored permonth, $0.12/GB transferred out, and $0.01 per 10,000

    get or 1,000 put operations. S3 also offers cheaper pricetiers for higher use, but we use the base prices as a worstcase. Overall prices are similar for other providers.

    Unlike performance, Table 2 shows that comparing bycost changes the relative ordering of the different systemvariants. Using 4 KB blocks had very poor performance,but using them has the lowest cost since they effectivelytransfer only data that clients request. The BlueSky base-line uses 32 KB blocks, requiring more data transfersand higher costs overall. If a client makes a 4 KB re-

    quest, the proxy will download the full 32 KB block;many times downloading the full block will satisfy fu-ture client requests with spatial locality, but not always.Finally, the range request optimization is essential in re-ducing cost. When the proxy downloads an entire 4 MBsegment when a client requests any data in it, the cost fordownloading data increases by 150 . If providers didnot support range requests, BlueSky would have to usesmaller segments in its le system layout.

    Although 4 KB blocks have the lowest cost, we arguethat using 32 KB blocks has the best cost-performancetradeoff. The costs with 32 KB clocks are higher, but theperformance of 4 KB blocks is far too low for a systemthat relies upon wide-area transfers

    6.8 Cleaning

    As with other le systems that do not overwrite in place,BlueSky must clean the le system to garbage collectoverwritten dataalthough less to recover critical stor-age space, and more to save on the cost of storing unnec-essary data at the cloud service. Recall that we designedthe BlueSky cleaner to operate in one of two locations:running on the BlueSky proxy or on a compute instancein the cloud service. Cleaning in the cloud has com-pelling advantages: it is faster, does not consume proxynetwork bandwidth, and is cheaper since cloud serviceslike S3 and Azure do not charge for local network trafc.

    The overhead of cleaning fundamentally depends on

    the workload. The amount of data that needs to be readand written back depends on the rate at which existingdata is overwritten and the fraction of live data in cleanedsegments, and the time it takes to clean depends on both.Rather than hypothesize a range of workloads, we de-scribe the results of a simple experiment to detail howthe cleaner operates.

    We populate a small BlueSky le system with 64 MBof data, split across 8 les. A client randomly writes, ev-ery few seconds, to a small portion (0.5 MB) of one of

  • 8/10/2019 vrable2

    13/14

    0

    50

    100

    150

    200

    1 2 3 4 5 6 7 8 9 10 11 12 13 14

    C l o u

    d S t o

    r a g e

    C o n s u m e

    d ( M B )

    Cleaner Pass Number

    Storage Used: Writes Running Concurrently with Cleaner

    ReclaimedWasted

    RewrittenUsed/Unaltered

    Figure 11: Storage space consumed during a write ex-periment running concurrently with the cleaner.

    these les. Over the course of the experiment the clientoverwrites 64 MB of data. In parallel a cleaner runs torecover storage space and defragment le contents; thecleaner runs every 30 seconds, after the proxy incorpo-rates changes made by the previous cleaner run. In ad-dition to providing data about cleaner performance, thisexperiment validates the design that allows for safe con-current execution of both the proxy and cleaner.

    Figure 11 shows the storage consumed during thiscleaner experiment; each set of stacked bars shows stor-age after a pass by the cleaner. At any point in time,only 64 MB of data is live in the le system, some of which (bottom dark bar) consists of data left alone bythe cleaner and some of which (lighter gray bar) wasrewritten by the cleaner. Some wasted space (lightestgray) cannot be immediately reclaimed; this space is ei-ther mixed useful data/garbage segments, or data whose

    relocation the proxy has yet to acknowledge. However,the cleaner deletes segments which it can establish theproxy no longer needs (white) to reclaim storage.

    This workload causes the cleaner to write largeamounts of data, because a small write to a le can causethe entire le to be rewritten to defragment the contents.Over the course of the experiment, even though the clientonly writes 64 MB of data the cleaner writes out an ad-ditional 224 MB of data. However, all these additionalwrites happen within the cloud where data transfers arefree. The extra activity at the proxy, to merge updateswritten by the cleaner, adds only 750 KB in writes and270 KB in reads.

    Despite all the data being written out, the cleaner isable to reclaim space during experiment execution tokeep the total space consumption bounded, and when theclient write activity nishes at the end of the experimentthe cleaner can repack the segment data to eliminate allremaining wasted space.

    6.9 Client Protocols: NFS and CIFS

    Finally, we use the SPECsfs benchmark to conrm thatthe performance of the BlueSky proxy is independent of

    0

    5

    10

    15

    20

    25

    30

    35

    0 100 200 300 400 500 600 700

    0 5 10 15 20 25

    O p e r a

    t i o n

    L a

    t e n c y

    ( m s

    )

    Requested Operations per Second

    Working Set Size (GB)

    Native NFSBlueSky NFS

    Samba (CIFS)BlueSky CIFS

    Figure 12: Latencies for read operations in SPECsfs as afunction of aggregate operations per second (for all op-erations) and working set size.

    the client protocol (NFS or CFS) that clients use. Theexperiments performed above use NFS for convenience,

    but the results hold for clients using CIFS as well.Figure 12 shows the latency of the read operations in

    the benchmark as a function of aggregate operations persecond (for all operations) and working set size. BecauseSPECsfs uses different operation mixes for its NFS andCIFS workloads, we focus on the latency of just the readoperations for a common point of comparison. We showresults for NFS and CIFS on the BlueSky proxy (Sec-tion 5.4) as well as standard implementations of both pro-tocols (Linux NFS and Samba for CIFS, on which ourimplementation is based). For the BlueSky proxy andstandard implementations, the performance of NFS andCIFS are broadly similar as the benchmark scales, andBlueSky mirrors any differences in the underlying stan-dard implementations. Since SPECsfs uses a workingset much larger than the BlueSky proxy cache capacityin this experiment, BlueSky has noticeably higher laten-cies than the standard implementations due to having toread data from cloud storage rather than local disk.

    7 Conclusion

    The promise of the cloud is that computation and stor-age will one day be seamlessly outsourced on an on-demand basis to massive data centers distributed aroundthe globe, while individual clients will effectively be-

    come transient access portals. This model of the fu-ture (ironically similar to the old big iron mainframemodel) may come to pass at some point, but today thereare many hundreds of billions of dollars invested in thelast disruptive computing model: client/server. Thus, inthe interstitial years between now and a potential futurebuilt around cloud infrastructure, there will be a need tobridge the gap from one regime to the other.

    In this paper, we have explored a solution to one suchchallenge: network le systems. Using a caching proxy

  • 8/10/2019 vrable2

    14/14

    architecture we demonstrate that LAN-oriented worksta-tion le system clients can be transparently served bycloud-based storage services with good performance forenterprise workloads. However, we show that exploit-ing the benets of this arrangement requires that designchoices (even low-level choices such as storage layout)

    are directly and carefully informed by the pricing mod-els exported by cloud providers (this coupling ultimatelyfavoring a log-structured layout with in-cloud cleaning).

    8 Acknowledgments

    We would like to thank our shepherd Ted Wong and theanonymous reviewers for their insightful feedback, andBrian Kantor and Cindy Moore for research computingsupport. This work was supported in part by the UCSDCenter for Networked Systems. Vrable was further sup-ported in part by a National Science Foundation Gradu-ate Research Fellowship.

    References[1] Amazon Web Services. Amazon Simple Storage Service.

    http://aws.amazon.com/s3/ .[2] A. Bessani, M. Correia, B. Quaresma, F. Andr e, and

    P. Sousa. DepSky: Dependable and Secure Storage ina Cloud-of-Clouds. In EuroSys 2011 , Apr. 2011.

    [3] Y. Chen and R. Sion. To Cloud Or Not ToCloud? Musings On Costs and Viability. http://www.cs.sunysb.edu/ sion/research/cloudc2010-draft.pdf .

    [4] Cirtas. Cirtas Bluejet Cloud Storage Controllers. http://www.cirtas.com/ .

    [5] Enomaly. ElasticDrive Distributed Remote Storage Sys-tem. http://www.elasticdrive.com/ .[6] I. Heizer, P. Leach, and D. Perry. Common Internet File

    System Protocol (CIFS/1.0). http://tools.ietf.org/html/draft-heizer-cifs-v1-spec-00 .

    [7] D. Hitz, J. Lau, and M. Malcolm. File System Designfor an NFS File Server Appliance. In Proceedings of theWinter USENIX Technical Conference , 1994.

    [8] J. Howard, M. Kazar, S. Nichols, D. Nichols, M. Satya-narayanan, R. Sidebotham, and M. West. Scale and Per-formance in a Distributed File System. ACM Transactionson Computer Systems (TOCS) , 6(1):5181, Feb. 1988.

    [9] IDC. Global market pulse. http://i.dell.com/sites/content/business/smb/sb360/en/

    Documents/0910-us-catalyst-2.pdf .[10] Jungle Disk. http://www.jungledisk.com/ .[11] R. Kotla, L. Alvisi, and M. Dahlin. SafeStore: A Durable

    and Practical Storage System. In Proceedings of the 2007 USENIX Annual Technical Conference , June 2007.

    [12] J. Li, M. Krohn, D. Mazi eres, and D. Shasha. SecureUntrusted Data Repository (SUNDR). In Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation (OSDI) , Dec. 2004.

    [13] P. Mahajan, S. Setty, S. Lee, A. Clement, L. Alvisi,M. Dahlin, and M. Walsh. Depot: Cloud Storage with

    Minimal Trust. In Proceedings of the 9th USENIX Con- ference on Operating Systems Design and Implementa-tion (OSDI) , Oct. 2010.

    [14] Microsoft. Windows Azure. http://www.microsoft.com/windowsazure/ .

    [15] Mozy. http://mozy.com/ .[16] A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen.

    Ivy: A Read/Write Peer-to-Peer File System. In Proceed-ings of the 5th Conference on Symposium on OperatingSystems Design and Implementation (OSDI) , Dec. 2002.

    [17] Nasuni. Nasuni: The Gateway to Cloud Storage. http://www.nasuni.com/ .

    [18] Panzura. Panzura. http://www.panzura.com/ .[19] R. Pike, D. Presotto, S. Dorward, B. Flandrena,

    K. Thompson, H. Trickey, and P. Winterbottom. Plan 9From Bell Labs. USENIX Computing Systems , 8(3):221254, Summer 1995.

    [20] S. Quinlan and S. Dorward. Venti: a new approach toarchival storage. In Proceedings of the 1st USENIX Con- ference on File and Storage Technologies (FAST) , 2002.

    [21] Rackspace. Rackspace Cloud. http://www.rackspacecloud.com/ .

    [22] R. Rizun. s3fs: FUSE-based le system backed by Ama-zon S3. http://code.google.com/p/s3fs/wiki/FuseOverAmazon .

    [23] M. Rosenblum and J. K. Ousterhout. The Design andImplementation of a Log-Structured File System. ACM Transactions on Computer Systems (TOCS) , 10(1):2652,1992.

    [24] C. Ruemmler and J. Wilkes. A trace-driven analysis of disk working set sizes. Technical Report HPL-OSR-93-23, HP Labs, Apr. 1993.

    [25] R. Sandberg, D. Goldberg, S. Kleirnan, D. Walsh, andB. Lyon. Design and Implementation of the Sun Network Filesystem. In Proceedings of the Summer USENIX Tech-nical Conference , pages 119130, 1985.

    [26] J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka,and E. Zeidner. Internet Small Computer Systems Inter-face (iSCSI), Apr. 2004. RFC 3720, http://tools.ietf.org/html/rfc3720 .

    [27] Standard Performance Evaluation Corporation.SPECsfs2008. http://www.spec.org/sfs2008/ .

    [28] StorSimple. StorSimple. http://www.storsimple.com/ .

    [29] TwinStrata. TwinStrata. http://www.twinstrata.com/ .

    [30] M. Vrable, S. Savage, and G. M. Voelker. Cumulus:Filesystem Backup to the Cloud. In Proceedings of the7th USENIX Conference on File and Storage Technolo-gies (FAST) , Feb. 2009.

    [31] T. M. Wong and J. Wilkes. My cache or yours? Mak-ing storage more exclusive. In Proceedings of the 2002USENIX Annual Technical Conference , June 2002.

    [32] N. Zhu, J. Chen, and T.-C. Chiueh. TBBT: Scalable andAccurate Trace Replay for File Server Evaluation. In Pro-ceedings of the 4th USENIX Conference on File and Stor-age Technologies (FAST) , Dec. 2005.

    http://aws.amazon.com/s3/http://aws.amazon.com/s3/http://www.cs.sunysb.edu/~sion/research/cloudc2010-draft.pdfhttp://www.cs.sunysb.edu/~sion/research/cloudc2010-draft.pdfhttp://www.cs.sunysb.edu/~sion/research/cloudc2010-draft.pdfhttp://www.cs.sunysb.edu/~sion/research/cloudc2010-draft.pdfhttp://www.cs.sunysb.edu/~sion/research/cloudc2010-draft.pdfhttp://www.cs.sunysb.edu/~sion/research/cloudc2010-draft.pdfhttp://www.cirtas.com/http://www.cirtas.com/http://www.cirtas.com/http://www.elasticdrive.com/http://www.elasticdrive.com/http://tools.ietf.org/html/draft-heizer-cifs-v1-spec-00http://tools.ietf.org/html/draft-heizer-cifs-v1-spec-00http://tools.ietf.org/html/draft-heizer-cifs-v1-spec-00http://i.dell.com/sites/content/business/smb/sb360/en/Documents/0910-us-catalyst-2.pdfhttp://i.dell.com/sites/content/business/smb/sb360/en/Documents/0910-us-catalyst-2.pdfhttp://i.dell.com/sites/content/business/smb/sb360/en/Documents/0910-us-catalyst-2.pdfhttp://i.dell.com/sites/content/business/smb/sb360/en/Documents/0910-us-catalyst-2.pdfhttp://www.jungledisk.com/http://www.jungledisk.com/http://www.microsoft.com/windowsazure/http://www.microsoft.com/windowsazure/http://www.microsoft.com/windowsazure/http://mozy.com/http://mozy.com/http://www.nasuni.com/http://www.nasuni.com/http://www.nasuni.com/http://www.panzura.com/http://www.panzura.com/http://www.rackspacecloud.com/http://www.rackspacecloud.com/http://www.rackspacecloud.com/http://code.google.com/p/s3fs/wiki/FuseOverAmazonhttp://code.google.com/p/s3fs/wiki/FuseOverAmazonhttp://code.google.com/p/s3fs/wiki/FuseOverAmazonhttp://tools.ietf.org/html/rfc3720http://tools.ietf.org/html/rfc3720http://www.spec.org/sfs2008/http://www.spec.org/sfs2008/http://www.storsimple.com/http://www.storsimple.com/http://www.storsimple.com/http://www.twinstrata.com/http://www.twinstrata.com/http://www.twinstrata.com/http://www.twinstrata.com/http://www.twinstrata.com/http://www.storsimple.com/http://www.storsimple.com/http://www.spec.org/sfs2008/http://www.spec.org/sfs2008/http://tools.ietf.org/html/rfc3720http://tools.ietf.org/html/rfc3720http://code.google.com/p/s3fs/wiki/FuseOverAmazonhttp://code.google.com/p/s3fs/wiki/FuseOverAmazonhttp://www.rackspacecloud.com/http://www.rackspacecloud.com/http://www.panzura.com/http://www.nasuni.com/http://www.nasuni.com/http://mozy.com/http://www.microsoft.com/windowsazure/http://www.microsoft.com/windowsazure/http://www.jungledisk.com/http://i.dell.com/sites/content/business/smb/sb360/en/Documents/0910-us-catalyst-2.pdfhttp://i.dell.com/sites/content/business/smb/sb360/en/Documents/0910-us-catalyst-2.pdfhttp://i.dell.com/sites/content/business/smb/sb360/en/Documents/0910-us-catalyst-2.pdfhttp://tools.ietf.org/html/draft-heizer-cifs-v1-spec-00http://tools.ietf.org/html/draft-heizer-cifs-v1-spec-00http://www.elasticdrive.com/http://www.cirtas.com/http://www.cirtas.com/http://www.cs.sunysb.edu/~sion/research/cloudc2010-draft.pdfhttp://www.cs.sunysb.edu/~sion/research/cloudc2010-draft.pdfhttp://www.cs.sunysb.edu/~sion/research/cloudc2010-draft.pdfhttp://aws.amazon.com/s3/