Distributed File Systems

03-10-09Some slides are taken from Professor Grimshaw, Ranveer Chandra, Krasimira Kapitnova, etc

3rd year graduate student working with Professor Grimshaw

Interests lie in Operating Systems, Distributed Systems, and more recently Cloud Computing

Also Trumpet Sporty things Hardware Junkie I like tacos … a lot

2

File System refresher Basic Issues

Naming / Transparency Caching Coherence Security Performance

Case Studies NFS v3 - v4 Lustre AFS 2.0

3

What is a file system?

Why have a file system?

4

Mmmm, refreshing File Systems

Must have Name e.g. “/home/sosa/DFSSlides.ppt” Data – some structured sequence of bytes

Tend to also have Size Protection Information Non-symbolic identifier Location Times, etc

5

A container abstraction to help organize files Generally hierarchical

(tree) structure Often a special type of

file Directories have a

Name Files and directories (if

hierarchical) within them

6

A large container for tourists

Two approaches to sharing files

Copy-based Application explicitly

copies files between machines

Examples: UUCP, FTP, gridFTP, {.*}FTP, Rcp, Scp, etc.

Access transparency – i.e. Distributed File Systems

7

Sharing is caring

Basic idea Find a copy

naming is based on machine name of source (viper.cs.virginia.edu), user id, and path

Transfer the file to the local file system scp [email protected]:fred.txt .

Read/write Copy back if modified

Pros ans Cons?8

mailto:[email protected]:fred.txt

Pros Semantics are clear No OS/library modification

Cons? Deal with model Have to copy whole file Inconsistencies Inconsistent copies all over the place Others?

9

Mechanism to access remote the same as local (i.e. through the file system hierarchy)

Why is this better?

… enter Distributed File Systems

10

A Distributed File System is a file system that may have files on more than one machine

Distributed File Systems take many forms Network File Systems Parallel File Systems Access Transparent

Distributed File Systems

Why distribute?11

Sharing files with other users Others can access your files You can have access to files you wouldn’t

regularly have access to Keeping files available for yourself on

more than one computer Small amount of local resources High failure rate of local resources Can eliminate version problems (same file

copied around with local edits)

12

Naming Performance Caching Consistency Semantics Fault Tolerance Scalability

13

What does a DFS look like to the user? Mount-like protocol .e.g

/../mntPointToBobsSharedFolder/file.txt Unified namespace. Everything looks like

they’re on the same namespace Pros and Cons?

14

Location transparency Name does not hint at physical location Mount points are not transparent

Location Independence File name does not need to be changed when

the file’s physical storage location changes

Independence without transparency?

15

Generally trade-off the benefits of DFS’s with some performance hits How much depends on

workload Always look at workload

to figure out what mechanisms to use

What are some ways to improve performance?

16

Single architectural feature that contributes most to performance in a DFS!!!

Single greatest cause of heartache for programmers of DFS’s Maintaining consistency semantics more

difficult Has a large potential impact on scalability

17

Size of the cached units of data Larger sizes make more

efficient use of the network –spacial locality, latency

Whole files simply semantics but can’t store very large files locally

Small files Who does what

Push vs Pull Important for

maintaining consistency

18

Different DFS’s have different consistency semantics UNIX semantics On Close semantics Timeout semantics (at least x-second up-to

date) Pro’s / Con’s?

19

Can replicate Fault Tolerance Performance

Replication is inherently location-opaque i.e. we need location independence in naming

Different forms of replication mechanisms, different consistency semantics Tradeoffs, tradeoffs, tradeoffs

20

Mount-based DFS NFS version 3 Others include SMB, CIFS, NFS version 4

Parallel DFS Lustre Others include HDFS, Google File System, etc

Non-Parallel Unified Namespace DFS’s Sprite AFS version 2.0 (basis for many other DFS’s)

Coda AFS 3.0

21

22

Most commonly used DFS ever!

Goals Machine & OS Independent Crash Recovery Transparent Access “Reasonable” Performance

Design All are client and servers RPC (on top of UDP v.1,

v.2+ on TCP) Open Network Computing

Remote Procedure Call External Data

Representation (XDR) Stateless Protocol

23

24

25

Client sends path name to server with request to mount

If path is legal and exported, server returns file handle Contains FS type, disk, i-node number of

directory, security info Subsequent accesses use file handle

Mount can be either at boot or automount Automount: Directories mounted on-use Why helpful?

Mount only affects client view

Mounting (part of) a remote file system in NFS.

26

Mounting nested directories from multiple servers in NFS.

27

28

Supports directory and file access via remote procedure calls (RPCs)

All UNIX system calls supported other than open & close

Open and close are intentionally not supported For a read, client sends lookup message to server Lookup returns file handle but does not copy info in

internal system tables Subsequently, read contains file handle, offset and

num bytes Each message is self-contained – flexible,

but?

a) Reading data from a file in NFS version 3.b) Reading data using a compound procedure in version 4.

29

Some general mandatory file attributes in NFS.

Attribute DescriptionTYPE The type of the file (regular, directory, symbolic link)

SIZE The length of the file in bytes

CHANGE Indicator for a client to see if and/or when the file has changed

FSID Server-unique identifier of the file's file system

30

Some general recommended file attributes.

Attribute DescriptionACL an access control list associated with the file

FILEHANDLE The server-provided file handle of this file

FILEID A file-system unique identifier for this file

FS_LOCATIONS Locations in the network where this file system may be found

OWNER The character-string name of the file's owner

TIME_ACCESS Time when the file data were last accessed

TIME_MODIFY Time when the file data were last modified

TIME_CREATE Time when the file was created

31

All communication done in the clear Client sends userid, group id of request

NFS server Discuss

32

Consistency semantics are dirty Checks non-dirty items every 5 seconds Things marked dirty flushed within 30

seconds Performance under load is horrible, why? Cross-mount hell - paths to files different

on different machines ID mismatch between domains

33

Goals Improved Access and good performance on

the Internet Better Scalability Strong Security Cross-platform interoperability and ease to

extend

34

Stateful Protocol (Open + Close) Compound Operations (Fully utilize

bandwidth) Lease-based Locks (Locking built-in) “Delegation” to clients (Less work for the

server) Close-Open Cache Consistency (Timeouts

still for attributes and directories) Better security

35

Borrowed model from CIFS (Common Internet File System) see MS

Open/Close Opens do lookup, create, and lock all in one (what

a deal)! Locks / delegation (explained later) released on

file close Always a notion of a “current file handle” i.e. see

pwd

36

Problem: Normal filesystem semantics have too many RPC’s (boo)

Solution: Group many calls into one call (yay)

Semantics Run sequentially Fails on first failure Returns status of each

individual RPC in the compound response (either to failure or success)

37

Compound Kitty

Both byte-range and file locks Heartbeats keep locks alive (renew

lock) If server fails, waits at least the agreed

upon lease time (constant) before accepting any other lock requests

If client fails, locks are released by server at the end of lease period

38

Tells client no one else has the file Client exposes callbacks

39

Any opens that happen after a close finishes are consistent with the information with the last close

Last close wins the competition

40

Uses the GSS-API framework

All id’s are formed with User@domain Group@domain

Every implementation must have Kerberos v5

Every implementation must have LIPKey

41

Meow

Replication / Migration mechanism added Special error messages to indicate

migration Special attribute for both replication and

migration that gives the location of the other / new location

May have read-only replicas

42

43

People don’t like to move Requires Kerberos (the death of many

good distributed file systems Looks just like V3 to end-user and V3 is

good enough

44

45

Need for a file system for large clusters that has the following attributes Highly scalable > 10,000 nodes Provide petabytes of storage High throughput (100 GB/sec)

Datacenters have different needs so we need a general-purpose back-end file system

46

Open-source object-based cluster file system

Fully compliant with POSIX Features (i.e. what I will discuss)

Object Protocols Intent-based Locking Adaptive Locking Policies Aggressive Caching

47

48

49

50

51

52

Policy depends on context

Mode 1: Performing operations on something they only mostly use (e.g. /home/username)

Mode 2: Performing operations on a highly contentious Resource (e.g. /tmp)

DLM capable of granting locks on an entire subtree and whole files

53

POSIX Keeps local journal of

updates for locked files One per file operation Hard linked files get special

treatment with subtree locks

Lock revoked -> updates flushed and replayed

Use subtree change times to validate cache entries

Additionally features collaborative caching -> referrals to other dedicated cache service

54

Security Supports GSS-API

Supports (does not require) Kerberos Supports PKI mechanisms

Did not want to be tied down to one mechanism

56

57

Named after Andrew Carnegie and Andrew Mellon Transarc Corp. and then IBM took development of AFS In 2000 IBM made OpenAFS available as open source

Goals Large scale (thousands of servers and clients) User mobility Scalability Heterogeneity Security Location transparency Availability

Features: Uniform name space Location independent file sharing Client side caching with cache

consistency Secure authentication via Kerberos High availability through automatic

switchover of replicas Scalability to span 5000 workstations

58

59

Based on the upload/download model Clients download and cache files Server keeps track of clients that cache the file Clients upload files at end of session

Whole file caching is key Later amended to block operations (v3) Simple and effective

Kerberos for Security AFS servers are stateful

Keep track of clients that have cached files Recall files that have been modified

60

Clients have partitioned name space: Local name space and shared name space Cluster of dedicated servers (Vice) present

shared name space Clients run Virtue protocol to communicate

with Vice

61

62

AFS’s storage is arranged in volumes Usually associated with files of a particular client

AFS dir entry maps vice files/dirs to a 96-bit fid Volume number Vnode number: index into i-node array of a volume Uniquifier: allows reuse of vnode numbers

Fids are location transparent File movements do not invalidate fids

Location information kept in volume-location database Volumes migrated to balance available disk space,

utilization Volume movement is atomic; operation aborted on server

crash

User process –> open file F

The kernel resolves that it’s a Vice file ->

passes it to Venus

D is in the cache & has callback –> use it without any network communication

D is in cache but has no callback –> contact the appropriate server for a new copy; establish callback

D is not in cache –> fetch it from the server; establish callback

File F is identified -> create a current cache copy

Venus returns to the kernel which opens F and returns its handle to the process

63

64

AFS caches entire files from servers Client interacts with servers only during open and close

OS on client intercepts calls, passes to Venus Venus is a client process that caches files from servers Venus contacts Vice only on open and close Reads and writes bypass Venus

Works due to callback: Server updates state to record caching Server notifies client before allowing another client to modify Clients lose their callback when someone writes the file

Venus caches dirs and symbolic links for path translation

The use of local copies when opening a session in Coda.

65

A descendent of AFS v2 (AFS v3 went another way with large chunk caching)

Goals More resilient to server and network

failures Constant Data Availability Portable computing

66

Keeps whole file caching, callbacks, end-to-end encryption

Adds full server replication General Update Protocol

Known as Coda Optimistic Protocol COP1 (first phase) performs actual semantic

operation to servers (using multicast if available) COP2 sends a data structure called an update set

which summarizes the client’s knowledge. These messages are piggybacked on later COP1’s

67

Disconnected Operation (KEY) Hoarding

Periodically reevaluates which objects merit retention in the cache (hoard walking)

Relies on both implicit and a lot of explicit info (profiles etc)

Emulating i.e. maintaining a replay log Reintegration – re-play replay log

Conflict Resolution Gives repair tool Log to give to user to manually fix issue

68

The state-transition diagram of a Coda client with respect to a volume.

69

AFS deployments in academia and government (100’s)

Security model required Kerberos Many organizations not willing to make the costly

switch AFS (but not coda) was not integrated into

Unix FS. Separate “ls”, different – though similar – API

Session semantics not appropriate for many applications

70

Goals Efficient use of large main memories Support for multiprocessor workstations Efficient network communication Diskless Operation Exact Emulation of UNIX FS semantics

Location transparent UNIX FS

71

Naming Local prefix table which maps path-name prefixes to

servers Cached locations Otherwise there is location embedded in remote stubs in

the tree hierarchy Caching

Needs sequential consistency If one client wants to write, disables caching on all open

clients. Assumes this isn’t very bad since this doesn’t happen often

No security between kernels. All over trusted network

72

The best way to implement something depends very highly on the goals you want to achieve

Always start with goals before deciding on consistency semantics

73

74

Technology

Distributed File Systems