View
215
Download
1
Tags:
Embed Size (px)
Citation preview
09/27/2007 File Systems 1
UCDavis, ecs251Fall 2007
ecs251 Fall 2007:
Operating System ModelsOperating System Models#1: File Systems
Dr. S. Felix Wu
Computer Science Department
University of California, Davishttp://www.cs.ucdavis.edu/~wu/
09/27/2007 File Systems 2
UCDavis, ecs251Fall 2007
File SystemsFile Systems
Information organization, access, and storage
Reliability Content-centric computing & networking
– Database, Web/Google, Bittorrent, emails
09/27/2007 File Systems 3
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 4
UCDavis, ecs251Fall 2007
System-call interfaceActive file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
09/27/2007 File Systems 5
UCDavis, ecs251Fall 2007 File File Disk Disk
separate the disk into blocks separate the file into blocks as well “paging” from file to disk
blocks: 4 - 7- 2- 10- 12
How to represent the file??How to link these 5 pages together??
09/27/2007 File Systems 6
UCDavis, ecs251Fall 2007
Bit Torrent piecesBit Torrent pieces
1 big file (X Gigabytes) with a number of pieces (5%) already in (and sharing with others).
How much disk space do we need at this moment?
09/27/2007 File Systems 7
UCDavis, ecs251Fall 2007 File File Disk blocks Disk blocks
fileblock
0
4
fileblock
1
7
fileblock
2
2
fileblock
3
10
0file
block4
12
What are the disadvantages?1. disk access can be slow for “random access”.2. How big is each block? 2^X bytes? 2^X+8 bytes?
09/27/2007 File Systems 8
UCDavis, ecs251Fall 2007
One Logical File One Logical File Physical Disk Blocks Physical Disk Blocks
efficient representation & access
09/27/2007 File Systems 9
UCDavis, ecs251Fall 2007 An i-nodeAn i-node
??? entries inone disk block
A file
09/27/2007 File Systems 10
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 11
UCDavis, ecs251Fall 2007125 struct ufs2_dinode {
126 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */127 int16_t di_nlink; /* 2: File link count. */128 u_int32_t di_uid; /* 4: File owner. */ 129 u_int32_t di_gid; /* 8: File group. */ 130 u_int32_t di_blksize; /* 12: Inode blocksize. */ 131 u_int64_t di_size; /* 16: File byte count. */ 132 u_int64_t di_blocks; /* 24: Bytes actually held. */ 133 ufs_time_t di_atime; /* 32: Last access time. */ 134 ufs_time_t di_mtime; /* 40: Last modified time. */ 135 ufs_time_t di_ctime; /* 48: Last inode change time. */ 136 ufs_time_t di_birthtime; /* 56: Inode creation time. */ 137 int32_t di_mtimensec; /* 64: Last modified time. */ 138 int32_t di_atimensec; /* 68: Last access time. */ 139 int32_t di_ctimensec; /* 72: Last inode change time. */ 140 int32_t di_birthnsec; /* 76: Inode creation time. */ 141 int32_t di_gen; /* 80: Generation number. */ 142 u_int32_t di_kernflags; /* 84: Kernel flags. */ 143 u_int32_t di_flags; /* 88: Status flags (chflags). */ 144 int32_t di_extsize; /* 92: External attributes block. */ 145 ufs2_daddr_t di_extb[NXADDR];/* 96: External attributes block. */ 146 ufs2_daddr_t di_db[NDADDR]; /* 112: Direct disk blocks. */ 147 ufs2_daddr_t di_ib[NIADDR]; /* 208: Indirect disk blocks. */ 148 int64_t di_spare[3]; /* 232: Reserved; currently unused */ 149 };
09/27/2007 File Systems 12
UCDavis, ecs251Fall 2007166 struct ufs1_dinode {
167 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */ 168 int16_t di_nlink; /* 2: File link count. */ 169 union { 170 u_int16_t oldids[2]; /* 4: Ffs: old user and group ids. */ 171 } di_u; 172 u_int64_t di_size; /* 8: File byte count. */ 173 int32_t di_atime; /* 16: Last access time. */ 174 int32_t di_atimensec; /* 20: Last access time. */ 175 int32_t di_mtime; /* 24: Last modified time. */ 176 int32_t di_mtimensec; /* 28: Last modified time. */ 177 int32_t di_ctime; /* 32: Last inode change time. */ 178 int32_t di_ctimensec; /* 36: Last inode change time. */ 179 ufs1_daddr_t di_db[NDADDR]; /* 40: Direct disk blocks. */ 180 ufs1_daddr_t di_ib[NIADDR]; /* 88: Indirect disk blocks. */ 181 u_int32_t di_flags; /* 100: Status flags (chflags). */ 182 int32_t di_blocks; /* 104: Blocks actually held. */ 183 int32_t di_gen; /* 108: Generation number. */ 184 u_int32_t di_uid; /* 112: File owner. */ 185 u_int32_t di_gid; /* 116: File group. */ 186 int32_t di_spare[2]; /* 120: Reserved; currently unused */ 187 };
09/27/2007 File Systems 13
UCDavis, ecs251Fall 2007
Bittorrent piecesBittorrent pieces
File size: 10 GBPieces downloaded: 512 MBHow much disk space do we need?
09/27/2007 File Systems 14
UCDavis, ecs251Fall 2007
Welcome to Welcome to ecs251ecs251
Alagesan, Shri Vidhya Anand, Manish K.Apple, James B.Bhattacharyya, Prantik Brasseur, Brian A.Cai, Liang Chan, An Chan, Kelcey Ding, Dan Gulati, Supriya S.Hadke, Amit A.Harve, Shruthi K.Jones, Chad E.Kim, Yejin Kishore, Apoorv Kwon, Taeho Lorca-Martinez, Daniel A.Mason, Blake Muppala, Yugandhar Nelson, Jarom R.Schwarz, Christopher G.Shafii, Sohail S.Shi, Lei Singh, Anhad P.Stegers, Till J.Sukharev, Jeffrey Thomason, Russell D.Xia, Ming Xu, Dan
09/27/2007 File Systems 15
UCDavis, ecs251Fall 2007
Subjects to coverSubjects to cover
File Systems Distributed File Systems (NFS/AFS/Coda) Peer-to-Peer Systems (Bittorrent, Chord, Web 2.0) Social Network Systems (Facebook, LinkedIn,
Orkut, MySpace,…) Transaction, Recovery & Distributed Commit Trust and Reputation Systems Self-Stabilization Systems (if time available)
09/27/2007 File Systems 16
UCDavis, ecs251Fall 2007
Class websiteClass website
http://www.cs.ucdavis.edu/~wu/ecs251
09/27/2007 File Systems 17
UCDavis, ecs251Fall 2007
About the InstructorAbout the Instructor
S. Felix Wu – [email protected]– Facebook group (under UC Davis network): “ecs251”
Office: 3057 Engineering II Phone: 530-754-7070 Office Hours:
– 10-11 a.m. on Monday and Friday– by appointment
09/27/2007 File Systems 18
UCDavis, ecs251Fall 2007
Why 3 email addresses?Why 3 email addresses?– [email protected]
read/response during the quarters, especially before the homework deadlines.
– [email protected]– My main email contact for everything all the time.
– [email protected]– Read only once in the past three months…
09/27/2007 File Systems 19
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 20
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 21
UCDavis, ecs251Fall 2007
Anti-SpamAnti-Spam
[email protected] subject: [ecs251_f2007]…
ecs251_f2007_0927 is the cyber social link between the instructor and the students in ecs251, Fall 2007.
09/27/2007 File Systems 22
UCDavis, ecs251Fall 2007
TextbookTextbook
A set of papers on these subjects
09/27/2007 File Systems 23
UCDavis, ecs251Fall 2007
RequirementsRequirements
Midterm: 30% (11/01/2007) Final Project: 70% (1~2 persons)
– Pre-proposal (10%, 10/18/2007)– Proposal/Design/Prototype (30%, 11/20/2007)– Final Demo & Report (30%, 12/11/2007)
09/27/2007 File Systems 24
UCDavis, ecs251Fall 2007
Projects (10/02/2007)Projects (10/02/2007)
On-demand/interactive Bittorrent– Kernel, BT client, & applications
Transaction-oriented File System P2P-X over Internet
– X = Facebook, Google, eBay, ud-DFS, Data Center “Internet-less information system”
– MANET-based systems Social network information crawling
09/27/2007 File Systems 25
UCDavis, ecs251Fall 2007
On-Demand/Interactive BTOn-Demand/Interactive BT BT piece selection -- rarest-first plus priority Downloading xyz.rm and we want to fast forward
to a particular part of the file Can “the kernel file system” pass that particular
need to the BT?– E.g., the system call read will be blocked until the BT
obtains the needed pieces. Advantage: don’t need to modify the applications! Disadvantages: ???
09/27/2007 File Systems 26
UCDavis, ecs251Fall 2007
Transaction-oriented FSTransaction-oriented FS Meta File system consistency:
– FreeBSD: Soft Updates– Linux: Journaling
Can we relax “All or Nothing”?– A better user interface for us to recover partial
file modifications.
09/27/2007 File Systems 27
UCDavis, ecs251Fall 2007
P2P over XP2P over X Facebook Google eBay MegaUpload
Problems with centralized service providers???
09/27/2007 File Systems 28
UCDavis, ecs251Fall 2007
DaceBookDaceBook
Davis version of FaceBook Peer-to-Peer social networking Community gateways to other social
networks such as Facebook
Socially here, information here
09/27/2007 File Systems 29
UCDavis, ecs251Fall 2007
Internet-less Information Internet-less Information SharingSharing
Ad Hoc Collaboration for Content Sharing in the Ad Hoc Collaboration for Content Sharing in the Infrastructure-less Heterogeneous NetworkInfrastructure-less Heterogeneous Network
E (WiFi)
S (WiFi)
C (WiMAX )
A (Bluetooth )
D (WiFi+WiMAX)
F (WiFi + WiMAX)
B (WiFi + Bluetooth )
09/27/2007 File Systems 30
UCDavis, ecs251Fall 2007
Infrastructure-LessInfrastructure-Less
“No IP address, DNS, or Web servers” Purely P2P from Layers 3 to 7
09/27/2007 File Systems 31
UCDavis, ecs251Fall 2007
Social Network CrawlingSocial Network Crawling
09/27/2007 File Systems 32
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 33
UCDavis, ecs251Fall 2007
Social NetworkingSocial Networking Social Network services
– Friendster, MySpace, Facebook, Orkut, LinkIn..
Social Links, Interest Keywords, Search & Community Services
09/27/2007 File Systems 34
UCDavis, ecs251Fall 2007
Social Networks: # of Users Social Networks: # of Users
FaceBook ~34M Friendster ~47M Hi5 ~50M LinkIn ~12M MySpace ~200M Orkut ~47M Yahoo! 360o ~4M … among many others.
http://en.wikipedia.org/wiki/List_of_social_networking_sites
09/27/2007 File Systems 35
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 36
UCDavis, ecs251Fall 2007
Social Network Access Social Network Access InterfacesInterfaces
FBML - Facebook Markup Language FQL - Facebook Query Language REST (REpresentational State Transfer)
Web API FBJS - Implementation of Javascript on the
Facebook platform
09/27/2007 File Systems 37
UCDavis, ecs251Fall 2007
#include <stdio.h>#include <stdlib.h>
intmain(void){ FILE *f1 = fopen("./sss.txt", "w"); int i;
for (i = 0; i < 1000; i++) { fseek(f1, rand(), SEEK_SET); fprintf(f1, "%d%d%d%d", rand(), rand(), rand(), rand()); if (i % 100 == 0) sleep(1); } fflush(f1);}
# ./t# ls –l ./sss.txt
09/27/2007 File Systems 38
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 39
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 40
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 41
UCDavis, ecs251Fall 2007
di_size vs. di_blocksdi_size vs. di_blocks
Logical Physical
fstat du
09/27/2007 File Systems 42
UCDavis, ecs251Fall 2007
One Logical File One Logical File Physical Disk Blocks Physical Disk Blocks
efficient representation & access
09/27/2007 File Systems 43
UCDavis, ecs251Fall 2007 An i-nodeAn i-node
Typical:each block 1K
??? entries inone disk block
A file
09/27/2007 File Systems 44
UCDavis, ecs251Fall 2007 An i-nodeAn i-node
Typical:each block 1K
??? entries inone disk block
A file
09/27/2007 File Systems 45
UCDavis, ecs251Fall 2007
i-nodei-node
How many disk blocks can a FS have? How many levels of i-node indirection will be
necessary to store a file of 2G bytes? (I.e., 0, 1, 2 or 3) What is the largest possible file size in i-node? What is the size of the i-node itself for a file of 10GB
with only 512 MB downloaded?
09/27/2007 File Systems 46
UCDavis, ecs251Fall 2007
AnswerAnswer How many disk blocks can a FS have?
– 264 or 232: Pointer (to blocks) size is 8/4 bytes. How many levels of i-node indirection will be
necessary to store a file of 2G (231) bytes? (I.e., 0, 1, 2 or 3)– 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10 >? 231
What is the largest possible file size in i-node?– 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10
– 264 –1– 232 * 210
You need to consider three issues and find the minimum!
09/27/2007 File Systems 47
UCDavis, ecs251Fall 2007
Answer: Lower BoundAnswer: Lower Bound
How many pointers?– 512MB divided by the block size (1K)– 512K pointers times 8 (4) bytes = 4 (2) MB
09/27/2007 File Systems 48
UCDavis, ecs251Fall 2007
Answer: Upper BoundAnswer: Upper Bound
In the worst case, EVERY indirection block has at least one entry!
How many indirection blocks?– Single: 1 block– Double: 1 + 28
– Tripple: 1 + 28 + 216
Total ~ 216 blocks times 1K = 64 MB– 214 times 1K = 16MB (ufs2 inode)
09/27/2007 File Systems 49
UCDavis, ecs251Fall 2007
Answer (4)Answer (4)
2 MB ~ 64 MB ufs1 4 MB ~ 16 MB ufs2
Answer: sss.txt ~17 MB– ~16 MB (inode indirection blocks)– 1000 writes times 1K ~ 1MB
09/27/2007 File Systems 50
UCDavis, ecs251Fall 2007 A File SystemA File System
partition partition partition
i-list directory and data blockssb
i-node i-node ……. i-node
d
09/27/2007 File Systems 51
UCDavis, ecs251Fall 2007
dirp = opendir(const char *filename);struct dirent *direntp = readdir(dirp);
struct dirent {ino_t d_ino;char d_name[NAME_MAX+1];
};
directory
direntinode
file_name
file
file
file
direntinode
file_name
direntinode
file_name
09/27/2007 File Systems 52
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 53
UCDavis, ecs251Fall 2007
drwxr-xr-xApr 1 2004
root wheel
drwxr-xr-xApr 1 2004
root wheel
rwxr-xr-xApr 15 2004
root wheel
rw-rw-r--Jan 19 2004
kirk staff
drwxr-xr-xApr 1 2004
root wheel
rwxr-xr-xApr 15 2004
bin bin
2
3
4
5
6
7
8
9
. 2
.. 2usr 4
vmunix 5
. 4
.. 2bin 7foo 6
text data
Hello World!
. 7
.. 4ex 9
groff 10vi 9
text data
directory/
directory/usr
directory/usr/bin
file/vmunix
file/usr/foo
file/usr/bin/vi
09/27/2007 File Systems 54
UCDavis, ecs251Fall 2007 struct dirent {
ino_t d_ino;char d_name[NAME_MAX+1];
};
struct stat {…short nlinks;
…};
directory
direntinode
file_name
file
file
file
direntinode
file_name
direntinode
file_name
09/27/2007 File Systems 55
UCDavis, ecs251Fall 2007 A File SystemA File System
partition partition partition
i-list directory and data blockssb
i-node i-node ……. i-node
d
09/27/2007 File Systems 56
UCDavis, ecs251Fall 2007
What is the difference?What is the difference?
ln –s /usr/src/sys/sys/proc.h ppp.h ln /usr/src/sys/sys/proc.h ppp.h
09/27/2007 File Systems 57
UCDavis, ecs251Fall 2007
File System Buffer CacheFile System Buffer Cacheapplication: read/write files
OS: translate file to disk blocks
...buffer cache ...maintains
controls disk accesses: read/write blocks
hardware:
Any problems?
09/27/2007 File Systems 58
UCDavis, ecs251Fall 2007
File System ConsistencyFile System Consistency
To maintain file system consistency the ordering of updates from buffer cache to disk is critical
Example:– if the directory block is written back before the
i-node and the system crashes, the directory structure will be inconsistent
09/27/2007 File Systems 59
UCDavis, ecs251Fall 2007
File System ConsistencyFile System Consistency File system almost always use a buffer/disk cache for
performance reasons This problem is critical especially for the blocks that
contain control information: i-node, free-list, directory blocks
Two copies of a disk block (buffer cache, disk) consistency problem if the system crashes before all the modified blocks are written back to disk
Write back critical blocks from the buffer cache to disk immediately
Data blocks are also written back periodically: sync
09/27/2007 File Systems 60
UCDavis, ecs251Fall 2007
Two StrategiesTwo Strategies Prevention
– Use un-buffered I/O when writing i-nodes or pointer blocks
– Use buffered I/O for other writes and force sync every 30 seconds
Detect and Fix– Detect the inconsistency
– Fix them according to the “rules”
– Fsck (File System Checker)
09/27/2007 File Systems 61
UCDavis, ecs251Fall 2007
File System IntegrityFile System Integrity Block consistency:
– Block-in-use table
– Free-list table
File consistency:– how many directories pointing to that i-node?
– nlink?
– three cases: D == L, L > D, D > L What to do with the latter two cases?
0 1 1 1 0 0 0 1 0 0 0 2
1 0 0 0 1 1 1 0 1 0 2 0
09/27/2007 File Systems 62
UCDavis, ecs251Fall 2007 File System IntegrityFile System Integrity
File system states(a) consistent(b) missing block(c) duplicate block in free list(d) duplicate data block
09/27/2007 File Systems 63
UCDavis, ecs251Fall 2007
Metadata OperationsMetadata Operations
Metadata operations modify the structure of the file system– Creating, deleting, or renaming
files, directories, or special files– Directory & I-node
Data must be written to disk in such a way that the file system can be recovered to a consistent state after a system crash
09/27/2007 File Systems 64
UCDavis, ecs251Fall 2007
Metadata IntegrityMetadata Integrity
FFS uses synchronous writes to guarantee the integrity of metadata– Any operation modifying multiple pieces of
metadata will write its data to disk in a specific order
– These writes will be blocking Guarantees integrity and durability of
metadata updates
09/27/2007 File Systems 65
UCDavis, ecs251Fall 2007
Deleting a file (I)Deleting a file (I)
abc
def
ghi
i-node-1
i-node-2
i-node-3
Assume we want to delete file “def”
09/27/2007 File Systems 66
UCDavis, ecs251Fall 2007
Deleting a file (II)Deleting a file (II)
abc
def
ghi
i-node-1
i-node-3
Cannot delete i-node before directory entry “def”
?
09/27/2007 File Systems 67
UCDavis, ecs251Fall 2007
Deleting a file (III)Deleting a file (III)
Correct sequence is1. Write to disk directory block containing deleted
directory entry “def”
2. Write to disk i-node block containing deleted i-node
Leaves the file system in a consistent state
09/27/2007 File Systems 68
UCDavis, ecs251Fall 2007
Creating a file (I)Creating a file (I)
abc
ghi
i-node-1
i-node-3
Assume we want to create new file “tuv”
09/27/2007 File Systems 69
UCDavis, ecs251Fall 2007
Creating a file (II)Creating a file (II)
abc
ghi
tuv
i-node-1
i-node-3
Cannot write directory entry “tuv” before i-node
?
09/27/2007 File Systems 70
UCDavis, ecs251Fall 2007
Creating a file (III)Creating a file (III)
Correct sequence is1. Write to disk i-node block containing new i-node
2. Write to disk directory block containing new directory entry
Leaves the file system in a consistent state
09/27/2007 File Systems 71
UCDavis, ecs251Fall 2007
Synchronous UpdatesSynchronous Updates
Used by FFS to guarantee consistency of metadata:– All metadata updates are done through blocking
writes
Increases the cost of metadata updates Can significantly impact the performance
of whole file system
09/27/2007 File Systems 72
UCDavis, ecs251Fall 2007
SOFT UPDATESSOFT UPDATES
Use delayed writes (write back) Maintain dependency information about
cached pieces of metadata:This i-node must be updated before/after this directory entry
Guarantee that metadata blocks are written to disk in the required order
09/27/2007 File Systems 73
UCDavis, ecs251Fall 2007
3 Soft Update Rules3 Soft Update Rules
Never point to a structure before it has been initialized.
Never reuse a resource before nullifying all previous pointers to it.
Never reset the old pointer to a live resource before the new pointer has been set.
09/27/2007 File Systems 74
UCDavis, ecs251Fall 2007
Problem #1 with S.U.Problem #1 with S.U.
Synchronous writes guaranteed that metadata operations were durable once the system call returned
Soft Updates guarantee that file system will recover into a consistent state but not necessarily the most recent one– Some updates could be lost
09/27/2007 File Systems 75
UCDavis, ecs251Fall 2007
We want to delete file “foo” and create new file “bar”
i-node-2 foo
NEW bar
NEW i-node-3
Block A Block B
What are the dependency relationship?
09/27/2007 File Systems 76
UCDavis, ecs251Fall 2007
We want to delete file “foo” and create new file “bar”
i-node-2 foo
NEW bar
NEW i-node-3
Block A Block B
Circular DependencyX-2nd Y-1st
09/27/2007 File Systems 77
UCDavis, ecs251Fall 2007
Problem #2 with S.U.Problem #2 with S.U.
Cyclical dependencies:– Same directory block contains entries to be
created and entries to be deleted– These entries point to i-nodes in the same block
Brainstorming:– How to resolve this issue in S.U.?
09/27/2007 File Systems 78
UCDavis, ecs251Fall 2007
How to update?? i-node first or director block first?
09/27/2007 File Systems 79
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 80
UCDavis, ecs251Fall 2007
Solution in S.U.Solution in S.U.
Roll back metadata in one of the blocks to an earlier, safe state
(Safe state does not contain new directory entry)
def
Block A’
09/27/2007 File Systems 81
UCDavis, ecs251Fall 2007
Write first block with metadata that were rolled back (block A’ of example)
Write blocks that can be written after first block has been written (block B of example)
Roll forward block that was rolled back Write that block Breaks the cyclical dependency but must now
write twice block A
09/27/2007 File Systems 82
UCDavis, ecs251Fall 2007
Before any Write Operation
After any Write Operation
SU Dependency Checking(roll back if necessary)
SU Dependency Processing(task list updating)(roll forward if necessary)
09/27/2007 File Systems 83
UCDavis, ecs251Fall 2007
two most popular approaches for improving the performance of metadata operations and recovery:– Journaling – Soft Updates
Journaling systems record metadata operations on an auxiliary log
Soft Updates uses ordered writes
09/27/2007 File Systems 84
UCDavis, ecs251Fall 2007 JOURNALINGJOURNALING
Journaling systems maintain an auxiliary log that records all meta-data operations
Write-ahead logging ensures that the log is written to disk before any blocks containing data modified by the corresponding operations.– After a crash, can replay the log to bring the file
system to a consistent state
09/27/2007 File Systems 85
UCDavis, ecs251Fall 2007
JournalingJournaling Atomically updated Old and new versions of data held on disk until
the update commits Undo logging:
– Copy old data to the log– Write new data to disk– If you crash during update, copy old data from log
Redo logging:– Write new data to the log– Old data remains on disk– If you crash, copy new data from the log
09/27/2007 File Systems 86
UCDavis, ecs251Fall 2007
How does it work?How does it work? Each disk update is a Transaction (atomic
update)– Write new data to the disk (journal)– The update is not final until a commit
Only after the commit block is written is the update final– The commit block is a single block of data on
the disk– Not necessarily flushed to disk yet!
09/27/2007 File Systems 87
UCDavis, ecs251Fall 2007
JOURNALINGJOURNALING
Log writes are performed in addition to the regular writes
Journaling systems incur log write overhead but– Log writes can be performed efficiently
because they are sequential (block operation consideration)
– Metadata blocks do not need to be written back after each update
09/27/2007 File Systems 88
UCDavis, ecs251Fall 2007
JOURNALINGJOURNALING
Journaling systems can provide– same durability semantics as FFS if log is
forced to disk after each meta-data operation– the laxer semantics of Soft Updates if log
writes are buffered until entire buffers are full
09/27/2007 File Systems 89
UCDavis, ecs251Fall 2007
Versus log-structured file Versus log-structured file systemsystem
A log structured file system ONLY contains a log, everything is written to the end of this log
LSFS dictates how the data is stored on disk JFS does not dictate how the data is stored
on disk
09/27/2007 File Systems 90
UCDavis, ecs251Fall 2007
Soft Updates vs. JournalingSoft Updates vs. Journaling
Advantages disadvantages
09/27/2007 File Systems 91
UCDavis, ecs251Fall 2007
With Soft Updates??With Soft Updates??
CPU
Do we still need “FSCK”? at boot time?
09/27/2007 File Systems 92
UCDavis, ecs251Fall 2007
Recover the Missing ResourcesRecover the Missing Resources
In the background, in an active FS…– We don’t want to wait for the lengthy FSCK
process to complete…
A related issue:– the virus scanning process– what happens if we get a new virus signature?
09/27/2007 File Systems 93
UCDavis, ecs251Fall 2007
Snapshot of the FSSnapshot of the FS
backup and restore dump reliably an active File System
– what will we do today to dump our 40GB FS “consistent” snapshots? (in the midnight…)
“background FSCK checks”
09/27/2007 File Systems 94
UCDavis, ecs251Fall 2007
What is a snapshot?What is a snapshot?(I mean “conceptually”.)(I mean “conceptually”.)
Freeze all activities related to the FS. Copy everything to “some space”. Resume the activities.
How do we efficiently implement this concept such that the activities will only be blocked for about 0.25 seconds, and we don’t have to buy a really big hard drive?
09/27/2007 File Systems 95
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 96
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 97
UCDavis, ecs251Fall 2007
Copy-on-Write
09/27/2007 File Systems 98
UCDavis, ecs251Fall 2007 Snapshot: a fileSnapshot: a file
Logical sizeVersus physical size
09/27/2007 File Systems 99
UCDavis, ecs251Fall 2007
ExampleExample
# mkdir /backups/usr/noon# mount –u –o snapshot /usr/snap.noon /usr# mdconfig –a –t vnode –u 0 –f /usr/snap.noon# mount –r /dev/md0 /backups/usr/noon
/* do whatever you want to test it */
# umount /backups/usr/noon# mdconfig –d –u 0# rm –f /usr/snap.noon
09/27/2007 File Systems 100
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 101
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 102
UCDavis, ecs251Fall 2007
#include <stdio.h>#include <stdlib.h>
intmain(void){ FILE *f1 = fopen("./sss.txt", "w"); int i;
for (i = 0; i < 1000; i++) { fseek(f1, rand(), SEEK_SET); fprintf(f1, "%d%d%d%d", rand(), rand(), rand(), rand()); if (i % 100 == 0) sleep(1); } fflush(f1);}
09/27/2007 File Systems 103
UCDavis, ecs251Fall 2007
ExampleExample
# mkdir /backups/usr/noon# mount –u –o snapshot /usr/snap.noon /usr# mdconfig –a –t vnode –u 0 –f /usr/snap.noon# mount –r /dev/md0 /backups/usr/noon
/* do whatever you want to test it */
# umount /backups/usr/noon# mdconfig –d –u 0# rm –f /usr/snap.noon
09/27/2007 File Systems 104
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 105
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 106
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 107
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 108
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 109
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 110
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 111
UCDavis, ecs251Fall 2007
ExampleExample
# mkdir /backups/usr/noon# mount –u –o snapshot /usr/snap.noon /usr# mdconfig –a –t vnode –u 0 –f /usr/snap.noon# mount –r /dev/md0 /backups/usr/noon
/* do whatever you want to test it */
# umount /backups/usr/noon# mdconfig –d –u 0# rm –f /usr/snap.noon
09/27/2007 File Systems 112
UCDavis, ecs251Fall 2007
Copy-on-Write
09/27/2007 File Systems 113
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 114
UCDavis, ecs251Fall 2007 A File SystemA File System
??? entries inone disk block
A file
09/27/2007 File Systems 115
UCDavis, ecs251Fall 2007 A Snapshot i-nodeA Snapshot i-node
??? entries inone disk block
A file
Not used orNot yet copy
09/27/2007 File Systems 116
UCDavis, ecs251Fall 2007 Copy-on-writeCopy-on-write
??? entries inone disk block
A file
Not used orNot yet copy
09/27/2007 File Systems 117
UCDavis, ecs251Fall 2007 Copy-on-writeCopy-on-write
??? entries inone disk block
A file
Not used orNot yet copy
09/27/2007 File Systems 118
UCDavis, ecs251Fall 2007
Multiple SnapshotsMultiple Snapshots
about 20 snapshots Interactions/sharing among snapshots
09/27/2007 File Systems 119
UCDavis, ecs251Fall 2007
Snapshot of the FSSnapshot of the FS
backup and restore dump reliably an active File System
– what will we do today to dump our 40GB FS “consistent” snapshots? (in the midnight…)
“background FSCK checks”
09/27/2007 File Systems 120
UCDavis, ecs251Fall 2007
Transaction-based FSTransaction-based FS
Performance versus consistency “Atomic Writes” on Multiple Blocks
– See the paper titled “Atomic Writes for Data Integrity and Consistency in Shared Storage Devices for Clusters” by Okun and Barak, FGCS, vol. 20, pages 539-547, 2004.
– Modify SCSI handling
09/27/2007 File Systems 121
UCDavis, ecs251Fall 2007
File System MountingFile System Mounting
A file system must be mounted before it can be accessed.
A unmounted file system is mounted at a mount point.
09/27/2007 File Systems 122
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 123
UCDavis, ecs251Fall 2007 Mount PointMount Point
09/27/2007 File Systems 124
UCDavis, ecs251Fall 2007
logical diskslogical disks/
usr sys dev etc bin
/
local adm home lib bin
fs0: /dev/hd0a
fs1: /dev/hd0e
mount -t ufs /dev/hd0e /usr
mount -t nfs 152.1.23.12:/export/cdrom /mnt/cdrom
09/27/2007 File Systems 125
UCDavis, ecs251Fall 2007
Distributed FSDistributed FS
Distributed File System– NFS (Network File System)– AFS (Andrew File System)– GFS (Google File System)– CODA
09/27/2007 File Systems 126
UCDavis, ecs251Fall 2007 Distributed FSDistributed FS
/
usr sys dev etc bin
/
local adm home lib bin
ftp.cs.ucdavis.edu fs0: /dev/hd0a
Server.yahoo.com fs0: /dev/hd0e
09/27/2007 File Systems 127
UCDavis, ecs251Fall 2007
Distributed File SystemDistributed File System
Transparency and Location Independence Reliability and Crash Recovery Scalability and Efficiency Correctness and Consistency Security and Safety
09/27/2007 File Systems 128
UCDavis, ecs251Fall 2007 CorrectnessCorrectness
One-copy Unix Semantics??
09/27/2007 File Systems 129
UCDavis, ecs251Fall 2007 CorrectnessCorrectness
One-copy Unix Semantics– every modification to every byte of a file has to
be immediately and permanently visible to every client.
09/27/2007 File Systems 130
UCDavis, ecs251Fall 2007 CorrectnessCorrectness
One-copy Unix Semantics– every modification to every byte of a file has to
be immediately and permanently visible to every client.
– Conceptually FS sequent access Make sense in a local file system Single processor versus shared memory
Is this necessary?
09/27/2007 File Systems 131
UCDavis, ecs251Fall 2007 DFS ArchitectureDFS Architecture
Server– storage for the distributed/shared files.– provides an access interface for the clients.
Client– consumer of the files.– runs applications in a distributed environment.
open closeread writeopendir statreaddir
applications
09/27/2007 File Systems 132
UCDavis, ecs251Fall 2007 NFS (SUN, 1985)NFS (SUN, 1985)
Based on RPC (Remote Procedure Call) and XDR (Extended Data Representation)
Server maintains no state– a READ on the server opens, seeks, reads, and closes– a WRITE is similar, but the buffer is flushed to disk before
closing Server crash: client continues to try until server reboots –
no loss Client crashes: client must rebuild its own state – no effect
on server
09/27/2007 File Systems 133
UCDavis, ecs251Fall 2007
RPC - XDRRPC - XDR
RPC: Standard protocol for calling procedures in another machine
Procedure is packaged with authorization and admin info
XDR: standard format for data, because manufacturers of computers cannot agree on byte ordering.
09/27/2007 File Systems 134
UCDavis, ecs251Fall 2007
rpcgenrpcgen
RPC program
rpcgen
RPC client.c RPC server.cRPC.h
datastructure
datastructure
09/27/2007 File Systems 135
UCDavis, ecs251Fall 2007
NFS OperationsNFS Operations
Every operation is independent: server opens file for every operation
File identified by handle -- no state information retained by server
client maintains mount table, v-node, offset in file table etc.
What do these imply???
09/27/2007 File Systems 136
UCDavis, ecs251Fall 2007
Client computer Server computer
UNIXfile
system
NFSclient
NFSserver
UNIXfile
system
Applicationprogram
Applicationprogram
Virtual file systemVirtual file system
Oth
er f
ile s
yste
mUNIX kernel
system calls
NFSprotocol
(remote operations)
UNIX
Operations on local files
Operationson
remote files
*
Applicationprogram
NFSClient
KernelApplicationprogram
NFSClient
Client computer
mount –t nfs home.yahoo.com:/pub/linux /mnt/linux
09/27/2007 File Systems 137
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 138
UCDavis, ecs251Fall 2007 State-ful vs. State-lessState-ful vs. State-less
A server is fully aware of its clients– does the client have the newest copy?
– what is the offset of an opened file?
– “a session” between a client and a server!
A server is completely unaware of its clients– memory-less: I do not remember you!!
– Just tell me what you want to get (and where).
– I am not responsible for your offset values (the client needs to maintain the state).
09/27/2007 File Systems 139
UCDavis, ecs251Fall 2007 The StateThe State
applications
openreadstatlseek
applications
openreadstatlseek
offset
09/27/2007 File Systems 140
UCDavis, ecs251Fall 2007
Network File SharingNetwork File Sharing Server side:
– Rpcbind (portmap)– Mountd - respond to mount requests (sometimes called
rpc.mountd). Relies on several files
– /etc/dfs/dfstab, – /etc/exports, – /etc/netgroup
– nfsd - serves files - actually a call to kernel level code.– lockd – file locking daemon.– statd – manages locks for lockd.– rquotad – manages quotas for exported file systems.
09/27/2007 File Systems 141
UCDavis, ecs251Fall 2007
Network File SharingNetwork File Sharing Client Side
– biod - client side caching daemon
– mount must understand the hostname:directory convention.
– Filesystem entries in /etc/[v]fstab tell the client what filesystems to mount.
09/27/2007 File Systems 142
UCDavis, ecs251Fall 2007
Unix file semanticsUnix file semantics
NFS:– open a file with read-write mode– later, the server’s copy becomes read-only
mode– now, the application tries to write it!!
09/27/2007 File Systems 143
UCDavis, ecs251Fall 2007
Problems with NFSProblems with NFS
Performance not scaleable:– maybe it is OK for a local office.– will be horrible with large scale systems.
09/27/2007 File Systems 144
UCDavis, ecs251Fall 2007
Similar to UNIX file caching for local files:– pages (blocks) from disk are held in a main memory buffer cache until
the space is required for newer pages. Read-ahead and delayed-write optimisations.
– For local files, writes are deferred to next sync event (30 second intervals)
– Works well in local context, where files are always accessed through the local cache, but in the remote case it doesn't offer necessary synchronization guarantees to clients.
NFS v3 servers offers two strategies for updating the disk:– write-through - altered pages are written to disk as soon as they are
received at the server. When a write() RPC returns, the NFS client knows that the page is on the disk.
– delayed commit - pages are held only in the cache until a commit() call is received for the relevant file. This is the default mode used by NFS v3 clients. A commit() is issued by the client whenever a file is closed.
*
09/27/2007 File Systems 145
UCDavis, ecs251Fall 2007
Server caching does nothing to reduce RPC traffic between client and server– further optimisation is essential to reduce server load in large networks– NFS client module caches the results of read, write, getattr, lookup and
readdir operations– synchronization of file contents (one-copy semantics) is not guaranteed
when two or more clients are sharing the same file. Timestamp-based validity check
– reduces inconsistency, but doesn't eliminate it– validity condition for cache entries at the client:
(T - Tc < t) v (Tmclient = Tmserver)– t is configurable (per file) but is typically set to
3 seconds for files and 30 secs. for directories– it remains difficult to write distributed
applications that share files with NFS
*
t freshness guaranteeTc time when cache entry was
last validatedTm time when block was last
updated at serverT current time
09/27/2007 File Systems 146
UCDavis, ecs251Fall 2007 AFSAFS
State-ful clients and servers. Caching the files to clients.
– File close ==> check-in the changes. How to maintain consistency?
– Using “Callback” in v2/3 (Valid or Cancelled)
openread
applications
invalidate and re-cache
09/27/2007 File Systems 147
UCDavis, ecs251Fall 2007
Why AFS?Why AFS?
Shared files are infrequently updated Local cache of a few hundred mega bytes
– Now 50~100 giga bytes Unix workload:
– Files are small, Read Operations dominated, sequential access is common, read/written by one user, reference bursts.
– Are these still true?
09/27/2007 File Systems 150
UCDavis, ecs251Fall 2007
Fault Tolerance in AFSFault Tolerance in AFS
a server crashes
a client crashes– check for call-back tokens first.
09/27/2007 File Systems 151
UCDavis, ecs251Fall 2007
Problems with AFSProblems with AFS
Availability what happens if call-back itself is lost??
09/27/2007 File Systems 152
UCDavis, ecs251Fall 2007
GFS – Google File SystemGFS – Google File System
“failures” are norm Multiple-GB files are common Append rather than overwrite
– Random writes are rare Can we relax the consistency?
09/27/2007 File Systems 153
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 154
UCDavis, ecs251Fall 2007 The MasterThe Master
Maintains all file system metadata.names space, access control info, file to chunk mappings, chunk (including replicas) location, etc.
Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state
09/27/2007 File Systems 155
UCDavis, ecs251Fall 2007
The MasterThe Master
Helps make sophisticated chunk placement and replication decision, using global knowledge
For reading and writing, client contacts Master to get chunk locations, then deals directly with chunkservers
Master is not a bottleneck for reads/writes
09/27/2007 File Systems 156
UCDavis, ecs251Fall 2007 ChunkserversChunkservers
Files are broken into chunks. Each chunk has a immutable globally unique 64-bit chunk-handle.
handle is assigned by the master at chunk creation
Chunk size is 64 MB
Each chunk is replicated on 3 (default) servers
09/27/2007 File Systems 157
UCDavis, ecs251Fall 2007
ClientsClients
Linked to apps using the file system API.
Communicates with master and chunkservers for reading and writing
Master interactions only for metadata
Chunkserver interactions for data
Only caches metadata informationData is too large to cache.
09/27/2007 File Systems 158
UCDavis, ecs251Fall 2007
Chunk LocationsChunk Locations
Master does not keep a persistent record of locations of chunks and replicas.
Polls chunkservers at startup, and when new chunkservers join/leave for this.
Stays up to date by controlling placement of new chunks and through HeartBeat messages (when monitoring chunkservers)
09/27/2007 File Systems 159
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 160
UCDavis, ecs251Fall 2007 Atomic commit protocols Atomic commit protocols
one-phase atomic commit protocol– the coordinator tells the participants whether to commit or abort– what is the problem with that?– this does not allow one of the servers to decide to abort – it may
have discovered a deadlock or it may have crashed and been restarted
two-phase atomic commit protocol– is designed to allow any participant to choose to abort a
transaction– phase 1 - each participant votes. If it votes to commit, it is
prepared. It cannot change its mind. In case it crashes, it must save updates in permanent store
– phase 2 - the participants carry out the joint decision
•
The decision could be commit or abort – participants record it in permanent store
09/27/2007 File Systems 161
UCDavis, ecs251Fall 2007 Two phase commit (2PC)Two phase commit (2PC)
Coordinator
Server Server Server Server Server…..
Coordinator
Server Server Server Server Server…..
What is your result?
Final consensus.
09/27/2007 File Systems 162
UCDavis, ecs251Fall 2007 Failure modelFailure model
Commit protocols are designed to work in– asynchronous system (e.g. messages may take a very long time)– servers/coordinator may crash – messages may be lost. – assume corrupt and duplicated messages are removed. – no byzantine faults – servers either crash or they obey their
requests 2PC is an example of a protocol for reaching a consensus.
– because crash failures of processes are masked by replacing a crashed process with a new process whose state is set from information saved in permanent storage and information held by other processes.
•
09/27/2007 File Systems 163
UCDavis, ecs251Fall 2007 2PC2PC
2PC– voting phase: coordinator asks all servers if
they can commit if yes, server records updates in permanent storage
and then votes
– completion phase: coordinator tells all servers to commit or abort
•
09/27/2007 File Systems 164
UCDavis, ecs251Fall 2007
INIT
WAIT
ABORT COMMIT
INIT
READY
ABORT COMMIT
canCommit
doCommitdoAbort
haveCommitted
vote
09/27/2007 File Systems 165
UCDavis, ecs251Fall 2007
INIT
WAIT
ABORT COMMIT
INIT
READY
ABORT COMMIT
canCommit
doCommitdoAbort
haveCommitted
vote
09/27/2007 File Systems 166
UCDavis, ecs251Fall 2007
FailuresFailures
Some servers missed “canCommit”. Coordinator missed some “votes”. Some servers missed “doAbort” or
“doCommit”.
09/27/2007 File Systems 167
UCDavis, ecs251Fall 2007
Failures/CrashesFailures/Crashes
Some servers crashed b/a “canCommit”. Coordinator crashed b/a receiving some
“votes”. Some servers crashes b/a receiving
“doAbort” or “doCommit”.
09/27/2007 File Systems 168
UCDavis, ecs251Fall 2007
INIT
WAIT
ABORT COMMIT
INIT
READY
ABORT COMMIT
canCommit
doCommitdoAbort
haveCommitted
vote
Assume the coordinator crashed after “canCommit” messageshave been sent:(0). Some servers have not received the vote requests. WAIT/INIT(1). All good servers are in the WAIT state. WAIT/???(2). Some servers are in either ABORT or COMMIT state. ABORT/COMMIT(3). All servers are in either ABORT or COMMIT state. ABORT/COMMIT
09/27/2007 File Systems 169
UCDavis, ecs251Fall 2007 3PC3PC
Skeen & Stonebraker, 1983Skeen & Stonebraker, 1983
INIT
WAIT
ABORT Pre-COMMIT
vote
COMMIT
ACK
Uncertain
Aborted
Committable
Committed
09/27/2007 File Systems 170
UCDavis, ecs251Fall 2007
ConsistencyConsistency
Read: are we reading the fresh copy? Write: have we updated all the copies? Failure/Partition
09/27/2007 File Systems 171
UCDavis, ecs251Fall 2007
NFS, AFS, & GFSNFS, AFS, & GFS
Read the newest copy– NFS: as long as we don’t cache…– AFS: the read callback might be broken…– GFS: we can check the master…
09/27/2007 File Systems 172
UCDavis, ecs251Fall 2007
NFS, AFS, & GFSNFS, AFS, & GFS
Write to the master copy– NFS: write through– AFS: you have to have the callback …– GFS: we will update all three replicas, but what
happen if a read occurs during the process we are updating the copies. (name space and file to chunks mapping)
09/27/2007 File Systems 173
UCDavis, ecs251Fall 2007 Pessimistic versus Pessimistic versus
OptimisticOptimistic
Locking the world while I am updating it…– Scaleable?– “Serializable schedule”– Assuming a close system -- “you can NOT fail
at this moment as I am updating this particular transaction” -- we can use “log” to handle the issue of “atomicity” (all or nothing).
09/27/2007 File Systems 174
UCDavis, ecs251Fall 2007
Soft UpdateSoft Update
Create X (t1) and Delete Y (t2)
T1, T2 T2, T1
We will enforce it when resolve the problem of circular dependency!
09/27/2007 File Systems 175
UCDavis, ecs251Fall 2007
SU & Background FSCKSU & Background FSCK
Soft Update guarantees that the File System will always in a “consistent” state at all time.
Essentially, Soft Update prevents any chances of inconsistency!
09/27/2007 File Systems 176
UCDavis, ecs251Fall 2007
An Alternative ApproachAn Alternative Approach “Optimistic” The chance for certain bad things to occur is
very small (depending on what you are doing).
And, it is very expensive to pessimistically prevent the probability.
“Let it happen, and we try to detect and recover from that…”
09/27/2007 File Systems 177
UCDavis, ecs251Fall 2007
The Optimistic ApproachThe Optimistic Approach
Regular Execution with “recording” Conflict Detection based on the “recorded” Conflict resolution
09/27/2007 File Systems 178
UCDavis, ecs251Fall 2007
ExampleExample
Allowing “inconsistencies”– Without soft update, we have to do FSCK in
the foreground (before we can use it).– I.e., we try to eliminate “inconsistencies”
But, do we really need “perfectly consistent FS” for all the applications?– Why not, take a snapshot and then do
background FSCK anyway!
09/27/2007 File Systems 179
UCDavis, ecs251Fall 2007
Optimistic?Optimistic?
NFS AFS GFS
09/27/2007 File Systems 180
UCDavis, ecs251Fall 2007
Optimistic?Optimistic? NFS
– If the server changes the access mode in the middle of an open session from a client…
AFS– “Callback” is the check for inconsistencies.
GFS– Failure in the middle of a write/append
09/27/2007 File Systems 181
UCDavis, ecs251Fall 2007
CODACODA
Server Replication:– if one server goes down, I can get another.
Disconnected Operation:– if all go down, I will use my own cache.
09/27/2007 File Systems 182
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 183
UCDavis, ecs251Fall 2007
Disconnected OperationDisconnected Operation
Continue critical work when that repository is inaccessible.
Key idea: caching data.– Performance– Availability
Server Replication
09/27/2007 File Systems 184
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 185
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 186
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 187
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 188
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 189
UCDavis, ecs251Fall 2007
09/27/2007 File Systems 190
UCDavis, ecs251Fall 2007
ConsistencyConsistency
If John update file X on server A and Mary read file X on server B….
Read-one & Write-all
09/27/2007 File Systems 191
UCDavis, ecs251Fall 2007 Read x & Write (N-x+1)Read x & Write (N-x+1)
read
write
09/27/2007 File Systems 192
UCDavis, ecs251Fall 2007 Example: R3W4 (6+1)Example: R3W4 (6+1)
Initial 0 0 0 0 0 0Alice-W 2 2 0 2 2 0Bob-W 2 3 3 3 3 0Alice-R 2 3 3 3 3 0Chris-W 2 1 1 1 1 0Dan-R 2 1 1 1 1 0Emily-W 7 7 1 1 1 7Frank-R 7 7 1 1 1 7
09/27/2007 File Systems 193
UCDavis, ecs251Fall 2007 RecoveryRecovery
The paper describes two recovery experiments on a cluster containing 15,000-16,000 chunks with 600-660 GB of data.
(a) One chunk server was killed. Recovery time was 23 minutes. Describe the process.
(b) Two chunk server were killed. Explain how the recovery process was changed.
(c) Suppose that the master server had been killed. How would the recovery have taken place?
09/27/2007 File Systems 194
UCDavis, ecs251Fall 2007
Consistency Model Consistency Model
(a) What is a consistency model?(b) How does the consistency model used by the
Google File System differ from the model that might be used by a general purpose file system?
(c) How does the file system guarantee correctness of the namespace?
(d) Explain what happens, from a consistency viewpoint, when an application attempts to append to a file.
09/27/2007 File Systems 195
UCDavis, ecs251Fall 2007
Client computer Server computer
Applicationprogram
Applicationprogram
Client module
Flat file service
Directory service
LookupAddNameUnNameGetNames
ReadWriteCreateDeleteGetAttributesSetAttributes
*