58

Operating Systems: Filesystems

Embed Size (px)

Citation preview

Page 1: Operating Systems: Filesystems

Operating Systems:

Filesystems

Shankar

April 25, 2015

Page 2: Operating Systems: Filesystems

Outline fs interface

Filesystem Interface

Persistent Storage Devices

Filesystem Implementation

FAT Filesystem

FFS: Unix Fast Filesystem

NTFS: Microsoft Filesystem

Copy-On-Write: Slides from OS:P&P text

Page 3: Operating Systems: Filesystems

Overview: Filesystem Interface fs interface

Persistent structure of directories and �les

File: named sequence of bytes

Directory: zero or more pointers to directories/�les

Persistent: non-volatile storage device: disk, ssmd, ...

Structure organized as a tree or acylic graph

nodes: directories and �lesroot directory: �/�path to a node: �/a/b/...�acyclic: more than one path to a �le (but not directory)

problematic for disk utilities, backup

Node metadata: owner, creation time, allowed access, ...

Users can create/delete nodes, modify content/metadata

Examples: FAT, UFS, NTFS, ZFS, ..., PFAT, GOSFS, GSFS2

Page 4: Operating Systems: Filesystems

Node fs interface

Name(s)

a name is the last entry in a path to the nodeeg, path �/a/b� vs name �b�subject to size and char limits

Type: directory or �le or device or ...

Size: subject to limit

Directory may have a separate limit on number of entries.

Time of creation, modi�cation, last access, ...

Content type (if �le): eg, text, pdf, jpeg, mpeg, executable, ...

Owner

Access for owner, other users, ...: eg, r, w, x, setuid, ...

Page 5: Operating Systems: Filesystems

Operations on Filesystems fs interface

Format(dev)create an empty �lesystem on device dev (eg, disk, �ash, ...)

Mount(fstype, dev)attach (to computer) �lesystem of type fstype on device devreturns a path to the �lesystem (eg, volume, mount point, ...)after this, processes can operate on the �lesystem

Unmount(path)detach (from computer) �lesystem at path // �nish all ioafter this, the �lesystem is inert in its device, unaccessible

Page 6: Operating Systems: Filesystems

Operations on Attached Filesystem � 1 fs interface

Create(path), CreateDir(path)create a �le/directory node at given path

Link(existingPath, newPath)create a (hard) link to an existing �le node

Delete(path) // aka Unlink(path)delete the given path to the �le node at pathdelete �le node of if no more links to it

DeleteDir(path)delete the directory node at path // must be empty

Change attributes (name, metadata) of node at path

eg, stat, touch, chown/chgrp, chmod, rename/mv

Page 7: Operating Systems: Filesystems

Operations on Attached Filesystem � 2 fs interface

Open(path, access), OpenDir(path, access)open the node at path with given access (r, w, ...)returns a �le descriptor // prefer: node descriptorafter this, node can be operated on

Close(nd), CloseDir(nd)close the node associated with node descriptor nd

Read(nd, file range, buffer), ReadDir(nd, dir range, buffer),read the given range from open node nd into given bu�erreturns number of bytes/entries read

Write(nd, file range, buffer)write bu�er contents into the given range of open �le node ndreturns number of bytes written

Page 8: Operating Systems: Filesystems

Operations on Attached Filesystem � 3 fs interface

Seek(nd, file location), SeekDir(nd, entry)move �r/w head� to given location/entry

MemMap(nd, file range, mem range)map the given range of �le nd to given range of memory

MemUnmap(nd, file range, mem range)unmap the given range of �le nd from given range of memory

Sync(nd)complete all pending io for open node nd

Page 9: Operating Systems: Filesystems

Consistency of Shared Files fs interface

Shared �le

one that is opened by several processes concurrently

Consistency of the shared �le:

when does a write by one process become visiblein the reads by the other processes

Various types of consistency (from strong to weak)

visible when write returnsvisible when sync returnsvisible when close returns

Single-processor system

all types of consistency easily achieved

Multi-processor system

strong notions are expensive/slow to achieve

Page 10: Operating Systems: Filesystems

Reliability fs interface

Disks, though non-volatile, are susceptible to failures

magnetic /mechanical / electronic parts wear outpower loss in the middle of a �lesystem operation

Filesystem should work inspite of failures to some extent

Small disk-area failures

no loss/degradation

Large disk-area failures

�lesystem remains a consistent subtreeeg, �les may be lost, but no undetected corrupted �les/links

Power outage

each ongoing �lesystem operation remains atomicie, either completes or has no e�ect on �lesystem

Page 11: Operating Systems: Filesystems

Outline storage devices

Filesystem Interface

Persistent Storage Devices

Filesystem Implementation

FAT Filesystem

FFS: Unix Fast Filesystem

NTFS: Microsoft Filesystem

Copy-On-Write: Slides from OS:P&P text

Page 12: Operating Systems: Filesystems

Disk Drive: Geometry storage devices

Platter(s) �xed to rotating spindle

Spindle speed

Platter has two surfaces

Surface has concentric tracks

Track has sectors

more in outer than inner

Sector: �xed capacity

Movable arm with �xed rw heads,one per surface

Bu�er memory

Eg, laptop disk (2011)

1 or 2 platters

4200�15000 rpm,15�4ms/rotation

diameter: 2.5 in

track width: < 1 micron

sector: 512 bytes

bu�er: 16MB

Page 13: Operating Systems: Filesystems

Disk Access Time storage devices

Disk access time = seek + rotation + transfer

Seek time: (moving rw head) + (electronic settling)

minimum: target is next track; settling onlymaximum: target is at other end

rotation delay: half-rotational delay on avg

transfer time: platter ↔ bu�er

transfer time: bu�er ↔ host memory

Eg, laptop disk (2011)

min seek: 0.3�1.5ms

max seek: 10�20ms

rotation delay: 7.5�2ms

platter ↔ bu�er: 50 (inner) � 100 (outer)MB/s

bu�er ↔ host memory: 100�300MB/s

Page 14: Operating Systems: Filesystems

Disk IO storage devices

Disk IO is in blocks (sectors): slower, more bursty (wrt memory)

Causes the �lesystem to be implemented in blocks also

Disk geometry strongly a�ects �lesystem layout

Disk blocks usually become corrupted/unusable over time

Filesystem layout must compensate for this

redundancy (duplication)error-correcting codesmigration without stopping

Page 15: Operating Systems: Filesystems

Disk Scheduling storage devices

FIFO: terrible: lots of head movement

SSTF (shortest seek time �rst)

favors �middle� requests; can starve �edge� requests

SCAN (elevator)

sweep from inner to outer, until no requests in this directionsweep from outer to inner, until no requests in this direction

CSCAN: like SCAN but in only one direction

fairer, less chance of sparsely-requested track

R-SCAN / R-CSCAN

allow minor deviations in direction to exploit rotational delays

Page 16: Operating Systems: Filesystems

Flash Devices storage devices

RAM with a �oating insulated gateNo moving partsmore robust than disk to impactsconsumes less energyuniform access time // good for random access

Wears out with repeated usageUnused device can lose charge over time (years?)

Read/write operationsdone in blocks (12KB to 128KB)tens of microseconds // if erased area available for writes

Write requires erased areaErase operationerasure block (few MB)tens of millisecondsso maintain a large erased area

Page 17: Operating Systems: Filesystems

Outline fs implementation

Filesystem Interface

Persistent Storage Devices

Filesystem Implementation

FAT Filesystem

FFS: Unix Fast Filesystem

NTFS: Microsoft Filesystem

Copy-On-Write: Slides from OS:P&P text

Page 18: Operating Systems: Filesystems

Overview of Fs Implementation � 1 fs implementation

De�ne a mapping of �lesystem to device blocksand implement �lesystem operations

performance: random and sequential data accessreliability: inspite of device failuresaccomodate di�erent devices (blocks, size, speed, geometry, ...)

The term ��lesystem� refers to two distinct entities

1. �lesystem interface, ie, �lesystem seen by users2. �lesystem implementation

Here are some abbreviations (used as quali�ers)

fs: �lesystemfs-int: �lesystem interface // eg, fs-int node is directory or �lefs-imp: �lesystem implementation

Page 19: Operating Systems: Filesystems

Overview of Fs Implementation � 2 fs implementation

Fs implementation usually starts at the second block in the device

[The �rst device block is usually reserved for the boot sector]

Fs implementation organizes the device in fs blocks 0, 1, · · ·Fs block = one or more device blocks, or vice versa

henceforth, �block� without quali�er means �fs block�

Fs implementation is a graph over blocks, rooted at a specialblock (the �superblock�)

[In contrast, fs interface is a graph over �les and directories]

Each �le or directory x maps to a subgraph of the fs-imp graph,attached to the fs-imp graph at a �root� block

subgraph's blocks hold

x 's data, x 's metadata, pointers to subgraph's blockspointers to the roots of other subgraphs if x directory

Page 20: Operating Systems: Filesystems

Overview of Fs Implementation � 3 fs implementation

0

1

2

boot 0

1

2

fs-blk contains- metadata- data- ptrs to data- ptrs to other subgraphs

(for dir-node subgraph)

fs interface

/

a cb

d e f

g h

fs implementation device blocksfs blocks

graph over nodes

(tree or acyclic)

graph over fs-blks

superblk

/.subgraph

a.subgraph

operations- mount, unmout, ...- create, link, unlink, delete- open, read, write, sync, close

implementations of operations

Page 21: Operating Systems: Filesystems

Superblock fs implementation

First fs block read by OS when mounting a �lesystem

Usually in the �rst few device blocks after the boot sector

Identi�es the �lesystem type and the info to mount it

magic number, indicating �lesystem typefs blocksize (vs device blocksize)size of disk (in fs blocks).pointer (block #) to the root of �/� directory's subgraphpointer to list of free blocks(perhaps) pointer to array of roots of subgraphs...

Page 22: Operating Systems: Filesystems

Subgraph for fs-int node (�le or directory) fs implementation

Node name(s) // multiple names/paths due to links

Node metadata

sizeowner, access, ...creation time, ......

Pointers to fs blocks containing node's data

If node is a �le, the data is the �le's data

If node is a directory, the data is a table of directory entries

table may be unordered, sorted, hashed, ...depends on number and size of entries, desired performance, ...

A directory entry points to the entry's subgraph

Page 23: Operating Systems: Filesystems

Data Pointers Organization � 1 fs implementation

Organization of the pointers from subgraph root to data blocks

Contiguous

data is placed in consecutive fs blockssubgraph root has starting fs block and data sizepros: random accesscons: external fragmentation; poor tolerance to bad blocks

Linked-list

each fs block of data has a pointer to the next fs block of data(or nil, if no more data)subgraph root has starting fs block and data sizepros: no fragmentation; good tolerance to bad blockscons: poor random access

Page 24: Operating Systems: Filesystems

Data Pointers Organization � 2 fs implementation

Separate linked-list

keep all the pointers (w/o data) in one or more fs blocks�rst load those fs blocks into memorythen the fs block for any data chunk is locatedby traversing a linked-list in memory (rather than disk)pros: no fragmentation; good tolerance to bad blocks; goodrandom access (as long as �le size is not too big)

Indexed

have an array of pointershave multiple levels of pointers for accomodating large �les

ie, a pointer points to a fs block of pointers

pros: no fragmentation; good tolerance to bad blocks; goodrandom access (logarithmic cost in data size)

Page 25: Operating Systems: Filesystems

For Good Performance fs implementation

Want large numbers of consecutive reads or writes

So put related info in nearby blocks

Large bu�ers (to minimize read misses)

Batched writes: need a large queue of writes

Page 26: Operating Systems: Filesystems

Reliability: overcoming block failures fs implementation

Redundancy within a disk

error-detection/correction code (EDC/ECC) in each disk blockmap each fs block to multiple disk blocksdynamically remap within a disk to bypass failing areas

Redundancy across disks

map each fs block to blocks in di�erent disksEDC/ECC in each disk blockOr EDC/ECC for a fs block across disks // eg, RAID· · ·

Page 27: Operating Systems: Filesystems

Reliability: Atomicity via Copy-On-Write fs implementation

When a user modi�es a �le or directory f

identify the part, say X , of the fs-imp graph to be modi�edwrite the new value of X in fresh blocks, say Yattach Y to the fs-imp graph // step with low prob of failuredetach X and garbage collect its blocks

note that X typically includes part of the path(s) from thefs-imp root to f 's subgraph

Consider a path a0, · · · , aN , where a0 is the fs-imp rootand aN holds data modi�ed by user

Then for some point aJ in the path

the new value of the su�x aj , · · · , aN are in new blocksthe new values of the pre�x a0, · · · , aJ−1 are in the old blocks

Page 28: Operating Systems: Filesystems

Reliability: Atomicity via Logging fs implementation

Maintain log of requested operations

When user issues a sequence of disk operations

add records (one for each operation) to logadd �commit� record after last operationasynchronously write them to diskerase from log after copmletion

Upon recovery from crash, (re)do all operations in log

operations are idempotent (repeating is ok)

Page 29: Operating Systems: Filesystems

Hierarchy in Filesystem fs implementation

Virtual �lesystem: optional

memory-only framework on which to mount real �lesystems

Mounted Filesystem(s)

real �lesystems, perhaps of di�erent types

Block cache

cache �lesystem blocks: performance, sharing, ...

Block device

wrapper for the various block devices with �lesystems

Device drivers for the various block devices

Page 30: Operating Systems: Filesystems

GeekOS: Hierarchy in Filesystem fs implementation

Page 31: Operating Systems: Filesystems

Outline FAT

Filesystem Interface

Persistent Storage Devices

Filesystem Implementation

FAT Filesystem

FFS: Unix Fast Filesystem

NTFS: Microsoft Filesystem

Copy-On-Write: Slides from OS:P&P text

Page 32: Operating Systems: Filesystems

FAT32 Overview � 1 FAT

FAT: MS-DOS �lesystemsimple, but no good for scalability, hard links, reliability, ...currently used only on simple storage devices: �oppy, �ash, ...

Disk divided into following regionsBoot sector: device block 0BIOS parameter blockOS boot loader code

Filesystem info sector: device block 1signatures, fs type, pointers to other sectionsfs blocksize, # free fs blocks, # last allocated fs block...

FAT: fs blocks 0 and 1; corresponds to the superblock

Data region: rest of the disk, organized as an array of fs blocksholds the data of the fs-int nodes (ie, �les and directories)

Page 33: Operating Systems: Filesystems

FAT32 Overview � 2 FAT

Each block in the data region is eitherfree or bad or holds data (of a �le or directory)

FAT: array with an entry for each block in data region

entries j0, j1, · · · form a chain i�blocks j0, j1, · · · hold successive data (of a fs-int node)

Entry n contains

constant, say FREE, if block n is free

constant, say BAD, if block n is bad (ie, unusable)

32-bit number, say x , if block n holds data (of a fs-int node)and block x holds the succeeding data (of the fs-int node)

constant, say END, if block n holds the last data chunk

Root directory table: typically at start of data region (block 2)

Page 34: Operating Systems: Filesystems

FAT32 Overview � 3 FAT

Directory entry: 32 bytes

name (8)extension (3)attributes (1)

read-only, hidden, system, volume label,subdirectory, archive, device

reserved (10)last modi�cation time (2)last modi�cation date (2)fs block # of starting fs block of the entry's datasize of entry's data (4)

Hard links??

Page 35: Operating Systems: Filesystems

Outline FFS

Filesystem Interface

Persistent Storage Devices

Filesystem Implementation

FAT Filesystem

FFS: Unix Fast Filesystem

NTFS: Microsoft Filesystem

Copy-On-Write: Slides from OS:P&P text

Page 36: Operating Systems: Filesystems

FFS Layout FFS

Boot blocks // few blocks at start

Superblock // after boot blocks

magic number�lesystem geometry (eg, locations of groups)�lesystem statistics/tuning params

Groups, each consisting of // cylinder groups

backup copy of superblockheader with statisticsfree space bit map...array of inodes

holds metadata and data pointers of fs-int nodes# inodes �xed at format time

array of data blocks

Page 37: Operating Systems: Filesystems

FFS Inodes FFS

Inodes are numbered sequentially starting at 0

inodes 0 and 1 are reservedinode 2 is the root directory's inode

An inode is either free or not

Non-free inode holds metadata and data pointers of a fs-int node

owner idtype (directory, �le, device, ...)access modesreference count // # of hard linkssize and #blocks of data15 pointers to data blocks

12 direct1 single-indirect, 1 double-indirect, 1 triple-indirect

Page 38: Operating Systems: Filesystems

FFS: fs-int node → asymmetric tree (max depth 4) FFS

meta

inode indirect blocks data blocks

Page 39: Operating Systems: Filesystems

FFS Directory entries FFS

The data blocks of a directory node hold directory entries

A directory entry is not split across data blocks

Directory entry −→ inode −→ data/pointer blocks

Directory entry for a fs-int node

# of the inode of the fs-int node // hard linksize of node datalength of node name (up to 255 bytes)entry name

Multiple directory entries can point to the same inode

Page 40: Operating Systems: Filesystems

User Ids FFS

Every user account has a user id (uid)

Root user (aka superuser, admin) has uid of 0

Processes and �lesystem entries have associated uids

indicates ownersdetermines access processes have to �lesystem entriesdetermines which processes can be signalled by a process

Page 41: Operating Systems: Filesystems

Process Ids FFS

Every process has two associated uids

e�ective user id (euid)

uid of user on whose behalf it is currently executingdetermines its access to �lesystem entries

real uid (ruid)

uid of the process's ownerdetermines which processes it can signal:x can signal y only if x is superuser or x .ruid = y .ruid

Page 42: Operating Systems: Filesystems

Process Ids FFS

Process is created: ruid/euid ← creating process's euid

Process with euid 0 executes SetUid(z): ruid/euid ← zno e�ect if process has non-zero euid

Example SetUid usage

login process has euid 0 (to access auth info �les)upon successful login, it starts a shell process (with euid 0)shell executes SetUid(authenticated user’s uid)

When a process executes a �le f with �setuid bit� set:its euid is set to f's owner's uid while it is executing f.

Upon bootup, the �rst process (�init�) runs with uid of 0

it spawns all other processes directly or indirectly

Page 43: Operating Systems: Filesystems

Directory entry's uids and permissions FFS

Every directory entry has three classes of users:owner (aka �user�)group (owner need not be in this group)others (users other than owner or group)

Each class's access is de�ned by three bits: r, w, x

For a �le:r: read the �lew: modify the �lex: execute the �le

For a directory:r: read the names (but not attributes) of entries in the directoryw: modify entries in the directory (create, delete, rename)x: access an entry's contents and metainfo

When a directory entry is created: attributes are set according tothe creating process's attributes (euid, umask, etc)

Page 44: Operating Systems: Filesystems

Directory entry's setuid bit FFS

Each directory entry also has a �setuid� bit.

If an executable �le has setuid set and a process (with executeaccess) executes it, the process's euid changes to the �le'sowner's uid while executing the �le.

Typically, the executable �le's owner is root, allowing a normaluser to get root privileges while executing the �le

This is a high-level analog of system calls

Page 45: Operating Systems: Filesystems

Directory entry's sticky bit FFS

Each directory entry also has a sticky bit.

Executable �le with sticky bit set: hint to the OS to retain thetext segment in swap space after the process executes

An entry x in a directory with sticky bit set:

a user with wx access to the directory can rename/delete anentry x in the directory only if it is x 's owner (or superuser)Usually set on /tmp directory.

Page 46: Operating Systems: Filesystems

Directory entry's setgid bit FFS

Unix has the notion of groups of users

A group is identi�ed by a group id, abbreviated gid

A gid de�nes a set of uids

A user account can be in di�erent groups, i.e., have multiple gids

Process has e�ective gid (egid) and real gid (rgid)

play a similar role as euid and ruid

A directory entry has a setgid bit

plays a similar role to setuid for executablesplays an entirely di�erent role for directories

Page 47: Operating Systems: Filesystems

Outline NTFS

Filesystem Interface

Persistent Storage Devices

Filesystem Implementation

FAT Filesystem

FFS: Unix Fast Filesystem

NTFS: Microsoft Filesystem

Copy-On-Write: Slides from OS:P&P text

Page 48: Operating Systems: Filesystems

NTFS Index Structure NTFS

Master File Table (MFT)corresponds to FFS inode arrayholds an array of 1KB MFT records

MFT Record: sequence of variable-size attribute records

Std info attribute: owner id, creation/mod/... times, security, ...

File name attribute: �le name and number

Data attribute recorddata itself (if small enough), or // residentlist of data �extents� (if not small enough) // non-resident

Attribute listpointers to attributes in this or other MFT recordspointers to attribute extentsneeded if attributes do not �t in one MFT recordeg, highly-fragmented and/or highly-linked

Page 49: Operating Systems: Filesystems

Example: Files with single MFT record NTFS

free

std info file name

extent

extentdatafile name data (resident)

data (non−resident) free

Master File Table (MFT)

std info

< MFT record >

(2 data extents)File B

File A

(data resident) start / length

datastart / length

Page 50: Operating Systems: Filesystems

Example: File with three MFT records NTFS

attr liststd infoFile C datafn fn

attr liststd info datafn

data extents

attr list extent

fn

data extents

std info data

data extents

Page 51: Operating Systems: Filesystems

Outline COW

Filesystem Interface

Persistent Storage Devices

Filesystem Implementation

FAT Filesystem

FFS: Unix Fast Filesystem

NTFS: Microsoft Filesystem

Copy-On-Write: Slides from OS:P&P text

Page 52: Operating Systems: Filesystems
Page 53: Operating Systems: Filesystems
Page 54: Operating Systems: Filesystems
Page 55: Operating Systems: Filesystems
Page 56: Operating Systems: Filesystems
Page 57: Operating Systems: Filesystems
Page 58: Operating Systems: Filesystems