34
CSC 660: Advanced Operating Systems Slide #1 CSC 660: Advanced OS Filesystem Implementation

CSC 660: Advanced OS

  • Upload
    brand

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

CSC 660: Advanced OS. Filesystem Implementation. Topics. Disks Ext2 Filesystem Layout Inode Allocation Block Addressing Block Allocation e2fsck Journaling Stackable Filesystems. Filesystem Layering. Hard Drive Components. Hard Drive Components. Actuator - PowerPoint PPT Presentation

Citation preview

Page 1: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #1

CSC 660: Advanced OS

Filesystem Implementation

Page 2: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #2

Topics

1. Disks

2. Ext2 Filesystem Layout

3. Inode Allocation

4. Block Addressing

5. Block Allocation

6. e2fsck

7. Journaling

8. Stackable Filesystems

Page 3: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #3

Filesystem Layering

Page 4: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #4

Hard Drive Components

Page 5: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #5

Hard Drive ComponentsActuator

Moves arm across disk to read/write data.Arm has multiple read/write heads (often 2/platter.)

PlattersRigid substrate material + magnetic coating.Divided into many concentric tracks.

Spindle MotorSpins platters from 3600-15,000 rpm.Speed determines disk latency.

Cache2-16MB of cache memoryReliability: write-back vs. write-through

Page 6: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #6

Disk Information: hdparm# hdparm -i /dev/hde

/dev/hde:

Model=WDC WD1200JB-00CRA1, FwRev=17.07W17, SerialNo=WD-WMA8C4533667 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq } RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40 BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=off CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version:

* signifies the current active mode

Page 7: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #7

Disk Performance

Seek TimeTime to move head to desired track (3-8 ms)

Rotational DelayTime until head over desired block (8ms for 7200)

LatencySeek Time + Rotational Delay

ThroughputData transfer rate (20-80 MB/s)

Page 8: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #8

Latency vs. Throughput

Which is more important?Depends on the type of load.

Sequential access – ThroughputMultimedia on a single user PC

Random access – LatencyMost servers

How to improve performance Faster disks CachingMore spindles (disks).More disk controllers.

Page 9: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #9

Ext2 Disk Data Structures

Page 10: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #10

Block Groups

• Each block group contains– Block bitmap– Inode bitmap– Inode blocks– Data blocks

• How many blocks in a group?– Bitmaps are only 1 block in size.– Block bitmap can map 8 x blocksize blocks.– 4Kbyte blocks => # data blocks = 32K (128M)

Page 11: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #11

Inode Table

• Consecutive set of inode blocks.– Inodes are 128 bytes in size.

– A 4K block contains 32 inodes.

• Extended Inode Attributes– Problem: inodes are fixed size, must be 2n

– Solution: i_file_acl attribute points to non-inode block containing extended attributes (immutable bit, ACLs)

– System calls: setxattr(), getxattr()– ACLs are most common application and have their own

system calls: setfacl(), getfacl()

Page 12: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #12

Disk Block Usage• Regular files

– Zero-size files use no blocks.

• Symlinks– Pathnames < 60 chars stored in i_block field of inode.

• Directories– Format:

• Dev/IPC Files– No data

blocks

Page 13: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #13

Creating an ext2 Filesystem

1. Initializes superblock + group descriptors.2. Checks for bad blocks and creates list.3. For each block group

1. Reserves space: super,desc,inodes,bitmaps.2. Initializes inode and block bitmaps to zero.3. Initializes inode table.

4. Creates root (/) directory.5. Creates lost+found directory for e2fsck.6. Updates bitmaps with two directories.7. Groups bad blocks in lost+found.

Page 14: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #14

Managing Disk Space

• How to avoid file fragmentation? – If file blocks aren’t contiguous on disk,

expensive seeks are required

• How to access blocks quickly? – The kernel should be able to quickly convert file

offset into a logical block number on disk with few disk accesses.

Page 15: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #15

Creating Inodes

1. Allocate VFS inode with new_inode()

2. If inode is a dir, find a suitable block group– If subdirectory of /, find block group with above

average free inodes + free blocks.

– If not subdir of /, use block group of parent dir if• Group does not have too many directories.

• Group has enough free inodes.

• Group has small “debt” value (+dirs, -files)

– Else use 1st block group with free inodes > avg.

Page 16: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #16

Creating Inodes

3. If new inode not a directory– Logarithmic search for free inodes starting with block

group of parent directory.

– Ex: searches i % n, (i+1)%n, (i+1+2)%n

– If log search fails, perform linear search.

4. Reads bitmap of selected block group.– Searches for 1st unused bit to get inode #.

5. Allocates disk inode.– Sets bit, marks inode bitmap block dirty.

– Sets inode fields and writes to disk.

Page 17: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #17

Data Block Addressing• Block Numbers

– File: relative position of block within file.• File Offset -> File Block: (int) (offset / blocksize).

– Logical: position within disk partition.• File Block -> Logical Block: use inode to translate.

• Inodes– Direct blocks: 12 logical block numbers (48K)– Indirect: points to block of block #s (4M)– Double-indirect (4G)– Triple-indirect (4T)

Page 18: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #18

Inode Block Addressing

Page 19: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #19

File Holes

• Portion of file not stored on disk.– Can contain only null bytes.– echo –n “x” | dd of=/tmp/hole bs=1024 seek=6– Used by databases and similar hashing apps.

• How big is a file with a hole?– i_size includes null bytes in hole.– i_blocks stores data blocks actually used.

Page 20: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #20

File Holes

Page 21: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #21

Allocating Data Blocks

• Goal parameter– Preferred logical block number.– If prev 2 blocks consecutive, goal = prev+1– Else if at least 1 block alloc, goal = prev– Else goal = first logical block of group

• ext2_alloc_block()– If goal block pre-allocated to file, allocates.– Else, discards remaining pre-allocated and

invokes ext2_new_block().

Page 22: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #22

ext2_new_block()

• If goal is free, allocates.

• Otherwise checks for nearby free blocks.

• If no nearby free blocks, checks all groups– Starts with block group of goal block.– Searches for group of 8+ adjacent free blocks.– If no such group, looks for single free block.

• Will allocate up to 8 adjacent free blocks.

Page 23: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #23

e2fsck1. Checks validity of all inodes.

Is file mode valid? Are blocks valid?Are blocks used by multiple inodes?

2. Checks validity of all directories.Valid format? Do all entries refer to inodes from 1?

3. Checks directory connectivity.Is there a path from / to each directory?

4. Checks inode reference counts.Compares link counts with values calculated in 1+2.Moves undeleted 0 link count files to /lost+found.

5. Checks filesystem summary validity.Do on-disk inode/block bitmaps match e2fsck ones?

Page 24: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #24

ext3 = ext2 + journaling

• ext3 adds a journal to the filesystem.– Journal (log) does sequential writes.– Just blocks, no inodes, bitmaps, etc.– Kernel thread writes log blocks to ext2 format.– Why? Eliminate need for e2fsck after crash.

Page 25: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #25

Journaling

Perform system call-level changes by:1. Write blocks to journal.

2. Wait for write to be committed to journal.

3. Write blocks to filesystem.

4. Discard blocks from journal.

Page 26: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #26

System Failure Resolution

• Failure before journal commit– Ignore missing or incomplete journal blocks.– Change is lost, but filesystem is consistent.

• Failure after commit– Journal blocks are written to filesystem.

Page 27: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #27

Journal Types

• Journal– All data and metadata logged to journal.

– Safest and slowest ext3 mode.

• Ordered (default)– Only metadata changes are logged.

– Ensures data blocks written before metadata.

– Guarantees writes that enlarge are safe.

• Writeback– Only metadata logged, no re-ordered.

– Fastest and least safe ext3 mode.

Page 28: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #28

Stackable Filesystems• Filesystems useful for enhancing OS.

– File encryption.– Secure deletion.– Virus detection.– File versioning.– UnionFS.

• But, filesystems are difficult to develop.– 10,000+ lines of C code is typical.– Most of which you don’t want to change.

Page 29: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #29

Stackable Filesystems

• Solution #1– Copy ext3fs + add your code.– Problem: maintenance, keeping up with ext3.

• Solution #2– Add a layer of indirection: stackable filesystems.

Page 30: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #30

Stackable Filesystems

Page 31: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #31

File Data API

• encode_data– Called by write calls before data sent to lower-

level filesystem.

• decode_data– Called by read calls after data received from

lower-level filesystem.

• Arguments– I/O blocks: cannot change size.– File attributes, user credentials.

Page 32: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #32

Filename API

• encode_filename– Modifies filename from user system call that is

sent to lower-level filesystem.

• decode_filename– Modifies filename from filesystem before

returning to user.

• Arguments– Filenames: can change length, but no invalid chars– File’s vnode, user credentials.

Page 33: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #33

File Attributes API

• No specific API.

• Must modify wrapfs calls directly.

Page 34: CSC 660: Advanced OS

CSC 660: Advanced Operating Systems Slide #34

References1. Daniel P. Bovet and Marco Cesati, Understanding the Linux Kernel,

3rd edition, O’Reilly, 2005.2. Remy Card, Theodore T’so, Stephen Tweedie, “Design and

Impementation of the Second Extended Filesystem,” http://web.mit.edu/tytso/www/linux/ext2intro.html, 1994.

3. Robert Love, Linux Kernel Development, 2nd edition, Prentice-Hall, 2005.

4. Claudia Rodriguez et al, The Linux Kernel Primer, Prentice-Hall, 2005.

5. Mendel Rosenblum and John K. Osterhout, “The Design and Implementation of a Log-structured Filesystem,” 13th ACM SOSP, 1991.

6. Andrew S. Tanenbaum, Modern Operating Systems, 3rd edition, Prentice-Hall, 2005.

7. Erek Zadok and Jason Nieh, “FIST: A Language for Stackable Filesystems,” http://www.filesystems.org/docs/fist-lang/fist.pdf, 2000.