Download ppt - Sogang University Advanced Operating Systems (UNIX & Linux File System) Advanced Operating Systems (UNIX & Linux File System) Sang Gue Oh, Ph.D. Email

Sogang University

Advanced Operating SystemsAdvanced Operating Systems

(UNIX & Linux File System)(UNIX & Linux File System)

Sang Gue Oh, Ph.D.Sang Gue Oh, Ph.D.

Email : [email protected] : [email protected]

Page 2Sogang University

UNIX File System

File System FrameworkFile System FrameworkFile System FrameworkFile System Framework

Device Driver

File Storage ServiceFile Storage Service

Directory ServiceDirectory Service

Directory APIDirectory API File Operations APIFile Operations APIFile System

Interface

File System

Implementation

User

Kernel

System Call InterfaceSystem Call InterfaceText nameText name File idFile id

Hard DiskHard Disk Floppy DiskFloppy Disk CD-ROMCD-ROM


UNIX File System

Two UNIX File Systems (Local File System)Two UNIX File Systems (Local File System)Two UNIX File Systems (Local File System)Two UNIX File Systems (Local File System)

System V File System Original file system for UNIX. All versions of System V as well as several commercial UNIX syste

ms support this file system.

FFS (Fast File System) Introduced by Berkeley UNIX in release 4.2 BSD. Provides more performance, robustness, and functionality. Gained wide commercial acceptance and SVR4 includes this file sys

tem. After integrating with VFS (Virtual File System), this file system is kn

own as the UNIX file system (ufs).


UNIX File System

Inode (Index Node)Inode (Index Node)Inode (Index Node)Inode (Index Node)

Index Node. Contains a description of the disk layout of the file data and other

administrative information such as the file owner, access permission, file size, block size, and access times.

Every fileEvery file has one Inode associated with it (Internal representation of a file) and the Inode # is unique unique across the system.

When a process refers to a file by name, the kernel maps it to an Inode for further operations.

Two different structures. On-Disk Inode : Stored in the disk. In-Core Inode : Maintained in the kernel after reading from the disk. Has more fields in addition to the fields of the disk inode.


UNIX File System

Block Addressing Using InodeBlock Addressing Using InodeBlock Addressing Using InodeBlock Addressing Using Inode


UNIX File System

Example - Accessing the BlockExample - Accessing the BlockExample - Accessing the BlockExample - Accessing the Block

Block size = 1024 bytes, 10 direct, indirect/double/triple are 1 Case 1 : Access byte offset 9000 of a file. Case 2 : Access byte offset 350,000 of a file.

824

9156

428

367

0

4096

228

00

88

99

SingleSingle

DoubleDouble

TripleTriple

DirectDirect

Data BlockData Block

367367Case 1Case 1

331

91569156

Double Indirect (start from 10K + 256K = Double Indirect (start from 10K + 256K = 272,384272,384))

3333

331331 33333333

Data BlockData Block

808 th

(9000-1024*8)

007575 350,000

- 272,384

= 77,616

816 thCase 2Case 2


UNIX File System

Example - Maximum File SizeExample - Maximum File SizeExample - Maximum File SizeExample - Maximum File Size

Total size = (# of direct block pointer * block size) + (# of block pointers in one block * block size) + ((# of block pointers in one block)**2 * block size) + ((# of block pointers in one block)**3 * block size)

Case : Block size = 1024 bytes (1 Kbyte), pointer size = 4 bytes, 10 direct, 1 indirect, 1 double indirect, and 1 triple indirect poi

nters. Maximum File Size = (10 * 1024) + (1024/4 * 1024) + ((1024/4)**2 * 1024) + ((1024/4)**3 * 1024) = 1024 * (10 + 256 + 256**2 + 256**3) bytes

Increasing block size vs. adding one more indirection (quadruply) ?


UNIX File System

Directory Structure (System V Case)Directory Structure (System V Case)Directory Structure (System V Case)Directory Structure (System V Case)

73 .

38 ..

9 file10deleted file

110subdirectory165 archana

Inode Number File Name

2 Bytes 14 Bytes 2 byte Inode # (16 bits) restricts the maximum # of files in the partition (65535). Maximum file name size is 14.


UNIX File System

File System Layout (System V Case)File System Layout (System V Case)File System Layout (System V Case)File System Layout (System V Case)

File system resides on a single logical disk or partition, and each logical disk may hold one file system at the most.

A partition is viewed as a linear array of blocks. The size a block is multiple of 512 bytes (e.g., 512/1024/2048). This represents the granularity of space allocation for a file.

Physical block # is an index into this array, which is translated into cylinder, track, and sector # via device driver.

B (Boot Area): Only one partition needs to contain. S (Super Block): Contains metadata about the file system. Inode List : Has a fixed size which limits the max number of files. The size of

an inode is 64 bytes in System V UNIX.

< Layout of a disk partition >

B S Inode List Data Blocks

0 10 1


UNIX File System

Super-BlockSuper-BlockSuper-BlockSuper-Block

Superblock contains the following administrative info. Size in blocks of the file system Size in blocks of the inode list Number of free blocks and inodes Free inode list Index of the next free inode in the free inode list Free block list Index of the next free block in the free block list

It is impractical to keep either free list completely in the super-block. Therefore, it is necessary to manage the free inode and block list.


UNIX File System

Free Inode List Management (1)Free Inode List Management (1)Free Inode List Management (1)Free Inode List Management (1) Assigning a free inode.

Assign from the superblock free inode list. When the list becomes empty => the kernel scans the disk from the re

membered inode to replenish the list. (if di_mode = 0, empty inode)

Super-Block Free Inode List

<-………………-> 48 83 99 <-……………..>Empty Free Inodes

Next Free Inode Pointerassign

<- ……………….-> 470Empty

remembered

471 475 476 <- ………………-> 535

Free Inodes


UNIX File System

Free Inode List Management (2)Free Inode List Management (2)Free Inode List Management (2)Free Inode List Management (2) Freeing an allocated inode.

If the superblock free inode list has room, place it and increase index. else compare the inode # with that of remembered position. If free inod

e # is less than that in the remembered position, replace it.

Super-Block Free Inode List

<-………………-> 48 83 99 <-……………..>Empty Free Inodes

Next Free Inode PointerPlace the free inode

471 475 476 <- ……………….-> 535 Free Inodes

remembered

471 475 476 <- ………………-> 500

Free InodesFree Inode 500


UNIX File System

Free Block List Management (1)Free Block List Management (1)Free Block List Management (1)Free Block List Management (1)

Allocating a free block. Allocate the next available block in the free block list in superblock. If the allocating block is the last block, the kernel treats it as a pointe

r to a block that contains a list of free blocks. It reads the block, populates the superblock array with the new list of block numbers.

aSuperblock

b

c

d

block a

block b

block c

<- ………………………………->

Assigned from here

copy


UNIX File System

Free Block List Management (2)Free Block List Management (2)Free Block List Management (2)Free Block List Management (2) Freeing an allocated block.

If the superblock list is not full, place it on the superblock free list. else newly freed block becomes a link block; the kernel writes the su

perblock list into the block and writes the block to disk. It then places the block number of the newly freed block in the superblock list.

b aSuperblock

b

c

d

Free block a

block b

block c

<- ………………………………->copy

aCase 1 (not full)

Case 2

(full)


UNIX File System

Analysis of System V File System (1)Analysis of System V File System (1) Analysis of System V File System (1)Analysis of System V File System (1) Distinguished by its simple design. This simplicity creates problems in the areas of reliability, perf

ormance, and functionality. Reliability problem.

One copy of superblock - Superblock corruption problem.

Performance problem. Long seek between two areas (superblock and data blocks) increas

es I/O times. Inodes are allocated randomly with no attempt to group related inod

es (files in the same directory): random disk access. As files are created and deleted, the order of blocks in the free list b

ecomes completely random -> Disk block allocation problem.

This slows down sequential access operations.


UNIX File System

Analysis of System V File System (2)Analysis of System V File System (2) Analysis of System V File System (2)Analysis of System V File System (2)

Disk block size problem. SVR2 (512 bytes), SVR3 (1024 bytes) Small blocks require many indirect block accesses. Increasing th

e block size allows more data to be read in a single disk access -> improves performance. On the other hand, this also wastes more disk space.

Need for a more flexible approach to allocating space to files.

Functionality problem. File name size : 14 bytes. Number of inodes : 65535 -> restricts the maximum # of files in the fi

le system.


UNIX File System

The Fast File System (FFS)The Fast File System (FFS)

Retained the old file system abstraction. Changed the underlying implementation.

Bigger blocks (4096 bytes or larger) Group related information using cylinder group (reduce long seek). Use bit map for maintaining free blocks. Variable length directory entries and long file name (max 255 bytes).

Up to 10 times faster than the old file system. Functional enhancements

Long file names -> 255 bytes. Symbolic links (possibly refer to a file on a different file system). Rename (previously required link and unlink commands - hard link). Quotas


UNIX File System

Hard Disk StructureHard Disk Structure

se c to r 0

se c to r 1

he a d 0

he a d 1

he a d 2

tra c k 0tra c k 1

tra c k 2

c ylinde r 0

c ylinde r 1

p la tte rs

UNIX views the disk as a linear array of blocks (multiple of sectors). Addressing starts from increasing sector #, head #, and cylinder #.


UNIX File System

File System Layout (FFS)File System Layout (FFS)

FFS further divides the partition into one or more cylinder groups, each containing a small set of consecutive cylinders.

This allows UNIX to store related data in the same cylinder group (e.g., inode & data block : avoid long seek).

The fields of cylinder group A redundant copy of superblock: varying offset. Space for static number of inodes (default: one inode per 2048 bytes). The bit map of available blocks; cf. free list Summary information describing the usage of data blocks.

BCylinder Group 0 Cylinder Group n

S S

...

...


UNIX File System

Blocks and Fragments (1)Blocks and Fragments (1)

Different file systems on the same machine can have different block sizes.

The block size is a power of two greater than or equal to a minimum 4096. Most implementations add an upper limits of 8192 bytes (Compare with those of System V - 512 or 1024 bytes).

The 2**32 bytes (4 gigabytes) can be addressed with only two levels of indirection (i.e., FFS does not use the triple indirect block, although some variants use it to support file sizes greater than 4 gigabytes.)

Typical UNIX systems have numerous small files that need to be stored efficiently.

FFS solves this problem by allowing each block to be divided into one or more fragments (1, 2, 4, or 8 fragments, allowing a lower bound of 512 bytes each).


UNIX File System

Blocks and Fragments (2)Blocks and Fragments (2)

The last block of file is not a complete disk block. Write system call

If enough space left in an already allocated fragment,

=> write into the available space. If no fragmented blocks available,

=> full block is written, the remaining new data is written to a block with

the necessary fragments or a full block.

As the file grows, this scheme generates frequent copy. FFS allows only direct blocks to contain fragments.

Bits in map

Fragment numbers

Block numbers

XXXX XXOO OOXX OOOO

0-3 4-7 8-11 12-15

0 1 2 3


UNIX File System

Allocation PoliciesAllocation Policies

FFS aims to colocate related information on the disk and optimize sequential access.

Allocation Policies Use the next available block rotationally closest to the request bl

ock on the same cylinder. If there are no blocks available on the same cylinder, use a block

within the same cylinder group. If that cylinder group is entirely full, hash the cylinder group num

ber to choose another cylinder group to look for a free block. Finally if the hash fails, apply an exhaustive search to all cylinder

groups.


UNIX File System

History of Linux File SystemsHistory of Linux File SystemsHistory of Linux File SystemsHistory of Linux File Systems

First Linux File System : Minix File System. 14 characters file names Maximal file size : 64 MB Lack in performance

April 1992 : Extended File System (Extfs). Variable length file names (up to 255 characters) Maximal file size : 2 GB Most successful in Linux community

January 1993 : New Extended File System (renamed as Second Extended File System - Ext2fs)

January 1993 : XIA File System


UNIX File System

The Virtual File System (VFS) (1)The Virtual File System (VFS) (1)The Virtual File System (VFS) (1)The Virtual File System (VFS) (1)

Software layer in the kernel that provides the file system interface to user space programs.

Manages kernel level file abstractions in one format for all file systems (implement file system independent operations).

Receives file oriented system calls from user level. write, open, stat, link

VFS switch: Translate them into the internal ones to interface with a specific file system module (file system dependent part).

File system provides methods to VFS that the VFS switch can call: Many are optional.

Also receives requests from other parts of the kernel, mostly from memory management.


UNIX File System


Before the VFS After the VFS

Minix File System

Buffer Cache

Device Driver

VFS

Minix Ext2fs

Buffer Cache

Device Driver


UNIX File System


VFS maintains its own superblock/inode (i.e., in-core inode) structure and each file system dependent translator converts the contents of the external superblock/file descriptor (i.e., disk inode) from/to the VFS superblock/inode.

VFS assumes that: (in the file system stored on the disk) The first sector of the disk is a boot block although it does not use it. A superblock contains disk-specific information, such as the number

of bytes in a disk block. External file descriptor (disk inode) on the disk describe the characte

ristics of each file. Data blocks linked into each file contain the data.

Before VFS can manage a particular file system type, it has to be registered with register_filesystem() call (fs/super.c) - usually regi

stered when the machine is booted.


UNIX File System

Mounting a File System (1)Mounting a File System (1)Mounting a File System (1)Mounting a File System (1)

Before a file system can be used, it has to be mounted. Mounting appends a new file system into an existing directory

hierarchy, which allows heterogeneous file systems to be combined in the system’s directory hierarchy.

When a file system is mounted, VFS creates an instance of the super_block data structure (defined in include/linux/fs.h) to hold information for the file system.

VFS then calls the new file system’s read_super() function (defined in fs/super.c) to retrieve the information contained in the new file system’s superblock and save it into the super_block structure.

After the mount, the VFS can use the super_operations (defined in include linux/fs.h) functions to handle on-disk superblock and inodes.


UNIX File System


FS dependent data

super operations

FS independent data

super_block

remount_fs

statfs

write_super

put_super

put_inode

write_inode

notify_change

read_inode

super_operations


UNIX File System


struct super_block { ….. unsigned long s_blocksize; ….. struct file_system_type *s_type; struct super_operations *s_op; ….. union { /* File system specific information */ struct minix_sb_info minix_sb; struct ext2_sb_info ext2_sb;

struct hpfs_sb_info hpfs_sb; struct ntfs_sb_info ntfs_sb; struct msdos_sb_info msdos_sb; } u; …..

}

struct super_operations {

void (*read_inode) (struct inode *);

void (*write_inode) (struct inode *);

void (*put_inode) (struct inode *);

void (*delete_inode) (struct inode *);

int (*notify_change) (struct dentry *, …);

void (*put_super) (struct super_block *);

void (*write_super) (struct super_block *);

int (*statfs) (struct super_block *, … );

int (*remount_fs) (struct super_block *, ..);

void (*clear_inode) (struct inode *);

void (*umount_begin) (struct super_block *);

};

/* Disk dependent functions such as handling

on-disk superblock, inodes, etc */


UNIX File System

VFS Inode (1)VFS Inode (1)VFS Inode (1)VFS Inode (1)

Every file operation is made on an inode. The kernel translates file pathnames into inode numbers. The VFS maintains a table of inodes in use. Inodes are referenced by the structure inode : (define in include/linux/fs.h):

FS independent data Pointer to FS dependent operations (inode_operations - defined in inclu

de/linux/fs.h) FS dependent data

Usually one operation table per inode type (regular file, directory, symbolic link, …).


UNIX File System


FS dependent data

inode operations

super block

FS independent data

inode

super_block

permissionrmdir

truncatemkdir

bmapsymlink

follow_linkunlink

readlinklink

renamelookup

mknodcreate

inode_operations


UNIX File System


struct inode_operations {

struct file_operations * default_file_ops;

int (*create) (struct inode *, …);

struct dentry * (*lookup) (struct inode, …);

int (*link) (struct dentry *, …);

int (*unlink) (struct inode *,struct dentry *);

int (*symlink) (struct inode *, …);

int (*mkdir) (struct inode *, …);

int (*rmdir) (struct inode *,struct dentry *);

int (*mknod) (struct inode *, …);

int (*rename) (struct inode *, …);

int (*readlink) (struct dentry *, char *,int);

…..

};

struct inode { ….. uid_t i_uid; gid_t i_gid; ….. time_t i_atime; ….. unsigned long i_blksize; unsigned long i_blocks; …..

struct inode_operations *i_op; struct super_block *i_sb; …..

union { /* File system specific information */ struct pipe_inode_info pipe_i; struct minix_inode_info minix_i; struct ext2_inode_info ext2_i; ….. } u; …..}


UNIX File System

Opening a File - VFS to Disk TranslationOpening a File - VFS to Disk TranslationOpening a File - VFS to Disk TranslationOpening a File - VFS to Disk Translation

Fd TableFd Table

struct file_structstruct file_struct

File structureFile structure

tabletable

struct filestruct file

POSIX API (System Calls)POSIX API (System Calls)

Linux Linux

InodeInode

Disk

Inode

Disk

Inode

FromFrom

task_structtask_structstruct inode


UNIX File System

Inode Cache (1)Inode Cache (1)Inode Cache (1)Inode Cache (1)

Inode cache is implemented as a hash table (fs/inode.c). Hash value : inode number, device identifier

Inodes are connected by two separate lists Hash list Type list

in_use : valid inode, hashed if i_nlink > 0 dirty : valid inode, hashed if i_nlink > 0, dirty. unused : ready to be re-used. Not hashed

Inode State Transitions unused hash : Allocate a new inode. iget() calls ext2_read_inode(). dirty in_use : Synchronize inodes. Call ext2_write_inode()

hash unused : Clear inodes. If i_nlink = 0, iput() calls ext2_delete_inode() when i_count falls to 0.


UNIX File System

Inode CacheInode Cache (2) (2)Inode CacheInode Cache (2) (2)

inode_hashtable[ ]

HASH_SIZE

i_hash

i_list

inode

inode_in_useinode_in_use

inode_unusedinode_unused

inode_dirty inode_dirty i_hash

i_list

i_hash

i_list

i_hash

i_list

i_hash

i_listi_hash

i_list

i_hash

i_list

mark_inode_dirty

clear_inodedelete_inode

read_inode

write_inode(sync)

dirty used

used dirty

Iget()

Iput()


UNIX File System

Directory CacheDirectory CacheDirectory CacheDirectory Cache

Speed up commonly used directories (fs/dcache.c). Directory cache consists of a hash table.

device number, directory’s name

Two-level LRU list. When it is first looked up, added onto the end of the first level list. If the entry accessed again, it is promoted to the end of the second l

evel list.


UNIX File System

Buffer Cache Management (1)Buffer Cache Management (1)Buffer Cache Management (1)Buffer Cache Management (1)

Hashed by device number and block number. Two functional parts

Free block list (free_list).

One list per each buffer size (512, 1K,…8K).

Hash table (lru_list: LRU list for each buffer type).

Three LRU list - CLEAN, LOCKED, DIRTY.

If found in hash list, put_last_lru().

If not in hash list, allocate free list according to its size.

– remove from free list, insert to lru list.


UNIX File System


b_prev_free b_next_free

buffer_head


buffer_head


buffer_headfree_list[1]


buffer_head


buffer_head


buffer_head

free_list[0]

free_list[6]

::


UNIX File System


Buffer_head

b_pprev b_next


Buffer_head

b_pprev b_next


Buffer_head

b_pprev b_next


Buffer_head

b_pprev b_next


Buffer_head

b_pprev b_next


Buffer_head

b_pprev b_next


Buffer_head

b_pprev b_next


Buffer_head

b_pprev b_next


Buffer_head

b_pprev b_next


Buffer_head

b_pprev b_next


hash_table

lru_list[BUF_DIRTY]

hash function

NULL

NULL

NULL

NULL


UNIX File System

Ext2 File SystemExt2 File SystemExt2 File SystemExt2 File System

Influenced by BSD FFS (also called UFS - Unix File System). Divides the partition into Block Group (c.f. Cylinder Group). Keeping a) data blocks close to their inodes, b) file inodes close

to their directory inode -> reduce seek time and speed up accessing to data.


UNIX File System

Ext2fs Super BlockExt2fs Super BlockExt2fs Super BlockExt2fs Super Block Contains a description of the basic size and shape of file system. Usually superblock in Block Group 0 is read (size: 1024 bytes). Each Block Group contains a duplicate copy in case of file system corruption. Superblock contains : (struct ext2_super_block: include/linux/ext2_fs.h)

number of blocks (total and reserved) number of inodes number of free blocks and inodes block and fragment size number of blocks and inodes per group last mount and write times FS state current and maximal mount count last check time and check interval mount option default values


UNIX File System

Group DescriptorsGroup DescriptorsGroup DescriptorsGroup Descriptors

Provides information on the block groups. All descriptors are duplicated in each group (size: 32 bytes). Each descriptor describes a block group : (struct ext2_group_desc : include/linux/ext2_fs.h)

block bitmap location inode bitmap location inode table location number of free blocks number of free inodes number of allocated directories (used by the allocation routines)


UNIX File System

Bitmaps and Inode TableBitmaps and Inode TableBitmaps and Inode TableBitmaps and Inode Table

The size of the bitmaps is one block. This restricts the size of a block group to 8192 blocks for block

s of 1024 bytes. Inode table is a vector of inodes with 128 bytes in size.


UNIX File System

Directory Structure in Ext2fsDirectory Structure in Ext2fsDirectory Structure in Ext2fsDirectory Structure in Ext2fs

Directory : linked list of variable length entries Each entry contains: (struct ext2_dir_entry : include/linux/ext2_fs.h)

the inode number the entry length (rounded up to a multiple of 4) the file name length the file name (maximum of 255)

Example

f2212i3long_file_name1440i2file1516i1

0 16 56


UNIX File System

Inode Structure in Ext2fsInode Structure in Ext2fsInode Structure in Ext2fsInode Structure in Ext2fs

Block size : 1K ~ 4K Data block : i_data[15]

direct block : 12 indirect block : 3

cf. - Sys V : i_addr[13]

- UFS : i_db[12]

i_ib[3] struct ext2_inode : include/linux/ext2_fs.h


UNIX File System

Data Allocation in Ext2fs (1)Data Allocation in Ext2fs (1)Data Allocation in Ext2fs (1)Data Allocation in Ext2fs (1)

Block groups are used to cluster together related inodes and data. Use “next-32-bits” search (within 32 blocks) for allocating a block’

s successor (target-oriented allocation). If that fails, searches forward :

within the group for an entire free byte in the bitmap (8 free blocks). within the group for any free bit in the bitmap, if that fails. searches subsequent groups in a similar manner, if that fails.

Pre-allocate up to 8 adjacent blocks when allocating a new block: de-allocates extra blocks on file close. pre-allocation achieves good performances. pre-allocation hit rates are around 75% even on very full file systems. #define EXT2_PREALLOCATE 8 -> include/linux/ext2_fs.h.


UNIX File System

Data Allocation in Ext2fs (2)Data Allocation in Ext2fs (2)

Achieves good locality : of related files through block groups. of related blocks through 8-bits clustering of block allocations.

Reduce CPU overhead : The size of bitmaps that must be searched is limited. Block pre-allocation reduces allocation overhead. The physical and logical location of the last block is recorded in each

inode. Rapidly detects and deals with sequential allocations.


UNIX File System

Extension of Ext2fsExtension of Ext2fsExtension of Ext2fsExtension of Ext2fs

File Undeletion Access Control List

File protection per user and/or per group

Automatic File Compression Files stored in gziped format Decompression on the fly during reads


UNIX File System

Need for New File System (1) : TechnologiesNeed for New File System (1) : Technologies Need for New File System (1) : TechnologiesNeed for New File System (1) : Technologies

Three important components in file system design : processors, disks, and main memory.

CPU speed is increasing at an exponential rate, while the improvement in disk speed is slower. Focused mainly on disk transfer bandwidth and no major improvements for access time (e.g., seek time). -> No major speed up in applications.

Main memory is increasing in size, which makes large file cache possible.

Absorb a greater fraction of the read requests (80 ~ 90 % hit ratio). Disk traffic will become more and more dominated by writes. File cache can be used as a write buffers that allow blocks to be

collected before writing for a single transfer. Large buffer may result in large data loss. -> Solution: periodic update

or synchronous write.


UNIX File System

Need for New File System (2) : WorkloadsNeed for New File System (2) : Workloads Need for New File System (2) : WorkloadsNeed for New File System (2) : Workloads Among different file system workloads, one of the most difficult

workloads for file system design is found in office and engineering environments.

Office and engineering applications: Tend to be dominated by accesses to small files (only a few kilobytes). Small files usually result in small random disk I/Os. The creation and deletion times for such files are often dominated by

updates to file system metadata.

Workloads dominated by sequential accesses to large files : supercomputing applications or multimedia applications.

A number of techniques exist to ensure that files are laid out sequentially on disk (e.g., FFS, Ext2, etc.).

Use large block sizes. I/O performance tends to be limited by the disk I/O bandwidth.


UNIX File System

Problems with Traditional File Systems (1)Problems with Traditional File Systems (1) Problems with Traditional File Systems (1)Problems with Traditional File Systems (1)

Performance Problems Kernel algorithms force a large number of synchronous I/O operations,

resulting in extremely long completion times. FFS writes data block asynchronously, while metadata are written synchron

ously. -> simplified crash recovery but reduced write performance.

Disk layout in FFS restricts to using only a fraction of the total disk bandwidth.

FFS is designed to read or write a single block in each I/O request. If two blocks are on consecutive sectors on the disk, the disk would rotate p

ast the next block (due to kernel processing time). Introduce the concept of rotational delay (or rotdelay) - block interleaving. Typically complete rotation time is around 15 ms and kernel needs 4 ms. If block size is 4 Kbytes and each track has 8 blocks, rotdelay = 2. Restrict the disk bandwidth to about 1/3 of the total bandwidth.


UNIX File System

Problems with Traditional File Systems (2) Problems with Traditional File Systems (2) Problems with Traditional File Systems (2) Problems with Traditional File Systems (2)

Solution : 1) read/write the entire track in each operation. 2) on-disk cache (disk reads store the entire track in the cache). Write operations still suffer from the rotational delay problem.


UNIX File System


Predominance of disk writes. Large buffer cache absorbs many of the disk read requests. For consistency, update daemon periodically flush dirty blocks to disk. Some disk operations require synchronous disk updates.

Many of the synchronous writes turn out to be quite unnecessary. Due to strong locality, the same block is very likely to be modified. Many files have a very short lifetime.

Disk head seek problem.

Since writes account for most of the disk activity, the operating

system needs to find other ways to solve these problems.


UNIX File System


Metadata Updates In order to prevent file system corruption, metadata updates need to be

written in a precise order. For example, if a file is deleted, the kernel must remove the directory entry,

free the inode, and free the disk blocks used by the file.

In traditional file systems, such ordering is achieved through synchronous writes.

Since the attributes (e.g., inode) for a file are separate from the file’s contents, it takes several disk I/Os, each preceded by a seek, to do disk operations (e.g., 5 I/Os to create a new file in FFS).

When writing small files, less than 5 % of the disk’s potential bandwidth is used for new data; the rest of the time is spent seeking.


UNIX File System


Crash Recovery Ordering metadata writes helps control the damage caused by a syste

m crash but does not eliminate it. Sequence of operations to rebuild file system by fsck:

Read and check all inodes and build a bitmap of used data blocks. Record inodes numbers and block addresses of all directories. Validate the structure of the directory tree, making sure that all links are acc

ounted for. Validate directory contents to account for all the files. If any directories could not be attached to the tree in phase 2, put them in th

e lost+found directory. If any file could not be attached to a directory, put it in the lost+found direct

ory. Check the bitmaps and summary counts for each cylinder group.

Fsck may experience a long delay before they can restart after crash.


UNIX File System


Security UNIX based access control mechanism (permission bits) is not enough

in a large computing environments.

Need finer granularity control scheme. ACL (Access Control List) - allows the file owner to explicitly allow or res

trict different types of access to specific users and groups.

UNIX inodes are not designed to hold such a list, so the file system must find other ways of implementing ACLs.

Size Unnecessary size restrictions on the size of the file system and of indivi

dual files.


UNIX File System

Journaling Approach (1)Journaling Approach (1) Journaling Approach (1)Journaling Approach (1) Record all file system changes in an append-only log file.

Use database logging technique: keep track of changes to make sure that all updates on the disk are done safely.

The log is written sequentially, in large chunks at a time, which results in efficient disk utilization and high performance.

After a crash, only the log needs to be examined, which means quicker recovery and higher reliability.

Write performance can be improved due to no seek operations. Basic Characteristics:

What to log ? All modifications including data blocks (Logging File System) vs. only

metadata changes (Journaling File System). Log operations or values ?


UNIX File System

Journaling Approach (2)Journaling Approach (2) Journaling Approach (2)Journaling Approach (2)

Log-enhanced File Systems vs. Log-structured File System. Log-enhanced File System : retain the traditional on-disk structures,such

as inodes and superblocks, and use the log as a supplement record.

Log-structured File System : the log is the only representation of the file

system on disk -> requires full logging (data as well as metadata).

Garbage collection : Finite sized log (logically circular file).

Group commit : Need to write the log in large chunks. There is a trad

eoff between performance and reliability.

Retrieval : Need an efficient way (indexing technique) of retrieving d

ata from the log in case of cache miss.


UNIX File System

Static File System vs. Logging File SystemStatic File System vs. Logging File System Static File System vs. Logging File SystemStatic File System vs. Logging File System

File Header File Header

Before WriteBefore Write

File DataFile Data

Before WriteBefore Write

File HeaderFile Header

After WriteAfter Write

File DataFile Data

After WriteAfter Write

Static File SystemStatic File SystemUpdate Blocks 2 & 3

Logging File SystemLogging File System

11 22 33

11 22 33

11 22

33

11 22

33