Sogang University
Advanced Operating SystemsAdvanced Operating Systems
(UNIX & Linux File System)(UNIX & Linux File System)
Sang Gue Oh, Ph.D.Sang Gue Oh, Ph.D.
Email : [email protected] : [email protected]
Page 2Sogang University
UNIX File System
File System FrameworkFile System FrameworkFile System FrameworkFile System Framework
Device Driver
File Storage ServiceFile Storage Service
Directory ServiceDirectory Service
Directory APIDirectory API File Operations APIFile Operations APIFile System
Interface
File System
Implementation
User
Kernel
System Call InterfaceSystem Call InterfaceText nameText name File idFile id
Hard DiskHard Disk Floppy DiskFloppy Disk CD-ROMCD-ROM
Page 3Sogang University
UNIX File System
Two UNIX File Systems (Local File System)Two UNIX File Systems (Local File System)Two UNIX File Systems (Local File System)Two UNIX File Systems (Local File System)
System V File System Original file system for UNIX. All versions of System V as well as several commercial UNIX syste
ms support this file system.
FFS (Fast File System) Introduced by Berkeley UNIX in release 4.2 BSD. Provides more performance, robustness, and functionality. Gained wide commercial acceptance and SVR4 includes this file sys
tem. After integrating with VFS (Virtual File System), this file system is kn
own as the UNIX file system (ufs).
Page 4Sogang University
UNIX File System
Inode (Index Node)Inode (Index Node)Inode (Index Node)Inode (Index Node)
Index Node. Contains a description of the disk layout of the file data and other
administrative information such as the file owner, access permission, file size, block size, and access times.
Every fileEvery file has one Inode associated with it (Internal representation of a file) and the Inode # is unique unique across the system.
When a process refers to a file by name, the kernel maps it to an Inode for further operations.
Two different structures. On-Disk Inode : Stored in the disk. In-Core Inode : Maintained in the kernel after reading from the disk. Has more fields in addition to the fields of the disk inode.
Page 5Sogang University
UNIX File System
Block Addressing Using InodeBlock Addressing Using InodeBlock Addressing Using InodeBlock Addressing Using Inode
Page 6Sogang University
UNIX File System
Example - Accessing the BlockExample - Accessing the BlockExample - Accessing the BlockExample - Accessing the Block
Block size = 1024 bytes, 10 direct, indirect/double/triple are 1 Case 1 : Access byte offset 9000 of a file. Case 2 : Access byte offset 350,000 of a file.
824
9156
428
367
0
4096
228
00
88
99
SingleSingle
DoubleDouble
TripleTriple
DirectDirect
Data BlockData Block
367367Case 1Case 1
331
91569156
Double Indirect (start from 10K + 256K = Double Indirect (start from 10K + 256K = 272,384272,384))
3333
331331 33333333
Data BlockData Block
808 th
(9000-1024*8)
007575 350,000
- 272,384
= 77,616
816 thCase 2Case 2
Page 7Sogang University
UNIX File System
Example - Maximum File SizeExample - Maximum File SizeExample - Maximum File SizeExample - Maximum File Size
Total size = (# of direct block pointer * block size) + (# of block pointers in one block * block size) + ((# of block pointers in one block)**2 * block size) + ((# of block pointers in one block)**3 * block size)
Case : Block size = 1024 bytes (1 Kbyte), pointer size = 4 bytes, 10 direct, 1 indirect, 1 double indirect, and 1 triple indirect poi
nters. Maximum File Size = (10 * 1024) + (1024/4 * 1024) + ((1024/4)**2 * 1024) + ((1024/4)**3 * 1024) = 1024 * (10 + 256 + 256**2 + 256**3) bytes
Increasing block size vs. adding one more indirection (quadruply) ?
Page 8Sogang University
UNIX File System
Directory Structure (System V Case)Directory Structure (System V Case)Directory Structure (System V Case)Directory Structure (System V Case)
73 .
38 ..
9 file10deleted file
110subdirectory165 archana
Inode Number File Name
2 Bytes 14 Bytes 2 byte Inode # (16 bits) restricts the maximum # of files in the partition (65535). Maximum file name size is 14.
Page 9Sogang University
UNIX File System
File System Layout (System V Case)File System Layout (System V Case)File System Layout (System V Case)File System Layout (System V Case)
File system resides on a single logical disk or partition, and each logical disk may hold one file system at the most.
A partition is viewed as a linear array of blocks. The size a block is multiple of 512 bytes (e.g., 512/1024/2048). This represents the granularity of space allocation for a file.
Physical block # is an index into this array, which is translated into cylinder, track, and sector # via device driver.
B (Boot Area): Only one partition needs to contain. S (Super Block): Contains metadata about the file system. Inode List : Has a fixed size which limits the max number of files. The size of
an inode is 64 bytes in System V UNIX.
< Layout of a disk partition >
B S Inode List Data Blocks
0 10 1
Page 10Sogang University
UNIX File System
Super-BlockSuper-BlockSuper-BlockSuper-Block
Superblock contains the following administrative info. Size in blocks of the file system Size in blocks of the inode list Number of free blocks and inodes Free inode list Index of the next free inode in the free inode list Free block list Index of the next free block in the free block list
It is impractical to keep either free list completely in the super-block. Therefore, it is necessary to manage the free inode and block list.
Page 11Sogang University
UNIX File System
Free Inode List Management (1)Free Inode List Management (1)Free Inode List Management (1)Free Inode List Management (1) Assigning a free inode.
Assign from the superblock free inode list. When the list becomes empty => the kernel scans the disk from the re
membered inode to replenish the list. (if di_mode = 0, empty inode)
Super-Block Free Inode List
<-………………-> 48 83 99 <-……………..>Empty Free Inodes
Next Free Inode Pointerassign
<- ……………….-> 470Empty
remembered
471 475 476 <- ………………-> 535
Free Inodes
Page 12Sogang University
UNIX File System
Free Inode List Management (2)Free Inode List Management (2)Free Inode List Management (2)Free Inode List Management (2) Freeing an allocated inode.
If the superblock free inode list has room, place it and increase index. else compare the inode # with that of remembered position. If free inod
e # is less than that in the remembered position, replace it.
Super-Block Free Inode List
<-………………-> 48 83 99 <-……………..>Empty Free Inodes
Next Free Inode PointerPlace the free inode
471 475 476 <- ……………….-> 535 Free Inodes
remembered
471 475 476 <- ………………-> 500
Free InodesFree Inode 500
Page 13Sogang University
UNIX File System
Free Block List Management (1)Free Block List Management (1)Free Block List Management (1)Free Block List Management (1)
Allocating a free block. Allocate the next available block in the free block list in superblock. If the allocating block is the last block, the kernel treats it as a pointe
r to a block that contains a list of free blocks. It reads the block, populates the superblock array with the new list of block numbers.
aSuperblock
b
c
d
block a
block b
block c
<- ………………………………->
Assigned from here
copy
Page 14Sogang University
UNIX File System
Free Block List Management (2)Free Block List Management (2)Free Block List Management (2)Free Block List Management (2) Freeing an allocated block.
If the superblock list is not full, place it on the superblock free list. else newly freed block becomes a link block; the kernel writes the su
perblock list into the block and writes the block to disk. It then places the block number of the newly freed block in the superblock list.
b aSuperblock
b
c
d
Free block a
block b
block c
<- ………………………………->copy
aCase 1 (not full)
Case 2
(full)
Page 15Sogang University
UNIX File System
Analysis of System V File System (1)Analysis of System V File System (1) Analysis of System V File System (1)Analysis of System V File System (1) Distinguished by its simple design. This simplicity creates problems in the areas of reliability, perf
ormance, and functionality. Reliability problem.
One copy of superblock - Superblock corruption problem.
Performance problem. Long seek between two areas (superblock and data blocks) increas
es I/O times. Inodes are allocated randomly with no attempt to group related inod
es (files in the same directory): random disk access. As files are created and deleted, the order of blocks in the free list b
ecomes completely random -> Disk block allocation problem.
This slows down sequential access operations.
Page 16Sogang University
UNIX File System
Analysis of System V File System (2)Analysis of System V File System (2) Analysis of System V File System (2)Analysis of System V File System (2)
Disk block size problem. SVR2 (512 bytes), SVR3 (1024 bytes) Small blocks require many indirect block accesses. Increasing th
e block size allows more data to be read in a single disk access -> improves performance. On the other hand, this also wastes more disk space.
Need for a more flexible approach to allocating space to files.
Functionality problem. File name size : 14 bytes. Number of inodes : 65535 -> restricts the maximum # of files in the fi
le system.
Page 17Sogang University
UNIX File System
The Fast File System (FFS)The Fast File System (FFS)
Retained the old file system abstraction. Changed the underlying implementation.
Bigger blocks (4096 bytes or larger) Group related information using cylinder group (reduce long seek). Use bit map for maintaining free blocks. Variable length directory entries and long file name (max 255 bytes).
Up to 10 times faster than the old file system. Functional enhancements
Long file names -> 255 bytes. Symbolic links (possibly refer to a file on a different file system). Rename (previously required link and unlink commands - hard link). Quotas
Page 18Sogang University
UNIX File System
Hard Disk StructureHard Disk Structure
se c to r 0
se c to r 1
he a d 0
he a d 1
he a d 2
tra c k 0tra c k 1
tra c k 2
c ylinde r 0
c ylinde r 1
p la tte rs
UNIX views the disk as a linear array of blocks (multiple of sectors). Addressing starts from increasing sector #, head #, and cylinder #.
Page 19Sogang University
UNIX File System
File System Layout (FFS)File System Layout (FFS)
FFS further divides the partition into one or more cylinder groups, each containing a small set of consecutive cylinders.
This allows UNIX to store related data in the same cylinder group (e.g., inode & data block : avoid long seek).
The fields of cylinder group A redundant copy of superblock: varying offset. Space for static number of inodes (default: one inode per 2048 bytes). The bit map of available blocks; cf. free list Summary information describing the usage of data blocks.
BCylinder Group 0 Cylinder Group n
S S
...
...
Page 20Sogang University
UNIX File System
Blocks and Fragments (1)Blocks and Fragments (1)
Different file systems on the same machine can have different block sizes.
The block size is a power of two greater than or equal to a minimum 4096. Most implementations add an upper limits of 8192 bytes (Compare with those of System V - 512 or 1024 bytes).
The 2**32 bytes (4 gigabytes) can be addressed with only two levels of indirection (i.e., FFS does not use the triple indirect block, although some variants use it to support file sizes greater than 4 gigabytes.)
Typical UNIX systems have numerous small files that need to be stored efficiently.
FFS solves this problem by allowing each block to be divided into one or more fragments (1, 2, 4, or 8 fragments, allowing a lower bound of 512 bytes each).
Page 21Sogang University
UNIX File System
Blocks and Fragments (2)Blocks and Fragments (2)
The last block of file is not a complete disk block. Write system call
If enough space left in an already allocated fragment,
=> write into the available space. If no fragmented blocks available,
=> full block is written, the remaining new data is written to a block with
the necessary fragments or a full block.
As the file grows, this scheme generates frequent copy. FFS allows only direct blocks to contain fragments.
Bits in map
Fragment numbers
Block numbers
XXXX XXOO OOXX OOOO
0-3 4-7 8-11 12-15
0 1 2 3
Page 22Sogang University
UNIX File System
Allocation PoliciesAllocation Policies
FFS aims to colocate related information on the disk and optimize sequential access.
Allocation Policies Use the next available block rotationally closest to the request bl
ock on the same cylinder. If there are no blocks available on the same cylinder, use a block
within the same cylinder group. If that cylinder group is entirely full, hash the cylinder group num
ber to choose another cylinder group to look for a free block. Finally if the hash fails, apply an exhaustive search to all cylinder
groups.
Page 23Sogang University
UNIX File System
History of Linux File SystemsHistory of Linux File SystemsHistory of Linux File SystemsHistory of Linux File Systems
First Linux File System : Minix File System. 14 characters file names Maximal file size : 64 MB Lack in performance
April 1992 : Extended File System (Extfs). Variable length file names (up to 255 characters) Maximal file size : 2 GB Most successful in Linux community
January 1993 : New Extended File System (renamed as Second Extended File System - Ext2fs)
January 1993 : XIA File System
Page 24Sogang University
UNIX File System
The Virtual File System (VFS) (1)The Virtual File System (VFS) (1)The Virtual File System (VFS) (1)The Virtual File System (VFS) (1)
Software layer in the kernel that provides the file system interface to user space programs.
Manages kernel level file abstractions in one format for all file systems (implement file system independent operations).
Receives file oriented system calls from user level. write, open, stat, link
VFS switch: Translate them into the internal ones to interface with a specific file system module (file system dependent part).
File system provides methods to VFS that the VFS switch can call: Many are optional.
Also receives requests from other parts of the kernel, mostly from memory management.
Page 25Sogang University
UNIX File System
The Virtual File System (VFS) (2)The Virtual File System (VFS) (2)The Virtual File System (VFS) (2)The Virtual File System (VFS) (2)
Before the VFS After the VFS
Minix File System
Buffer Cache
Device Driver
VFS
Minix Ext2fs
Buffer Cache
Device Driver
Page 26Sogang University
UNIX File System
The Virtual File System (VFS) (3)The Virtual File System (VFS) (3)The Virtual File System (VFS) (3)The Virtual File System (VFS) (3)
VFS maintains its own superblock/inode (i.e., in-core inode) structure and each file system dependent translator converts the contents of the external superblock/file descriptor (i.e., disk inode) from/to the VFS superblock/inode.
VFS assumes that: (in the file system stored on the disk) The first sector of the disk is a boot block although it does not use it. A superblock contains disk-specific information, such as the number
of bytes in a disk block. External file descriptor (disk inode) on the disk describe the characte
ristics of each file. Data blocks linked into each file contain the data.
Before VFS can manage a particular file system type, it has to be registered with register_filesystem() call (fs/super.c) - usually regi
stered when the machine is booted.
Page 27Sogang University
UNIX File System
Mounting a File System (1)Mounting a File System (1)Mounting a File System (1)Mounting a File System (1)
Before a file system can be used, it has to be mounted. Mounting appends a new file system into an existing directory
hierarchy, which allows heterogeneous file systems to be combined in the system’s directory hierarchy.
When a file system is mounted, VFS creates an instance of the super_block data structure (defined in include/linux/fs.h) to hold information for the file system.
VFS then calls the new file system’s read_super() function (defined in fs/super.c) to retrieve the information contained in the new file system’s superblock and save it into the super_block structure.
After the mount, the VFS can use the super_operations (defined in include linux/fs.h) functions to handle on-disk superblock and inodes.
Page 28Sogang University
UNIX File System
Mounting a File System (2)Mounting a File System (2)Mounting a File System (2)Mounting a File System (2)
FS dependent data
super operations
FS independent data
super_block
remount_fs
statfs
write_super
put_super
put_inode
write_inode
notify_change
read_inode
super_operations
Page 29Sogang University
UNIX File System
Mounting a File System (3)Mounting a File System (3)Mounting a File System (3)Mounting a File System (3)
struct super_block { ….. unsigned long s_blocksize; ….. struct file_system_type *s_type; struct super_operations *s_op; ….. union { /* File system specific information */ struct minix_sb_info minix_sb; struct ext2_sb_info ext2_sb;
struct hpfs_sb_info hpfs_sb; struct ntfs_sb_info ntfs_sb; struct msdos_sb_info msdos_sb; } u; …..
}
struct super_operations {
void (*read_inode) (struct inode *);
void (*write_inode) (struct inode *);
void (*put_inode) (struct inode *);
void (*delete_inode) (struct inode *);
int (*notify_change) (struct dentry *, …);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
int (*statfs) (struct super_block *, … );
int (*remount_fs) (struct super_block *, ..);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
};
/* Disk dependent functions such as handling
on-disk superblock, inodes, etc */
Page 30Sogang University
UNIX File System
VFS Inode (1)VFS Inode (1)VFS Inode (1)VFS Inode (1)
Every file operation is made on an inode. The kernel translates file pathnames into inode numbers. The VFS maintains a table of inodes in use. Inodes are referenced by the structure inode : (define in include/linux/fs.h):
FS independent data Pointer to FS dependent operations (inode_operations - defined in inclu
de/linux/fs.h) FS dependent data
Usually one operation table per inode type (regular file, directory, symbolic link, …).
Page 31Sogang University
UNIX File System
VFS Inode (2)VFS Inode (2)VFS Inode (2)VFS Inode (2)
FS dependent data
inode operations
super block
FS independent data
inode
super_block
permissionrmdir
truncatemkdir
bmapsymlink
follow_linkunlink
readlinklink
renamelookup
mknodcreate
inode_operations
Page 32Sogang University
UNIX File System
VFS Inode (3)VFS Inode (3)VFS Inode (3)VFS Inode (3)
struct inode_operations {
struct file_operations * default_file_ops;
int (*create) (struct inode *, …);
struct dentry * (*lookup) (struct inode, …);
int (*link) (struct dentry *, …);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *, …);
int (*mkdir) (struct inode *, …);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *, …);
int (*rename) (struct inode *, …);
int (*readlink) (struct dentry *, char *,int);
…..
};
struct inode { ….. uid_t i_uid; gid_t i_gid; ….. time_t i_atime; ….. unsigned long i_blksize; unsigned long i_blocks; …..
struct inode_operations *i_op; struct super_block *i_sb; …..
union { /* File system specific information */ struct pipe_inode_info pipe_i; struct minix_inode_info minix_i; struct ext2_inode_info ext2_i; ….. } u; …..}
Page 33Sogang University
UNIX File System
Opening a File - VFS to Disk TranslationOpening a File - VFS to Disk TranslationOpening a File - VFS to Disk TranslationOpening a File - VFS to Disk Translation
Fd TableFd Table
struct file_structstruct file_struct
File structureFile structure
tabletable
struct filestruct file
POSIX API (System Calls)POSIX API (System Calls)
Linux Linux
InodeInode
Disk
Inode
Disk
Inode
FromFrom
task_structtask_structstruct inode
Page 34Sogang University
UNIX File System
Inode Cache (1)Inode Cache (1)Inode Cache (1)Inode Cache (1)
Inode cache is implemented as a hash table (fs/inode.c). Hash value : inode number, device identifier
Inodes are connected by two separate lists Hash list Type list
in_use : valid inode, hashed if i_nlink > 0 dirty : valid inode, hashed if i_nlink > 0, dirty. unused : ready to be re-used. Not hashed
Inode State Transitions unused hash : Allocate a new inode. iget() calls ext2_read_inode(). dirty in_use : Synchronize inodes. Call ext2_write_inode()
hash unused : Clear inodes. If i_nlink = 0, iput() calls ext2_delete_inode() when i_count falls to 0.
Page 35Sogang University
UNIX File System
Inode CacheInode Cache (2) (2)Inode CacheInode Cache (2) (2)
inode_hashtable[ ]
HASH_SIZE
i_hash
i_list
inode
inode_in_useinode_in_use
inode_unusedinode_unused
inode_dirty inode_dirty i_hash
i_list
i_hash
i_list
i_hash
i_list
i_hash
i_listi_hash
i_list
i_hash
i_list
mark_inode_dirty
clear_inodedelete_inode
read_inode
write_inode(sync)
dirty used
used dirty
Iget()
Iput()
Page 36Sogang University
UNIX File System
Directory CacheDirectory CacheDirectory CacheDirectory Cache
Speed up commonly used directories (fs/dcache.c). Directory cache consists of a hash table.
device number, directory’s name
Two-level LRU list. When it is first looked up, added onto the end of the first level list. If the entry accessed again, it is promoted to the end of the second l
evel list.
Page 37Sogang University
UNIX File System
Buffer Cache Management (1)Buffer Cache Management (1)Buffer Cache Management (1)Buffer Cache Management (1)
Hashed by device number and block number. Two functional parts
Free block list (free_list).
One list per each buffer size (512, 1K,…8K).
Hash table (lru_list: LRU list for each buffer type).
Three LRU list - CLEAN, LOCKED, DIRTY.
If found in hash list, put_last_lru().
If not in hash list, allocate free list according to its size.
– remove from free list, insert to lru list.
Page 38Sogang University
UNIX File System
Buffer Cache Management (2)Buffer Cache Management (2)Buffer Cache Management (2)Buffer Cache Management (2)
b_prev_free b_next_free
buffer_head
b_prev_free b_next_free
buffer_head
b_prev_free b_next_free
buffer_headfree_list[1]
b_prev_free b_next_free
buffer_head
b_prev_free b_next_free
buffer_head
b_prev_free b_next_free
buffer_head
free_list[0]
free_list[6]
::
Page 39Sogang University
UNIX File System
Buffer Cache Management (3)Buffer Cache Management (3)Buffer Cache Management (3)Buffer Cache Management (3)
Buffer_head
b_pprev b_next
b_prev_free b_next_free
Buffer_head
b_pprev b_next
b_prev_free b_next_free
Buffer_head
b_pprev b_next
b_prev_free b_next_free
Buffer_head
b_pprev b_next
b_prev_free b_next_free
Buffer_head
b_pprev b_next
b_prev_free b_next_free
Buffer_head
b_pprev b_next
b_prev_free b_next_free
Buffer_head
b_pprev b_next
b_prev_free b_next_free
Buffer_head
b_pprev b_next
b_prev_free b_next_free
Buffer_head
b_pprev b_next
b_prev_free b_next_free
Buffer_head
b_pprev b_next
b_prev_free b_next_free
hash_table
lru_list[BUF_DIRTY]
hash function
NULL
NULL
NULL
NULL
Page 40Sogang University
UNIX File System
Ext2 File SystemExt2 File SystemExt2 File SystemExt2 File System
Influenced by BSD FFS (also called UFS - Unix File System). Divides the partition into Block Group (c.f. Cylinder Group). Keeping a) data blocks close to their inodes, b) file inodes close
to their directory inode -> reduce seek time and speed up accessing to data.
Page 41Sogang University
UNIX File System
Ext2fs Super BlockExt2fs Super BlockExt2fs Super BlockExt2fs Super Block Contains a description of the basic size and shape of file system. Usually superblock in Block Group 0 is read (size: 1024 bytes). Each Block Group contains a duplicate copy in case of file system corruption. Superblock contains : (struct ext2_super_block: include/linux/ext2_fs.h)
number of blocks (total and reserved) number of inodes number of free blocks and inodes block and fragment size number of blocks and inodes per group last mount and write times FS state current and maximal mount count last check time and check interval mount option default values
Page 42Sogang University
UNIX File System
Group DescriptorsGroup DescriptorsGroup DescriptorsGroup Descriptors
Provides information on the block groups. All descriptors are duplicated in each group (size: 32 bytes). Each descriptor describes a block group : (struct ext2_group_desc : include/linux/ext2_fs.h)
block bitmap location inode bitmap location inode table location number of free blocks number of free inodes number of allocated directories (used by the allocation routines)
Page 43Sogang University
UNIX File System
Bitmaps and Inode TableBitmaps and Inode TableBitmaps and Inode TableBitmaps and Inode Table
The size of the bitmaps is one block. This restricts the size of a block group to 8192 blocks for block
s of 1024 bytes. Inode table is a vector of inodes with 128 bytes in size.
Page 44Sogang University
UNIX File System
Directory Structure in Ext2fsDirectory Structure in Ext2fsDirectory Structure in Ext2fsDirectory Structure in Ext2fs
Directory : linked list of variable length entries Each entry contains: (struct ext2_dir_entry : include/linux/ext2_fs.h)
the inode number the entry length (rounded up to a multiple of 4) the file name length the file name (maximum of 255)
Example
f2212i3long_file_name1440i2file1516i1
0 16 56
Page 45Sogang University
UNIX File System
Inode Structure in Ext2fsInode Structure in Ext2fsInode Structure in Ext2fsInode Structure in Ext2fs
Block size : 1K ~ 4K Data block : i_data[15]
direct block : 12 indirect block : 3
cf. - Sys V : i_addr[13]
- UFS : i_db[12]
i_ib[3] struct ext2_inode : include/linux/ext2_fs.h
Page 46Sogang University
UNIX File System
Data Allocation in Ext2fs (1)Data Allocation in Ext2fs (1)Data Allocation in Ext2fs (1)Data Allocation in Ext2fs (1)
Block groups are used to cluster together related inodes and data. Use “next-32-bits” search (within 32 blocks) for allocating a block’
s successor (target-oriented allocation). If that fails, searches forward :
within the group for an entire free byte in the bitmap (8 free blocks). within the group for any free bit in the bitmap, if that fails. searches subsequent groups in a similar manner, if that fails.
Pre-allocate up to 8 adjacent blocks when allocating a new block: de-allocates extra blocks on file close. pre-allocation achieves good performances. pre-allocation hit rates are around 75% even on very full file systems. #define EXT2_PREALLOCATE 8 -> include/linux/ext2_fs.h.
Page 47Sogang University
UNIX File System
Data Allocation in Ext2fs (2)Data Allocation in Ext2fs (2)
Achieves good locality : of related files through block groups. of related blocks through 8-bits clustering of block allocations.
Reduce CPU overhead : The size of bitmaps that must be searched is limited. Block pre-allocation reduces allocation overhead. The physical and logical location of the last block is recorded in each
inode. Rapidly detects and deals with sequential allocations.
Page 48Sogang University
UNIX File System
Extension of Ext2fsExtension of Ext2fsExtension of Ext2fsExtension of Ext2fs
File Undeletion Access Control List
File protection per user and/or per group
Automatic File Compression Files stored in gziped format Decompression on the fly during reads
Page 49Sogang University
UNIX File System
Need for New File System (1) : TechnologiesNeed for New File System (1) : Technologies Need for New File System (1) : TechnologiesNeed for New File System (1) : Technologies
Three important components in file system design : processors, disks, and main memory.
CPU speed is increasing at an exponential rate, while the improvement in disk speed is slower. Focused mainly on disk transfer bandwidth and no major improvements for access time (e.g., seek time). -> No major speed up in applications.
Main memory is increasing in size, which makes large file cache possible.
Absorb a greater fraction of the read requests (80 ~ 90 % hit ratio). Disk traffic will become more and more dominated by writes. File cache can be used as a write buffers that allow blocks to be
collected before writing for a single transfer. Large buffer may result in large data loss. -> Solution: periodic update
or synchronous write.
Page 50Sogang University
UNIX File System
Need for New File System (2) : WorkloadsNeed for New File System (2) : Workloads Need for New File System (2) : WorkloadsNeed for New File System (2) : Workloads Among different file system workloads, one of the most difficult
workloads for file system design is found in office and engineering environments.
Office and engineering applications: Tend to be dominated by accesses to small files (only a few kilobytes). Small files usually result in small random disk I/Os. The creation and deletion times for such files are often dominated by
updates to file system metadata.
Workloads dominated by sequential accesses to large files : supercomputing applications or multimedia applications.
A number of techniques exist to ensure that files are laid out sequentially on disk (e.g., FFS, Ext2, etc.).
Use large block sizes. I/O performance tends to be limited by the disk I/O bandwidth.
Page 51Sogang University
UNIX File System
Problems with Traditional File Systems (1)Problems with Traditional File Systems (1) Problems with Traditional File Systems (1)Problems with Traditional File Systems (1)
Performance Problems Kernel algorithms force a large number of synchronous I/O operations,
resulting in extremely long completion times. FFS writes data block asynchronously, while metadata are written synchron
ously. -> simplified crash recovery but reduced write performance.
Disk layout in FFS restricts to using only a fraction of the total disk bandwidth.
FFS is designed to read or write a single block in each I/O request. If two blocks are on consecutive sectors on the disk, the disk would rotate p
ast the next block (due to kernel processing time). Introduce the concept of rotational delay (or rotdelay) - block interleaving. Typically complete rotation time is around 15 ms and kernel needs 4 ms. If block size is 4 Kbytes and each track has 8 blocks, rotdelay = 2. Restrict the disk bandwidth to about 1/3 of the total bandwidth.
Page 52Sogang University
UNIX File System
Problems with Traditional File Systems (2) Problems with Traditional File Systems (2) Problems with Traditional File Systems (2) Problems with Traditional File Systems (2)
Solution : 1) read/write the entire track in each operation. 2) on-disk cache (disk reads store the entire track in the cache). Write operations still suffer from the rotational delay problem.
Page 53Sogang University
UNIX File System
Problems with Traditional File Systems (3) Problems with Traditional File Systems (3) Problems with Traditional File Systems (3) Problems with Traditional File Systems (3)
Predominance of disk writes. Large buffer cache absorbs many of the disk read requests. For consistency, update daemon periodically flush dirty blocks to disk. Some disk operations require synchronous disk updates.
Many of the synchronous writes turn out to be quite unnecessary. Due to strong locality, the same block is very likely to be modified. Many files have a very short lifetime.
Disk head seek problem.
Since writes account for most of the disk activity, the operating
system needs to find other ways to solve these problems.
Page 54Sogang University
UNIX File System
Problems with Traditional File Systems (4) Problems with Traditional File Systems (4) Problems with Traditional File Systems (4) Problems with Traditional File Systems (4)
Metadata Updates In order to prevent file system corruption, metadata updates need to be
written in a precise order. For example, if a file is deleted, the kernel must remove the directory entry,
free the inode, and free the disk blocks used by the file.
In traditional file systems, such ordering is achieved through synchronous writes.
Since the attributes (e.g., inode) for a file are separate from the file’s contents, it takes several disk I/Os, each preceded by a seek, to do disk operations (e.g., 5 I/Os to create a new file in FFS).
When writing small files, less than 5 % of the disk’s potential bandwidth is used for new data; the rest of the time is spent seeking.
Page 55Sogang University
UNIX File System
Problems with Traditional File Systems (5)Problems with Traditional File Systems (5) Problems with Traditional File Systems (5)Problems with Traditional File Systems (5)
Crash Recovery Ordering metadata writes helps control the damage caused by a syste
m crash but does not eliminate it. Sequence of operations to rebuild file system by fsck:
Read and check all inodes and build a bitmap of used data blocks. Record inodes numbers and block addresses of all directories. Validate the structure of the directory tree, making sure that all links are acc
ounted for. Validate directory contents to account for all the files. If any directories could not be attached to the tree in phase 2, put them in th
e lost+found directory. If any file could not be attached to a directory, put it in the lost+found direct
ory. Check the bitmaps and summary counts for each cylinder group.
Fsck may experience a long delay before they can restart after crash.
Page 56Sogang University
UNIX File System
Problems with Traditional File Systems (6)Problems with Traditional File Systems (6) Problems with Traditional File Systems (6)Problems with Traditional File Systems (6)
Security UNIX based access control mechanism (permission bits) is not enough
in a large computing environments.
Need finer granularity control scheme. ACL (Access Control List) - allows the file owner to explicitly allow or res
trict different types of access to specific users and groups.
UNIX inodes are not designed to hold such a list, so the file system must find other ways of implementing ACLs.
Size Unnecessary size restrictions on the size of the file system and of indivi
dual files.
Page 57Sogang University
UNIX File System
Journaling Approach (1)Journaling Approach (1) Journaling Approach (1)Journaling Approach (1) Record all file system changes in an append-only log file.
Use database logging technique: keep track of changes to make sure that all updates on the disk are done safely.
The log is written sequentially, in large chunks at a time, which results in efficient disk utilization and high performance.
After a crash, only the log needs to be examined, which means quicker recovery and higher reliability.
Write performance can be improved due to no seek operations. Basic Characteristics:
What to log ? All modifications including data blocks (Logging File System) vs. only
metadata changes (Journaling File System). Log operations or values ?
Page 58Sogang University
UNIX File System
Journaling Approach (2)Journaling Approach (2) Journaling Approach (2)Journaling Approach (2)
Log-enhanced File Systems vs. Log-structured File System. Log-enhanced File System : retain the traditional on-disk structures,such
as inodes and superblocks, and use the log as a supplement record.
Log-structured File System : the log is the only representation of the file
system on disk -> requires full logging (data as well as metadata).
Garbage collection : Finite sized log (logically circular file).
Group commit : Need to write the log in large chunks. There is a trad
eoff between performance and reliability.
Retrieval : Need an efficient way (indexing technique) of retrieving d
ata from the log in case of cache miss.
Page 59Sogang University
UNIX File System
Static File System vs. Logging File SystemStatic File System vs. Logging File System Static File System vs. Logging File SystemStatic File System vs. Logging File System
File Header File Header
Before WriteBefore Write
File DataFile Data
Before WriteBefore Write
File HeaderFile Header
After WriteAfter Write
File DataFile Data
After WriteAfter Write
Static File SystemStatic File SystemUpdate Blocks 2 & 3
Logging File SystemLogging File System
11 22 33
11 22 33
11 22
33
11 22
33