46
File system concepts

File system concepts - csl.skku.educsl.skku.edu/uploads/SWE3015S14/swe3015s14fs.pdf · Unix file systems history Unix file system (System V, 1974) ... dev_t s_dev s_inodes list_head

  • Upload
    lymien

  • View
    222

  • Download
    2

Embed Size (px)

Citation preview

File system concepts

File system concepts

• Ease of searching a specific data

– File to group data: variable size, naming

– Directory to group files

File data

Directory File name, file offset File name, file offset

File data

Unix file systems history

Unix file system (System V, 1974)

Berkeley fast file system (BSD 4.2, 1984)

Extended file system (Linux, 1992)

Log-structured file system (1991)

Minix file system (Minix, 1987)

Ext4 file system (2008)

XFS (IRIX, 1994) Journaling file system

(OS/2, 1999)

BTRFS (2009)

Ext2 file system (1993)

Ext3 file system (2001)

1970

1980

1990

2000

2010

Journaling file system (AIX, 1990)

Journaling file system (Linux, 2001)

XFS (Linux, 2002)

F2FS (2012)

HFS (1985)

HFS+ (1998)

DOS/Windows file systems history

• File Allocation Table

– FAT (8bit, 1977) / FAT12 (1980) / FAT16 (1984)

Target for floppy disk

– HPFS (OS/2, 1989)

– FAT32/VFAT (1996)

– exFAT (2006)

• NTFS

– Since Windows NT 3.1 (1993)

Network/distributed file systems

• Network file systems

– Mount remote file system to local directory

– Network File System

– Server Message Block/CiFS (samba)

– AppleTalk Filing Protocol

• Distributed file system

– Share storage device to build a large file system

– Andrew File System

– Google file system

– Hadoop file system (HDFS)

File system interfaces

• R. C. Daley, P. G. Neumann, A General-Purpose File System For Secondary Storage, 1965 – Defined what a file system is and how it works

– Concepts of user, file, directory, directory hierarchy

– Backup storage and their usage • Incremental backup / weekly full backup recovery

• POSIX [IEEE 1003 / Richard Stallman / 1988]

– Standardized file system interfaces

– Standard I/O API

– Direct I/O API

– Memory mapped I/O API

File system interface : stream I/O

• Buffered and line-by-line I/O interface

• Header: <stdio.h>

• Handler: FILE *f;

• Functions

– fopen, fclose

– fprintf, fscanf

– fgets, fputs

– fread, fwrite

– fseek, ftell

#include <stdio.h>

int main(void)

{

FILE *fp;

char *str;

if ( fp = fopen("main.c", "r") )

{

str = malloc(4096);

while( fgets(str, 4095, fp) )

printf("%s", str);

fclose(fp);

free(str);

}

return 0;

}

File system interface : direct I/O

• Header: <fcntl.h>, <unistd.h>, …

• Handler: int fd;

• Functions

– open, creat, close

– read, write

– lseek, lseek64

– posix_fallocate, posix_fadvise

#include <fcntl.h>

#include <unistd.h>

int main(void)

{

int fd;

void *buf;

if ( (fd = open("main.c", "r")) > 0)

{

buf = malloc(4096);

while( read(fd, buf, 4096) > 0)

write(1, buf, 4096);

close(fd);

free(buf);

}

return 0;

}

File system interface : mmap I/O

• Memory access to read/write a file

• Header: <sys/mman.h>

• Handler: void *ptr;

• Functions

– void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset)

– int munmap(void *addr, size_t length)

File system interface : mmap I/O

• Example

#include <fcntl.h>

#include <unistd.h>

#include <sys/mman.h>

int main(void)

{

int fd, length;

void *buf;

if ( (fd = open("main.c", "r")) > 0)

{

length = lseek(fd, 0, SEEK_END);

buf = mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0);

write(1, buf, length);

munmap(buf, length);

close(fd);

}

return 0;

}

Stream I/O illustrated

Application

VFS

Page cache

libc

fopen

open

sys_open

Hello, Guys

fgets

read

sys_read

fgets

Hello, Guys

Hello,

fclose

close

fprintf

Hello, World

write

World

fflush

sys_write sys_close

Memory mapped I/O illustrated

Application

VFS

Page cache

libc

mmap

sys_mmap

동해물과 백두산이 마르고 닳도록 하느님이 보우하사 우리나라 만세

무궁화 삼천리 화려강산 …

c=buf[0]

aops->readpage()

buf[1]=‘\n’ munmap

동해물과 백두산이 마르고 닳도록 하느님이 보우하사 우리나라 만세

무궁화 삼천리 화려강산 pagefault

aops->writepage()

replacement

sys_munmap

File system design

File system design elements

• Space allocation

– Contiguous allocation vs. fragmented allocation

– File to block mapping management

– Managing free space

• Name space management

– File naming: name length, case sensitivity, … • ex. early UNIX file system / FAT uses 8.3 naming system

– Directory hierarchy • Single level array

• Tree-structured multi-level directory

• graph-structured directory

Disk layout and file abstraction

• Abstractions in file system

– File data

– Inode: per file metadata • name, size, data location, modified time, owner, …

– Directory hierarchy

– Superblock

– Meta data for free space management

File a, 0 File a, 1 Inode a Dir b Superblock

?

Allocated/free space management

• Bitmap approach (ext*fs)

– Low storage capacity usage

– High free space search cost

• Linked List approach (FAT)

– Low free space search cost

11011000

Allocated/free space management

• Tree-based approach

– Inode and indirect blocks

– Extents: (start block number, contiguous blocks)

inode filename attributes

direct blocks

single indirect double indirect triple indirect

Indirect block

Indirect block Indirect block

data

data data data data

data

Indirect block

data

data

data

data

Allocated/free space management

• Tree-based approach

– B-Tree (XFS, btrfs, …) • Useful for extent-based allocation

(1, 3) (7, 1) (10, 4)

3

4

8

1 2 3 4 5 6 7 8

File allocation

(14, 5) (4, 3) (8, 2)

5

3

2

(0, 1)

Free space

Directory implementation

• Array

– Easy to manage

– File name length limit

• Linear list

– Variable length file name

– Hard to manage

• Hash table

– Indexed by file name: fast search

– Hash collision

RUN.EXE

README.TXT

DATA.DB

RUN.EXE

README.TXT

DATA.DB

Long named file.docx …

Example: FAT

Characteristics

• Background: 1970s

– Personal computer

– Floppy disks (~ 1MB)

• 8.3 name space

– Case insensitive

– Long name format extension

• No protection mechanism

• No consistency guarantee

– chkdsk, diskscan

• File data location management

– Linked list approach

• FAT entry (1 entry / 1 cluster)

– Next cluster number (cluster: 512 bytes ~ 32 KB)

– 0: free, -1: end of file

Boot block

File allocation table

0 0 0 0

A.EXE

FAT Root dir. Data

00003 00005 00006 -1

Backup

Directory

• A special file with 32 bytes directory entries

• Entries

– File name: 11 bytes (name 8, extension 3)

– Attributes • Read-only, hidden, system, sub-directory, archive, long file name

– ctime, atime, mtime • Year (7), month (4), day (5), hour (5), min (6), second/2 (5)

– First data cluster

– File size (max. 4 GB)

Long name extension

• Combining consecutive directory entries

– First entry: normal directory entry (first 11 character)

– LFN entries • File name segment: 26 bytes

• Reserved critical entries

– First data cluster

– File type, sequence number, etc.

Introductio 0 ctime atime mtime FDC length n to File L F System.pptx 0

Sequence File type First cluster, for compatibility

Boot sector

• Boot strap

• File system summary

– File system size (sectors)

– Logical sector size

– Cluster size

– # of FATs

– Root directory entries • Root directory first cluster

– Volume label

– Drive number

Free space management

• Next free cluster pointer

– FAT32 maintains last allocated cluster number Possible to undelete recently delete files

– Produces fragmentation

0 0 0 0 00003 00005 00006 -1

Last allocated cluster

Example: ext3

Characteristics

• Background

– Linux operating system: multi-user

– Evolving for from desktop to server and real-time system

• Based on block groups

– Each block group works as an independent file system

– Inode, directory, file data

• Inodes for allocation and attribute management

• Journaling support from ext3

Block group

• Ext file system = an array of block groups

• Block group size: determined by block size

– 4K block 128MB

– Why? Data block bitmap must fit in a block

bg_block_bitmap, bg_inode_bitmap, bg_inode_table bg_free_blocks_count, bg_free_inodes_count, …

Inode

• Size: 128-byte / 256-byte (ext4)

Directory

• ext3~ supports HTree: hashing for entry lookup [Daniel Phillips, A Directory Index for Ext2, Linux Symposium’02]

Free space management

• Data block bitmap / inode bitmap in each block group

• Block allocation rule

– Top-level directory’s inode • In the empty block group, if possible

• Block group with maximum free inodes

– Other inodes and data blocks • In the block group where its inode or parent resides, if possible

• Nearest-backside block group with free blocks more than average

/usr /home /var /etc

Linux virtual file system

Storage implementation layers

Virtual File System

Ext4 FAT NFS FUSE

Page Cache

Block device

Device mapper

Network stack

I/O scheduler

Device driver

MTD

YAFFS

FTL

CFQ noop antic

Introduction to VFS

• Hordes different file system implementations

– IPC mechanisms (PIPE, FIFO, socket, …) too

• Abstracts generic file system implementations

– Directory traversal

– Page cache

• Interfaces with POSIX system call APIs

– File descriptor management

/

usr (ext4)

home (btrfs)

boot (squashfs)

local (xfs)

vmware.socket (socket)

VFS implementation

• System call to file system’s specific methods

• Generic objects

– Superblock: specific file system

– Inode: specific file

– Dentry: a directory entry

– File: an open file

VFS operations

• File system specific operations – Pseudo object oriented programming model

• File system specific object: ex. sb->s_fs_info

• File system specific operations: ex. sb->s_op.sync_fs()

– Generic object + FS specific object + operations = VFS

• VFS internal objects

– To handle file system status • struct file_system_type file system mounting

• struct vfsmount file system mount point

• struct file_struct file descriptor management – struct file *fd_array[NR_OPEN_DEFAULT]

• struct fs_struct process status (working dir, …)

– Dentry cache

VFS operations example

• sys_open (fs/open.c)

– Main routine: do_sys_open()

– Allocate fd: get_unused_fd_flags()

– Walk path and open a file: do_filp_open() • lookup_fast(), __d_lookup() : dentry cache lookup

• i_op->lookup(dir, dentry) : repeat to target inode

• d_op->d_hash(dentry, name), d_op->d_compare(dentry, name1, name2)

• f_op->open(inode, file)

• i_op->create(dir, dentry, mode)

• s_op->alloc_inode(sb)

Superblock API

• Superblock object

– Per mounted file system instance

• Superblock operations

– Superblock management • write_super(), put_super()

– Inode management • alloc_inode(), write_inode()

– File system management • sync_fs(), free_fs()

• Initialization: get_sb() function

Type Name

list_head s_list

dev_t s_dev

list_head s_inodes

list_head s_files

super_operations s_op

void * s_fs_info

… …

Inode API

• Inode object

• Inode operations

– Inode management • create, truncate, setattr, fallocate,

– Directory management • lookup, link, unlink, symlink, mkdir, …

Type Name

super_block i_sb

list_head i_dentry

unsigned long i_ino

atomic_t i_count

uid_t i_uid

struct timespec i_atime

loff_t i_size

address_space i_mapping

inode_operations i_op

file_operations i_fop

void * i_private

File API

• File object

• File operations

– llseek(), read(), write(), mmap()

– open(), release(), flush()

Type Name

struct path f_path

int f_flags

loff_t f_pos

address_space f_mapping

file_operations f_op

void * private_data

Directory API

• Dentry object

• Operations

– d_revalidate

– d_hash

– d_compare

– d_delete, d_release

– d_iput, d_dname

• Almost no need to implement

– Exception: case insensitive file name

Type Name

struct inode d_inode

hlist_node d_hash

struct dentry d_parent

struct qstr d_name

list_head d_subdirs

list_head d_alias

dentry_operations d_op

void* d_fsdata

char d_iname[]

VFS objects relationships

superblock inode inode

inode file file

dentry dentry

s_inode_list

i_sb f_mapping->host

f_sb

i_dentry

d_inode

d_sb

Page cache implementation

• Integrated with process virtual memory module

• Page associated with inode

– page->mapping, index

• address_space object

– Target file

– Page cache tree (radix)

• address_space operations

– I/O: writepage, readpage, writepages, readpages

– Block allocation and mapping: bmap()

Type Name

struct inode host

radix_tree_root page_tree

long nrpages

address_space_ operations

a_ops

… …

Delayed allocation

Page cache implementation

• Useful API

– page = find_get_page(mapping, index)

– SetPageDirty(page), ClearPageDirty(page)

• Flush thread

– Write-back dirty pages • On-demand, dirty threshold, free threshold

– Per allocation-group • Per disk if device mapper is not used

• Per top-level virtual device if device mapper is applied

Implementing file system

• superblock, inode, file, dentry, address space

superblock

file file

dentry dentry

s_inode_list

i_sb

f_mapping

f_sb

d_inode

d_sb

dentry dentry address

space host

i_mapping

d_parent

inode inode

inode

i_dentry