34
© Microsoft Corporation 2 004 1 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating Systems Group Windows Core Operating Systems Division Microsoft Corporation

© Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

Embed Size (px)

Citation preview

Page 1: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 1

Windows Kernel Internals IIAdvanced File Systems

University of Tokyo – July 2004

Dave Probert, Ph.D.Advanced Operating Systems Group

Windows Core Operating Systems DivisionMicrosoft Corporation

Page 2: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 2

Disk Basics

Volume exported via device object

Addressed by byte offset and length

Enforced on sector boundaries

NTFS allocation unit - clusters

Round size down to clusters

Page 3: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 3

Storage ManagementVolumes may span multiple logical disks

Partitioning Description Benefits

spanned logical catenation of arbitrary sized volumes

size

striped(RAID-0)

interleaved same-sized volumes

read/write perf

mirrored(RAID-1)

redundant writes to same-sized volume, alternate reads

reliability, read perf

RAID-5 striped volumes w/ parity reliability, size, read perf

Page 4: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 4

File System Device Stack

NT I/O Manager

File System Filters

File System DriverCache Manager

Virtual MemoryManager

Application

Kernel32 / ntdlluserkernel

Partition/VolumeStorage Manager

Disk Class Manager

Disk Driver

DISK

Page 5: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 5

NTFS Deals with files

Partition is collection of files

Common routines for all meta-data

Utilizes MM and Cache Manager

No specific on-disk locations

Page 6: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 6

CacheManager overviewCache manager

– kernel-mode routines– asynchronous worker routines– interface between filesystems and VM mgr

Functionality– access methods for pages of file data on opened files– automatic asynchronous read ahead– automatic asynchronous write behind (lazy write)– supports “Fast I/O” – IRP bypass

Page 7: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 7

Datastructure Layout

File Object == Handle (U or K), not one per file

Section Object Pointers and FS File Context shared/stream

Kernel

Handle

File ObjectFilesystem File Context

FS Handle Context (2)

Section Object Pointers

Data Section (Mm)

Image Section (Mm)

Shared Cache Map (Cc)

Private Cache Map (Cc)

Page 8: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 8

DatastructuresFile Object

– FsContext – per physical stream context– FsContext2 – per user handle stream context, not all

streams have handle context (metadata)– SectionObjectPointers – the point of “single

instancing”• DataSection – exists if the stream has had a mapped section

created (for use by Cc or user)• SharedCacheMap – exists if the stream has been set up for

the cache manager• ImageSection – exists for executables

– PrivateCacheMap – per handle Cc context (readahead) that also serves as reference from this file object to the shared cache map

Page 9: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 9

Cache View Management

A Shared Cache Map has an array of View Access Control Block (VACB) pointers which record the base cache address of each view– promoted to a sparse form for files > 32MB

Access interfaces map File+FileOffset to a cache address

Taking a view miss results in a new mapping, possibly unmapping an unreferenced view in another file (views are recycled LRU)

Since a view is fixed size, mapping across a view is impossible – Cc returns one address

Fixed size means no fragmentation …

Page 10: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 10

View Mapping

0-256KB 256KB-512KB 512KB-768KB

c1000000

File Offfset

<NULL> cf0c0000

VACB Array

Page 11: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 11

CacheManager Interface Summary

File objects start out unadornedCcInitializeCacheMap to initiate caching via Cc on a file

object– setup the Shared/Private Cache Map & Mm if

neccesaryAccess methods (Copy, Mdl, Mapping/Pinning)Maintenance FunctionsCcUninitializeCacheMap to terminate caching on a file

object– teardown S/P Cache Maps– Mm lives on. Its data section is the cache!

Page 12: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 12

CacheManager / FS Diagram

Cache Manager

Memory Manager

Filesystem

Storage Drivers

Disk

Fast IO Read/Write IRP-based Read/Write

Page Fault

Cache Access, Flush, Purge

Noncached IO

Cached IO

Page 13: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 13

File System Notes

Three basic types of IO– cached, non-cached, paging

Three file sizes– file size, allocation size, valid data length

Three worker threads– Mm’s modified page writer (paging file)– Mm’s mapped page writer (mapped files)– Cc’s lazy writer pool (flushes views)

Page 14: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 14

Cache Manager Summary

Virtual block cache for files not logical block cache for disksMemory manager is the ACTUAL cache managerCache Manager context integrated into FileObjectsCache Manager manages views on files in kernel virtual

address spaceI/O has special fast path for cached accessesThe Lazy Writer periodically flushes dirty data to diskFilesystems need two interfaces to CC: map and pin

Page 15: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 15

NTFS on-disk structure

Some NTFS system files

$Bitmap

$BadClus

$Boot

. (root directory)

$Logfile

$Volume

$Mft

$MftMirr

$Secure

Page 16: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 16

$Mft File

Data is entirely File Records

File Records are fixed size

Every file on volume has a File Record

File records are recycled

Reserved area for system files

Critical file records mirrored in $MftMirr

Page 17: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 17

File Records

‘Base’ file record for each file

Header followed by ‘Attributes’

Additional file records as needed

Update Sequence Array

ID by offset and sequence number

Page 18: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 18

P Q R S TA B C D E F G H IJ K L M N O U V

A B C D E F G H I J K L M N O P Q R S T U V

File D:\Letters (File ID 0x200)

File \$Mft

100200

2000

280200

P Q R S T A B C D E FG H I J KL M N OU V

Physical Disk

Page 19: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 19

File Basics

TimestampsFile attributes (DOS + NTFS)

Filename (+ hard links)

Data streams

ACL

Indexes

File Building BlocksFile RecordsNtfs AttributesAllocated clusters

Page 20: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 20

File Record Header

USA Header

Sequence Number

First Attribute Offset

First Free Byte and Size

Base File Record

IN_USE bit

Page 21: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 21

NTFS Attributes

Type code and optional name

Resident or non-resident

Header followed by value

Sorted within file record

Common code for operations

Page 22: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 22

$STANDARD_INFORMATION (Time Stamps, DOS Attributes)

MFT File Record

$FILE_NAME - VeryLongFileName.Txt

$DATA (Default Data Stream)

$DATA - “VeryLongFileName.Txt:A named stream”

$END (Available for attribute growth or new attribute)

$FILE_NAME - VERYLO~1.TXT

Page 23: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 23

Attribute Header

Length

Form

Name and name length

Flags (Compressed, Encrypted, Sparse)

Page 24: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 24

Resident Attributes

Data follows attribute header

‘Allocation Size’ on 8-byte boundary

May grow or shrink

Convert to non-resident

Page 25: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 25

Non-Resident Attributes

Data stored in allocated disk clusters

May describe sub-range of stream

Sizes and stream properties

Mapping pairs for on-disk runs

Page 26: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 26

Some Attribute Types

$STANDARD_INFORMATION $FILE_NAME $SECURITY_DESCRIPTOR $DATA $INDEX_ROOT $INDEX_ALLOCATION $BITMAP $EA

Page 27: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 27

Mapping Pairs

Stored in a byte optimal format

Represents allocation and holes

Each pair is relative to prior run

Used to represent compression/sparse

Page 28: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 28

Indexes

File name and view indexes

Indexes are B-trees

Entries stored at each level

Intermediate nodes have down pointers

$INDEX_ROOT

$INDEX_ALLOCATION

$BITMAP

Page 29: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 29

Index Implementation

Top level - $INDEX_ROOT

Index buckets - $INDEX_ALLOCATION

Available buckets - $BITMAP

Page 30: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 30

A B CG I N P QZunused data

A B C G I N P Q Z

0x36 (00110110)

$BITMAP

$INDEX_ALLOCATION

$INDEX_ROOT

E J endR

Page 31: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 31

$ATTRIBUTE_LIST

Needed for multi-file record file

Entry for each attribute in file

Resident or non-resident form

Must be in base file record

Page 32: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 32

Attribute List (example)

• Base Record - 0x200

• 0x10 - Standard• 0x20 - Attribute List• 0x30 - FileName• 0x80 - Default Data• 0x80 - Data1 “Owner”

• Aux Record - 0x180

• 0x30 - FileName• 0x80 - Data “Author”• 0x80 - Data0 “Owner”• 0x80 - Data “Writer”

Page 33: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 33

Attribute List (example cont.)

Code FR VCN Name (Not Present)0x10 0x200 $Standard0x30 0x200 $Filename0x30 0x180 $Filename0x80 0x200 0 $Data0x80 0x180 0 “Author” $Data0x80 0x180 0 “Owner” $Data0x80 0x200 40 “Owner” $Data0x80 0x180 “Writer” $Data

Page 34: © Microsoft Corporation 20041 Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating

© Microsoft Corporation 2004 34

Discussion