Upload
shiela
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Naming. Bus & File System. Principles of Computer System (2012 Fall). naming model. Module use of names . Two ways modules uses a named object: By value Module gets a copy of the named object By reference Module operates directly on the named object Purpose #1: Sharing and Organization - PowerPoint PPT Presentation
Citation preview
1
Naming
Principles of Computer System (2012 Fall)
Bus & File System
2
NAMING MODEL
Module use of names • Two ways modules uses a named object:– By value• Module gets a copy of the named object
– By reference• Module operates directly on the named object
• Purpose #1: Sharing and Organization– Most communication happens using names
• Purpose #2: Delayed binding to an object– Supports replaceability and indirection
3
Modules and names
4
Naming schemes• Three parts– Name space: Symbols and syntax rules for
generating names– Name-mapping algorithm: Maps names to values– Universe of values: All possible of values
• Terminology– Binding – A mapping from a name to value– A name that has a mapping is bound– A name mapping algorithm resolves a name
5
Naming model
6
Naming Context
• Name lookup typical done in a context– Examples: • Mail [email protected] • Dial: 51355355• “Hey, you!”
• Name spaces with only one possible context are called universal name spaces– Example: US social security numbers
7
Determining Context - 1• Hard code it in the resolver– Examples: Many universal name spaces work this
way• Embedded in name itself– [email protected] : • Name = “cse”• Context = “sjtu.edu.cn”
– /ipads.se.sjtu.edu.cn/courses/cse-g/2012f/README : • Name = “README” • Context = “/ipads.se.sjtu.edu.cn/courses/cse-g/2012f”
8
Determining Context - 2• Taken from environment (Dynamic)– Unix cmd: “rm foo”: • Name = “foo”, • context is current dir
– Read 0x7c911109: • Name = “0x7c911109”, • context is thread’s address space
• Many errors in systems due to using wrong context
9
Interpreter naming API• value ← RESOLVE(name, context)– Return the mapping of name in the context
• status ← BIND(name, value, context)– Establish a name to value mapping in the context
• status ← UNBIND(name, context)– Delete name from context
• list ← ENUMERATE(context)– Return a list of all bindings
• result ← COMPARE(name1, name2)– Check if name1 and name2 are equal
10
Uniqueness
• Many naming systems are not unique– Names can map to 0, 1, or many values• RESOLVE can return NULL or a list of values
– A value may have 0, 1, or many names• Reverse RESOLVE can return NULL or a list
• Unique identity name space– Never reused – called a Stable Binding• Your SJTU student number • Many billing systems have “Customer #s”
11
Name mapping algorithms
• Table lookup– Find name in a table• Examples: Phone
book, old /etc/hosts– Context: Specify which
table to use• Recursive lookup• Multiple lookup
12
Addresses
• Addresses are used as both names and locators– e.g. IP address: 171.64.64.64, 1950s phone number, I/O
device address– Highly useful but fragility
• Work-around when object moves– Change all references - can be painful– Make it work for both new and old– Have client search if resolve fail
• Indirection is frequently the solution– Update indirection map to handle moves– Examples: host names, cell phone numbers, etc.
13
14
BUS : A HARDWARE LAYER
Booting
• 3 Abstractions in Computer– Interpreter– Memory– Communication link
• Naming in booting– Linux booting sequence– Bus address– Memory load– Mmap I/O & DMA
15
Keep two questions in mind
• What is the memory, the interpreter, and the communication link respectively?
• What is the name, the context, the name mapping algorithm?
16
Linux booting: 5 stages
17
System startup
Stage 1 bootloader
Stage 2 bootloader
Kernel
Init
BIOS on Flash
GRUB on MBR (Disk)
GRUB/LILO on Disk
Linux on Disk
User app on Disk
1. BIOS
• BIOS’s job– 1st instruction: 0xFFFF0– POST (Power-On Self Test)– Manage resource: name space– Enumerate bus device– Load boot loader into memory & give control to it
• Three abstractions– Interpreter: CPU, BIOS controller, memory controller– Memory: flash memory & RAM– Communication link: system bus
18
2. Bootloader stage 1 (MBR)
• MBR (Master Boot Record)– First 512-byte on the disk (the first block)
• Bootloader stage 1’s job– Load stage 2 into memory
& give control to it• Three abstractions– I: CPU & DC & MC – M: disk & RAM– C: system bus
19
Bootloader
PartitionTable
Magic num51
2-by
te
MBR
3. Bootloader stage 2
• Bootloader stage 2’s job– List boot menu– Load user-selected kernel into memory
& give control to it• Three abstractions– I: CPU & DC & MC– M: disk & RAM– C: system bus
20
4. Kernel
• Kernel’s job– Change CPU to protected mode– Initialize system…– Load init into memory and run it
• Three abstractions– I: CPU & DC & MC– M: disk & RAM– C: system bus
21
22
5. init• init process– The first user space program, pid=1– The root & parent of all other processes
• init’s job– Run /etc/rc.d/rc.sysinit
• Start system processes in /etc/inittab• Start multiple “getty” which waits for console logins
• Three abstractions– I: CPU & DC & MC– M: disk & RAM– C: system bus
23
Question
• How does CPU find the 1st instruction on BIOS?– Hard wire 0xFFFF0 to PC after reset
• What happens during a memory load?
24
Booting sequence
• Three Abstractions– Interpreter: CPU, memory controller, disk controller– Memory: BIOS’s flash memory, RAM, disk– Communication link: System bus
• Common Patterns– Processor read from memory (LOAD) and interpret
• Memory cell naming: bus address– I/O devices transfer data to memory
• Disk sector naming: block number• DMA & Memory-mapped I/O
25
Specific Operations
Processor
Memory I/O Device
load/store PIO/MMIO
DMA
• Memory Load/Store– Between CPU and memory– Physical memory address space
• I/O Operations– MMIO: map device memory
and registers into physical address space
– E.g., frame buffer• DMA– Also using physical address
A Hardware Layer: the bus
26
Bus: Hardware Layer
• Bus feature– A set of wires: comprising addr, data, control lines
that connect to a bus interface on each module– Bus arbitration protocol: decide which module may
send or receive a message at any particular time• Bus arbiter (optional): a circuit to choose which
modules can use the bus
– Broadcast link: every module hears every message• Bus address: identify the intended recipient
27
Split-transaction
1. Source module requires exclusive use of the bus2. Source module places a bus address of the destine
module on the bus3. Source module signals READY wire to alert the
other module4. The destine module singles ACKNOWLEDGE wire
after copied the data– If synchronized, then READY & ACKNOWLEDGE are not
needed, just check the address lines on each clock cycle5. Source module releases the bus
28
Memory load example: LOAD 1742, R1
Processor #2 => all bus modules: {1742, READ, 102}
29
Memory load example: LOAD 1742, R1
• Memory1 recognizes the address is within its range– By examining just a few high-order address bits
30
Memory load example: LOAD 1742, R1
• Memory1 acknowledges and processor2 releases the bus• Memory1 performs the internal operation to get the value– value <- READ (1742)
31
Memory load example: LOAD 1742, R1
• Memory1 => all bus modules: {102, value}
32
Memory load example: LOAD 1742, R1
• Processor2 is waiting for this result, just copies the data on the bus to its register R1
33
Memory load example: LOAD 1742, R1
• Processor2 acknowledges and memory1 releases the bus
34
Bus Address
• Bus address space (physical address)– Each module has its own bus address range– BIOS is in charge of managing at booting time– 1MB in the past, 4GB today, larger in the future– Basic unit: byte
• Each module examines the bus address field – For every message – Ignores those not intended for it– What about sniffering?
35
Simple I/O Device in a Similar Way
• Example: Keyboard– When user depresses a key, keyboard SENDs a
message to the processor containing the key value– As the processor is not ready, its bus interface:• copies the data into a temporary register, • acknowledges the keyboard, • SENDs an interrupt signal to the processor
– The processor handles the interrupt in next cycle• SENDs the value over the bus to memory module
– Suitable for slow device, not suitable for disk
36
37
DMA for Disk Device
• DMA (Direct Memory Access)– A processor SENDs a request to a disk controller to
READ a block of data– Including the address of a buffer in memory
• The disk SENDs the data directly to memory– Incrementing the memory address appropriately
38
DMA for Disk Device
• Benefits of DMA– Relieve the CPU’s load to execute other program– Reduce one transfer (original two)– Take better advantage of long message if the bus
supports– Amortize the overhead of the bus protocol
Memory Mapped I/O
• Use LOAD and STORE instructions to address the register and buffer of the I/O modules– Just like access memory– Address is overloaded name with location info
• Provide a uniform interface to bus modules– MMU translates virtual addr to physical addr• Physical address is system bus address
– I/O modules translate bus address to register address internally
39
Memory Mapped I/O
40
Processor
MMU
Virtual address
Physical address (System bus address)
Memory Disk Keyboard
Internally translateto register address
41
Volatile Address#include <stdio.h>void main(){ int i = 10; int a = i; printf("i= %d\n",a);
// Change value of i __asm { mov dword ptr [ebp-4], 20h } int b = i; printf("i= %d\n",b); }
Memory Mapped I/O combined with DMA
42
DMA example
43
BIOS Memory Disk
Processor 1101
256-511 3072-4095 121-124
• Processor #1 => all bus modules: {121, WRITE, 11742}– Disk acknowledge and write the value 11742 to its control register
• Processor #1 => all bus modules: {122, WRITE, 3328}• Processor #1 => all bus modules: {123, WRITE, 256}• Processor #1 => all bus modules: {124, WRITE, 1}
102
Processor 2
DMA example
44
BIOS Memory Disk
Processor 1101
256-511 3072-4095 121-124
• Disk => all bus modules: {3328, WRITE, data[11742]}– Memory acknowledge and save data[11742]
• Disk => all bus modules: {3329, WRITE, data[11743]}• ... (loop)• Disk => all bus modules: {3583, WRITE, data[11997]}
102
Processor 2
DMA example
45
BIOS Memory Disk
Processor 1101
256-511 3072-4095 121-124
• When transferring is finished, disk controller SENDs message to the processor– Just like keyboard controller does when press a key
• Processor will enter interrupt handler next cycle• Now the processor knows that the DMA is done
102
Processor 2
Questions
• Why not map the whole disk to memory?– So that the CPU can access a byte on the disk
directly by system bus– 1. Too large– 2. Too slow
46
The principle of least astonishment:
People are part of the system. The design should match the user’s experience, expectations, and mental models
47
FILE SYSTEM: A SOFTWARE LAYER
Outline
• UNIX file system– 7 layers in file system (3 + 1 + 3)
• FS API implementation– OPEN, READ, WRITE, CLOSE, FSYNC
• UNIX shell– Implied context, search path, name discovery
• Review of naming model
48
File• File is a high-level version of the memory abstraction• A file has two key properties– It is durable & has a name
• The system layer implements files using modules from the hardware layer– Divide-and-conquer strategy– Makes use of several hidden layers of machine-oriented
names (addresses), one on another, to implement files– Maps user-friendly names to these files
• In UNIX, everything is a file - KISS
50
API of the UNIX file system
• OPEN, READ, WRITE, SEEK, CLOSE• FSYNC• STAT, CHMOD, CHOWN• RENAME, LINK, UNLINK, SYMLINK• MKDIR, CHDIR, CHROOT• MOUNT, UNMOUNT
51
The naming layers of the UNIX file system (version 6)
52
Disk structure
53
track0
platters
track2track1
head 0
head 1
head 2
Cylinder 0
Cylinder 1
Sector 0Sector 1
• Platter• Track• Sector• Head• Cylinder
Block layer
54
• Block size: a trade-off– Neither too small or too big
• Name mapping: block number -> block• Context: the storage device (e.g. disk) itself– Binds block numbers to physical blocks
• Name-mapping algorithm–
• Name discovery: super block– Keep track of block usage: e.g. free list, bitmap
BlockBlock num
Super block
• One superblock per file system– Kernel reads superblock when mount the FS
• Superblock contains
55
– Size of the blocks– Number of free blocks– A list of free blocks– Index to next free block
– Lock field for free block and free inode lists– Flag to indicate modification of superblock
– Size of the inode list– Number of free inodes– A list of free inodes– Index to next free inode
BlockBlock num
File layer
• File requirements– Store items that are larger than one block– May grow or shrink over time– A file is a linear array of bytes of arbitrary length– Record which blocks belong to each file
• Inode (index node)– A container for metadata about the file–
56
BlockBlock num
File(inode)
File layer
• Name mapping: index number -> block number• Context: the inode itself• Name mapping algorithm
• Max length of an offset is 3 bytes in UNIX version 6• What about large files?
57
BlockBlock num
File(inode)
Inode for larger files
58
inode
indirect blockdouble indirect block
block
BlockBlock num
File(inode)
Max length of an offset is 3 bytes in UNIX version 6
Inode number layer
59
• Name mapping: inode number -> inode• Context: the inode table• Name-mapping algorithm: inode table– At a fixed location on storage
• Name discovery– Track which inode number are in use– E.g. free list, a field in inode
BlockBlock num
File(inode)
Inode num
Put layers so far together
60
• Needs more user friendly name– Numbers are convenient names only for computer
• Numbers change on different storage device
BlockBlock num
File(inode)
Inode num
File name layer
61
• File name– Hide metadata of file management– Files and I/O devices
• Name mapping algorithm– Mapping table saved in directory– Default context: current working directory– Context reference is also inode number
• The directory itself is a file–
– Max length of a name is 14 bytes in UNIX version 6
BlockBlock num
File(inode)
Inode num
Filename
LOOKUP in a directory
• Name compare method: STRING_MATCH• LOOKUP(“program”, dir) will return 10
62
BlockBlock num
File(inode)
Inode num
Filename
Path name layer
• Hierarchy of directories and files– Structured naming: E.g. “projects/paper”
• Name-mapping algorithm–
– PLAIN_NAME returns true if no ‘/’ in the path• Context: the working directory
63
BlockBlock num
File(inode)
Inode num
FilenamePath name
Links• LINK: shortcut for long names– LINK(“Mail/inbox/new-assignment”, “assignment”)– Turns strict hierarchy into a directed graph
• Users cannot create links to directories -> acyclic graph– Different names, same inode number
• UNLINK– Remove the binding of filename to inode number– If UNLINK last binding, put inode/blocks to free-list
• A reference counter is needed
64
BlockBlock num
File(inode)
Inode num
FilenamePath name
Links
• Reference count– An inode can bind multiple file names– +1 when LINK, -1 when UNLINK– A file will be deleted when reference count is 0• WARN: violation of the principle of least astonishment
– No cycle allowed• Except for ‘.’ and ‘..’• Naming current and parent
directory with no need to know their names
65
BlockBlock num
File(inode)
Inode num
FilenamePath name
No cycle for LINK
66
/
25:1
• /a/b is a directory• The refcnt of a is 1• a’s inode num is 25
/
25:2
/
25: 1
a
b
• LINK (“/a/b/c”, a”)• Cause a cycle!• Refcnt of a is 2
a
bc bc
• UNLINK (“/a”)• Refcnt of a is 1, so the
inode 25 is not deleted• Now inode 25 is dis-
connected from graph
a
BlockBlock num
File(inode)
Inode num
FilenamePath name
Renaming - 1
• Text edit usually save editing file in a tmp file• What if the computer fails between 1 & 2?
– to_name will be lost, which surprises the user– Need atomic action in chap-9
• Weaker specification– if to_name already exist, it will already exist even if machine
fails between 1 & 2
67
BlockBlock num
File(inode)
Inode num
FilenamePath name
Absolute path name layer
• HOME directory– Every user’s default working directory– Problem: no sharing of HOME files between users
• Context: the root directory– A universal context for all users– Well-known name: ‘/’– Both ‘/.’ and ‘/..’ are linked to ‘/’
69
BlockBlock num
File(inode)
Inode num
Filename
Absolute pathPath name
An example: find blocks of “/programs/pong.c”
70
An example: find blocks of “/programs/pong.c”
71
• ‘/’ root directory: inode is 1
An example: find blocks of “/programs/pong.c”
72
• Find the first directory in ‘/’ by block number
An example: find blocks of “/programs/pong.c”
73
• Find ‘/programs’ by comparing name
An example: find blocks of “/programs/pong.c”
74
• Find ‘/programs’ inode by its inode number 7
An example: find blocks of “/programs/pong.c”
75
• Find the first file in ‘/programs/’
An example: find blocks of “/programs/pong.c”
76
• Find ‘/programs/pong.c’ by comparing its name
An example: find blocks of “/programs/pong.c”
77
• Find inode of ‘/programs/pong.c’ by the inode number 9
An example: find blocks of “/programs/pong.c”
78
• Find block number of ‘/programs/pong.c’
An example: find blocks of “/programs/pong.c”
79
• Find data of block 61 by its block number– And data of block 44 & 15
Symbolic link layer
• MOUNT– Records the device and the root inode number of the file
system in memory– Record in the in-memory version of the inode for
“/dev/fd1” its parent’s inode– UNMOUNT undoes the mount
• Change to the file name layer– If LOOKUP runs into an inode on which a file system is
mount, it uses the root inode of that file system for the lookup
80
BlockBlock num
File(inode)
Inode num
Filename
Symbolic linkAbsolute pathPath name
Symbolic link layer
• Name files on other disks– Inode is different on other disks– Supports to attach new disks to the name space
• Two options– Make inodes unique across all disks– Create synonyms for the files on the other disks
• Soft link (symbolic link)– SYMLINK– Add another type of inode– Context: the directory hierarchy
81
BlockBlock num
File(inode)
Inode num
Filename
Symbolic linkAbsolute pathPath name
Two types of links (synonyms)• Add link “assignment” to “Mail/new-assignment”• Hard link
– No new file is created, just add a binding between a string and an existing inode
– Target inode reference count is increased– If target file is deleted, the link is still valid
• Soft link– A new file is created, the data is the string “Mail/new-
assignment”– Target inode reference count is not increased– If target file is deleted, the link is not valid
• Soft link can create cycle by SYMLINK(“a”, “a”)
82
Symbolic link layer
• Another interesting behavior of soft link– Current directory is “/Scholarly/programs/www”– This wd contains a soft link• “CSE2012-web” -> “Scholarly/programs/www”
– Run following commands• CHDIR (“CSE2012-web”)• CHDIR (“..”)
– What is the current directory?• “..” is resolved in the new default context
83
Decouple modules with indirection
84
Implementing the file system API
• Review– CHDIR, MKDIR– LINK, UNLINK, RENAME– SYMLINK– MOUNT, UNMOUNT
• Next– OPEN, READ, WRITE, CLOSE– FSYNC
85
File meta-data
• Owner ID– User ID and group ID that own this inode
• Types of permission– Owner, group, other– Read, write, execute
• Time stamps– Last access (by OPEN)– Last modification (by WRITE)– Last change of inode (by LINK)
86
OPEN file
• Check user’s permission• Update last access time• Return a short name for a file– fd: file descriptor– Used by READ, WRITE, CLOSE
87
File descriptor
• Each process starts with three open files– Standard in: fd = 0– Standard out: fd = 1– Standard error: fd = 2
• Can also use fd to name opened devices– Keyboard, display, etc.– Allow a designer not to worry about input/output• Just read from fd 0 and write to fd 1
• Each process has its own fd name space
88
File cursor
• File cursor– Keep track of operation position within a file
• Sharing cursor– Parent passes its fd to its child
• In UNIX, child inherits all open fds from its parent
– Allow parent and child to share a output file• Not sharing cursor– Two processes open the same file
89
fd_table & file_table
• One file_table for the whole system– Records information for opened files• Inode number, file cursor, reference count of opening
processes• Children can share the cursor with their parent
• One fd_table for each process– Records mapping of fd to index of the file_table
90
File cursor sharing
91
3 115
index
Process A
fd_tablefd
3 116
index
Process Bfd
3 116
index
C is B’s childfd
• Process A, B and C all open just one file with inode number 23• Process A and B open the same file, not share file cursor• Process A and C share the file cursor
23 128
23 240
...inode num file cursorindex
115
116
file_table
1
refcnt
2
WRITE & CLOSE• WRITE is similar to READ– Allocate new block if necessary– Update inode’s size and mtime
• CLOSE– Free the entry in the fd_table – Decrease the reference counter in file table– Free the entry in file table if counter is 0
• Failures in the middle may cause inconsistency– E.g. a block is allocated from on-disk free list, but no inode
records that block yet, then the block is lost
94
Question
• When writing, which order is prefered?– Allocate new blocks, write new data, update size– Allocate new blocks, update size, write new data– Update size, allocate new blocks, write new data
95
Delete after OPEN but before CLOSE
• One process has a file open• Another process removes the last name
pointing to that file– Reference counter is now 0
• The inode isn’t freed until the first process calls CLOSE
96
FSYNC• Block cache– Cache of recently used disk blocks– Read from disk if cache miss– Delay the writes for batching– Improve performance– Problem: may cause inconsistency if fail before write
• FSYNC– Ensure all changes to the file have been written to the
storage device
97
98
Questions
• What about virtual address space?• Where is cache?• Who assigns the physical addresses?• What’s the address of disk?• How to ensure DMA security?
99
ABOUT THE LAB
100
Distributed File System
• Components– Extent Server, Lock Server, Client– Shift the complexity from server to client• Unlike NFS
101
FUSE
102
Lab-1: Lock Server
• RPC– How to pass arguments?– Read rpc/rpc.cc
• Lock Server and Client– acqure/release lock– Multi-thread
• At-most-once– How to identify duplicated requests?
103
Collaboration Policy
• You must write all the code you hand in• You are not allowed to look at other’s code• You may discuss with other students
104
Program Environment
• We provide a VM image– Run on VMware– Available on our web site– username: cse– password: cselab
105
Hand In• Hand in Process– $ make handin– rename the tgz file with your student ID– email the tgz file to xiayubin at gmail.com
• Hard Deadline– Hand in before the deadline: x 100%– Within 24 hours after deadline: x 80%– Within 48 hours after deadline: x 60%– Within 72 hours after deadline: x 40%– Within 96 hours after deadline: x 20%
106
BACKUP