Naming

1

Naming

Principles of Computer System (2012 Fall)

Bus & File System

2

NAMING MODEL

Module use of names • Two ways modules uses a named object:– By value• Module gets a copy of the named object

– By reference• Module operates directly on the named object

• Purpose #1: Sharing and Organization– Most communication happens using names

• Purpose #2: Delayed binding to an object– Supports replaceability and indirection

3

Modules and names

4

Naming schemes• Three parts– Name space: Symbols and syntax rules for

generating names– Name-mapping algorithm: Maps names to values– Universe of values: All possible of values

• Terminology– Binding – A mapping from a name to value– A name that has a mapping is bound– A name mapping algorithm resolves a name

5

Naming model

6

Naming Context

• Name lookup typical done in a context– Examples: • Mail [email protected] • Dial: 51355355• “Hey, you!”

• Name spaces with only one possible context are called universal name spaces– Example: US social security numbers

7

Determining Context - 1• Hard code it in the resolver– Examples: Many universal name spaces work this

way• Embedded in name itself– [email protected] ： • Name = “cse”• Context = “sjtu.edu.cn”

– /ipads.se.sjtu.edu.cn/courses/cse-g/2012f/README ： • Name = “README” • Context = “/ipads.se.sjtu.edu.cn/courses/cse-g/2012f”

8

Determining Context - 2• Taken from environment (Dynamic)– Unix cmd: “rm foo”: • Name = “foo”, • context is current dir

– Read 0x7c911109: • Name = “0x7c911109”, • context is thread’s address space

• Many errors in systems due to using wrong context

9

Interpreter naming API• value ← RESOLVE(name, context)– Return the mapping of name in the context

• status ← BIND(name, value, context)– Establish a name to value mapping in the context

• status ← UNBIND(name, context)– Delete name from context

• list ← ENUMERATE(context)– Return a list of all bindings

• result ← COMPARE(name1, name2)– Check if name1 and name2 are equal

10

Uniqueness

• Many naming systems are not unique– Names can map to 0, 1, or many values• RESOLVE can return NULL or a list of values

– A value may have 0, 1, or many names• Reverse RESOLVE can return NULL or a list

• Unique identity name space– Never reused – called a Stable Binding• Your SJTU student number • Many billing systems have “Customer #s”

11

Name mapping algorithms

• Table lookup– Find name in a table• Examples: Phone

book, old /etc/hosts– Context: Specify which

table to use• Recursive lookup• Multiple lookup

12

Addresses

• Addresses are used as both names and locators– e.g. IP address: 171.64.64.64, 1950s phone number, I/O

device address– Highly useful but fragility

• Work-around when object moves– Change all references - can be painful– Make it work for both new and old– Have client search if resolve fail

• Indirection is frequently the solution– Update indirection map to handle moves– Examples: host names, cell phone numbers, etc.

13

14

BUS ： A HARDWARE LAYER

Booting

• 3 Abstractions in Computer– Interpreter– Memory– Communication link

• Naming in booting– Linux booting sequence– Bus address– Memory load– Mmap I/O & DMA

15

Keep two questions in mind

• What is the memory, the interpreter, and the communication link respectively?

• What is the name, the context, the name mapping algorithm?

16

Linux booting: 5 stages

17

System startup

Stage 1 bootloader

Stage 2 bootloader

Kernel

Init

BIOS on Flash

GRUB on MBR (Disk)

GRUB/LILO on Disk

Linux on Disk

User app on Disk

1. BIOS

• BIOS’s job– 1st instruction: 0xFFFF0– POST (Power-On Self Test)– Manage resource: name space– Enumerate bus device– Load boot loader into memory & give control to it

• Three abstractions– Interpreter: CPU, BIOS controller, memory controller– Memory: flash memory & RAM– Communication link: system bus

18

2. Bootloader stage 1 (MBR)

• MBR (Master Boot Record)– First 512-byte on the disk (the first block)

• Bootloader stage 1’s job– Load stage 2 into memory

& give control to it• Three abstractions– I: CPU & DC & MC – M: disk & RAM– C: system bus

19

Bootloader

PartitionTable

Magic num51

2-by

te

MBR

3. Bootloader stage 2

• Bootloader stage 2’s job– List boot menu– Load user-selected kernel into memory

& give control to it• Three abstractions– I: CPU & DC & MC– M: disk & RAM– C: system bus

20

4. Kernel

• Kernel’s job– Change CPU to protected mode– Initialize system…– Load init into memory and run it

• Three abstractions– I: CPU & DC & MC– M: disk & RAM– C: system bus

21

22

5. init• init process– The first user space program, pid=1– The root & parent of all other processes

• init’s job– Run /etc/rc.d/rc.sysinit

• Start system processes in /etc/inittab• Start multiple “getty” which waits for console logins

• Three abstractions– I: CPU & DC & MC– M: disk & RAM– C: system bus

23

Question

• How does CPU find the 1st instruction on BIOS?– Hard wire 0xFFFF0 to PC after reset

• What happens during a memory load?

24

Booting sequence

• Three Abstractions– Interpreter: CPU, memory controller, disk controller– Memory: BIOS’s flash memory, RAM, disk– Communication link: System bus

• Common Patterns– Processor read from memory (LOAD) and interpret

• Memory cell naming: bus address– I/O devices transfer data to memory

• Disk sector naming: block number• DMA & Memory-mapped I/O

25

Specific Operations

Processor

Memory I/O Device

load/store PIO/MMIO

DMA

• Memory Load/Store– Between CPU and memory– Physical memory address space

• I/O Operations– MMIO: map device memory

and registers into physical address space

– E.g., frame buffer• DMA– Also using physical address

A Hardware Layer: the bus

26

Bus: Hardware Layer

• Bus feature– A set of wires: comprising addr, data, control lines

that connect to a bus interface on each module– Bus arbitration protocol: decide which module may

send or receive a message at any particular time• Bus arbiter (optional): a circuit to choose which

modules can use the bus

– Broadcast link: every module hears every message• Bus address: identify the intended recipient

27

Split-transaction

1. Source module requires exclusive use of the bus2. Source module places a bus address of the destine

module on the bus3. Source module signals READY wire to alert the

other module4. The destine module singles ACKNOWLEDGE wire

after copied the data– If synchronized, then READY & ACKNOWLEDGE are not

needed, just check the address lines on each clock cycle5. Source module releases the bus

28

Memory load example: LOAD 1742, R1

Processor #2 => all bus modules: {1742, READ, 102}

29


• Memory1 recognizes the address is within its range– By examining just a few high-order address bits

30


• Memory1 acknowledges and processor2 releases the bus• Memory1 performs the internal operation to get the value– value <- READ (1742)

31


• Memory1 => all bus modules: {102, value}

32


• Processor2 is waiting for this result, just copies the data on the bus to its register R1

33


• Processor2 acknowledges and memory1 releases the bus

34

Bus Address

• Bus address space (physical address)– Each module has its own bus address range– BIOS is in charge of managing at booting time– 1MB in the past, 4GB today, larger in the future– Basic unit: byte

• Each module examines the bus address field – For every message – Ignores those not intended for it– What about sniffering?

35

Simple I/O Device in a Similar Way

• Example: Keyboard– When user depresses a key, keyboard SENDs a

message to the processor containing the key value– As the processor is not ready, its bus interface:• copies the data into a temporary register, • acknowledges the keyboard, • SENDs an interrupt signal to the processor

– The processor handles the interrupt in next cycle• SENDs the value over the bus to memory module

– Suitable for slow device, not suitable for disk

36

37

DMA for Disk Device

• DMA (Direct Memory Access)– A processor SENDs a request to a disk controller to

READ a block of data– Including the address of a buffer in memory

• The disk SENDs the data directly to memory– Incrementing the memory address appropriately

38

DMA for Disk Device

• Benefits of DMA– Relieve the CPU’s load to execute other program– Reduce one transfer (original two)– Take better advantage of long message if the bus

supports– Amortize the overhead of the bus protocol

Memory Mapped I/O

• Use LOAD and STORE instructions to address the register and buffer of the I/O modules– Just like access memory– Address is overloaded name with location info

• Provide a uniform interface to bus modules– MMU translates virtual addr to physical addr• Physical address is system bus address

– I/O modules translate bus address to register address internally

39

Memory Mapped I/O

40

Processor

MMU

Virtual address

Physical address (System bus address)

Memory Disk Keyboard

Internally translateto register address

41

Volatile Address#include <stdio.h>void main(){ int i = 10; int a = i; printf("i= %d\n",a);

// Change value of i __asm { mov dword ptr [ebp-4], 20h } int b = i; printf("i= %d\n",b); }

Memory Mapped I/O combined with DMA

42

DMA example

43

BIOS Memory Disk

Processor 1101

256-511 3072-4095 121-124

• Processor #1 => all bus modules: {121, WRITE, 11742}– Disk acknowledge and write the value 11742 to its control register

• Processor #1 => all bus modules: {122, WRITE, 3328}• Processor #1 => all bus modules: {123, WRITE, 256}• Processor #1 => all bus modules: {124, WRITE, 1}

102

Processor 2

DMA example

44

BIOS Memory Disk

Processor 1101

256-511 3072-4095 121-124

• Disk => all bus modules: {3328, WRITE, data[11742]}– Memory acknowledge and save data[11742]

• Disk => all bus modules: {3329, WRITE, data[11743]}• ... (loop)• Disk => all bus modules: {3583, WRITE, data[11997]}

102

Processor 2

DMA example

45

BIOS Memory Disk

Processor 1101

256-511 3072-4095 121-124

• When transferring is finished, disk controller SENDs message to the processor– Just like keyboard controller does when press a key

• Processor will enter interrupt handler next cycle• Now the processor knows that the DMA is done

102

Processor 2

Questions

• Why not map the whole disk to memory?– So that the CPU can access a byte on the disk

directly by system bus– 1. Too large– 2. Too slow

46

The principle of least astonishment:

People are part of the system. The design should match the user’s experience, expectations, and mental models

47

FILE SYSTEM: A SOFTWARE LAYER

Outline

• UNIX file system– 7 layers in file system (3 + 1 + 3)

• FS API implementation– OPEN, READ, WRITE, CLOSE, FSYNC

• UNIX shell– Implied context, search path, name discovery

• Review of naming model

48

File• File is a high-level version of the memory abstraction• A file has two key properties– It is durable & has a name

• The system layer implements files using modules from the hardware layer– Divide-and-conquer strategy– Makes use of several hidden layers of machine-oriented

names (addresses), one on another, to implement files– Maps user-friendly names to these files

• In UNIX, everything is a file - KISS

50

API of the UNIX file system

• OPEN, READ, WRITE, SEEK, CLOSE• FSYNC• STAT, CHMOD, CHOWN• RENAME, LINK, UNLINK, SYMLINK• MKDIR, CHDIR, CHROOT• MOUNT, UNMOUNT

51

The naming layers of the UNIX file system (version 6)

52

Disk structure

53

track0

platters

track2track1

head 0

head 1

head 2

Cylinder 0

Cylinder 1

Sector 0Sector 1

• Platter• Track• Sector• Head• Cylinder

Block layer

54

• Block size: a trade-off– Neither too small or too big

• Name mapping: block number -> block• Context: the storage device (e.g. disk) itself– Binds block numbers to physical blocks

• Name-mapping algorithm–

• Name discovery: super block– Keep track of block usage: e.g. free list, bitmap

BlockBlock num

Super block

• One superblock per file system– Kernel reads superblock when mount the FS

• Superblock contains

55

– Size of the blocks– Number of free blocks– A list of free blocks– Index to next free block

– Lock field for free block and free inode lists– Flag to indicate modification of superblock

– Size of the inode list– Number of free inodes– A list of free inodes– Index to next free inode

BlockBlock num

File layer

• File requirements– Store items that are larger than one block– May grow or shrink over time– A file is a linear array of bytes of arbitrary length– Record which blocks belong to each file

• Inode (index node)– A container for metadata about the file–

56

BlockBlock num

File(inode)

File layer

• Name mapping: index number -> block number• Context: the inode itself• Name mapping algorithm

• Max length of an offset is 3 bytes in UNIX version 6• What about large files?

57

BlockBlock num

File(inode)

Inode for larger files

58

inode

indirect blockdouble indirect block

block

BlockBlock num

File(inode)

Max length of an offset is 3 bytes in UNIX version 6

Inode number layer

59

• Name mapping: inode number -> inode• Context: the inode table• Name-mapping algorithm: inode table– At a fixed location on storage

• Name discovery– Track which inode number are in use– E.g. free list, a field in inode

BlockBlock num

File(inode)

Inode num

Put layers so far together

60

• Needs more user friendly name– Numbers are convenient names only for computer

• Numbers change on different storage device

BlockBlock num

File(inode)

Inode num

File name layer

61

• File name– Hide metadata of file management– Files and I/O devices

• Name mapping algorithm– Mapping table saved in directory– Default context: current working directory– Context reference is also inode number

• The directory itself is a file–

– Max length of a name is 14 bytes in UNIX version 6

BlockBlock num

File(inode)

Inode num

Filename

LOOKUP in a directory

• Name compare method: STRING_MATCH• LOOKUP(“program”, dir) will return 10

62

BlockBlock num

File(inode)

Inode num

Filename

Path name layer

• Hierarchy of directories and files– Structured naming: E.g. “projects/paper”

• Name-mapping algorithm–

– PLAIN_NAME returns true if no ‘/’ in the path• Context: the working directory

63

BlockBlock num

File(inode)

Inode num

FilenamePath name

Links• LINK: shortcut for long names– LINK(“Mail/inbox/new-assignment”, “assignment”)– Turns strict hierarchy into a directed graph

• Users cannot create links to directories -> acyclic graph– Different names, same inode number

• UNLINK– Remove the binding of filename to inode number– If UNLINK last binding, put inode/blocks to free-list

• A reference counter is needed

64

BlockBlock num

File(inode)

Inode num

FilenamePath name

Links

• Reference count– An inode can bind multiple file names– +1 when LINK, -1 when UNLINK– A file will be deleted when reference count is 0• WARN: violation of the principle of least astonishment

– No cycle allowed• Except for ‘.’ and ‘..’• Naming current and parent

directory with no need to know their names

65

BlockBlock num

File(inode)

Inode num

FilenamePath name

No cycle for LINK

66

/

25:1

• /a/b is a directory• The refcnt of a is 1• a’s inode num is 25

/

25:2

/

25: 1

a

b

• LINK (“/a/b/c”, a”)• Cause a cycle!• Refcnt of a is 2

a

bc bc

• UNLINK (“/a”)• Refcnt of a is 1, so the

inode 25 is not deleted• Now inode 25 is dis-

connected from graph

a

BlockBlock num

File(inode)

Inode num

FilenamePath name

Renaming - 1

• Text edit usually save editing file in a tmp file• What if the computer fails between 1 & 2?

– to_name will be lost, which surprises the user– Need atomic action in chap-9

• Weaker specification– if to_name already exist, it will already exist even if machine

fails between 1 & 2

67

BlockBlock num

File(inode)

Inode num

FilenamePath name

Absolute path name layer

• HOME directory– Every user’s default working directory– Problem: no sharing of HOME files between users

• Context: the root directory– A universal context for all users– Well-known name: ‘/’– Both ‘/.’ and ‘/..’ are linked to ‘/’

69

BlockBlock num

File(inode)

Inode num

Filename

Absolute pathPath name

An example: find blocks of “/programs/pong.c”

70


71

• ‘/’ root directory: inode is 1


72

• Find the first directory in ‘/’ by block number


73

• Find ‘/programs’ by comparing name


74

• Find ‘/programs’ inode by its inode number 7


75

• Find the first file in ‘/programs/’


76

• Find ‘/programs/pong.c’ by comparing its name


77

• Find inode of ‘/programs/pong.c’ by the inode number 9


78

• Find block number of ‘/programs/pong.c’


79

• Find data of block 61 by its block number– And data of block 44 & 15

Symbolic link layer

• MOUNT– Records the device and the root inode number of the file

system in memory– Record in the in-memory version of the inode for

“/dev/fd1” its parent’s inode– UNMOUNT undoes the mount

• Change to the file name layer– If LOOKUP runs into an inode on which a file system is

mount, it uses the root inode of that file system for the lookup

80

BlockBlock num

File(inode)

Inode num

Filename

Symbolic linkAbsolute pathPath name

Symbolic link layer

• Name files on other disks– Inode is different on other disks– Supports to attach new disks to the name space

• Two options– Make inodes unique across all disks– Create synonyms for the files on the other disks

• Soft link (symbolic link)– SYMLINK– Add another type of inode– Context: the directory hierarchy

81

BlockBlock num

File(inode)

Inode num

Filename

Symbolic linkAbsolute pathPath name

Two types of links (synonyms)• Add link “assignment” to “Mail/new-assignment”• Hard link

– No new file is created, just add a binding between a string and an existing inode

– Target inode reference count is increased– If target file is deleted, the link is still valid

• Soft link– A new file is created, the data is the string “Mail/new-

assignment”– Target inode reference count is not increased– If target file is deleted, the link is not valid

• Soft link can create cycle by SYMLINK(“a”, “a”)

82

Symbolic link layer

• Another interesting behavior of soft link– Current directory is “/Scholarly/programs/www”– This wd contains a soft link• “CSE2012-web” -> “Scholarly/programs/www”

– Run following commands• CHDIR (“CSE2012-web”)• CHDIR (“..”)

– What is the current directory?• “..” is resolved in the new default context

83

Decouple modules with indirection

84

Implementing the file system API

• Review– CHDIR, MKDIR– LINK, UNLINK, RENAME– SYMLINK– MOUNT, UNMOUNT

• Next– OPEN, READ, WRITE, CLOSE– FSYNC

85

File meta-data

• Owner ID– User ID and group ID that own this inode

• Types of permission– Owner, group, other– Read, write, execute

• Time stamps– Last access (by OPEN)– Last modification (by WRITE)– Last change of inode (by LINK)

86

OPEN file

• Check user’s permission• Update last access time• Return a short name for a file– fd: file descriptor– Used by READ, WRITE, CLOSE

87

File descriptor

• Each process starts with three open files– Standard in: fd = 0– Standard out: fd = 1– Standard error: fd = 2

• Can also use fd to name opened devices– Keyboard, display, etc.– Allow a designer not to worry about input/output• Just read from fd 0 and write to fd 1

• Each process has its own fd name space

88

File cursor

• File cursor– Keep track of operation position within a file

• Sharing cursor– Parent passes its fd to its child

• In UNIX, child inherits all open fds from its parent

– Allow parent and child to share a output file• Not sharing cursor– Two processes open the same file

89

fd_table & file_table

• One file_table for the whole system– Records information for opened files• Inode number, file cursor, reference count of opening

processes• Children can share the cursor with their parent

• One fd_table for each process– Records mapping of fd to index of the file_table

90

File cursor sharing

91

3 115

index

Process A

fd_tablefd

3 116

index

Process Bfd

3 116

index

C is B’s childfd

• Process A, B and C all open just one file with inode number 23• Process A and B open the same file, not share file cursor• Process A and C share the file cursor

23 128

23 240

...inode num file cursorindex

115

116

file_table

1

refcnt

2

WRITE & CLOSE• WRITE is similar to READ– Allocate new block if necessary– Update inode’s size and mtime

• CLOSE– Free the entry in the fd_table – Decrease the reference counter in file table– Free the entry in file table if counter is 0

• Failures in the middle may cause inconsistency– E.g. a block is allocated from on-disk free list, but no inode

records that block yet, then the block is lost

94

Question

• When writing, which order is prefered?– Allocate new blocks, write new data, update size– Allocate new blocks, update size, write new data– Update size, allocate new blocks, write new data

95

Delete after OPEN but before CLOSE

• One process has a file open• Another process removes the last name

pointing to that file– Reference counter is now 0

• The inode isn’t freed until the first process calls CLOSE

96

FSYNC• Block cache– Cache of recently used disk blocks– Read from disk if cache miss– Delay the writes for batching– Improve performance– Problem: may cause inconsistency if fail before write

• FSYNC– Ensure all changes to the file have been written to the

storage device

97

98

Questions

• What about virtual address space?• Where is cache?• Who assigns the physical addresses?• What’s the address of disk?• How to ensure DMA security?

99

ABOUT THE LAB

100

Distributed File System

• Components– Extent Server, Lock Server, Client– Shift the complexity from server to client• Unlike NFS

101

FUSE

102

Lab-1: Lock Server

• RPC– How to pass arguments?– Read rpc/rpc.cc

• Lock Server and Client– acqure/release lock– Multi-thread

• At-most-once– How to identify duplicated requests?

103

Collaboration Policy

• You must write all the code you hand in• You are not allowed to look at other’s code• You may discuss with other students

104

Program Environment

• We provide a VM image– Run on VMware– Available on our web site– username: cse– password: cselab

105

Hand In• Hand in Process– $ make handin– rename the tgz file with your student ID– email the tgz file to xiayubin at gmail.com

• Hard Deadline– Hand in before the deadline: x 100%– Within 24 hours after deadline: x 80%– Within 48 hours after deadline: x 60%– Within 72 hours after deadline: x 40%– Within 96 hours after deadline: x 20%

106

BACKUP

Documents

Naming