Executing Programs - people.irisa.fr

Operating Systems DesignReseaux et Systemes Avances (RSA)

Martin Quinson <[email protected]>

Ecole Superieure d’Informatique et Applications de Lorraine – 2ieme annee

2008-2009

Module Presentation

Module FocusI Study of Operating Systems (OS)

I Focus on Design (not Usage)

(System Programming ; RS module)

Module Prerequisites

I C language

I Unix usage (shell usage, basic commands)

I System Programming (forking, process control, shell programming)

Module Objectives:

I Understanding the challenge to solve when writing an OS

I Know and understand the main components of an OS

I Know and understand the considerations in their design

I Being able to compare solutions to these classical challenges

I Knowing how they are solved under UnixMartin Quinson RSA (2008-2009) Module Presentation (2/355)

Practical Information

Module split in two parts

Part on Systems

I Lecturer: Martin Quinson

I 6 lectures, 4 TD (on-table labs), 2 TP (practical labs)

I Exam on 6/3/2009

Part on NetworksI Lecturer: Isabelle Chrisment

I 6 lectures (one is wrongly scheduled next week), 3 TD, 3 TP

I Exam on 20/4/2009

Martin Quinson RSA (2008-2009) Module Presentation (3/355)

Bibliography (for this part only)

BooksI Silberschatz, Peterson, Glavin: Operating Systems Concepts (7th edition)

Good introduction to concepts

I Tanenbaum, Woodhull: Operating Systems: Design and Implementation

The minix book, which one of the rare pedagogical operating system

I Lamiroy, Najman, Talbot: Systemes d’exploitationIn french; not only Unix but also a bit of Windows

I Leffler et Al, Design and Implementation of the 4.3BSD UNIX Operating SystemDissection of a classical version of Unix. Maybe somehow dated but instructive.

Course available on the InternetI Introduction aux systemes et aux reseaux (S. Krakowiak, Grenoble – in french)

http://sardes.inrialpes.fr/~krakowia/Enseignement/L3/SR-L3.html/

I Operating Systems and Systems Programming (M. Rosenblum, Stanford)http://www.stanford.edu/class/cs140/

URL of this coursehttp://www.loria.fr/~quinson/teach-RSA.html (empty for now, but . . . )


Agenda of this Course

Operating System Design and Advanced Usage

1 IntroductionWhat is an OS, Computer Architecture, Main Components, Recurring Themes

2 Processes HandlingProcess creation (implementing fork), Scheduling (goals, algorithms, real cases)

3 Memory SubsystemGoals, Paging, Sectioning, Real Cases, Trashing, User-level Management

4 Input/Output Subsystem (disks)Main Concepts, Implementation, Performance Concerns, Security

5 SecurityProtection, Security


First Chapter

Introduction

Computer ArchitectureHow Modern Computers Work

Executing ProgramsStorage HierarchyData movements

Current Trend: Multi-processors/multi-cores

Operating System IntroductionWhat is an Operating System?Roles and SubsystemsProtectionRecurring Themes for OS

Operating System Design

Case of Linux

Martin Quinson RSA (2008-2009) Introduction (6/355)

What is an Operating System?

Software between Applications and Reality

I Shields Applications from Hardware complexity: makes them portable

I Shields Applications from Hardware limitation: makes finite into (near) infinite

I Shields Hardware from Applications: provide protection

and these are difficult goals

Disks Graphic Cards

Sound

Operating System

Hardware

Firefox KDEEmacs

Martin Quinson RSA (2008-2009) Introduction (7/355)

Computer Architecture Basics

What is a Computer anyway?

Von Neumann Model

I No separation of code and data in memory

I Revolutionary back in 1945

I A bit outdated by now

Memory

Accumulator

Logic UnitArithmetic

OutputInput

ControlUnit

Modern Computer SystemsI Control & Computation merged into CPU

(Central Processing Unit)

I Element communicate through a busData transfer facility

I Memory is not uniformI Registers within CPUI Caches close to elements

(avoid bus’s cost if possible)I Speed and capacities differ greatly

Memory

Cache

CPU

Cache Cache Cache

Bus

Registers

GraphicsController

USBDiskController Adapter

Martin Quinson RSA (2008-2009) Chap 1: Introduction (8/355)

Executing Programs

Main CPU loop

1. Get address of next instruction to executeAddress stored in a specific register (Instruction Pointer, noted %eip on x86)

2. Fetch instruction through busopcode options parameters

I opcode: operation code, identify the instruction kindI options: set of flags configuring the instructionI parameters: some operands (register, address, value)

3. Run the instruction, and increments the instruction pointerunless instruction changes IP, such as branching or function call/return

Examples of instruction semantics with add1 (adds two integers)add1 %edx, (%eax) ; adds content of %edx to value stored at address %eax

; and store result at address %eaxadd1 %eax, %edx ; adds content of %eax to content of %edx

; and store result in register %edxadd1 $10, (%eax) ; adds 10 to value pointed by %eax and store result at %edx; Option flags are used to specify the semantic of operands


Storage Hierarchy

Memory is not uniform, but hierarchical

I Huge difference between kind of memoriesIn term of: speed, size, price, etc.

I New technologies introduced recently(non-volatile main memory, flash disk)

Registers

Cache

Main Memory

Electronic disk

Level Registers Cache Main Memory Disk storageTypical size <1Kb few Mb few Gb 100s of Gb

Access time (ns) 0.25 - 0.5 0.5 - 25 80 - 250 5,000,000Bandwidth (Mb/s) 20k - 100k 5k - 10k 1k - 5k 20 - 150

Volatile? Yes Yes Yes NoManaged by Compiler Hardware OS OSBacked by Cache Main Mem Disk CD or tape

Network may be seen as a 5th level (or more)I But several networked technologies complicate the picture


Buses

Allow data movement between computer components

Connectors toplug cards Links for

informationtransport

Buses classificationsI Synchronous (fast, but every component at same pace) or Asynchronous

I Classification depending on what they interconnect:I Processor bus: within a chip, between elementsI Memory bus: between CPU and main memory (synchronous for performance)I I/O bus: connects devices to main memory (asynchronous for portability)

I On a bus, each link specialized depending on what it conveys:

I Address link: convey address of data to convey

I Data link: convey actual data

I Control link: used to synchronize operations or similar

CPU Memory

AddressDataControl


Computer Architecture History

“Archaic” design

Mainmemory

CPU - Memory BUS

I/O BUS

NetworkDiskScreen

I/O Controllers

Busadapter

Cache

CPU

Current design

MainmemoryScreen

Network

Disk

controllers

CPU

scannerprinter

keyboardmouse

NorthBridge

SouthBridge

cache


Speaking with the devices

What they are?

I Devices are every input/output elements in the computer

I Hard disk, network, keyboard, mouse, digital camera, etc.

ProblemsI The OS needs to handle the data movement between CPU and devices

I Devices are slow compared to CPU (Get data from disk: ≈ 5ms ; 200Hz)

I Devices can produce data asynchronously (keyboard, mouse, network)

First solution: Polling

I Ask for new data regularly (but resource waste, plus not optimal response time)

Used solution: Interrupts

I Asynchronous communication: devices interrupt CPU to start an handler

I Similar to signals between processes, but from devices to CPU


Interrupt handling in the OS

Big lines

1. Device ready to send data.Send a Interrupt ReQuest (IRQ) to CPU through specific control bus

2. After current instruction, CPU reads the IRQ (a number)Notifies controller to release itRetrieves the Interrupt Handler function from table(called Programmable Interrupt Controller – PIC)

3. Save current context (registers+instruction pointer) and execute handler

4. Restore context and resume previous activity

NotesI This behavior is hardwired in CPU and out of control of programs

I Interruptions can be temporary masked (as signals). Handling deferred

I Installing new handlers, and masking interruption requires specific privileges

I Check cat /proc/interrupts to see your mapping under linux


How a Modern Computer Works (summary)


Computer Architecture Current Trend: Multi-*

Motivation: Endless need for more computing power

I Modeling and simulating natural phenomenon (genes, meteorology, finance)

I Gaming realism

I Web servers handle thousands of hits per second

Past SolutionI Increase clock speed, put more electronic gates

I We are reaching the physical limits

Current and Future SolutionI Multiply cores, processors and machines

I Systems ways more complex to use efficiently

; The OS needs to evolve to help


Multi-Processors

Shared Memory Processor(SMP)

CPU

SharedMemory

C

C

CCC

C

C

CC C

Cluster System

M

M

MMM

M

M

M M M

C

C

C C C

C

C

CCCFull System

NetworkLocal

Distributed Systems

M

M

MMM

M

M

M M M

C

C

C C C

C

C

CCCFull System

Internet

I SMP communicate through shared memory

I Clusters and DS communicate through classical network (are thus out of topic)


UMA (Uniform Memory Access)

Classical UMA

CPU CPU

Bus

sharedmemory

UMA with cache

Bus

sharedCPU CPUmemorycache cache

Advanced UMA

priv.mem.

priv.mem.

Bus

sharedCPU CPUmemorycache cache

I Every processor access the memory at the same speed

I But memory to slow in classical design, thus adding a cache

I Can go further by adding a private memory to each processor


Implementing UMA: crossbar commuter

I Non-blocking network: Several memory access possible in parallel


NUMA: NON-uniform Memory Access

I Biggest challenge: feed CPU with data (memory slower than CPU)I Idea: Put several CPU per board, and plug boards on mainboard

CPUcache

CPUcache

sharedmemory

sharedmemory

CPUcache

CPUcache

sharedmemory

CPUcache

CPUcache

memory networkdisksMainboard

One card One card One card

IssueI Memory access is non-uniform (slower when far away)

Need specific programming approach to keep efficient

I Cache consistency can turn into a nightmare


Multi-core: Parallelism on Chip

I Idea: Reduce distance to elements (thus latency)

I How: Put several computing elements on the same chip

AMD/Intel bicore chips

cache L1

cache L2

cache L1

corecomputingcomputing

core

Cell Processor

RAMRAM

controllers

memory

controllers

I/O

SPE 1

SPE 3

SPE 5

SPE 7

SPE 2

SPE 4

SPE 6

SPE 8

64bits PowerPCPower Processor Element (PPE)

(c)

Nic

ola

sB

lachfo

rd2005

EIB

(Ele

ment

Inte

rconnect

Bus)

Current TrendI Put more and more cores on chip

I Even put non-symmetric cores: PPE is classical RISC, SPE are SIMD


Computer Architecture Future

Put more and more core on chips

I Intel Research produced a 80 cores chip (delivering 1Tflop)I Complete Cluster-On-Chip envisionned to come soon

Increase even further Architecture Hierarchy

I Researchers build NUMA of CellsI Or Clusters of Cells

Change Paradigm

I GPU have several memory caches, with differing performanceI Flash disk are radically different of classical hard disks

other disk technology are under radarI Embedded Systems and Sensor Network radically change goals

The Operating System must deal with this complexityI Computer Architecture is a very active research area, lead by industryI Operating System is thus also an active research area

(this all is a bit out of scope, but you need to understand underlying complexity)


First Chapter

Introduction






Case of Linux


What is an Operating System?

Software between Applications and Reality

I Shields Applications from Hardware complexity: makes them portable

I Shields Applications from Hardware limitation: makes finite into (near) infinite

I Shields Hardware from Applications: provide protection

and these are difficult goals

Disks Graphic Cards

Sound

Operating System

Hardware

Firefox KDEEmacs


History of Operating Systems

Step 0: OS as a Standard Library

I One machine, one user, one software

I Still used in embedded systems

I OS simple (but complex applications)

OS

Hardware

Application

Step 1: Multiple SoftwareI Previous inefficient: process blocked ; machine wasted

I Hack: Allow more than one process, switch when blocked

I Problems: Infinite loops, or random write in memory?

I OS’s Protection: Interposition, Privileges, Preemption

OS

Hardware

gcc emacs

Step 2: Multiple UsersI Simple OS expensive: one machine per user

I Hack: allow more than one user at the same time

I Problems: What if user gluttons, evil or too numerous?

I OS’s Protection: Authentication, Right Management

Jim Bobgcc emacs

Hardware

OS


Roles of an Operating System

RolesI Startup computer at boot time, shutdown at the end

I Passive Role: offers functions that the application may call (API)I Access to devices (display, store data to disks), Start new processes, etc.

I Active Role: interposition when application request resource usage

I Process Scheduling, Virtual Memory, etc. (not on step 0 previous slide)

System Call (syscalls) functions callable by applications to request service from OSKernel System running the active role, and implementing the system callsCommand Interface Textual (shell) or Graphical (mouse): regular apps using API

Firmware Soft running on device controllers

Kernel

Hardware

System Calls API

Command interfaceSystem tools

Applications

Firmware

Operating System


Main OS Sub-Systems

Process HandlingI Process creation (fork, exec) and termination (wait, waitpid)I Suspend, resume (sleep, pause)I IPC (signals, pipes, semaphores, shared memory, etc)

Memory HandlingI Motivations

I Memory only storage directly accessible from CPU⇒ load applications in memory to run them

I Need to protect apps from each others ⇒ bulletproof partitioning

I The OS knows which memory zone is leased and to whoI It allocates and takes back memory on need

I/O HandlingI Controls with any device (through controllers)I Unifies interface device ↔ OS (portability)

Other Sub-SystemsI File System: stable storage (naming, robustness – cf. RS module)I Networking: communicate with other machines (cf. second half of module)


ProtectionMotivation

I An OS has to protect some resourcesI Hardware: Memory, CPU time, Devices (fair share; no hardware misuse)I Software: data on disk, in memory, other (privacy, access management)

I Particularly true for multi-users OSes

Hardware-Aided ProtectionI Modern CPU provide at least two execution levels:

I User mode: not privileged ; peasantI Privileged mode: privileged ; god (also called supervisor, superuser or kernel)

I Applications run in user mode, kernel runs in privileged mode(reset in syscalls, or by interrupt giving control back to OS)

I Some instructions said privileged, only accessible in corresponding mode (I/O)

I User level requests privileged instructions from kernel through syscalls

mode bit = 0

User Space

Kernel Space

mode bit = 1User

applicationContext Switching

Hardware Interrupt

runssyscall

Calls asyscall

Resumeexecution


Protection Examples

I/O Protection

I Any I/O instructions are privileged

I Any I/O request must transit through kernel

I (before, on MS-DOS on 80386, virus could destroy floppy disk)

Memory ProtectionI Example of regions you don’t want the user to mess with:

I Interrupt vector (they could install their own handler)I Authentication tables (they could pretend to be anyone)I Other users’ data (no confidentiality)

I Hardware-level Memory Management Unit (MMU)I Two specific registers: base and limit bounding accessible area by applicationI Assembly code to change them is privilegedI Requesting memory out of the bounds gives control back to OSI Bounds not effective in kernel mode

CPU time (no infinite loop)

I Regular clock interrupts give control back to OS


OS Theme #1: Finite Pie, Infinite demand

How to make the pie go farther?

I Key: Resource usage bursty, so give to other when idleI Not new: rather than one classroom, instructor, restaurant per person, share.

But more utilization = more complexityI How to manage? (ex: 1 road per car vs freeway); abstraction (lanes), synchronization (traffic lights) capacity increase (build)

What when illusion breaks? (resource really exhausted)Refuse service (busy signal), give up (VM swapping), backoff and retry (TCP/IP),break (freeway)

How to share pie?I Ask users? Yeah, right.I Usually monitor usage and attempt to be fair by re-apportion

How to handle pigs?I Quota (disk), ejection (swap), buy more resources, break down (net),

laws (road)I Hard to distinguish responsible busy progs from stupid selfish pigs


OS Theme #2: Performance

Trick #1: Exploit Bursty Applications

I Take stuff from idle guy to busy one. Both happy

Trick #2: Exploit skew

I 80% of time in 20% of codeI 90% of memory accesses to only 10% of totalI Idea of caches:

I Put 10% of memory in fast expensive memory, rest in slow cheap oneI Looks like a big fast memory

Trick #3; Exploit history

I Past predicts future (because future = past)I What’s best cache entry to remove? If future=past, least recently used oneI Works all the time (weather forecast, stock market, etc.)



IntroductionI No “perfect” solution, but some approaches proven successful

I Internal structure of different OS vary widely

GoalsI User Goals: Easy to use and learn, reliable, safe, fast

I System Goals: Easy to design, implement, and maintain; flexible, reliable,error-free, and efficient

Policy and Mechanism

I Classical SE consideration:Separate what will be done (policy) from how (mechanism)

I Allows maximum flexibility, portability over implementations


Simple Structure: MS-DOS

Main design goal:

I Stuff more functionalities in 640k

Implications:

I Not well structured

I Layer bypassed when needed

I Hard to maintain, and code for


Layered Operating System

Similar to TCP/IP or OSI

I Build your OS as a stack of layers

I Layer 0 is Hardware, Highest is UI

I Layer N only use services of N-1

Traditional UNIX


Monolithic Operating Systems

DefinitionI Every functions of the OS in one big binary

(process, memory, IPC, file systems, network stack, driver pilots)

I Everything runs in kernel mode

BenefitsI Design and implementation easier

I Better performance1

DrawbacksI Ever growing code base (as drivers are added)

I Memory waste (even unused elements are loaded)

I Hard to maintain (multiple interactions)

I Security not enforced (bug in one driver ; system crash)

1This point is commonly accepted, but have very strong opponents.Martin Quinson RSA (2008-2009) Chap 1: Introduction (35/355)

Micro-kernel Operating Systems

Idea: move all you can to user-space

I Only remain low-level address space and thread management, plus IPC

I Scheduling, Virtual Memory mapping, FS, Drivers, etc. run as daemons

Big picture

Application

Scheduler

File system VM

Driver

Syscalls

Calls Trapped by

kernel

Application

IPC Thread Low Mem

File system

Scheduler

Driver

VM

Calls Trapped by

Syscalls

kernel


Do Micro-kernels Suck?

This is a neat ideaI A micro-kernel is a few dozen of kilobytes, Linux a few hundreds of megabytes

I Easier to trust a small code base (kernel mode bugs are disasters)

I Easier to optimize (for example on ARM were MMU is hard)

Why didn’t it work yet?

I First implementation was ... not a technical success (Mach 1)

I Idea that IPC times between daemons must be a performance killmore IPC instead of function calls, more context switches for each IPC...

I But recent micro-kernel prove this wrong:L4 has a 4-5% performance overhead on most benchmarks

Some examples

I L4 (wombat, dawbat, . . . ), GNU/HURDMinix

I Mac OS/X (cheater!), QNX

I Still waiting for the big dayMartin Quinson RSA (2008-2009) Chap 1: Introduction (37/355)

Modular Operating Systems

DefinitionI Everything runs in kernel space, but loading parts on need

I Elements well partitioned, communicating through interface

Goal: Some advantages of micro-kernels without performance loss

I Code is modular, Software Engineers are happy

I We still have function calls between OS components instead of IPC

Almost every modern OS is architectured this way


Virtual MachinesVirtualization Idea

I Push the layered approach to its extremeI Hardware+(host) OS = Some kind of hardwareI Guest OS (running on top) have illusion of running on real hardwareI Host OS in charge of sharing real resources between several Guest OSes

(First implementation by IBM in 1972 in mainframes)

Para-Virtualization IdeaI Quite the same, but guest OS not presented exact same interface than real OSI Thus needs to be modified, but result reveals fasterI Host OS then called Hypervisor


First Chapter

Introduction






Case of Linux


Architecture du noyau Unix

Moufida Maimour Systemes d’exploitation II (06/07) (47/216)

Architecture du noyau : approche descriptive

Le noyau constitue de 3 grandes parties :I l’interface des appels systeme, interface entre les programme utilisateur

et le noyauI le sous-systeme de gestion des processus

I gestion des processus creation, terminaison, suspension, synchronisation etcommunication.

I ordonnancement traite la gestion du partage du temps et des priorites.I gestion de la memoire gere le partage des objets, la protection

interprocessus, le swapping ou la pagination.

I le sous systeme de gestion des fichiersI gestion du buffer cache gere l’allocation des tampons d’E/S.I gestion des fichiers traite la protection, l’allocation de l’espace disque, la

designation des fichiersI gestion des peripheriques gere les fichiers en mode caractere et en mode

bloc, l’acces aux peripheriques, y compris aux reseaux.


Architecture du noyau : approche fonctionnelleLe noyau UNIX est decoupe en 2 grandes parties qui cooperent pour lepartage des ressources du systeme et pour la mise-en-oeuvre de certainsservices :

Partie superieurefournit des services aux processus utilisateurs en reponse aux

appels systeme et aux exceptions.

I execution synchrone en mode noyau pour pouvoir acceder a la fois auxstructure de donnees du noyau et aux contextes des processusutilisateurs.

Partie inferieurecomposee d’un ensemble de sous-programmes invoques pour le traitement

des interruptions materielles

I des activites se deroulant d’une facon asynchrone et s’executent en modenoyau


Invocation des services systeme

Interruptions materielles et exceptionsI Une interruption est provoquee par un signal provenant du monde

exterieur au processeur, et modifiant le comportement de celui-ci. Le butest de le prevenir de l’occurrence d’un evenement exterieur :

I fin d’une E/S, top d’horloge . . .I 80x86 : 32-238. Linux utilise le vecteur 128 (0x80) pour les appels systeme.

I Une exception est un signal provoque par un disfonctionnement duprogramme en cours d’execution :

I division par zero, faute de page . . .I 80x86 : 20 differentes exceptions 0..19. Les valeurs de 20 a 31 sont

reservees par Intel pour le futur.I Chaque interruption ou exception dispose d’un sous-programme (handler)

qui prend en charge l’evenement correspondant : table de vecteursd’interruption ou IDT : Interrupt Descriptor Table dans le langage Linux.



Traitement d’une exception ou d’une interruptionI Arrivee de l’interruption/exceptionI Sauvegarde du contexte actuel (PC . . . ) en utilisant la pile noyauI Acces a la table des vecteurs d’interruption pour determiner l’adresse du

sous-programme de l’interruption (le handler) et chargement du PC avecson adresse

I Execution du sous-programme en mode noyauI Retablissement de l’ancien contexte et reprise de l’ancien programme en

mode utilisateur



Les appels systemeI L’interface entre le SE et les programmes utilisateurs est definie par

l’ensemble des appels systeme fournis par ce dernier.I Un appel systeme peut etre vu comme un appel d’une fonction classique

effectuee en mode noyau.I Un appel systeme est generalement realise a l’aide d’une interruption

logicielle avec un deroutement vers un emplacement specifique dans latable des vecteurs d’interruptions.

I Une interruption logicielle est declenchee par un programme a l’aided’une instruction speciale (trap, syscall)

I Il n’ y a pas de changement de processus (preemption)I Le handler est execute en utilisant les ressources du contexte du

processus interrompu (la pile noyau).I Des informations necessaires a la requete peuvent etre passees (par

registres, pile ou memoire)



Implementation des appels systeme


Invocation des services systemeBibliotheque standard C

I le code d’un appel systeme est souvent en assembleur, mais une fonctionde bibliotheque de fonctions en C est souvent fournie.


Invocation des services systemeAppels systeme, exemple : read

count = read(df, tampon, nbOctets)

Fonction read()

de bibliothèque

Espace utilisateur

Espace noyau

0

0xFFFFFFFF

Incrémenter SP

Empiler df

Empiler &tampon

Empiler nbOctets

Appel read()

Branchement

Appel à read()

depuis le programme

utilisateur

l’appel système

code de

Retour à l’appelant

Déroutement vers le noyau

Placement du code de read

dans un registre

1

4 5

6

7

2

3



Traitement des exceptions sous LinuxI La plupart des exceptions issues du CPU sont interpretes par Linux

comme des cas d’erreursI A l’occurence d’une exception, le noyau envoie un signal au processus qui

a cause l’exception.I Exemple. division par zero : envoi signal SIGFPE.I Handler d’une exception :

1. Sauvegarde du contenu de la plupart des registre dans la pile noyau(assembleur)

2. Traitement de l’exception (fonction C)3. Quitter le handler en invoquant la fonction ret from exception



Traitement des interruptions sous LinuxI Difference avec les exceptions : ne peut pas envoyer un signal au

processus en cours Rightarrow traitement differentI Interruptions de tempsI Interruptions interprocessusI Interruptions d’E/S

1. Sauvegarde la valeur de l’IRQ et le contenu des registres dans la pile noyau2. Envoi d’un ACK au PIC, lui permettant de traiter d’autres interruptions3. Execution les routines de traitement d’interruption (ISR : Interrupt Service

Routines” associees aux peripheriques qui partage la ligne IRQ.4. Invocation de la fonction ret from intr()



Appels systeme sous Linux

...

xyz()

...SYSCALL

xyz() {

}

...

...

system_call :

...sys_xyz() ...SYSEXIT

sys_xyz() {

...

}

Mode NoyauMode utilisateur

Programmeapplication libc std lib handler

System call service routine

Wrapper routine System call


Second Chapter

Process Handling

Introduction

Process ImplementationProcess Memory LayoutProcess Control Block

Process Scheduling: Theoretical ConceptsContext SwitchingOS Scheduling InfrastructureScheduling Algorithms

Scheduling in Real OSesUNIX

SolarisHP-UX4.4BSDLinux 2.6

Windows XP

Process CreationUNIXWindows

Martin Quinson RSA (2008-2009) Chap 2: Process Handling (41/355)

Introduction to processes

What is a Process?I Fact: Computer has to deal with variety of programs

I jobs on batch systems, user or system programs on time shared systemsI “jobs” and “tasks” used interchangeably in following

I Process: Dynamic entity executing a program on a processor

I OS point of view:I Program counter and stack: active part, doing stuff (thread)I Address space (memory protection): passive thing, thread environmentI Internal state (opened files, etc): environment on OS side

Process != ProgramI Programm:

Code + data (passive)

int i;int main() {

printf("Salut\n");}

I Process:Programm running

Code main()

int i;Data

Heap

Stack

I Even if you use the same program than me, it won’t be the same processMartin Quinson RSA (2008-2009) Chap 2: Process Handling (42/355)

Why processes?

To deal with complexity

I Allow activities to coexist simplyEach live in a separate box, and only deal with OS. OS handle all of them similarly

gameOS

emacs

firefox

OSgamefirefoxemacs

For efficiency

I When a process blocks, execute next one

��

time saved

wasted timewithout overlap

with overlap

t

gcc

gcc

emacs (waiting for user)

emacs(blocked)


Process Memory Layout

Process point of view OS point of viewBig picture

(reserved)

Heap

data

(cf. malloc)

Global

Code

spaceaddressing

Hole in

StackExecution

UNIX details

(protected)

Kernel

Heap

(cf. malloc)

Low addresses

MAXINT

bytes

(4Gb/4Tb)

Globals

Code

Constants

Dynamic

Libraries

High addresses

Cadre0

Cadre1

Pile

Segment

Data

SegmentText

SegmentStack

SectionBSS

DataSection

TextSection

Sections inProgramBinary

ProcessSegments0x00000000

0xefffffff

Holes inaddressing

space

User Mode access only privateaddressing space

Kernel Mode idem +

I protected addr. spaceI own kernel-mode stack

for calls made in kernel mode

(one per process for reentrance)

RemarksI Memory of each process isolated

; protection

I Code shared between processes(done automatically by OS)

I Data segment not shared(unless you use shm&mmap)

I Threads share everything(but the stack)


Process Control Block (PCB)

Information associated with each process

I Process state(running, ready, blocked, etc)

I Program counters

I CPU registers

I CPU scheduling informations

I Memory-management information

I Accounting information

I I/O status information

I . . .

Process state

Process ID

Program counter

Registers

Memory limits

list of open files


PCB Data Structures

PCB classically split in two parts

I Memory was very expensive back in the age

; reduce size of resident areas

Process tableI Always in memory

I Contains info on every process(even swapped ones)

I What’s needed for scheduling(amongst other)

User StructureI Part of process virtual memory

(can be swapped away)

I What’s needed when process active

Kernel

Structure

User

Data

Stack

Process

structure

stack

Text

Process tableresiding in memoryvirtual memory

Process


PCB in 4.4BSD (partial view)

User StructureI Execution state:

general registers, SP, PC

I Pointer to entry in process table

I Information on syscall currently run

I Open file descriptors

I Current directory

I Accounting informationI Time spent in user/kernel modesI Limits (CPU time, memory, . . . )I Maximal stack size

I Kernel stack of this process

Process StructureI Identification: PID, PPID, UID

I Scheduling: priority, blocked time

I Memory: pointer to pages table

I Synchro: blocking event description

I Signals: pending ones, handlers


Second Chapter

Process Handling

Introduction





Windows XP



Process States

Existing states

I new: just created

I running: instructions get executed

I waiting: blocked, waiting some event to occur

I ready: waiting to be assigned some processor

I terminated: finished execution

Transition diagram

new

ready running

terminated

waiting

I/O or event wait

exitinterrupt

admitted

scheduledevent completion


Context Switching

Process ContextI User Context: Stack, Data and Text segments

I Hardware Context: CPU registers and pointers

I System Context: User Structure part of PCB (process structure, kernel stack)

Context Switching

I Needed to change running process (interrupt, I/O request, etc)

I Save one process’s context and restore the one of another

I SynchronyI Explicit: call to sleep()I Implicit: time elapsed, I/O request

I AsynchronyI For example hardware interrupt

I All this is overhead : keep it fastTiming is hardware-dependent

Save state in PCB0

Restaure state from PCB1

Save state in PCB1

Restaure state from PCB0

running

inactive

Process P0 Process P1Operating Systemsyscall, interrupt or trap

syscall, interrupt or trap


Deciding which Process gets Scheduled

First IdeasI Scan process table for first runnable

/ Expensive, weird priority (small pids get more)At least separate runnable and blocked threads!

I FIFO? (put threads on back of list, pull them off from front)

(some toy OSes do so)I Priority? (give some threads more chances to get CPU)

Scheduling ChallengesFairness Don’t starve processesPrioritize More important firstDeadline Must be finished before ’x’ (car breaks, music&voice)Optimizations Some schedules ways faster than others

No Optimal PolicyI Many variables, can’t optimize them all (multi-objective optimization)I Conflicting goals

I I want to finish soonish, who cares about you?I Less important jobs should not completely starve


OS Scheduling Infrastructure

QueuesI Processes placed in several queues depending on their state

I Job Queue: all jobs in systemI Ready Queue: jobs in main mem, ready and waitingI Device Queue: jobs waiting for an I/O device

I Processes migrate among the different queues

Big Picture

headtail

headtail

headtail PCB4 PCB7

PCB2

PCB1 PCB3

PCB5PCB6

headtail

terminal

disk 0

ready

tape 0


OS Scheduling Queues


OS Schedulers

Short Term SchedulerI Decides which jobs from ready queue gets scheduled

I Runs often (ms) ; must be fast

Long Term Scheduler

I Decides which jobs gets into ready queue

I Runs less often (second to minute) ; can be slow


Process Scheduling

Process Categorization

I CPU-bound: only uses CPU (would go faster with bigger CPU)

I IO-bound: limited by I/O speed (would go faster with faster disks/memory)

RemarksI Very few processes are CPU-bound for long time

I In real code, same program alternatively CPU-bound and I/O bound

Usage Bursts

I CPU burst=code section being CPU bound (idem for I/O)

I Improving scheduling requires to understand bursts distribution


CPU Bursts Distribution

I Interactive systems ; shorter CPU burstsI Scientific Code ; (very) long CPU bursts (CPU burners)


Scheduling Criteria

Scheduling Goal

I User perspective: Reduce completion time

I Owner perspective: Maximize resource utilization

CriteriaI CPU utilization: keep CPU busy ;max

I Throughput: amount of jobs completed per unit of time ;max

I Turnaround time: makespan of a particular job ;min

I Waiting time: amount of time a job waited in ready state ;min

I Response time: time between submission and first action (time shared) ;min

A whole load of algorithms exist

I Some are simple (silly?)

I Some are cleaver, specifically designed to improve one criteria

I Impossible to satisfy all criteria at the same time


First Come First Served (FCFS) Algorithm

I Simply implemented with a linked list

Workload 1Process Burst time Arrival Waiting Time

P1 24 0 0P2 3 1 24P3 3 2 28

Gantt-chartP1 P2 P3

0 20 25 30

Average Waiting Time: 17.3

Workload 2Process Burst time Arrival Waiting Time

P1 24 2 10P2 3 0 0P3 3 1 5

Gantt-chartP2 P1P3

0 5 10 30

Average Waiting Time: 3.3

I This effect is called Convoy Effect (small placed after long do suffer)

I This is not adapted to interactive systems (I/O bound jobs suffer)

I What about prioritizing short jobs?


Shortest Job First (SJB) Algorithm

Process Burst time Waiting TimeP1 6 3P2 8 16P3 7 9P4 3 0

Gantt chartP4

0 3

P1

9 16

P3

24

P2

Average Waiting Time: 7

SJB is as optimal as unrealistic

I Impossible to achieve lower waiting time (but long jobs suffer)

I But how to know burst time in advance?

Guessing Burst Time

I Use past to predict future! (as usual)

I Exponential averaging:I tn: actual length of nth CPU burst

I τn: guess for nth CPU burst

I α: parameter between 0 and 1

I τn+1 = αtn + (1− α)τn

I α = 0 =⇒ τn+1 = τn; recent measurements ignored

I α = 1 =⇒ τn+1 = tn; only last measurement used

I τn =P

j(1− α)n−j × tj; measurement coefs reduce with ageMartin Quinson RSA (2008-2009) Chap 2: Process Handling (59/355)

Round-Robin (RR) Algorithm

Big lines

I Interrupt the process after a while (regardless of whether it’s done or not)I Schedule someone else

Advantages

I No convey effect: small job not blocked for ever behind big jobsI Big jobs do not starve yielding for small jobs

Picking the right quantum

I Quantum too big ; good throughput, bad interactivity

P1

P2

0 1 2010 30 4021 26 31 366 15 16

running

i/o I Reactivity of P1 very bad(lags)

I I/O device underused

I Quantum too small ; good reactivity, high overhead

P1

P2

Processes continuously interrupted

I Quantum = ∞ ; FCFSI Classical value: 10-100 milliseconds


Scheduling with Priority

Process Priority

I Associate a priority (an integer) to each process

I CPU allocated to ready process with highest priority

I Can be preemptive or not (whether we interrupt processes not done yet)

ProblemI Low priority processes may never get to the resource (starvation)

I Solution: Aging (priority increases when not served)

Particular casesI FCFS: give the same priority to anyone

I SJF: priority inversely proportional to burst length

RemarkI On UNIX, processes are traditionally given a nice value

(inversely proportional to priority: nice processes give CPU to others)


Multi-Level Scheduling

Split ready queue in sub-queue, with specific scheduling policy

I Foreground (interactive jobs): RR

I Background (batch jobs): FCFS

Need to schedule between queues

I Any foreground first (but possible starvation)

I Preemptive to share 80%/20% of CPU


Algorithmes de schedulingFeedback Multilevel Scheduling

I Ici, les processus peuvent changer de files, ce qui permet de separer lesprocessus ayant des caracteristiques differents en termes de cycles UC

I Si un processus a des cycles UC longs, le deplacer dans une file d’attentemoins prioritaire⇒ les processus interactifs finissent par avoir la prioritela plus elevee

I Si un processus attend longtemps, il est deplace dans une file plusprioritaire

quantum = 8

quantum = 16

FCFS

Moufida Maimour Systemes d’exploitation II (06/07) (90/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (63/355)

Algorithmes de schedulingFeedback Multilevel Scheduling

est defini parI le nombre de files d’attenteI l’algorithme de scheduling pour chaque fileI la methode utilisee pour determiner le moment de changer la priorite d’un

processus

quantum = 8

quantum = 16

FCFS


Ordonnancement SOLARIS

3 classes d’ordonnancementI Timesharing and interactive (TS & IA) : RR avec

priorite (plus de priorite aux processus les plusinteractifs)

I System(SYS) : FCFS avec preemption et prioritesfixes

I Realtime (RT) : RR avec priorite ou un processus(RT) a une priorite fixe durant sa vie

+Interruptions

et interatif

Temps éel

Système

Temps partagé

169

0

59

60

99

100

159

160


Ordonnancement SOLARIS

Dispatch table : processus interactifsI time quantum : la longueur

par defaut du quantumassigne au processus

I time quantum expired : lanouvelle priorite pour unprocessus qui utilise latotalite de son quantum

I return from sleep : lanouvelle priorite pour unprocessus qui se bloqueavant d’utiliser la totalitede son quantum


Ordonnancement HP-UX

2 types d’ordonnanceurs :

Temps-reel (RealTime ou RT)I FIFO ou RRI priorites fixes, ne peuvent pas etre

changees par le noyauI sans requisition : un processus

s’execute jusqu’a sa fin ou sebloquer

Temps-partage (TimeShare ou TS) :I RRI la valeur de la priorite augmente

(priorite diminue) avec l’utilisationde l’UC et diminue en attendant

I avec requisition


Ordonnancement HP-UXL’ordonnanceur temps partage

Le noyau differencie en termes depriorite, les processus utilisateurs desprocessus systeme (mode noyau, enattente d’un evenement). Ces derniersont une priorite superieure.

I en mode utilisateur, un processuspeut etre requisitionne, arrete oumeme transfere en memoiresecondaire

I en mode noyau, un processuss’execute jusqu’a se bloquer, uneinterruption ou se terminer

Processus temps réel

Processus système

Processus utilisateurs

Pri

ori

té

+ 0

128

178

255


4.4BSD

Etats d’un processusI SIDL : etat intermediaire lors de la creation d’un processus (idle)I SRUN : pret (executable ou Runnable)I SSLEEP : attente d’un evenementI SSTOP : arrete par son pere ou un signalI SZOMB : en attente de terminaison

RemarquesI Il n’y a pas d’etat “en cours d’execution”I Un certain nombre de flags completent les informations sur l’etat d’un

processus


4.4BSD

Scheduling(1)I Un processus a 2 priorites :

I mode utilisateur p usrpri ∈ [PUSER,127] ou PUSER=50 et correspond a lapriorite attribuee au processus utilisateur le plus prioritaire.

I mode noyau p priority ∈ [0,PUSER] donnant plus de chance a un processusen mode noyau afin qu’il libere des que possible les ressources systeme qu’ildetient.

I un quantum = 0.1s (valeur empirique)I la priorite d’un processus est ajustee dynamiquement :

pusrpri = PUSER +pcpu

4+ 2pnice (1)

I pnice permet a l’utilisateur de moduler sa priorite,I pcpu incrementee toutes les 10 ms et donne une estimation de la

consommation UC du processus actif.

⇒ La priorite d’un processus diminue avec sa consommation UC.


4.4BSD

Scheduling (2)I toutes les secondes pcpu est reajustee selon la formule :

pcpu =2 load

2 load + 1pcpu + pnice (2)

ou load est une estimation de la charge du systeme et correspond a lalongueur de la file d’attente des processus prets.

I Lors de la reactivation d’un processus utilisateur en attente (Asleep),l’ordonnanceur reajuste pcpu :

pcpu = pcpu

(2 load

2 load + 1

)pslptime

(3)

ou pslptime comptabilise le temps d’attente du processus

⇒ sert a estomper le passe lointain.


4.4BSD

Scheduling (3)Exemple.On considere un seul processus qui est actif et qui consomme toute l’UC. Ceprocessus consomme Ti ticks a la frequence de l’horloge pendant la duree i .load = 1Toutes les secondes, le filtre est applique avec la formule suivantepcpu = 0.66 pcpu :

pcpu = 0.66 T0

pcpu = 0.66 T1 + 0.44 T0

pcpu = 0.66 T2 + 0.44 T1 + 0.3 T0

pcpu = 0.66 T3 + ... + 0.20 T0

pcpu = 0.66 T4 + ... + 0.13 T0

On remarque que l’effet de T0 s’estompe avec le temps.


4.4BSDLa runqueue

I L’ensemble des processus dans l’etat Runnable constitue la file desprocessus prets : la runqueue

I L’implantation du scheduling est realisee par une liste chaınee desprocessus associee a chaque groupe de priorites flottantes.

I qs la table des tetes et queues de listes des filesI whichqs une table associee a qs pour indiquer l’occupation de chaque file.

whichqs

0

1

0

proc proc

queue

tete

numéro

de la

priorité

qs


Linux 2.6 : implantation des processus

Descripteur de processus (task struct)contient toutes les informations relatives a un processus

Etat d’un processus (champ state)I TASK RUNNING le processus est pret a etre execute ou en cours

d’execution.I TASK INTERRUPTIBLE le processus est suspendu en attendant qu’une

condition soit realisee :I une interruption materielle,I liberation d’une source que le processus attendI reception d’un signal, . . .

I TASK STOPPED processus arrete a cause d’un signal SIGSTOP,SIGTSTP, SIGTTIN ou SIGTTOU.

I TASK ZOMBIE l’execution du processus est terminee alors que son peren’a pas encore utilise un appel systeme de type wait() pour obtenir desinformations a propos du processus mort.

I . . .


Linux 2.6 : implantation des processusDescripteur de processus (task struct)

Lightweight processes (processus legers)I Un processus leger correspond a un threadI Un groupe de threads est un ensemble de processus legers qui implante

une meme application multithreadee :I partagent l’espace d’adressageI agissent comme un tout vis-a-vis de certains appels systeme : getpid(), kill(),

. . .I peuvent etre schedules separementI chacun son pid, mais un seul pid de groupe qui est le pid du premier thread

du groupe

Identification d’un processusI pid : identifiant du processus de 0 a 32767 = PID MAX DEFAULT −1

/proc/sys/kernel/pid maxpid map array permet de savoir quel pid est deja affecte

I tgid (threag group leader pid) : pid du premier processus leger du groupegetpid() retourne tgid et non pas pid (POSIX compatible)


Linux 2.6 : implantation des processusDescripteur de processus (task struct)

addr_limit

cpu

*task

thread_info

task_struct

state

liste des signaux reçus

signal_struct

files_struct

info fichiers

info mémoire

mm_struct

pour les processus

info de bas niveau


Linux 2.6 : implantation des processus

Descripteur de processus : 80x86

thread unionunion thread union {struct thread info thread info;

unsigned long stack[2048];}

thread_info

Pile

thread_info

Descripteur de processus

015fa000

015fbfff

015fb000

015fa034task52

octets

esp


Linux 2.6 : implantation des processusDescripteur de processus : 80x86

thread_info

Pile

thread_info

Descripteur de processus

015fa000

015fbfff

015fb000

015fa034task52

octets

esp

A partir de esp, le noyau peut trouver, pour le processus en cours :l’adresse de la structure “thread info”current thread infomovl $0xffffe000,%ecxandl %esp,%ecxmovl %ecx,p

l’adresse de son descripteurcurrent macromovl $0xffffe000,%ecxandl %esp,%ecxmovl (%ecx),p


Linux 2.6 : Implantation des processus

Des listes de descripteurs de processusI liste de tous les processus,I liste des processus prets, une liste par niveau de priorite. Usage de la

structure prio array t :I int nr active : nombre des descripteurs de la liste,I unsigned long[5] : bitmap, si un flag a 1 alors la liste correspondante est non

vide,I struct list head[140] queue, les tetes des 140 listes de priorite.

I liste des processus en attente, une liste par evenementI un flag est utilise, s’il est a 1, reveiller un seul processus sinon reveiller tous.


Linux 2.6 : ordonnancement des processus

Classes d’ordonnancementI SCHED FIFO (FIFO real-time process) : si aucun autre processus n’est

plus prioritaire, un processus continue a s’executer.I SCHED RR (Round-Robin real-time process) : permet un equite parmi

ceux ayant la meme priorite.I SCHED NORMAL (conventional, time shared process)


Linux 2.6 : ordonnancement des processusPrincipe

I Deux domaines separes de priorites statiques :I priorite conventionnelle : 100-139 correspondant au nice de -20 a 19. La

valeur du nice peut etre changee avec l’appel systeme nice() ou setpiority()I priorite temps reel : 0-99

I Ordonnancement a priorite dynamique : chaque processus a une prioriteinitiale qui peut diminuer (si tributaire UC) et augmenter (si tributaire E/S)

I Utilisation d’un quantum (timeslice) variable qui peut etre consomme enplusieurs fois.

I Recalcul des timeslices lorsque tous les processus ont consomme latotalite des timeslices.

I Ordonnancement avec requisition (preemptive scheduling) :I arrivee d’un nouveau processus avec une plus grande prioriteI timeslice devient nul

MaxMin Default

100ms 800ms5ms

priorité


Linux 2.6 : cas des processus conventionnels

Calcul du timeslice

timeslice =

{(140− staticP)× 20 if staticP < 120(140− staticP)× 5 sinon

Ou staticP est la priorite statique du processus.

Priorite statique Nice value timeslice (ms)100 -20 800110 -10 600120 0 100130 +10 50139 +19 5



Priorite dynamiqueI le nombre auquel se refere le scheduler actuellement pour elire le

prochain processus a executer :

dynamicP = max(100, min(staticP − bonus + 5, 139))

I bonus ∈ [0..10] : bonus < 5 correspond a unepenalite

I le bonus depend de l’historique du processus(“average sleep time”)

I un processus est considere comme interactifsi

dynamicP ≤ 3× staticP/4 + 28

ou

bonus − 5 ≥ staticP/4− 28 = interactiveDelta

Avg sleep time bonus0-100 ms 0

100-200 ms 1200-300 ms 2300-400 ms 3400-500 ms 4500-600 ms 5600-700 ms 6700-800 ms 7800-900 ms 8900-1000 ms 9

1 seconde 10



Pour eviter la famine et optimiser le recalcul des timeslices :I processus actifs, qui n’ont pas fini leurs timeslicesI processus expires, ceux deja servis

Remarque 1 : les processus temps reels sont toujours places dans la liste desprocessus actifs.

Remarque 2 : le scheduler 2.6 trouve le prochain processus a executer en untemps constant (O(1)) contrairement au 2.4.Rappelliste des processus prets, une liste par niveau de priorite. Usage de lastructure prio array t :

I int nr active : nombre des descripteurs de la liste,I unsigned long[5] : bitmap, si un flag a 1 alors la liste correspondante est

non vide,I struct list head[140] queue, les tetes des 140 listes de priorite.


Linux 2.6 : Implantation des appels systeme

Rappel

...

xyz()

...SYSCALL

xyz() {

}

...

...

system_call :

...sys_xyz() ...SYSEXIT

sys_xyz() {

...

}

Mode NoyauMode utilisateur

Programmeapplication libc std lib handler

System call service routine

Wrapper routine System call



Deux methodes d’invocation1. L’interruption 0x80, vecteur d’interruption 128. Intel reserve les vecteurs

32-238 pour les interruptions materielles. Usage de l’instructiond’assembleur iret

2. Les instructions sysenter et sysexit introduits avec le pentium II

La methode int 0x80I Initialisation (au demarrage du systeme) du vecteur d’interruption 128 par

l’adresse de la routine : get system gate(0x80,&system call)I Sauvegarde des registresI Passage du numero de l’appel systeme par le registre EAXI Passage possible d’arguments (par registre)I Si erreur, un appel systeme retourne une valeur negative dont la valeur

absolue donne le code de l’erreur errno. Cette derniere est mise a jourpar la wrapper routine.

I Restitution des registres sauvegardes. Retour dans le processus appelantet au mode utilisateur.



Code en assembleur

ENTRY(system call)pushl %eaxSAVE ALLmovl $0xffffe0000, %ebpandl %esp, %ebxcmpl $(nr syscalls), %eaxjae syscall badsys

syscall call :call *sys call table(0,%eax,4)movl %eax,24(%esp)

syscall exit :climovl 8(%ebp), %ecxtestw $0xffff, %cx

restore all :RESTORE ALL

Moufida Maimour Systemes d’exploitation II (06/07) (114/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (87/355) 5.18 Silberschatz, Galvin and Gagne ©2005Operating System Concepts – 7th Edition, Feb 2, 2005

Windows XP SchedulingWindows XP Scheduling

■ Windows XP schedules threads using a prioritybased, preemptive scheduling algorithm.

■ The Windows XP scheduler ensures that the higher priority thread will always run.

■ The portion of the Windows XP kernel that handles scheduling is called the dispatcher.

■ A thread selected to run by the dispatcher will run until it is preempted by a higher priority thread, until it terminates, until its time quantum ends, or until it calls a blocking system call (such as for Input/Output operation)

■ If a higher priority realtime thread becomes ready while a lower priority thread is running, the lowerpriority thread will be preempted.


5.19 Silberschatz, Galvin and Gagne ©2005Operating System Concepts – 7th Edition, Feb 2, 2005

Windows XP SchedulingWindows XP Scheduling

■ There are two priority classes:● The variable class contains threads having priorities from 1

to 15● The realtime class contains threads with priorities from 16

to 31.● A single thread running at priority 0 is used for memory

management.■ Each scheduling priority has a separate queue of the

corresponding processes■ The dispatcher uses a queue for each scheduling priority and

traverses the set of queues from highest to lowest until it finds a thread that is ready to run.

■ If no ready thread is found, the dispatcher will execute a special thread called the idle thread.



Windows XP PrioritiesWindows XP Priorities

The relative priorities The priority

within a class classes

By default, the base priority is the value of the Normal relative priority for the specific class



Windows XP Priorities: Some RulesWindows XP Priorities: Some Rules

■ Processes are typically members of the NORMAL_PRIORITY_CLASS.

■ A process will belong to this class unless the parent of the process was of the IDLE_PRIORITY_CLASS or unless another class was specified when the process was created.

■ The initial priority of a thread is typically the base priority of the process the thread belongs to.

■ When a thread’s time quantum runs out, the thread is interrupted; if the thread is in the variablepriority class, its priority is lowered. However, the priority is never lowered below the base priority.



Windows XP Priorities: Some RulesWindows XP Priorities: Some Rules

■ Lowering the thread’s priority tends to limit the CPU consumption of computebound threads.

■ When a variablepriority thread is released from a wait operation, the dispatcher boosts the priority. The amount of boost depends on what the thread is waiting for:

● A thread that was waiting for keyboard I/O would get a large increase

● A threat that was waiting for a disk operation would get a moderate increase

■ Windows XP distinguishes between the foreground process that is currently selected on the screen and the background processes that are not currently selected. When a process moves into the background, Windows XP increases the scheduling quantum by some factor – typically by 3.


Second Chapter

Process Handling

Introduction





Windows XP



Creation d’un processus

fork()Sous Unix, il y a separation entre :

I la creation d’un processus (fork())I l’execution d’un programme (exec())

fork()I duplique le contexte complet du pere pour generer le filsI retourne le pid du fils cree au pereI retourne 0 au fils


Creation d’un processus

fork()Problemes a resoudre

I Allocation des ressources pour le processus fils :I systeme : entree dans la table des processus, pidI memoire : texte, donnees, pile utilisateur, pile noyau et la structure utilisateur

I Creation d’un contexte d’execution pour le processus fils a partir ducontexte du pere

I Lancement du nouveau processusI double retour de la fonction fork()I ordonnancement du processus fils


Creation d’un processusfork()

ouPartage

Code du processus filsCode du processus pereDuplication

fork()

returnreturn

fork() fork()


Terminaison d’un processus

Algorithme de exit(status)I ignorer les signauxI RAZ des timersI etat = SZOMBI fermer les fichiers ouverts par ce processusI decrementer las compteurs dans la table des fichiers ouverts du systemeI . . .I liberer la memoire virtuelle, physique, la structure U et la pile noyauI sortir le processus de la file des processus prets et le mettre dans la file

des zombiesI faire adopter tous les fils du processus par le processus initI stockage de la valeur “status” dans la structure du processus (zombie)I envoi du signal SIGCHLD au pere (qui sera reveille par ce signal)I appel de la fonction de commutation de contexte


Process Creation on WindowsNo exact Windows equivalent of fork() and exec()

Windows has CreateProcess method

I Does both fork+exec in one step

I Creates new process + loads specified program into it

I Much more parameters than fork+execI More precisely: 10I Ok to put NULL for most of them

Example

#include <windows.h>#include <iostream.h>void main( ) {

PROCESS_INFORMATION pi; // Filled by CreateProcessSTARTUPINFO si; // Read by CreateProcess, ok to zero itZeroMemory( &si, sizeof(si) );si.cb = sizeof(si);if (!CreateProcess(NULL,"toto.exe 5 10", NULL,NULL,TRUE,0,NULL,NULL, &si, &pi) )

cerr << "CreateProcess failed." << endl;WaitForSingleObject ( pi.hProcess, INFINITE ); // Wait for process terminaisonCloseHandle( pi.hProcess ); // cleanupsCloseHandle( pi.hThread );

}


CreateProcess Syntax

BOOL CreateProcess(LPCTSTR lpApplicationName, // pointer to name of executable moduleLPTSTR lpCommandLine, // pointer to command line stringLPSECURITY ATTRIBUTES lpPA, // Process security attributesLPSECURITY ATTRIBUTES lpTA, // Thread security attributesBOOL bInheritHandles, // handle inheritance flagDWORD dwCreationFlags, // creation flagsLPVOID lpEnvironment, // pointer to new environment blockLPCTSTR lpCurrentDirectory, // pointer to current directory nameLPSTARTUPINFO lpStartupInfo, // pointer to STARTUPINFOLPPROCESS INFORMATION lpPI // pointer to PROCESS INFORMATION

);

I Two ways to specify program to start(first arg ; program location; second arg ; command line)

I Creation flags are combined with |I 0 ; in same windowI CREATE NEW CONSOLE useful;I Specify priority, linkage to parent, etc.

I Structures pi and si used for process communication (how to start, basic info)


Third Chapter

Memory Handling2

Hardware Memory ManagementIntroductionVirtual MemorySegmentationPagingExamplesPDP-11

x86MIPS and DEC Alpha

Swapping

Virtual Memory Operating System

Memory Allocation

2Greatly inspired from David Mazieres course at Stanford.David Mazieres RSA (2008-2009) Chap 3: Memory Handling (100/355)

We want processes to coexist in memory

What about simply sharing memory between processes?

OSgcc

firefoxemacs

0x8000

0x7000

0x9000

0x6000

0x5000

What if...I emacs needs more memory than allocated?

I firefox needs more memory than exists on machine?

I gcc have an error and writes into 0x6500?

I emacs does not use all its memory?

Other open question

I When does emacs know it runs at 0x5000? (compile, link or run time)

David Mazieres RSA (2008-2009) Chap 3: Memory Handling (101/355)

Issues in sharing physical memory

ProtectionI A bug in one process can corrupt memory in another

I Must somehow prevent process A from trashing B’s memory

I Also prevent A from even observing B’s memory(ssh-agent contains secrets)

Transparency

I A process shouldn’t require particular memory locations

I Processes often require large amounts of contiguous memory(for stack, large data structures, etc.)

Resource exhaustionI Programmers typically assume machine has ”enough” memory

I Sum of sizes of all processes often greater than physical memory


Virtual Memory Goals

physicalmemory

CPU

store load

gcc

virtual address

MMUTranslation box

physical address

data

legal? yesno

disks

Give each program its own ”virtual” address space

I At run time, relocate each load and store to its actual memory

I So app doesn’t care what physical memory it’s using

Also enforce protection

I Prevent one app from messing with another’s memory

And allow programs to see more memory than exists

I Somehow relocate some memory accesses to disk


Virtual Memory Advantages

Can re-locate program while running

I Run partially in memory, partially on disk

Most of a process’s memory will be idle

I Think of the 80/20 rule

busy

idle

idle

Process 2memoryPhysical

Process 1

busy

idle

I Write idle parts to disk until needed

I Let other processes use memory for idle part

I Like CPU virtualization: when process not using CPU, switch.When not using a page switch it to another process.


Virtual Memory Implementation (1/2)

Challenge: VM = extra layer, could be slow

First Idea: Load-time linking

jump 0x2000

0x3000

0x1000

jump 0x5000

0x4000

0x6000

OS

a.out’

static a.out

I Link as usual, but keep the list of references

I Fix up process when actually executedI Determine where process will reside in memoryI Adjust all references within program (using addition)

ProblemsI How to enforce protection

I How to move once in memory (Consider: data pointers)

I What if no contiguous free region fits program?David Mazieres RSA (2008-2009) Chap 3: Memory Handling (105/355)

Virtual Memory Implementation (2/2)

Challenge: VM = extra layer, could be slow

Better Idea: base+bound registers

jump 0x2000

0x3000

0x1000

jump 0x5000

0x4000

0x6000

OS

a.out’

static a.out

I Two special privileged registers: base and bound

I On each load/store:I Physical address = virtual address + base registerI Check 0 leq virtual address ¡ bound, else trap to kernel

I How to move process in memory?I Change base register

I What happens on context switch?I OS must re-load base and bound register


Virtual Memory Actual Implementation

DefinitionsI Programs load/store to virtual (or logical) addresses

I Actual memory uses physical (or real) addresses

MMUCPUmemoryphysicalphysical

address

virtual

address

Memory Management Unit (MMU)

I Usually part of CPU

I Accessed with privileged instructions(e.g., load bound registers)

I Translates from virtual to physical addresses

I Gives per-process view of memory called address space


Address Space

MMU

0

0

0

P1

P2

P3

ViewVirtual Addresses

OS

Physical AddressesView


Base and bound trade-offs

Advantages

I Cheap in terms of hardware: only two registers

I Cheap in terms of cycles: do add and compare in parallel

I Examples: Cray-1 used this scheme

Disadvantages

I Growing a process is expensive or impossible

I No way to share code or data(E.g., two copies of gcc)

gccemacs

shgcc

free

One solution: Multiple segments per process

I E.g., separate code, stack, data segments

I Possibly multiple data segments


Segmentation

Let processes have many base/bounds regs

I Address space build from many segments

I Can share/protect memory on segment granularity

Must specify segment as part of virtual address


Segmentation mechanics

Implementation

I Each process has a segment table

I Each virtual address indicates a segment and offset:I Top bits of addr select seg, low bits select offset (PDP-10)I Seg select by instruction, or operand (pc selects text)


Segmentation Example

2-bit segment number (1st digit), 12 bit offset (last 3)

I Where is 0x0240? 0x1108? 0x265c? 0x3002? 0x1600?


Segmentation trade-offs

Advantages

I Multiple segments per process

I Allows sharing (how?)

I Don’t need entire process in memory

Disadvantages

I Requires translation hardware, which could limit performance

I Segments not completely transparent to program(e.g., default segment faster or uses shorter instruction)

I n byte segments needs n contiguous bytes of physical memory

I Makes fragmentation a real problem.


Fragmentation

What is it?I Inability to use free memory

Where does it come from?I Variable-sized pieces ; many small holes

(external fragmentation)

I Fixed-sized pieces ; no external holes, but force internal waste(internal fragmentation)


Alternatives to hardware MMU

Language-level protection (Java)

I Single address space for different modules

I Language enforces isolation

I Singularity OS does this(OS with type-checking and design by contract in place of hardware protection)

Software fault isolationI Instrument compiler output

I Checks before every store operation prevents modules from trashing each other


Paging

Big Idea

I Divide memory up into small pages

I Map virtual pages to physical pages (each process has separate mapping)

Hardware gives control to OS on certain operations

I Read-only pages trap to OS on write

I Invalid pages trap to OS on read or write

I OS can change mapping and resume application

Other features sometimes foundI Hardware can set ”accessed” and ”dirty” bits

I Control page execute permission separately from read/write

I Control caching of page


Paging trade-offs

Trade-offsI Eliminates external fragmentation

I Simplifies allocation, free, and swap

I Internal fragmentation of .5 pages per”segment”

Simplified Allocation

I Allocate any physical page to any process

I Can store idle virtual pages on disk

memory emacsgccPhysical Disk


Paging data structures

Pages are fixed size (typically 4K)I Least significant 12 (log 4K) bits of address are page offsetI Most significant bits are page number

Each process has a page tableI Maps Virtual Page Numbers to Physical Page NumbersI Also includes bits for protection, validity, etc.

On memory accessI Translate virtual page number to physical page number, then add offset


Example: Paging on PDP-11

64K virtual memory, 8K pages

I Separate address space for instructions & data

I I.e., can’t read your own instructions with a load

Entire page table stored in registers

I 8 Instruction page translation registers

I 8 Data page translations

/ Swap 16 machine registers on each context switch


x86 PagingBasics

I Normally 4KB pagesI Paging enabled by bits in a control register (%cr0)

I only privileged OS code can manipulate control registersI %cr3: points to 4KB page directory

I Page directory: 1024 PDEs (page directory entries)Each contains physical address of a page table

I Page table: 1024 PTEs (page table entries)I Each contains physical address of virtual 4K pageI Page table covers 4 MB of virtual memory

Page Translation Mechanics

Page Directory

Directory Entry

%cr3 (PDBR)

Page Table

Page−Table Entry

4−KByte Page

Physical Address

32*

10

12

10

20

0

irectory e f s

31 21 111222

Linear Address

D Tabl O f et

aligned onto a 4-Kbytes boundary 1024 PDE × 1024 PTE = 220 pages = 4Gb


x86 Page Directory Entry (4 Kb page)

31

Available for system programmer’s use

Global Page (ignored)

Page Size (0 indicates 4K)

Reserved (set to 0)

12 11 9 8 7 6 5 4 3 2 1 0

PCA0

Accessed

Cache Disabled

Write−Through

User/Supervisor

Read/Write

Present

DP

PWT

U/S

R/

WAvailPage Table Base Address

PS

G


x86 Page Table Entry (4 Kb page)

31

Available for system programmer’s use

Global Page

Page Table Attribute Index

Dirty

12 11 9 8 7 6 5 4 3 2 1 0

PCAD

Accessed

Cache Disabled

Write−Through

User/Supervisor

Read/Write

Present

DP

PWT

U/S

R/

WAvailPage Base Address

PAT

G


Making Paging Fast

x86 paging translation require 3 memory reference per load/store

I Look up page table address in page directory

I Look up physical page number in page table

I Actually access physical page corresponding to virtual address

Page Directory

Directory Entry

%cr3 (PDBR)

Page Table

Page−Table Entry

4−KByte Page

Physical Address

32*

10

12

10

20

0

irectory e f s

31 21 111222

Linear Address

D Tabl O f et

aligned onto a 4-Kbytes boundary 1024 PDE × 1024 PTE = 220 pages = 4Gb

Translation Lookaside Buffer (TLB)

I For speed, CPU caches recently used translations

I Typical: 64-2K entries, 4-way to fully associative, 95% hit rate

I Each entry maps virtual page number → PPN + protection information

I On each memory reference:I Check TLB. If there get physical address fastI If not, walk page tables, insert in TLB for next time (must evict some entry)


TLB details

TLB operates at CPU pipeline speed

⇒ small, fast

Complication

I What to do when switch address space?

I x86 solution: Flush TLB on context switch

I MIPS solution: Tag each entry with associated process’s ID

In General, OS must manually keep TLB validI e.g. x86 INVLPG instruction

I Invalidates a page translation in TLBI Must execute after changing a possibly used page table entryI Otherwise, hardware will miss page table change

I More Complex on a multiprocessor (TLB shootdown)


x86 Paging Extensions

PSE: Page Size Extension

I Setting bit 7 in PDE (and bit 4 of %cr4) makes a 4MB translation(no page table, direct translation)

I Useful for big chucks (less meta-data, but more internal fragmentation)

PAE: Physical Address Extensions

I Physical @ are 36 bits (up to 64Gb); virtual @ still 32 bits (more 4Gb apps / box)

I Three-level translation walk (table entries are 64bits)

%cr3

Page Directory

64bit PageDirectory Entry

Page Table

64bit PageTable Entry

9 9

4k Page

PhysicalAddress

12

3031 20 11 0

2

Directory PointerEntry




21 1229


Long Mode PAE

CharacteristicsI Physical memory: Up to 1Tb currently (4Pb in future)

I Virtual memory: up to 256Tb currently (16Eb in future)

I Four-level translation walk

%cr3

2930

12999

11122021

9

48 38394763


0

Page-Map

Level-4 Table

PML4E

Page-Directory

Pointer Table

PDPE

Page Directory Page Table 4k Page

PhysicalAddress

PDE PTE

I Why are the upper 16 bits not used?


Where do the OS live?

In its own address space?

I Can’t do this on most hardware(e.g., syscall instruction won’t switch address spaces)

I Also would make it harder to parse syscall arguments passed as pointers

So in the same address space as process

I Use protection bits to prohibit user code from writing kernel

Typically all kernel text, most data at same VA in every address space

I On x86, must manually set up page tables for this

I Usually just map kernel in contiguous physical memory when boot loader putskernel into contiguous physical memory

I Some hardware puts physical memory (kernel-only) somewhere in virtualaddress space


Example Memory Layout

4 Gig

0xf000000kernel text & most data

First 256MB physical memory

USTACKTOPuser stack

Invalid Memorymapped kernel data

Invalid Memory

program text (read-only)

0

program dataBSS

heap

[mmaped regions]

break point


Very different MMUs exist

MIPSI Hardware has 64-entry TLB (references to addresses not in TLB trap to kernel)

I Each TLB entry has the following fields:Virtual page, Pid, Page frame, NC, D, V, Global

I Kernel itself unpagedI All of physical memory contiguously mapped in high VMI Kernel uses these pseudo-physical addresses

I User TLB fault hander very efficientI Two hardware registers reserved for itI utlb miss handler can itself fault–allow paged page tables

I OS is free to choose page table format!


Very different MMUs exist

DEC AlphaI Software managed TLB (like MIPS)

I 8KB, 64KB, 512KB, 4MB pages all availableI TLB supports 128 instruction/128 data entries of any size

I But TLB miss handler not part of OSI Processor ships with special ”PAL code” in ROMI Processor-specific, but provides uniform interface to OSI Basically firmware that runs from main memory like OS

I Various events vector directly to PAL codeCALL PAL instruction, TLB miss/fault, FP disabled

I PAL code runs in special privileged processor modeInterrupts always disabled; Have access to special instructions and registers


Paging to disk

Example of swapping

I gcc needs a new page of memory

I OS re-claims an idle page from emacsI If page is clean (i.e., also stored on disk):

I E.g., page of text from emacs binary on diskI Can always re-read same page from binaryI So okay to discard contents now & give page to gcc

I If page is dirty (meaning memory is the only copy)I Must write page to disk first before giving to gcc

I Either way:I Mark page invalid in emacsI emacs will fault on next access to virtual pageI On fault, OS reads page data back from disk into new page, maps new page

into emacs, resumes executing


Third Chapter

Memory Handling



Swapping


Memory Allocation


Paging

• Use disk to simulate larger virtual than physical mem– p. 2/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (133/355)

Working set model

• Disk much, much slower than memory

- Goal: Run at memory, not disk speeds

• 90/10 rule: 10% of memory gets 90% of memory refs

- So, keep that 10% in real memory, the other 90% on disk

- How to pick which 10%?

– p. 3/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (134/355)

Paging challenges

• How to resume a process after a fault?

- Need to save state and resume

- Process might have been in the middle of an instruction!

• What to fetch?

- Just needed page or more?

• What to eject?

- How to allocate physical pages amongst processes?

- Which of a particular proc’s pages to keep in memory?


Re-starting instructions• Hardware provides kernel w. info about page fault

- Faulting virtual address (e.g., in % r2 reg on x86)

- Address of instruction that caused fault

- Was the access a read or write? Was it an instruction fetch?

Was it caused by user access to kernel-only memory?

• Hardware must allow resuming after a fault

• Idempotent instructions are easy

- E.g., simple load or store instruction can be restarted

- Just re-execute any instruction that only accesses one address

• Complex instructions must be re-started, too

- E.g., x86 move string instructions

- Specify srd, dst, count in %esi, %edi, %e x registers

- On fault, registers adjusted to resume where move left off


What to fetch

• Bring in page that caused page fault

• Pre-fetch surrounding pages?

- Reading two disk blocks approximately as fast as reading one

- As long as no track/head switch, seek time dominates

- If application exhibits spacial locality, then big win to store and

read multiple contiguous pages

• Also pre-zero unused pages in idle loop

- Need 0-filled pages for stack, heap, anonymously mmapped

memory

- Zeroing them only on demand is slower

- So many OSes zero freed pages while CPU is idle


Selecting physical pages

• May need to eject some pages

- More on eviction policy in two slides

• May also have a choice of physical pages

• Direct-mapped physical caches

- Virtual → Physical mapping can affect performance

- Applications can conflict with each other or themselves

- Scientific applications benefit if consecutive virtual pages to not

conflict in the cache

- Many other applications do better with random mapping


Superpages

• How should OS make use of “large” mappings

- x86 has 2/4MB pages that might be useful

- Alpha has even more choices: 8KB, 64KB, 512KB, 4MB

• Sometimes more pages in L2 cache than TLB entries

- Don’t want costly TLB misses going to main memory

• Transparent superpage support [Navarro]

- “Reserve” appropriate physical pages if possible

- Promote contiguous pages to superpages

- Does complicate evicting (esp. dirty pages) – demote


Straw man: FIFO eviction

• Evict oldest fetched page in system

• Example—reference string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5

• 3 physical pages: 9 page faults


Straw man: FIFO eviction

• Evict oldest fetched page in system





Belady’s Anomaly

• More phys. mem. doesn’t always mean fewer faults


Optimal page replacement

• What is optimal (if you knew the future)?


Optimal page replacement

• What is optimal (if you knew the future)?

- Replace page that will not be used for longest period of time


• With 4 physical pages:


LRU page replacement• Approximate optimal with least recently used

- Because past often predicts the future


• With 4 physical pages: 8 page faults

• Problem 1: Can be pessimal – example?

• Problem 2: How to implement?– p. 12/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (145/355)

LRU page replacement• Approximate optimal with least recently used

- Because past often predicts the future


• With 4 physical pages: 8 page faults

• Problem 1: Can be pessimal – example?

- Looping over memory (then want MRU eviction)

• Problem 2: How to implement?– p. 12/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (146/355)

Straw man LRU implementations

• Stamp PTEs with timer value

- E.g., CPU has cycle counter

- Automatically writes value to PTE on each page access

- Scan page table to find oldest counter value = LRU page

- Problem: Would double memory traffic!

• Keep doubly-linked list of pages

- On access remove page, place at tail of list

- Problem: again, very expensive

• What to do?

- Just approximate LRU, don’t try to do it exactly


Clock algorithm

• Use accessed bit supported by most hardware

- E.g., Pentium will write 1 to A bit in PTE on first access

- Software managed TLBs like MIPS can do the same

• Do FIFO but skip accessed pages

• Keep pages in circular FIFO list

• Scan:

- page’s A bit = 1, set to 0 & skip

- else if A == 0, evict

• A.k.a. second-chance replacement


Clock alg. (continued)• Large memory may be a problem

- Most pages reference in long interval

• Add a second clock hand

- Leading edge clears A bits

- Trailing edge evicts pages with A=0

• Can also take advantage of hardware Dirty bit

- Each page can be (Unaccessed, Clean), (Unaccessed, Dirty),

(Accessed, Clean), or (Accessed, Dirty)

- Consider clean pages for eviction before dirty

• Or use n-bit accessed count instead just A bit

- On sweep: count = (A << (n− 1)) | (count >> 1)

- Evict page with lowest count


Other replacement algorithms

• Random eviction

- Dirt simple to implement

- Not overly horrible (avoids Belady & pathological cases)

• LFU (least frequently used) eviction

- instead of just A bit, count # times each page accessed

- least frequently accessed must not be very useful

(or maybe was just brought in and is about to be used)

- decay usage counts over time (for pages that fall out of usage)

• MFU (most frequently used) algorithm

- because page with the smallest count was probably just

brought in and has yet to be used

• Neither LFU nor MFU used very commonly


Naïve paging

• Naïve page requires 2 disk I/Os per page fault– p. 17/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (151/355)

Page buffering

• Idea: reduce # of I/Os on the critical path

• Keep pool of free page frames

- On fault, still select victim page to evict

- But read fetched page into already free page

- Can resume execution while writing out victim page

- Then add victim page to free pool

• Can also yank pages back from free pool

- Contains only clean pages, but may still have data

- If page fault on page still in free pool, recycle


Page allocation

• Allocation can be global or local

• Global allocation doesn’t consider page ownership

- E.g., with LRU, evict least recently used page of any proc

- Works well if P1 needs 20% of memory and P2 needs 70%:

- Doesn’t protect you from memory pigs

(imagine P2 keeps looping through array that is size of mem)

• Local allocation isolates processes (or users)

- Separately determine how much mem each procshould have

- Then use LRU/clock/etc. to determine which pages to evict

within each process


Thrashing

• Thrashing: processes on system require morememory than it has

- Each time one page is brought in, another page, whose contents

will soon be referenced, is thrown out

- Processes will spend all of their time blocked, waiting for pages

to be fetched from disk

- I/O devs at 100% utilization but system not getting much

useful work done

• What we wanted: virtual memory the size of disk

with access time of of physical memory

• What we have: memory with access time = disk

access


Reasons for thrashing• Process doesn’t reuse memory, so caching doesn’t

work (past != future)

• Process does reuse memory, but it does not “fit”

• Individually, all processes fit and reuse memory, but

too many for system

- At least this case is possible to address


Multiprogramming & Thrashing

• Need to shed load when thrashing


Dealing with thrashing

• Approach 1: working set

- Thrashing viewed from a caching perspective: given locality of

reference, how big a cache does the process need?

- Or: how much memory does process need in order to make

reasonable progress (its working set)?

- Only run processes whose memory requirements can be

satisfied

• Approach 2: page fault frequency

- Thrashing viewed as poor ratio of fetch to work

- PFF = page faults / instructions executed

- If PFF rises above threshold, process needs more memory

not enough memory on the system? Swap out.

- If PFF sinks below threshold, memory can be taken away


Working sets

• Working set changes across phases

- Baloons during transition


Calculating the working set

• Working set: all pages proc. will access in next T time

- Can’t calculate without predicting future

• Approximate by assuming past predicts future

- So working set ≈ pages accessed in last T time

• Keep idle time for each page

• Periodically scan all resident pages in system

- A bit set? Clear it and clear the page’s idle time

- A bit clear? Add CPU consumed since last scan to idle time

- Working set is pages with idle time < T


Two-level scheduler• Divide processes into active & inactive

- Active – means working set resident in memory

- Inactive – working set intentionally not loaded

• Balance set: union of all active working sets

- Must keep balance set smaller than physical memory

• Use long-term scheduler

- Moves procs from active → inactive until balance set small

enough

- Periodically allows inactive to become active

- As working set changes, must update balance set

• Complications

- How to chose T?

- How to pick processes for active set

- How to count shared memory (e.g., libc)– p. 26/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (160/355)

Some complications of paging

• What happens to available memory?

- Some physical memory tied up by kernel VM structures

• What happens to user/kernel crossings?

- More crossings into kernel

- Pointers in syscall arguments must be checked

• What happens to IPC?

- Must change hardware address space

- Increases TLB misses

- Context switch flushes TLB entirely on x86

(But not on MIPS. . . Why?)


64-bit address spaces

• Straight hierarchical page tables not efficient

• Solution 1: Guarded page tables [Liedtke]

- Omit intermediary tables with only one entry

- Add predicate in high level tables, stating the only virtual

address range mapped underneath + # bits to skip

• Solution 2: Hashed page tables

- Store Virtual → Physical translations in hash table

- Table size proportional to physical memory

- Clustering makes this more efficient


Typical virtual address space

Invalid Memory


0

program dataBSS

heap

USTACKTOPuser stack

4 Gigkernel memory

Invalid Memory

breakpoint

• Dynamically allocated memory goes in heap

- Typically right above BSS (uninitialized data) section

• Top of heap called breakpoint

- Memory between breakpoint and stack is invalid


Early VM system calls

• OS keeps “Breakpoint” – top of heap

- Memory regions between breakpoint & stack fault

• har *brk ( onst har *addr);- Set and return new value of breakpoint

• har *sbrk (int in r);- Increment value of the breakpoint & return old value

• Can implement mallo in terms of sbrk- But hard to “give back” physical memory to system


Memory mapped files

memory-mapped file

memory-mapped file

USTACKTOPuser stack

4 Gigkernel memory

Invalid Memory


0

program dataBSS

heapbreakpoint

Invalid Memory

• Other memory objects between heap and stack– p. 31/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (165/355)

mmap system call

• void *mmap (void *addr, size_t len, int prot,int flags, int fd, off_t offset)- Map file specified by fd at virtual address addr- If addr is NULL, let kernel choose the address

• prot – protection of region

- OR of PROT_EXEC, PROT_READ, PROT_WRITE, PROT_NONE

• flags- MAP_ANON – anonymous memory (fd should be -1)

- MAP_PRIVATE – modifications are private

- MAP_SHARED – modifications seen by everyone


More VM system calls

• int msyn (void *addr, size_t len, int flags);- Flush changes of mmapped file to backing store

• int munmap(void *addr, size_t len)- Removes memory-mapped object

• int mprote t(void *addr, size_t len, int prot)- Changes protection on pages to or of PROT_. . .

• int min ore(void *addr, size_t len, har *ve )- Returns in ve which pages present


Catching page faults

struct sigaction {

union { /* signal handler */

void (*sa_handler)(int);

void (*sa_sigaction)(int, siginfo_t *, void *);

};

sigset_t sa_mask; /* signal mask to apply */

int sa_flags;

};

int sigaction (int sig, const struct sigaction *act,

struct sigaction *oact)

• Can specify function to run on SIGSEGV– p. 34/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (168/355)

Example: OpenBSD/i386 siginfo

struct sigcontext {int sc_gs; int sc_fs; int sc_es; int sc_ds;int sc_edi; int sc_esi; int sc_ebp; int sc_ebx;int sc_edx; int sc_ecx; int sc_eax;

int sc_eip; int sc_cs; /* instruction pointer */int sc_eflags; /* condition codes, etc. */int sc_esp; int sc_ss; /* stack pointer */

int sc_onstack; /* sigstack state to restore */int sc_mask; /* signal mask to restore */

int sc_trapno;int sc_err;

};


4.4 BSD VM system

• Each process has a vmspace structure containing

- vm_map – machine-independent virtual address space

- vm_pmap – machine-dependent data structures

- statistics – e.g. for syscalls like getrusage ()

• vm_map is a linked list of vm_map_entry structs

- vm_map_entry covers contiguous virtual memory

- points to vm_object struct

• vm_object is source of data

- e.g. vnode object for memory mapped file

- points to list of vm_page structs (one per mapped page)

- shadow objects point to other objects for copy on write


vm_map_entry

vm_map_entry

vm_map_entry

vm_map_entry

shadowobject

vm_page

object

vnode/

shadowobject

vm_page

vnode/

object

vnode/

object

vm_page

vm_page

vm_page

vm_page

vm_page

vm_map

vm_pmap

stats

vmspace


Pmap (machine-dependent) layer

• Pmap layer holds architecture-specific VM code

• VM layer invokes pmap layer

- On page faults to install mappings

- To protect or unmap pages

- To ask for dirty/accessed bits

• Pmap layer is lazy and can discard mappings

- No need to notify VM layer

- Process will fault and VM layer must reinstall mapping

• Pmap handles restrictions imposed by cache


Example uses

• vm_map_entry structs for a process

- r/o text segment → file object

- r/w data segment → shadow object → file object

- r/w stack → anonymous object

• New vm_map_entry objects after a fork:

- Share text segment directly (read-only)

- Share data through two new shadow objects

(must share pre-fork but not post fork changes)

- Share stack through two new shadow objects

• Must discard/collapse superfluous shadows

- E.g., when child process exits


What happens on a fault?

• Traverse vm_map_entry list to get appropriate entry

- No entry? Protection violation? Send process a SIGSEGV

• Traverse list of [shadow] objects

• For each object, traverse vm_page structs

• Found a vm_page for this object?

- If first vm_object in chain, map page

- If read fault, install page read only

- Else if write fault, install copy of page

• Else get page from object

- Page in from file, zero-fill new page, etc.


Third Chapter

Memory Handling



Swapping


Memory Allocation


Dynamic memory allocation

• Almost every useful program uses it

- Gives wonderful functionality benefits

- Don’t have to statically specify complex data structures

- Can have data grow as a function of input size

- Allows recursive procedures (stack growth)

- But, can have a huge impact on performance

• Today: how to implement it

• Some interesting facts:

- Two or three line code change can have huge, non-obvious

impact on how well allocator works (examples to come)

- Proven: impossible to construct an "always good" allocator

- Surprising result: after 35 years, memory management still

poorly understood


Why is it hard?

• Satisfy arbitrary set of allocation and free’s.

• Easy without free: set a pointer to the beginning of

some big chunk of memory (“heap”) and increment

on each allocation:

• Problem: free creates holes (“fragmentation”) Result?

Lots of free space but cannot satisfy request!


More abstractly

• What an allocator must do:

- Track which parts of memory in use, which parts are free.

- Ideal: no wasted space, no time overhead.

• What the allocator cannot do:

- Control order of the number and size of requested blocks.

- Change user ptrs = (bad) placement decisions permanent.

• The core fight: minimize fragmentation

- App frees blocks in any order, creating holes in “heap”.

- Holes too small? cannot satisfy future requests.


What is fragmentation really?

• Inability to use memory that is free

• Two causes

- Different lifetimes—if adjacent objects die at different times,

then fragmentation:

- If they die at the same time, then no fragmentation:

- Different sizes: If all requests the same size, then no

fragmentation (paging artificially creates this):


Important decisions• Placement choice: where in free memory to put a

requested block?

- Freedom: can select any memory in the heap

- Ideal: put block where it won’t cause fragmentation later.

(impossible in general: requires future knowledge)

• Splitting free blocks to satisfy smaller requests

- Fights internal fragmentation.

- Freedom: can chose any larger block to split.

- One way: chose block with smallest remainder (best fit).

• Coalescing free blocks to yield larger blocks

- Freedom: when coalescing done (deferring can be good) fights

external fragmentation.


Impossible to “solve” fragmentation

• If you read allocation papers to find the best allocator

- All discussions revolve around tradeoffs.

- The reason? There cannot be a best allocator.

• Theoretical result:

- For any possible allocation algorithm, there exist streams of

allocation and deallocation requests that defeat the allocator

and force it into severe fragmentation.

• What is bad?

- Good allocator: requires gross memory M · log(nmax/nmin),M = bytes of live data, nmin = smallest allocation, nmax = largest

- Bad allocator: M · (nmax/nmin)


Pathological examples

• Given allocation of 7 20-byte chunks

- What’s a bad stream of frees and then allocates?

• Given 100 bytes of free space

- What’s a really bad combination of placement decisions and

malloc & frees?

• Next: two allocators (best fit, first fit) that, in practice,work pretty well.

- “pretty well” = ∼20% fragmentation under many workloads


Best fit

• Strategy: minimize fragmentation by allocatingspace from block that leaves smallest fragment

- Data structure: heap is a list of free blocks, each has a header

holding block size and pointers to next

- Code: Search freelist for block closest in size to the request.

(Exact match is ideal)

- During free (usually) coalesce adjacent blocks

• Problem: Sawdust

- Remainder so small that over time left with “sawdust”

everywhere.

- Fortunately not a problem in practice.


Best fit gone wrong

• Simple bad case: allocate n, m (m < n) in alternating

orders, free all the ms, then try to allocate an m + 1.

• Example: start with 100 bytes of memory

- alloc 19, 21, 19, 21, 19

- free 19, 19, 19:

- alloc 20? Fails! (wasted space = 57 bytes)

• However, doesn’t seem to happen in practice (though

the way real programs behave suggest it easily could)


First fit

• Strategy: pick the first block that fits

- Data structure: free list, sorted lifo, fifo, or by address

- Code: scan list, take the first one.

• LIFO: put free object on front of list.

- Simple, but causes higher fragmentation

• Address sort: order free blocks by address.

- Makes coalescing easy (just check if next block is free)

- Also preserves empty/idle space (locality good when paging)

• FIFO: put free object at end of list.

- Gives similar fragmentation as address sort, but unclear why


Subtle pathology: LIFO FF

• Storage management example of subtle impact of

simple decisions

• LIFO first fit seems good:

- Put object on front of list (cheap), hope same size used again

(cheap + good locality).

• But, has big problems for simple allocation patterns:

- Repeatedly intermix short-lived large allocations, with

long-lived small allocations.

- Each time large object freed, a small chunk will be quickly

taken. Pathological fragmentation.


First fit: Nuances

• First fit + address order in practice:

- Blocks at front preferentially split, ones at back only split when

no larger one found before them

- Result? Seems to roughly sort free list by size

- So? Makes first fit operationally similar to best fit: a first fit of a

sorted list = best fit!

• Problem: sawdust at beginning of the list

- Sorting of list forces a large requests to skip over many small

blocks. Need to use a scalable heap organization

• When better than best fit?

- Suppose memory has free blocks:

- Suppose allocation ops are 10 then 20 (best fit best)

- Suppose allocation ops are 8, 12, then 12 (first fit best)


First/best fit: weird parallels

• Both seem to perform roughly equivalently

• In fact the placement decisions of both are roughlyidentical under both randomized and real workloads!

- No one knows why.

- Pretty strange since they seem pretty different.

• Possible explanations:

- First fit like best fit because over time its free list becomes

sorted by size: the beginning of the free list accumulates small

objects and so fits tend to be close to best.

- Both have implicit “open space hueristic” try not to cut into

large open spaces: large blocks at end only used when have to

be (e.g., first fit: skips over all smaller blocks).


Some worse ideas

• Worst-fit:

- Strategy: fight against sawdust by splitting blocks to maximize

leftover size

- In real life seems to ensure that no large blocks around.

• Next fit:

- Strategy: use first fit, but remember where we found the last

thing and start searching from there.

- Seems like a good idea, but tends to break down entire list.

• Buddy systems:

- Round up allocations to power of 2 to make management faster.

- Result? Heavy internal fragmentation.


Slab allocation• Kernel allocates many instances of same structures

- E.g., a 1.7 KB task_stru t for every process on system

• Often want contiguous physical memory (for DMA)

• Slab allocation optimizes for this case:

- A slab is multiple pages of contiguous physical memory

- A cache contains one or more slabs

- Each cache stores only one kind of object (fixed size)

• Each slab is full, empty, or partial

• E.g., need new task_stru t?

- Look in the task_stru t cache

- If there is a partial slab, pick free task_stru t in that

- Else, use empty, or may need to allocate new slab for cache

• Advantages: speed, and no internal fragmentation– p. 16/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (190/355)

Known patterns of real programs• So far we’ve treated programs as black boxes.

• Most real programs exhibit 1 or 2 (or all 3) of thefollowing patterns of alloc/dealloc:

- ramps: accumulate data monotonically over time

- peaks: allocate many objects, use briefly, then free all

- plateaus: allocate many objects, use for a long time


Pattern 1: ramps

• In a practical sense: ramp = no free!

- Implication for fragmentation?

- What happens if you evaluate allocator with ramp programs

only?


Pattern 2: peaks

• Peaks: allocate many objects, use briefly, then free all

- Fragmentation a real danger.

- Interleave peak & ramp? Interleave two different peaks?

- What happens if peak allocated from contiguous memory?


Exploiting peaks• Peak phases: alloc a lot, then free everything

- So have new allocation interface: alloc as before, but only

support free of everything.

- Called “arena allocation”, “obstack” (object stack), or

procedure call (by compiler people).

• arena = a linked list of large chunks of memory.

- Advantages: alloc is a pointer increment, free is “free”.

No wasted space for tags or list pointers.


Pattern 3: Plateaus

• Plateaus: allocate many objects, use for a long time

- what happens if overlap with peak or different plateau?


Fighting fragmentation

• Segregation = reduced fragmentation:

- Allocated at same time ∼ freed at same time

- Different type ∼ freed at different time

• Implementation observations:

- Programs allocate small number of different sizes.

- Fragmentation at peak use more important than at low.

- Most allocations small (< 10 words)

- Work done with allocated memory increases with size.

- Implications?


Simple, fast segregated free lists

• Array of free lists for small sizes, tree for larger

- Place blocks of same size on same page.

- Have count of allocated blocks: if goes to zero, can return page

• Pro: segregate sizes, no size tag, fast small alloc

• Con: worst case waste: 1 page per size even w/o free,

after pessimal free waste 1 page per object


Typical space overheads

• Free list bookkeeping + alignment determineminimum allocatable size:

- Store size of block.

- Pointers to next and previous freelist element.

- Machine enforced overhead: alignment. Allocator doesn’t

know type. Must align memory to conservative boundary.

- Minimum allocation unit? Space overhead when allocated?


Getting more space from OS

• On Unix, can use sbrk- E.g., to activate a new zero-filled page:

• For large allocations, sbrk a bad idea

- May want to give memory back to OS

- Can’t w. sbrk unless big chunk last thing allocated

- So allocate large chunk using mmap’s MAP_ANON– p. 25/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (199/355)

Faults + resumption = power

• Resuming after fault lets us emulate many things

- “every problem can be solved with layer of indirection”

• Example: sub-page protection

• To protect sub-page region in paging system:

- Set entire page to weakest permission; record in PT

- Any access that violates perm will cause an access fault

- Fault handler checks if page special, and if so, if access allowed.

Continue or raise error, as appropriate


More fault resumption examples• Emulate accessed bits:

- Set page permissions to “invalid”.

- On any access will get a fault: Mark as accessed

• Avoid save/restore of FP registers

- Make first FP operation fault to detect usage

• Emulate non-existent instructions:

- Give inst an illegal opcode; OS fault handler detects and

emulates fake instruction

• Run OS on top of another OS!

- Slam OS into normal process

- When does something “privileged,” real

OS gets woken up with a fault.

- If op allowed, do it, otherwise kill.

- IBM’s VM/370. Vmware (sort of)– p. 27/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (201/355)

Not just for kernels

• User-level code can resume after faults, too

• mprote t – protects memory

• siga tion – catches signal after page fault

- Return from signal handler restarts faulting instruction

• Many applications detailed by Appel & Li

• Example: concurrent snapshotting of process

- Mark all of processes memory read-only w. mprote t- One thread starts writing all of memory to disk

- Other thread keeps executing

- On fault – write that page to disk, make writable, resume


Distributed shared memory

• Virtual memory allows us to go to memory or disk

- But, can use the same idea to go anywhere! Even to another

computer. Page across network rather than to disk. Faster, and

allows network of workstations (NOW)


Persistent stores

• Idea: Objects that persist across program invocations

- E.g., object-oriented database; useful for CAD/CAM type apps

• Achieve by memory-mapping a file

• But only write changes to file at end if commit

- Use dirty bits to detect which pages must be written out

- Or with mprotect/sigaction emulated dirty bits on write faults

• On 32-bit machine, store can be larger than memory

- But single run of program won’t access > 4GB of objects

- Keep mapping betw. 32-bit mem ptrs and 64-bit disk offsets

- Use faults to bring in pages from disk as necessary

- After reading page, translate pointers—known as swizzling


Garbage collection• In safe languages, run time knows about all pointers

- So can move an object if you change all the pointers

• What memory locations might a program access?

- Any objects whose pointers are currently in registers

- Recursively, any pointers in objects it might access

- Anything else is unreachable, or garbage; memory can be re-used

• Example: stop-and-copy garbage collection

- Memory full? Temporarily pause program, allocate new heap

- Copy all objects pointed to by registers into new heap

- Mark old copied objects as copied, record new location

- Start scanning through new heap. For each pointer:

- Copied already? Adjust pointer to new location

- Not copied? Then copy it and adjust pointer

- Free old heap—program will never access it—and continue


Concurrent garbage collection• Idea: Stop & copy, but without the stop

- Mutator thread runs program, collector concurrently does GC

• When collector invoked:

- Protect from space & unscanned to space from mutator

- Copy objects in registers into to space, resume mutator

- All pointers in scanned to space point to to space

- If mutator accesses unscanned area, fault, scan page, resume

from space

1 2 3to space

areascanned

areaunscanned

4 5 6 mutator faultson access

=


Heap overflow detection• Many GCed languages need fast allocation

- E.g., in lisp, constantly allocating cons cells

- Allocation can be as often as every 50 instructions

• Fast allocation is just to bump a pointer

char *next_free;char *heap_limit;

void *alloc (unsigned size) {if (next_free + size > heap_limit) /* 1 */

invoke_garbage_collector (); /* 2 */char *ret = next_free;next_free += size;return ret;

}

• But would be even faster to eliminate lines 1 & 2!


Heap overflow detection 2

• Mark page at end of heap inaccessible

- mprote t (heap_limit, PAGE_SIZE, PROT_NONE);• Program will allocate memory beyond end of heap

• Program will use memory and fault

- Note: Depends on specifics of language

- But many languages will touch allocated memory immediately

• Invoke garbage collector

- Must now put just allocated object into new heap

• Note: requires more than just resumption

- Faulting instruction must be resumed

- But must resume with different target virtual address

- Doable on most architectures since GC updates registers


Reference counting• Seemingly simpler GC scheme:

- Each object has “ref count” of pointers to it

- Increment when pointer set to it

- Decremented when pointer killed

- ref count == 0? Free object

• Works well for hierarchical data structures

- E.g., pages of physical memory


Reference counting pros/cons• Circular data structures always have ref count > 0

- No external pointers means lost memory

• Can do manually w/o PL support, but error-prone

• Potentially more efficient than real GC

- No need to halt program to run collector

- Avoids weird unpredictable latencies

• Potentially less efficient than real GC

- With real GC, copying a pointer is cheap

- With reference counting, must write ref count each time– p. 36/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (210/355)

Fourth Chapter

I/O subsystem3

DisksI/O subsystem of the OSDisk Control Algorithms

Files and directoriesBasicsConsistency and Resilience

3From David Mazieres course at Stanford.Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (211/355)

Memory and I/O buses

I/O bus1880Mbps 1056Mbps

Crossbar

Memory

CPU

• CPU accesses physical memory over a bus

• Devices access memory over I/O bus with DMA

• Devices can appear to be a region of memory

– p. 1/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (212/355)

Realistic PC architectureAdvanced

ProgramableInterrupt

Controllerbus

I/O

APIC

CPU

BridgeMain

memory

North

bussidefront-

SouthBridge

busISA

CPU

USB

busAGP

PCIIRQsbus

PCI


What is memory?

• SRAM – Static RAM

- Like two NOT gates circularly wired input-to-output

- 4–6 transistors per bit, actively holds its value

- Very fast, used to cache slower memory

• DRAM – Dynamic RAM

- A capacitor + gate, holds charge to indicate bit value

- 1 transistor per bit – extremely dense storage

- Charge leaks—need slow comparator to decide if bit 1 or 0

- Must re-write charge after reading, and periodically refresh

• VRAM – “Video RAM”

- Dual ported, can write while someone else reads


What is I/O bus? E.g., PCI


Communicating with a device• Memory-mapped device registers

- Certain physical addresses correspond to device registers

- Load/store gets status/sends instructions – not real memory

• Device memory – device may have memory OS can

write to directly on other side of I/O bus

• Special I/O instructions

- Some CPUs (e.g., x86) have special I/O instructions

- Like load & store, but asserts special I/O pin on CPU

- OS can allow user-mode access to I/O ports with finer

granularity than page

• DMA – place instructions to card in main memory

- Typically then need to “poke” card by writing to register

- Overlaps unrelated computation with moving data over

(typically slower than memory) I/O bus– p. 5/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (216/355)

DMA buffers

Bufferdescriptorlist

Memory buffers

100

1400

1500

1500

1500

…

• Include list of buffer locations in main memory

• Card reads list then accesses buffers (w. DMA)

- Allows for scatter/gather I/O


Example: Network Interface Card

Host

I/O

bus

Adaptor

Network linkBus

interfaceLink

interface

• Link interface talks to wire/fiber/antenna

- Typically does framing, link-layer CRC

• FIFOs on card provide small amount of buffering

• Bus interface logic uses DMA to move packets to and

from buffers in main memory


Example: IDE disk with DMA


Driver architecture• Device driver provides several entry points to kernel

- Reset, ioctl, output, interrupt, read, write, strategy . . .

• How should driver synchronize with card?

- E.g., Need to know when transmit buffers free or packets arrive

- Need to know when disk request complete

• One approach: Polling

- Sent a packet? Loop asking card when buffer is free

- Waiting to receive? Keep asking card if it has packet

- Disk I/O? Keep looping until disk ready bit set

• Disadvantages of polling

- Can’t use CPU for anything else while polling

- Or schedule poll in future and do something else, but then high

latency to receive packet or process disk block


Interrupt driven devices• Instead, ask card to interrupt CPU on events

- Interrupt handler runs at high priority

- Asks card what happened (xmit buffer free, new packet)

- This is what most general-purpose OSes do

• Bad under high network packet arrival rate

- Packets can arrive faster than OS can process them

- Interrupts are very expensive (context switch)

- Interrupts handlers have high priority

- In worst case, can spend 100% of time in interrupt handler and

never make any progress – receive livelock

- Best: Adaptive switching between interrupts and polling

• Very good for disk requests

• Rest of today: Disks (network devices in 1.5 weeks)


Anatomy of a disk

• Stack of magnetic platters

- Rotate together on a central spindle @3,600-15,000 RPM

- Drive speed drifts slowly over time

- Can’t predict rotational position after 100-200 revolutions

• Disk arm assembly

- Arms rotate around pivot, all move together

- Pivot offers some resistance to linear shocks

- Arms contain disk heads–one for each recording surface

- Heads read and write data to platters


Disk


Disk


Disk


Storage on a magnetic platter

• Platters divided into concentric tracks

• A stack of tracks of fixed radius is a cylinder

• Heads record and sense data along cylinders

- Significant fractions of encoded stream for error correction

• Generally only one head active at a time

- Disks usually have one set of read-write circuitry

- Must worry about cross-talk between channels

- Hard to keep multiple heads exactly aligned


Cylinders, tracks, & sectors


Disk positioning system

• Move head to specific track and keep it there

- Resist physical socks, imperfect tracks, etc.

• A seek consists of up to four phases:

- speedup–accelerate arm to max speed or half way point

- coast–at max speed (for long seeks)

- slowdown–stops arm near destination

- settle–adjusts head to actual desired track

• Very short seeks dominated by settle time (∼1 ms)

• Short (200-400 cyl.) seeks dominated by speedup

- Accelerations of 40g


Seek details

• Head switches comparable to short seeks

- May also require head adjustment

- Settles take longer for writes than reads

• Disk keeps table of pivot motor power

- Maps seek distance to power and time

- Disk interpolates over entries in table

- Table set by periodic “thermal recalibration”

- 500 ms recalibration every 25 min, bad for AV

• “Average seek time” quoted can be many things

- Time to seek 1/3 disk, 1/3 time to seek whole disk,


Sectors

• Disk interface presents linear array of sectors

- Generally 512 bytes, written atomically

• Disk maps logical sector #s to physical sectors

- Zoning–puts more sectors on longer tracks

- Track skewing–sector 0 pos. varies by track (why?)

- Sparing–flawed sectors remapped elsewhere

• OS doesn’t know logical to physical sector mapping

- Larger logical sector # difference means larger seek

- Highly non-linear relationship (and depends on zone)

- OS has no info on rotational positions

- Can empirically build table to estimate times


Sectors

• Disk interface presents linear array of sectors

- Generally 512 bytes, written atomically

• Disk maps logical sector #s to physical sectors

- Zoning–puts more sectors on longer tracks

- Track skewing–sector 0 pos. varies by track (sequential access speed)

- Sparing–flawed sectors remapped elsewhere

• OS doesn’t know logical to physical sector mapping

- Larger logical sector # difference means larger seek

- Highly non-linear relationship (and depends on zone)

- OS has no info on rotational positions

- Can empirically build table to estimate times


Disk interface• Controls hardware, mediates access

• Computer, disk often connected by bus (e.g., SCSI)

- Multiple devices may contentd for bus

• Possible disk/interface features:

• Disconnect from bus during requests

• Command queuing: Give disk multiple requests

- Disk can schedule them using rotational information

• Disk cache used for read-ahead

- Otherwise, sequential reads would incur whole revolution

- Cross track boundaries? Can’t stop a head-switch

• Some disks support write caching

- But data not stable–not suitable for all requests


Disk performance• Placement & ordering of requests a huge issue

- Sequential I/O much, much faster than random

- Long seeks much slower than short ones

- Power might fail any time, leaving inconsistent state

• Must be careful about order for crashes

- More on this in next two lectures

• Try to achieve contiguous accesses where possible

- E.g., make big chunks of individual files contiguous

• Try to order requests to minimize seek times

- OS can only do this if it has a multiple requests to order

- Requires disk I/O concurrency

- High-performance apps try to maximize I/O concurrency

• Next: How to schedule concurrent requests– p. 23/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (233/355)

Scheduling: FCFS

• “First Come First Served”

- Process disk requests in the order they are received

• Advantages

• Disadvantages


Scheduling: FCFS

• “First Come First Served”

- Process disk requests in the order they are received

• Advantages

- Easy to implement

- Good fairness

• Disadvantages

- Cannot exploit request locality

- Increases average latency, decreasing throughput


Shortest positioning time first (SPTF)• Shortest positioning time first (SPTF)

- Always pick request with shortest seek time

• Advantages

• Disadvantages

• Improvement

• Also called Shortest Seek Time First (SSTF)


Shortest positioning time first (SPTF)• Shortest positioning time first (SPTF)

- Always pick request with shortest seek time

• Advantages

- Exploits locality of disk requests

- Higher throughput

• Disadvantages

- Starvation

- Don’t always know what request will be fastest

• Improvement: Aged SPTF

- Give older requests higher priority

- Adjust “effective” seek time with weighting factor:

Teff = Tpos − W · Twait

• Also called Shortest Seek Time First (SSTF)


“Elevator” scheduling (SCAN)• Sweep across disk, servicing all requests passed

- Like SPTF, but next seek must be in same direction

- Switch directions only if no further requests

• Advantages

• Disadvantages

• CSCAN:

• Also called LOOK/CLOOK in textbook

- (Textbook uses [C]SCAN to mean scan entire disk uselessly)


“Elevator” scheduling (SCAN)• Sweep across disk, servicing all requests passed

- Like SPTF, but next seek must be in same direction

- Switch directions only if no further requests

• Advantages

- Takes advantage of locality

- Bounded waiting

• Disadvantages

- Cylinders in the middle get better service

- Might miss locality SPTF could exploit

• CSCAN: Only sweep in one direction

Very commonly used algorithm in Unix

• Also called LOOK/CLOOK in textbook

- (Textbook uses [C]SCAN to mean scan entire disk uselessly)


VSCAN(r)

• Continuum between SPTF and SCAN

- Like SPTF, but slightly uses “effective” positioning time

If request in same direction as previous seek: Teff = Tpos

Otherwise: Teff = Tpos + r · Tmax

- when r = 0, get SPTF, when r = 1, get SCAN

- E.g., r = 0.2 works well

• Advantages and disadvantages

- Those of SPTF and SCAN, depending on how r is set


CS 140 Lecture: files and directories

Dawson Engler Stanford CS department

Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (241/355)

File system fun◆ File systems = the hardest part of OS

– More papers on FSes than any other single topic◆ Main tasks of file system:

– don’t go away (ever)– associate bytes with name (files)– associate names with each other (directories)– Can implement file systems on disk, over network, in

memory, in non-volatile ram (NVRAM), on tape, w/ paper.– We’ll focus on disk and generalize later

◆ Today: files and directories + a bit of speed.


The medium is the message◆ Disk = First thing we’ve seen that doesn’t go away

– So: Where everything important lives. Failure.◆ Slow (ms access vs ns for memory)

◆ Huge (100x bigger than memory)– How to organize large collection of ad hoc information?

Taxonomies! (Basically FS = general way to make these)

memorycrash

Processor speed: ~2x/yr

Disk access time: 7%/yr


Memory vs. Disk

◆ Smallest write: sector◆ Atomic write = sector

◆ ~10ms– not on a good curve

◆ 20MB/s◆ NUMA

◆ Crash?– Contents not gone (“non-

volatile”)– Lose? Corrupt? No ok.

◆ (usually) bytes◆ byte, word

◆ Random access: nanosecs– faster all the time

◆ Seq access 200-1000MB/s◆ UMA

◆ Crash?– Contents gone (“volatile”)

– Lose + start over = okDawson Engler RSA (2008-2009) Chap 4: I/O subsystem (244/355)

Some useful facts◆ Disk reads/writes in terms of sectors, not bytes

– read/write single sector or adjacent groups

◆ How to write a single byte? “Read-modify-write”– read in sector containing the byte– modify that byte– write entire sector back to disk– key: if cached, don’t need to read in

◆ Sector = unit of atomicity. – sector write done completely, even if crash in middle

» (disk saves up enough momentum to complete)– larger atomic units have to be synthesized by OSDawson Engler RSA (2008-2009) Chap 4: I/O subsystem (245/355)

The equation that ruled the world.◆ Approximate time to get data:

◆ So?– Each time touch disk = 10s ms. – Touch 50-100 times = 1 *second*– Can do *billions* of ALU ops in same time.

◆ This fact = Huge social impact on OS research– Most pre-2000 research based on speed.– Publishable speedup = ~30%– Easy to get > 30% by removing just a few accesses.– Result: more papers on FSes than any other single topic

seek time(ms) + rotational delay(ms) + bytes / disk bandwidth


Files: named bytes on disk◆ File abstraction:

– user’s view: named sequence of bytes

– FS’s view: collection of disk blocks– file system’s job: translate name & offset to disk blocks

◆ File operations:– create a file, delete a file– read from file, write to file

◆ Want: operations to have as few disk accesses as possible & have minimal space overhead

offset:int disk addr:int

int main() { … foo.c


◆ In some sense, the problems we will look at are no different than those in virtual memory– like page tables, file system meta data are simply data

structures used to construct mappings.– Page table: map virtual page # to physical page #

– file meta data: map byte offset to disk block address

– directory: map name to disk address or file #

What’s so hard about grouping blocks???

Page table28 33

Unix inode 418 8003121

directoryfoo.c 44


◆ In some ways problem similar: – want location transparency, oblivious to size, & protection

◆ In some ways the problem is easier: – CPU time to do FS mappings not a big deal (= no TLB)– Page tables deal with sparse address spaces and random

access, files are dense (0 .. filesize-1) & ~sequential◆ In some way’s problem is harder:

– Each layer of translation = potential disk access– Space a huge premium! (But disk is huge?!?!) Reason?

Cache space never enough, the amount of data you can Get into one fetch never enough.

– Range very extreme: Many <10k, some more than GB.– Implications?

FS vs VM


Problem: how to track file’s data?◆ Disk management:

– Need to keep track of where file contents are on disk– Must be able to use this to map byte offset to disk block

◆ Things to keep in mind while designing file structure:– Most files are small – Much of the disk is allocated to large files– Many of the I/O operations are made to large files– Want good sequential and good random access (what do

these require?)◆ Just like VM: data structures recapitulate cs107

– Arrays, linked list, trees (of arrays), hash tables.


Simple mechanism: contiguous allocation◆ “Extent-based”: allocate files like segmented memory

– When creating a file, make the user specify pre-specify its length and allocate all space at once

– File descriptor contents: location and size

– Example: IBM OS/360

– Pro?– Cons? (What does VM scheme does this correspond to?)

file a (base=1,len=3) file b (base=5,len=2)

what happens if file c needs 2 sectors???


Simple mechanism: contiguous allocation◆ “Extent-based”: allocate files like segmented memory

– When creating a file, make the user specify pre-specify its length and allocate all space at once

– File descriptor contents: location and size

– Example: IBM OS/360

– Pro: simple, fast access, both sequential and random. – Cons? (Segmentation)

file a (base=1,len=3) file b (base=5,len=2)

what happens if file c needs 2 sectors???


Linked files◆ Basically a linked list on disk.

– Keep a linked list of all free blocks– file descriptor contents: a pointer to file’s first block– in each block, keep a pointer to the next one

– Pro? – Con? – Examples (sort-of): Alto, TOPS-10, DOS FAT

file a (base=1) file b (base=5)

how do you find the last block in a?


Linked files◆ Basically a linked list on disk.

– Keep a linked list of all free blocks– file descriptor contents: a pointer to file’s first block– in each block, keep a pointer to the next one

– Pro: easy dynamic growth & sequential access, no fragmentation

– Con? – Examples (sort-of): Alto, TOPS-10, DOS FAT

file a (base=1) file b (base=5)

how do you find the last block in a?


Example: DOS FS (simplified)◆ Uses linked files. Cute: links reside in fixed-sized

“file allocation table” (FAT) rather than in the blocks.

– Still do pointer chasing, but can cache entire FAT so can be cheap compared to disk access.

file a 6 4 3

free eof 1eof 3

eof4...

file b 2 1

FAT (16-bit entries)

a: 6b: 2

Directory (5) 0123456


FAT discussion◆ Entry size = 16 bits

– What’s the maximum size of the FAT? – Given a 512 byte block, what’s the maximum size of FS?– One attack: go to bigger blocks. Pro? Con?

◆ Space overhead of FAT is trivial:– 2 bytes / 512 byte block = ~.4% (Compare to Unix)

◆ Reliability: how to protect against errors?

◆ Bootstrapping: where is root directory?


FAT discussion◆ Entry size = 16 bits

– What’s the maximum size of the FAT? – Given a 512 byte block, what’s the maximum size of FS?– One attack: go to bigger blocks. Pro? Con?

◆ Space overhead of FAT is trivial:– 2 bytes / 512 byte block = ~.4% (Compare to Unix)

◆ Reliability: how to protect against errors? – Create duplicate copies of FAT on disk. – State duplication a very common theme in reliability

◆ Bootstrapping: where is root directory? – Fixed location on disk: FAT (opt) FAT root dir …


Indexed files◆ Each file has an array holding all of it’s block pointers

– (purpose and issues = those of a page table)– max file size fixed by array’s size (static or dynamic?)– create: allocate array to hold all file’s blocks, but

allocate on demand using free list

– Pro?– con?

file a file b


Indexed files◆ Each file has an array holding all of it’s block pointers

– (purpose and issues = those of a page table)– max file size fixed by array’s size (static or dynamic?)– create: allocate array to hold all file’s blocks, but

allocate on demand using free list

– pro: both sequential and random access easy– Con: mapping table = large contig chunk of space. Same

problem we were trying to initially solve.

file a file b


Indexed files◆ Issues same as in page tables

– Large possible file size = lots of unused entries– Large actual size? table needs large contiguous disk chunk– Solve identically: small regions with index array, this

array with another array, … Downside?

2^32 file size

2^20 entries!

4K blocks

idle

idle


Ptr 1ptr 2 …ptr 128

Multi-level indexed files: ~4.3 BSD◆ File descriptor (inode) = 14 block pointers + “stuff”

Ptr 1ptr 2ptr 3ptr 4...

ptr 13ptr 14

stuffdata blocks

Ptr 1ptr 2 …ptr 128

Indirect block

Double indirect block

Indirect blks


◆ Pro?– simple, easy to build, fast access to small files– Maximum file length fixed, but large. (With 4k blks?)

◆ Cons:– what’s the worst case # of accesses?– What’s some bad space overheads?

◆ An empirical problem:– because you allocate blocks by taking them off unordered

freelist, meta data and data get strewn across disk

Unix discussion


◆ Inodes are stored in a fixed sized array– Size of array determined when disk is initialized and can’t

be changed. Array lives in known location on disk. Originally at one side of disk:

– Now is smeared across it (why?)

– The index of an inode in the inode array called an i-number. Internally, the OS refers to files by inumber

– When file is opened, the inode brought in memory, when closed, it is flushed back to disk.

More about inodes

Inode array file blocks ...


Example: (oversimplified) Unix file system◆ Want to modify byte 4 in /a/b.c:

◆ readin root directory (inode 2)◆ lookup a (inode 12); readin◆ lookup inode for b.c (13); readin

◆ use inode to find blk for byte 4 (blksize = 512, so offset = 0 gives blk 14); readin and modify

Root directory

. : 2 : dir a: 12: dir . :12 dir .. :2:dir b.c :13:inode

refcnt=1

int main() { …

14 0 … 0


◆ Problem: – “spend all day generating data, come back the next

morning, want to use it.” F. Corbato, on why files/dirs invented.

◆ Approach 0: have user remember where on disk the file is. – (e.g., social security numbers)

◆ Yuck. People want human digestible names– we use directories to map names to file blocks

◆ Next: What is in a directory and why?

Directories


◆ Approach 1: have a single directory for entire system.– put directory at known location on disk– directory contains <name, index> pairs– if one user uses a name, no one else can– many ancient PCs work this way. (cf “hosts.txt”)

◆ Approach 2: have a single directory for each user– still clumsy. And ls on 10,000 files is a real pain– (many older mathematicians work this way)

◆ Approach 3: hierarchical name spaces– allow directory to map names to files or other dirs– file system forms a tree (or graph, if links allowed)– large name spaces tend to be hierarchical (ip addresses,

domain names, scoping in programming languages, etc.)

A short history of time


◆ Used since CTSS (1960s)– Unix picked up and used really nicely.

◆ Directories stored on disk just like regular files– inode contains special flag bit set– user’s can read just like any other file– only special programs can write (why?)– Inodes at fixed disk location

– File pointed to by the index may be another directory

– makes FS into hierarchical tree(what needed to make a DAG?)

◆ Simple. Plus speeding up file ops = speeding up dir ops!

Hierarchical Unix/

afs bin cdrom dev sbin tmp

awk chmod chown

<name, inode#><afs, 1021><tmp, 1020><bin, 1022><cdrom, 4123><dev, 1001><sbin, 1011> ...


◆ Bootstrapping: Where do you start looking? – Root directory– inode #2 on the system– 0 and 1 used for other purposes

◆ Special names:– Root directory: “/”– Current directory: “.”– Parent directory: “..” – user’s home directory: “~”

◆ Using the given names, only need two operations to navigate the entire name space:– cd ‘name’: move into (change context to) directory “name”– ls : enumerate all names in current directory (context)

Naming magic


Unix example: /a/b/c.c

a

b

c.c

Name space

“.”

“..”

“.” Physical organization

Inode table

disk

<a,3>What inode holds file for a? b? c.c?

2345...

<b,5>

<c.c, 14>


◆ Cumbersome to constantly specify full path names– in Unix, each process associated with a “current working

directory”– file names that do not begin with “/” are assumed to be

relative to the working directory, otherwise translation happens as before

◆ Shells track a default list of active contexts – a “search path”– given a search path { A, B, C } a shell will check in A,

then check in B, then check in C– can escape using explicit paths: “./foo”

◆ Example of locality

Default context: working directory


◆ More than one dir entry can refer to a given file– Unix stores count of pointers (“hard links”) to inode

– to make: “ln foo bar” creates a synonym (‘bar’) for ‘foo’

◆ Soft links:– also point to a file (or dir), but object can be deleted

from underneath it (or never even exist). – Unix builds like directories: normal file holds pointed to

name, with special “sym link” bit set

– When the file system encounters a symbolic link it automatically translates it (if possible).

Creating synonyms: Hard and soft links

ref = 2...

foo bar

/bar“baz”


Micro-case study: speeding up a FS◆ Original Unix FS: Simple and elegant:

◆ Nouns: – data blocks – inodes (directories represented as files)– hard links– superblock. (specifies number of blks in FS, counts of

max # of files, pointer to head of free list) ◆ Problem: slow

– only gets 20Kb/sec (2% of disk maximum) even for sequential disk transfers!

inodes data blocks (512 bytes)

disksuperblock


A plethora of performance costs◆ Blocks too small (512 bytes)

– file index too large – too many layers of mapping indirection– transfer rate low (get one block at time)

◆ Sucky clustering of related objects:– Consecutive file blocks not close together– Inodes far from data blocks– Inodes for directory not close together– poor enumeration performance: e.g., “ls”, “grep foo *.c”

◆ Next: how FFS fixes these problems (to a degree)Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (273/355)

Problem 1: Too small block size◆ Why not just make bigger?

◆ Bigger block increases bandwidth, but how to deal with wastage (“internal fragmentation”)?– Use idea from malloc: split unused portion.

Block size space wasted file bandwidth512 6.9% 2.6%1024 11.8% 3.3%2048 22.4% 6.4%4096 45.6% 12.0%1MB 99.0% 97.2%


Handling internal fragmentation◆ BSD FFS:

– has large block size (4096 or 8192)– allow large blocks to be chopped into small ones

(“fragments”)– Used for little files and pieces at the ends of files

◆ Best way to eliminate internal fragmentation?– Variable sized splits of course– Why does FFS use fixed-sized fragments (1024, 2048)?

file a File b


◆ Our central fact: – Moving disk head expensive

◆ So? Put related data close

– Fastest: adjacent– sectors (can span platters)

– Next: in same cylinder– (can also span platters)

– Next: in cylinder close by

Prob’ 2: Where to allocate data?


◆ 1 or more consecutive cylinders into a “cylinder group”

– Key: can access any block in a cylinder without performing a seek. Next fastest place is adjacent cylinder.

– Tries to put everything related in same cylinder group– Tries to put everything not related in different group (?!)

Clustering related objects in FFS

Cylinder group 1

cylinder group 2


◆ Tries to put sequential blocks in adjacent sectors– (access one block, probably access next)

◆ Tries to keep inode in same cylinder as file data:– (if you look at inode, most likely will look at data too)

◆ Tries to keep all inodes in a dir in same cylinder group– (access one name, frequently access many)– “ls -l”

Clustering in FFS

file a file b

1 2 3 1 2

Inode 1 2 3


What’s a cylinder group look like?◆ Basically a mini-Unix file system:

◆ How how to ensure there’s space for related stuff?– Place different directories in different cylinder groups– Keep a “free space reserve” so can allocate near existing

things– when file grows to big (1MB) send its remainder to

different cylinder group.

inodes data blocks (512 bytes)

superblock


◆ Old Unix (& dos): Linked list of free blocks– Just take a block off of the head. Easy.

– Bad: free list gets jumbled over time. Finding adjacent blocks hard and slow

◆ FFS: switch to bit-map of free blocks– 1010101111111000001111111000101100– easier to find contiguous blocks. – Small, so usually keep entire thing in memory– key: keep a reserve of free blocks. Makes finding a

close block easier

Prob’ 3: Finding space for related objects

head


◆ Usually keep entire bitmap in memory:– 4G disk / 4K byte blocks. How big is map?

◆ Allocate block close to block x?– check for blocks near bmap[x/32] – if disk almost empty, will likely find one near– as disk becomes full, search becomes more expensive and

less effective.◆ Trade space for time (search time, file access time)

– keep a reserve (e.g, 10%) of disk always free, ideally scattered across disk

– don’t tell users (df --> 110% full)– N platters = N adjacent blocks– with 10% free, can almost always find one of them free

Using a bitmap


So what did we gain?◆ Performance improvements:

– able to get 20-40% of disk bandwidth for large files– 10-20x original Unix file system!– Better small file performance (why?)

◆ Is this the best we can do? No.◆ Block based rather than extent based

– name contiguous blocks with single pointer and length– (Linux ext2fs)

◆ Writes of meta data done synchronously– really hurts small file performance– make asynchronous with write-ordering (“soft updates”)

or logging (the episode file system, ~LFS)– play with semantics (/tmp file systems)Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (282/355)

Other hacks?◆ Obvious:

– Big file cache.◆ Fact: no rotation delay if get whole track.

– How to use?◆ Fact: transfer cost negligible.

– Can get 20x the data for only ~5% more overhead– 1 sector = 10ms + 8ms + 50us (512/10MB/s) = 18ms– 20 sectors = 10ms + 8ms + 1ms = 19ms– How to use?

◆ Fact: if transfer huge, seek + rotation negligible– Mendel: LFS. Hoard data, write out MB at a time.


Review: FFS background• 1980s improvement to original Unix FS, which had:

- 512-byte blocks

- Free blocks in linked list

- All inodes at beginning of disk

- Low throughput: 512 bytes per average seek time

• Unix FS performance problems:

- Transfers only 512 bytes per disk access

- Eventually random allocation → 512 bytes / disk seek

- Inodes far from directory and file data

- Within directory, inodes far from each other

• Also had some usability problems:

- 14-character file names a pain

- Can’t atomically update file in crash-proof way

– p. 2/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (284/355)

Review: FFS [McKusic] basics

• Change block size to at least 4K

- To avoid wasting space, use “fragments” for ends of files

• Cylinder groups spread inodes around disk

• Bitmaps replace free list

• FS reserves space to improve allocation

- Tunable parameter, default 10%

- Only superuser can use space when over 90% full

• Usability improvements:

- File names up to 255 characters

- Atomic rename system call

- Symbolic links assign one file name to another


FFS disk layout

superblocks

cylindergroups

inodes data blocks

informationbookkeeping

• Each cylinder group has its own:

- Superblock

- Bookkeeping information

- Set of inodes

- Data/directory blocks


Basic FFS data structures

data

data

data

data

namei-number

...

contents

directory

...

inode

...

indirectblock

...double indirindirect ptr

...

metadata

...

...

data ptrdata ptr

data ptrdata ptr

. . .• Inode is key data structure for each file

- Has permissions and access/modification/inode-change times

- Has link count (# directories containing file); file deleted when 0

- Points to data blocks of file (and indirect blocks)

• By convention, inode #2 always root directory– p. 5/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (287/355)

FFS superblock

• Superblock contains file system parameters

- Disk characteristics, block size, CG info

- Information necessary to get inode given i-number

• Replicated once per cylinder group

- At shifting offsets, so as to span multiple platters

- Contains magic number to find replicas if 1st superblock dies

• Contains non-replicated “summary info”

- # blocks, fragments, inodes, directories in FS

- Flag stating if FS was cleanly unmounted


Cylinder groups

• Groups related inodes and their data

• Contains a number of inodes (set when FS created)

- Default one inode per 2K data

• Contains file and directory blocks

• Contains bookkeeping information

- Block map – bit map of available fragments

- Summary info within CG – # free inodes, blocks/frags, files,

directories

- # free blocks by rotational position (8 positions)

[In 1980s, disks weren’t commonly zoned, so this was

reasonable]


Inode allocation

• Allocate inodes in same CG as directory if possible

• New directories put in new cylinder groups

- Consider CGs with greater than average # free inodes

- Chose CG with smallest # directories

• Within CG, inodes allocated randomly (next free)

- Would like related inodes as close as possible

- OK, because one CG doesn’t have that many inodes


Fragment allocation

• Allocate space when user writes beyond end of file

• Want last block to be a fragment if not full-size

- If already a fragment, may contain space for write – done

- Else, must deallocate any existing fragment, allocate new

• If no appropriate free fragments, break full block

• Problem: Slow for many small writes

- (Partial) soution: new stat struct field st_blksize- Tells applications file system block size

- stdio library can buffer this much data


Block allocation

• Try to optimize for sequential access

- If available, use rotationally close block in same cylinder

- Otherwise, use block in same CG

- If CG totally full, find other CG with quadratic hashing

- Otherwise, search all CGs for some free space

• Problem: Don’t want one file filling up whole CG

- Otherwise other inodes will have data far away

• Solution: Break big files over many CGs

- But large extents in each CGs, so sequential access doesn’t

require many seeks


Directories

• Inodes like files, but with different type bits

• Contents considered as 512-byte chunks

• Each chunk has dire t structure(s) with:

- 32-bit inumber

- 16-bit size of directory entry

- 8-bit file type (NEW)

- 8-bit length of file name

• Coalesce when deleting

- If first dire t in chunk deleted, set inumber = 0

• Periodically compact directory chunks


Updating FFS for the 90s

• No longer want to assume rotational delay

- With disk caches, want data contiguously allocated

• Solution: Cluster writes

- FS delays writing a block back to get more blocks

- Accumulates blocks into 64K clusters, written at once

• Allocation of clusters similar to fragments/blocks

- Summary info

- Cluster map has one bit for each 64K if all free

• Also read in 64K chunks when doing read ahead


Fixing corruption – fsck• Must run FS check (fsck) program after crash

• Summary info usually bad after crash

- Scan to check free block map, block/inode counts

• System may have corrupt inodes (not simple crash)

- Bad block numbers, cross-allocation, etc.

- Do sanity check, clear inodes with garbage

• Fields in inodes may be wrong

- Count number of directory entries to verify link count, if no

entries but count 6= 0, move to lost+found- Make sure size and used data counts match blocks

• Directories may be bad

- Holes illegal, . and .. must be valid, . . .

- All directories must be reachable


Crash recovery permeates FS code

• Have to ensure fsck can recover file system

• Example: Suppose all data written asynchronously

• Delete/truncate a file, append to other file, crash

- New file may reuse block from old

- Old inode may not be updated

- Cross-allocation!

- Often inode with older mtime wrong, but can’t be sure

• Append to file, allocate indirect block, crash

- Inode points to indirect block

- But indirect block may contain garbage


Ordering of updates

• Must be careful about order of updates

- Write new inode to disk before directory entry

- Remove directory name before deallocating inode

- Write cleared inode to disk before updating CG free map

• Solution: Many metadata updates syncrhonous

- Of course, this hurts performance

- E.g., untar much slower than disk b/w

• Note: Cannot update buffers on the disk queue

- E.g., say you make two updates to same directory block

- But crash recovery requires first to be synchronous

- Must wait for first write to complete before doing second


Performance vs. consistency

• FFS crash recoverability comes at huge cost

- Makes tasks such as untar easily 10-20 times slower

- All because you might lose power or reboot at any time

• Even while slowing ordinary usage, recovery slow

- If fsck takes one minute, then disks get 10× bigger . . .

• One solution: battery-backed RAM

- Expensive (requires specialized hardware)

- Often don’t learn battery has died until too late

- A pain if computer dies (can’t just move disk)

- If OS bug causes crash, RAM might be garbage

• Better solution: Advanced file system techniques

- Topic of rest of lecture


First attempt: Ordered updates

• Must follow three rules in ordering updates:

1. Never write pointer before initializing the structure it points to

2. Never reuse a resource before nullifying all pointers to it

3. Never clear last pointer to live resource before setting new one

• If you do this, file system will be recoverable

• Moreover, can recover quickly

- Might leak free disk space, but otherwise correct

- So start running after reboot, scavenge for space in background

• How to achieve?

- Keep a partial order on buffered blocks


Ordered updates (continued)

• Example: Create file A

- Block X contains an inode

- Block Y contains a directory block

- Create file A in inode block X, dir block Y

• We say Y → X meaning X must be written before Y

• Can delay both writes, so long as order preserved

- Say you create a second file B in blocks X and Y

- Only have to write each out once for both creates


Problem: Cyclic dependencies

• Suppose you create file A, unlink file B

- Both files in same directory block & inode block

• Can’t write directory until inode A initialized

- Otherwise, after crash directory will point to bogus inode

- Worse yet, same inode # might be re-allocated

- So could end up with file name A being an unrelated file

• Can’t write inode block until dir entry B cleared

- Otherwise, B could end up with too small a link count

- File could be deleted while links to it still exist

• Otherwise, fsck has to be very slow

- Check every directory entry and inode link count


Cyclic dependencies illustrated

Inode #4

Inode #5

Inode #6

Inode #7

Inode Block Directory Block

< −−,#0 >

< C,#7 >

< B,#5 >

Inode #4

Inode #5

Inode #6

Inode #7


< A,#4 >

< C,#7 >

< B,#5 >

Inode #4

Inode #5

Inode #6

Inode #7


< A,#4 >

< C,#7 >

< −−,#0 >

(a) Original Organization (b) Create File A

(c) Remove File B– p. 20/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (302/355)

More problems

• Crash might occur between ordered but relatedwrites

- E.g., summary information wrong after block freed

• Block aging

- Block that always has dependency will never get written back

• Solution: “Soft updates” [Ganger]

- Write blocks in any order

- But keep track of dependencies

- When writing a block, temporarily roll back any changes you

can’t yet commit to disk


Breaking dependencies w. rollback

Inode #4

Inode #5

Inode #6

Inode #7


< −−,#0 >

< C,#7 >

< B,#5 >

Inode #4

Inode #5

Inode #6

Inode #7


< A,#4 >

< C,#7 >

< −−,#0 >

(a) After Metadata Updates

Main Memory Disk

• Now say we decide to write directory block. . .



Main Memory Disk

Inode #4

Inode #5

Inode #6

Inode #7


< −−,#0 >

< C,#7 >

< −−,#0 >

Inode #4

Inode #5

Inode #6

Inode #7


< A,#4 >

< C,#7 >

< −−,#0 >

(b) Safe Version of Directory Block Written

• Note: Directory block still dirty

• But now inode block has no dependencies

• Say we write inode block out. . .



Main Memory Disk

Inode #4

Inode #5

Inode #6

Inode #7


< −−,#0 >

< C,#7 >

< −−,#0 >

Inode #4

Inode #5

Inode #6

Inode #7


< A,#4 >

< C,#7 >

< −−,#0 >

(c) Inode Block Written

• Now inode block clean (same in memory as on disk)

• But have to write directory block a second time. . .



Main Memory Disk

Inode #4

Inode #5

Inode #6

Inode #7


< A,#4 >

< C,#7 >

< −−,#0 >

Inode #4

Inode #5

Inode #6

Inode #7


< A,#4 >

< C,#7 >

< −−,#0 >

(d) Directory Block Written

• All data stably on disk


Soft updates

• Structure for each updated field or pointer, contains:

- old value

- new value

- list of updates on which this update depends (dependees)

• Can write blocks in any order

- But must temporarily undo updates with pending

dependencies

- Must lock rolled-back version so applications don’t see it

- Choose ordering based on disk arm scheduling

• Some dependencies better handled by postponingin-memory updates

- E.g., Just mark block as free in bitmap after pointer cleared


Simple example

• Create zero-length file A

• Depender: Directory entry for A

- Can’t be written untill dependees on disk

• Dependees:

- Inode – must be initialized before dir entry written

- Bitmap – must mark inode allocated before dir entry written

• Old value: empty directory entry

• New value: 〈filename A, inode #〉


Operations requiring soft updates (1)

1. Block allocation

- Must write the disk block, the free map, & a pointer

- Disk block & free map must be written before pointer

- Use Undo/redo on pointer (& possibly file size)

2. Block deallocation

- Must write the cleared pointer & free map

- Just update free map after pointer written to disk

- Or just immediately update free map if pointer not on disk

• Say you quickly append block to file then truncate

- You will know pointer to block not written because of the

allocated dependency structure

- So both operations together require do disk I/O


Operations requiring soft updates (2)

3. Link addition (see simple example)

- Must write the directory entry, inode, & free map (if new inode)

- Inode and free map must be written before dir entry

- Use undo/redo on i# in dir entry (ignore entries w. i# 0)

4. Link removal

- Must write directory entry, inode & free map (if nlinks==0)

- Must decrement nlinks only after pointer cleared

- Clear directory entry immediately

- Decrement in-memory nlinks once pointer written

- If directory entry was never written, decrement immediately

(again will know by presence of dependency structure)

• Note: Quick create/delete requires no disk I/O


Soft update issues• fsync – sycall to flush file changes to disk

- Must also flush directory entries, parent directories, etc.

• unmount – flush all changes to disk on shutdown

- Some buffers must be flushed multiple times to get clean

• Deleting large directory trees frighteningly fast

- unlink syscall returns even if inode/indir block not cached!

- Dependencies allocated faster than blocks written

- Cap # dependencies allocated to avoid exhausting memory

• Useless write-backs

- Syncer flushes dirty buffers to disk every 30 seconds

- Writing all at once means many dependencies unsatisfied

- Fix syncer to write blocks one at a time

- Fix LRU buffer eviction to know about dependencies


Soft updates fsck

• Split into foreground and background parts

• foreground must be done before remounting FS

- Need to make sure per-cylinder summary info makes sense

- Recompute free block/inode counts from bitmaps – very fast

- Will leave FS consistent, but might leak disk space

• Background does traditional fsck operations

- Can do in background after mounting to recuperate free space

- Must be done in forground after a media failure

• Difference from traditional FFS fsck:

- May have many, many inodes with non-zero link counts

- Don’t stick them all in lost+found (unless media failure)


An alternative: Journaling

• Reserve a portion of disk for write-ahead log

- Write any metadata operation first to log, then to disk

- After crash/reboot, re-play the log (efficient)

- My re-do already committed change, but won’t miss anything

• Performance advantage:

- Log is consecutive portion of disk

- Multiple log writes very fast (at disk b/w)

- Consider updates committed when written to log

• Example: delete directory tree

- Record all freed blocks, changed directory entries in log

- Return control to user

- Write out changed directories, bitmaps, etc. in background

(sort for good disk arm scheduling)


Journaling details

• Must find oldest relevant log entry

- Otherwise, redundant and slow to replay whole log

• Use checkpoints

- Once all records up to log entry N have been processed and

affected blocks stably committed to disk. . .

- Record N to disk either in reserved checkpoint location, or in

checkpoint log record

- Never need to go back before most recent checkpointed N

• Must also find end of log

- Typically circular buffer; don’t play old records out of order

- Can include begin transaction/end transaction records

- Also typically have checksum in case some sectors bad


Case study: XFS [Sweeney]• Main idea: Think big

- Big disks, files, large # of files, 64-bit everything

- Yet maintain very good performance

• Break disk up into Allocation Groups (AGs)

- 0.5 – 4 GB regions of disk

- New directories go in new AGs

- Within directory, inodes of files go in same AG

- Unlike cylinder groups, AGs too large to minimize seek times

- Unlike cylinder groups, no fixed # of inodes per AG

• Advantages of AGs:

- Parallelize allocation of blocks/inodes on multiprocessor

(independent locking of different free space structures)

- Can use 32-bit block pointers within AGs

(keeps data structures smaller)– p. 31/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (316/355)

B+-treesptr

ptr

ptr

ptr

K K K

ptr

ptr

ptr

ptr

ptr

ptr

ptr

ptr

KV

KV

KV

KV

KV

KV

• XFS makes extensive use of B+-trees

- Indexed data structure stores ordered Keys & Values

- Keys must have an ordering defined on them

- Stored data in blocks for efficient disk access

• For B+-tree w. n items, all operation O(log n):

- Retrieve closest 〈key, value〉 to target key k

- Insert a new 〈key, value〉 pair

- Delete 〈key, value〉 pair


B+-trees continued

• See any algorithms book for details (e.g., [Cormen])

• Some operations on B-tree are complex:

- E.g., insert item into completely full B+-tree

- May require “splitting” nodes, adding new level to tree

- Would be bad to crash & leave B+tree in inconsistent state

• Journal enables atomic complex operations

- First write all changes to the log

- If crash while writing log, incomplete log record will be

discarded, and no change made

- Otherwise, if crash while updating B+-tree, will replay entire

log record and write everything


B+-trees in XFS• B+-trees are complex to implement

- But once you’ve done it, might as well use everywhere

• Use B+-trees for directories (keyed on filename hash)

- Makes large directories efficient

• Use B+-trees for inodes

- No more FFS-style fixed block pointers

- Instead, B+-tree maps: file offset → 〈start block, # blocks〉- Ideally file is one or small number of contiguous extents

- Allows small inodes & no indirect blocks even for huge files

• Use to find inode based on inumber

- High bits of inumber specify AG

- B+-tree in AG maps: starting i# → 〈block #, free-map〉- So free inodes tracked right in leaf of B+-tree


More B+-trees in XFS

• Free extents tracked by two B+-trees

1. start block # → # free blocks

2. # free blocks → # start blocks #

• Use journal to update both atomically & consistently

• #1 allows you to coalesce adjacent free regions

• #1 allows you to allocate near some target

- E.g., when extending file, but next block near previous one

- When first writing to file, but data near inode

• #2 allows you to do best fit allocation

- Leave large free extents for large files


Contiguous allocation

• Ideally want each file contiguous on disk

- Sequential file I/O should be as fast as sequential disk I/O

• But how do you know how large a file will be?

• Idea: delayed allocation

- write syscall only affects the buffer cache

- Allow write into buffers before deciding where to place on disk

- Assign disk space only when buffers are flushed

• Other advantages:

- Short-lived files never need disk space allocated

- mmaped files often written in random order in memory, but will

be written to disk mostly contiguously

- Write clustering: find other nearby stuff to write to disk


Fift Chapter

Security4

4From David Mazieres course at Stanford.Dawson Engler RSA (2008-2009) Chap 5: Security (322/355)

View access control as a matrix

• Subjects (processes/users) access objects (e.g., files)

• Each cell of matrix has allowed permissions

– p. 1/31Dawson Engler RSA (2008-2009) Chap 5: Security (323/355)

Specifying policy• Manually filling out matrix would be tedious

• Use tools such as groups or role-based access control:


Two ways to slice the matrix

• Along columns:

- Kernel stores list of who can access object along with object

- Most systems you’ve used probably do this

- Examples: Unix file permissions, Access Control Lists (ACLs)

• Along rows:

- Capability systems do this

- More on these later. . .


Example: Unix protection

• Each process has a User ID & one or more group IDs

• System stores with each file:

- User who owns the file and group file is in

- Permissions for user, any one in file group, and other

• Shown by output of ls -l command:- user︷︸︸︷rwx group︷︸︸︷r-x other︷︸︸︷r-x owner︷︸︸︷dm group︷︸︸︷ s140 ... index.html- User permissions apply to processes with same user ID

- Else, group permissions apply to processes in same group

- Else, other permissions apply


Unix continued

• Directories have permission bits, too

- Need write perm. on directory to create or delete a file

• Special user root (UID 0) has all privileges

- E.g., Read/write any file, change owners of files

- Required for administration (backup, creating new users, etc.)

• Example:

- drwxr-xr-x 56 root wheel 4096 Apr 4 10:08 /et - Directory writable only by root, readable by everyone

- Means non-root users cannot directly delete files in /et


Non-file permissions in Unix

• Many devices show up in file system

- E.g., /dev/tty1 permissions just like for files

• Other access controls not represented in file system

• E.g., must usually be root to do the following:

- Bind any TCP or UDP port number less than 1,024

- Change the current process’s user or group ID

- Mount or unmount file systems

- Create device nodes (such as /dev/tty1) in the file system

- Change the owner of a file

- Set the time-of-day clock; halt or reboot machine


Example: Login runs as root• Unix users typically stored in files in /et

- Files password, group, and often shadow or master.passwd• For each user, files contain:

- Textual username (e.g., “dm”, or “root”)

- Numeric user ID, and group ID(s)

- One-way hash of user’s password: {salt, H(passwd, salt)}- Other information, such as user’s full name, login shell, etc.

• /usr/bin/login runs as root

- Reads username & password from terminal

- Looks up username in /et /passwd, etc.

- Computes H(typed password, salt) & checks that it matches

- If matches, sets group ID & user ID for username

- Execute user’s shell with exe system call


Setuid• Some legitimate actions require more privs than UID

- E.g., how should users change their passwords?

- Stored in root-owned /et /passwd & /et /shadow files

• Solution: Setuid/setgid programs

- Run with privileges of file’s owner or group

- Each process has real and effective UID/GID

- real is user who launched setuid program

- effective is owner/group of file, used in access checks

- E.g., /usr/bin/passwd – changes users password

- E.g., /bin/su – acquire new user ID with correct password

• Have to be very careful when writing setuid code

- Attackers can run setuid programs any time (no need to wait

for root to run a vulnerable job)

- Attacker controls many aspects of program’s environment


Other permissions

• When can process A send a signal to process B w. kill?

- Allow if sender and receiver have same effective UID

- But need ability to kill processes you launch even if suid

- So allow if real UIDs match, as well

- Can also send SIGCONT w/o UID match if in same session

• Debugger system call ptrace

- Lets one process modify another’s memory

- Setuid gives a program more privilege than invoking user

- So don’t let process ptrace more privileged process

- E.g., Require sender to match real & effective UID of target

- Also disable setuid if ptraced target calls exec

- Exception: root can ptrace anyone


A security hole

• Even without root or setuid, attackers can trick root

owned processes into doing things. . .

• Example: Want to clear unused files in /tmp• Every night, automatically run this command as root:find /tmp -atime +3 -exe rm -f -- {} \;• find identifies files not accessed in 3 days

- executes rm, replacing {} with file name

• rm -f -- path deletes file path

- Note “--” prevents path from being parsed as option

• What’s wrong here?


An attack

find/rm Attacker

creat (“/tmp/etc/passwd”)

readdir (“/tmp”) → “etc”

lstat (“/tmp/etc”) → DIRECTORY

readdir (“/tmp/etc”) → “passwd”

rename (“/tmp/etc” → “/tmp/x”)

symlink (“/etc”, “/tmp/etc”)

unlink (“/tmp/etc/passwd”)

• Time-of-check-to-time-of-use (TOCTTOU) bug

- find checks that /tmp/et is not symlink

- But meaning of file name changes before it is used


xterm command• Provides a terminal window in X-windows

• Used to run with setuid root privileges

- Requires kernel pseudo-terminal (pty) device

- Required root privs to change ownership of pty to user

- Also writes protected utmp/wtmp files to record users

• Had feature to log terminal session to file

fd = open (logfile, O_CREAT|O_WRONLY|O_TRUNC, 0666);

/* ... */

a ess








if (access (logfile, W_OK) < 0)

return ERROR;fd = open (logfile, O_CREAT|O_WRONLY|O_TRUNC, 0666);

/* ... */

• a ess call avoids dangerous security hole

- Does permission check with real, not effective UID








if (access (logfile, W_OK) < 0)

return ERROR;fd = open (logfile, O_CREAT|O_WRONLY|O_TRUNC, 0666);

/* ... */

• a ess call avoids dangerous security hole

- Does permission check with real, not effective UID

- Wrong: Another TOCTTOU bug


An attack

xterm Attacker

creat (“/tmp/X”)

access (“/tmp/X”) → OK

unlink (“/tmp/X”)

symlink (“/tmp/X” → “/etc/passwd”)

open (“/tmp/X”)

• Attacker changes /tmp/X between check and use

- xterm unwittingly overwrites /et /passwd- Time-of-check-to-time-of-use (TOCTTOU) bug

• OpenBSD man page: “CAVEATS: access() is a

potential security hole and should never be used.”


SSH configuration files

• SSH 1.2.12 – secure login program, runs as root

- Needs to bind TCP port under 1,024 (privileged operation)

- Needs to read client private key (for host authentication)

• Also needs to read & write files owned by user

- Read configuration file ~/.ssh/ onfig- Record server keys in ~/.ssh/known_hosts

• Author wanted to avoid TOCTTOU bugs:

- First binds socket & reads root-owned secret key file

- Then drops all privileges before accessing user files

- Idea: avoid using any user-controlled arguments/files until

you have no more privileges than the user


Trick question: ptrace bug

• Dropping privs allows user to “debug” SSH

- Depends on OS, but at the time several had ptrace

implementations that made SSH vulerable

• Once in debugger

- Could use privileged port to connect anywhere

- Could read secret host key from memory

- Could overwrite local user name to get privs of other user

• The fix: restructure into 3 processes!

- Perhaps overkill, but really wanted to avoid problems


A linux security hole

• Some programs acquire then release privileges

- E.g., su user setuid, becomes user if password correct

• Consider the following:

- A and B unprivileged processes owned by attacker

- A ptraces B

- A executes “su user” to its own identity

- While su is superuser, B execs su root(A is superuser, so this is not disabled)

- A types password, gets shell, and is attached to su root- Can manipulate su root’s memory to get root shell


Editorial

• Previous examples show two limitations of Unix

• Many OS security policies subjective not objective

- When can you signal/debug process? Re-bind network port?

- Rules for non-file operations somewhat incoherent

- Even some file rules weird (Creating hard links to files)

• Correct code is much harder to write than incorrect

- Delete file without traversing symbolic link

- Read SSH configuration file (requires 3 processes??)

- Write mailbox owned by user in dir owned by root/mail

• Don’t just blame the application writers

- Must also blame the interfaces they program to


Another security problem [Hardy]

• Setting: A multi-user time sharing system

- This time it’s not Unix

• Wanted fortran compiler to keep statistics

- Modified compiler /sysx/fort to record stats in /sysx/stat- Gave compiler “home files license”—allows writing to

anything in /sysx (kind of like Unix setuid)

• What’s wrong here?


A confused deputy

• Attacker could overwrite any files in /sysx- System billing records kept in /sysx/bill got wiped

- Probably command like fort -o /sysx/bill file.f• Is this a compiler bug?

- Original implementors did not anticipate extra rights

- Can’t blame them for unchecked output file

• Compiler is a “confused deputy”

- Inherits privileges from invoking user (e.g., read file.f)

- Also inherits from home files license

- Which master is it serving on any given system call?

- OS doesn’t know if it just sees open ("/sysx/bill", ...)– p. 19/31Dawson Engler RSA (2008-2009) Chap 5: Security (343/355)

Capabilities

• Slicing matrix along rows yields capabilities

- E.g., For each process, store a list of objects it can access

- Process explicitly invokes particular capabilities

• Can help avoid confused deputy problem

- E.g., Must give compiler an argument that both specifies the

output file and conveys the capability to write the file

(think about passing a file descriptor, not a file name)

- So compiler uses no ambient authority to write file

• Three general approaches to capabilities:

- Hardware enforced (Tagged architectures like M-machine)

- Kernel-enforced (Hydra, KeyKOS)

- Self-authenticating capabilities (like Amoeba)


Hydra

• Machine & programing env. built at CMU in ’70s

• OS enforced object modularity with capabilities

- Could only call object methods with a capability

• Agumentation let methods manipulate objects

- A method executes with the capability list of the object, not the

caller

• Template methods take capabilities from caller

- So method can access objects specified by caller


KeyKOS

• Capability system developed in the early 1980s

• Goal: Extreme security, reliability, and availability

• Structured as a “nanokernel”

- Kernel proper only 20,000 likes of C, 100KB footprint

- Avoids many problems with traditional kernels

- Traditional OS interfaces implemented outside the kernel

(including binary compatibility with existing OSes)

• Basic idea: No privileges other than capabilities

- Means kernel provides purely objective security mechanism

- As objective as pointers to objects in OO languages

- In fact, partition system into many processes akin to objects


Unique features of KeyKOS

• Single-level store

- Everything is persistent: memory, processes, . . .

- System periodically checkpoints its entire state

- After power outage, everything comes back up as it was

(may just lose the last few characters you typed)

• “Stateless” kernel design only caches information

- All kernel state reconstructible from persistent data

• Simplifies kernel and makes it more robust

- Kernel never runs out of space in memory allocation

- No message queues, etc. in kernel

- Run out of memory? Just checkpoint system


KeyKOS capabilities

• Refered to as “keys” for short

• Types of keys:

- devices – Low-level hardware access

- pages – Persistent page of memory (can be mapped)

- nodes – Container for 16 capabilities

- segments – Pages & segments glued together with nodes

- meters – right to consume CPU time

- domains – a thread context

• Anyone possessing a key can grant it to others

- But creating a key is a privileged operation

- E.g., requires “prime meter” to divide it into submeters


Capability details

• Each domain has a number of key “slots”:

- 16 general-purpose key slots

- address slot – contains segment with process VM

- meter slot – contains key for CPU time

- keeper slot – contains key for exceptions

• Segments also have an associated keeper

- Process that gets invoked on invalid reference

• Meter keeper (allows creative scheduling policies)

• Calls generate return key for calling domain

- (Not required–other forms of message don’t do this)


KeyNIX: UNIX on KeyKOS

• “One kernel per process” architecture

- Hard to crash kernel

- Even harder to crash system

• Proc’s kernel is it’s keeper

- Unmodified Unix binary makes Unix syscall

- Invalid KeyKOS syscall, transfers control to Unix keeper

• Of course, kernels need to share state

- Use shared segment for process and file tables


KeyNIX overview


Keynix I/O

• Every file is a different process

- Elegant, and fault isolated

- Small files can live in a node, not a segment

- Makes the namei() function very expensive

• Pipes require queues

- This turned out to be complicated and inefficient

- Interaction with signals complicated

• Other OS features perform very well, though

- E.g., fork is six times faster than Mach 2.5


Self-authenticating capabilities

• Every access must be accompanied by a capability

- For each object, OS stores random check value

- Capability is: {Object, Rights, MAC(check, Rights)}• OS gives processes capabilities

- Process creating resource gets full access rights

- Can ask OS to generate capability with restricted rights

• Makes sharing very easy in distributed systems

• To revoke rights, must change check value

- Need some way for everyone else to reacquire capabilities

• Hard to control propagation


Amoeba• A distributed OS, based on capabilities of form:

- server port, object ID, rights, check

• Any server can listen on any machine

- Server port is hash of secret

- Kernel won’t let you listen if you don’t know secret

• Many types of object have capabilities

- files, directories, processes, devices, servers (E.g., X windows)

• Separate file and directory servers

- Can implement your own file server, or store other object types

in directories, which is cool

• Check is like a secret password for the object

- Server records check value for capabilities w. all rights

- Restricted capability’s check is hash of old check, rights


Limitations of capabilities

• IPC performance a losing battle with CPU makers

- CPUs optimized for “common” code, not context switches

- Capability systems usually involve many IPCs

• Capability programming model never took off

- Requires changes throughout application software

- Call capabilities “file descriptors” or “Java pointers” and

people will use them

- But discipline of pure capability system challenging so far

- People sometimes quip that capabilities are an OS concept of

the future and always will be


Documents

Executing Programs - people.irisa.fr