Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Operating Systems DesignReseaux et Systemes Avances (RSA)
Martin Quinson <[email protected]>
Ecole Superieure d’Informatique et Applications de Lorraine – 2ieme annee
2008-2009
Module Presentation
Module FocusI Study of Operating Systems (OS)
I Focus on Design (not Usage)
(System Programming ; RS module)
Module Prerequisites
I C language
I Unix usage (shell usage, basic commands)
I System Programming (forking, process control, shell programming)
Module Objectives:
I Understanding the challenge to solve when writing an OS
I Know and understand the main components of an OS
I Know and understand the considerations in their design
I Being able to compare solutions to these classical challenges
I Knowing how they are solved under UnixMartin Quinson RSA (2008-2009) Module Presentation (2/355)
Practical Information
Module split in two parts
Part on Systems
I Lecturer: Martin Quinson
I 6 lectures, 4 TD (on-table labs), 2 TP (practical labs)
I Exam on 6/3/2009
Part on NetworksI Lecturer: Isabelle Chrisment
I 6 lectures (one is wrongly scheduled next week), 3 TD, 3 TP
I Exam on 20/4/2009
Martin Quinson RSA (2008-2009) Module Presentation (3/355)
Bibliography (for this part only)
BooksI Silberschatz, Peterson, Glavin: Operating Systems Concepts (7th edition)
Good introduction to concepts
I Tanenbaum, Woodhull: Operating Systems: Design and Implementation
The minix book, which one of the rare pedagogical operating system
I Lamiroy, Najman, Talbot: Systemes d’exploitationIn french; not only Unix but also a bit of Windows
I Leffler et Al, Design and Implementation of the 4.3BSD UNIX Operating SystemDissection of a classical version of Unix. Maybe somehow dated but instructive.
Course available on the InternetI Introduction aux systemes et aux reseaux (S. Krakowiak, Grenoble – in french)
http://sardes.inrialpes.fr/~krakowia/Enseignement/L3/SR-L3.html/
I Operating Systems and Systems Programming (M. Rosenblum, Stanford)http://www.stanford.edu/class/cs140/
URL of this coursehttp://www.loria.fr/~quinson/teach-RSA.html (empty for now, but . . . )
Martin Quinson RSA (2008-2009) Module Presentation (4/355)
Agenda of this Course
Operating System Design and Advanced Usage
1 IntroductionWhat is an OS, Computer Architecture, Main Components, Recurring Themes
2 Processes HandlingProcess creation (implementing fork), Scheduling (goals, algorithms, real cases)
3 Memory SubsystemGoals, Paging, Sectioning, Real Cases, Trashing, User-level Management
4 Input/Output Subsystem (disks)Main Concepts, Implementation, Performance Concerns, Security
5 SecurityProtection, Security
Martin Quinson RSA (2008-2009) Module Presentation (5/355)
First Chapter
Introduction
Computer ArchitectureHow Modern Computers Work
Executing ProgramsStorage HierarchyData movements
Current Trend: Multi-processors/multi-cores
Operating System IntroductionWhat is an Operating System?Roles and SubsystemsProtectionRecurring Themes for OS
Operating System Design
Case of Linux
Martin Quinson RSA (2008-2009) Introduction (6/355)
What is an Operating System?
Software between Applications and Reality
I Shields Applications from Hardware complexity: makes them portable
I Shields Applications from Hardware limitation: makes finite into (near) infinite
I Shields Hardware from Applications: provide protection
and these are difficult goals
Disks Graphic Cards
Sound
Operating System
Hardware
Firefox KDEEmacs
Martin Quinson RSA (2008-2009) Introduction (7/355)
Computer Architecture Basics
What is a Computer anyway?
Von Neumann Model
I No separation of code and data in memory
I Revolutionary back in 1945
I A bit outdated by now
Memory
Accumulator
Logic UnitArithmetic
OutputInput
ControlUnit
Modern Computer SystemsI Control & Computation merged into CPU
(Central Processing Unit)
I Element communicate through a busData transfer facility
I Memory is not uniformI Registers within CPUI Caches close to elements
(avoid bus’s cost if possible)I Speed and capacities differ greatly
Memory
Cache
CPU
Cache Cache Cache
Bus
Registers
GraphicsController
USBDiskController Adapter
Martin Quinson RSA (2008-2009) Chap 1: Introduction (8/355)
Executing Programs
Main CPU loop
1. Get address of next instruction to executeAddress stored in a specific register (Instruction Pointer, noted %eip on x86)
2. Fetch instruction through busopcode options parameters
I opcode: operation code, identify the instruction kindI options: set of flags configuring the instructionI parameters: some operands (register, address, value)
3. Run the instruction, and increments the instruction pointerunless instruction changes IP, such as branching or function call/return
Examples of instruction semantics with add1 (adds two integers)add1 %edx, (%eax) ; adds content of %edx to value stored at address %eax
; and store result at address %eaxadd1 %eax, %edx ; adds content of %eax to content of %edx
; and store result in register %edxadd1 $10, (%eax) ; adds 10 to value pointed by %eax and store result at %edx; Option flags are used to specify the semantic of operands
Martin Quinson RSA (2008-2009) Chap 1: Introduction (9/355)
Storage Hierarchy
Memory is not uniform, but hierarchical
I Huge difference between kind of memoriesIn term of: speed, size, price, etc.
I New technologies introduced recently(non-volatile main memory, flash disk)
Registers
Cache
Main Memory
Electronic disk
Level Registers Cache Main Memory Disk storageTypical size <1Kb few Mb few Gb 100s of Gb
Access time (ns) 0.25 - 0.5 0.5 - 25 80 - 250 5,000,000Bandwidth (Mb/s) 20k - 100k 5k - 10k 1k - 5k 20 - 150
Volatile? Yes Yes Yes NoManaged by Compiler Hardware OS OSBacked by Cache Main Mem Disk CD or tape
Network may be seen as a 5th level (or more)I But several networked technologies complicate the picture
Martin Quinson RSA (2008-2009) Chap 1: Introduction (10/355)
Buses
Allow data movement between computer components
Connectors toplug cards Links for
informationtransport
Buses classificationsI Synchronous (fast, but every component at same pace) or Asynchronous
I Classification depending on what they interconnect:I Processor bus: within a chip, between elementsI Memory bus: between CPU and main memory (synchronous for performance)I I/O bus: connects devices to main memory (asynchronous for portability)
I On a bus, each link specialized depending on what it conveys:
I Address link: convey address of data to convey
I Data link: convey actual data
I Control link: used to synchronize operations or similar
CPU Memory
AddressDataControl
Martin Quinson RSA (2008-2009) Chap 1: Introduction (11/355)
Computer Architecture History
“Archaic” design
Mainmemory
CPU - Memory BUS
I/O BUS
NetworkDiskScreen
I/O Controllers
Busadapter
Cache
CPU
Current design
MainmemoryScreen
Network
Disk
controllers
CPU
scannerprinter
keyboardmouse
NorthBridge
SouthBridge
cache
Martin Quinson RSA (2008-2009) Chap 1: Introduction (12/355)
Speaking with the devices
What they are?
I Devices are every input/output elements in the computer
I Hard disk, network, keyboard, mouse, digital camera, etc.
ProblemsI The OS needs to handle the data movement between CPU and devices
I Devices are slow compared to CPU (Get data from disk: ≈ 5ms ; 200Hz)
I Devices can produce data asynchronously (keyboard, mouse, network)
First solution: Polling
I Ask for new data regularly (but resource waste, plus not optimal response time)
Used solution: Interrupts
I Asynchronous communication: devices interrupt CPU to start an handler
I Similar to signals between processes, but from devices to CPU
Martin Quinson RSA (2008-2009) Chap 1: Introduction (13/355)
Interrupt handling in the OS
Big lines
1. Device ready to send data.Send a Interrupt ReQuest (IRQ) to CPU through specific control bus
2. After current instruction, CPU reads the IRQ (a number)Notifies controller to release itRetrieves the Interrupt Handler function from table(called Programmable Interrupt Controller – PIC)
3. Save current context (registers+instruction pointer) and execute handler
4. Restore context and resume previous activity
NotesI This behavior is hardwired in CPU and out of control of programs
I Interruptions can be temporary masked (as signals). Handling deferred
I Installing new handlers, and masking interruption requires specific privileges
I Check cat /proc/interrupts to see your mapping under linux
Martin Quinson RSA (2008-2009) Chap 1: Introduction (14/355)
How a Modern Computer Works (summary)
Martin Quinson RSA (2008-2009) Chap 1: Introduction (15/355)
Computer Architecture Current Trend: Multi-*
Motivation: Endless need for more computing power
I Modeling and simulating natural phenomenon (genes, meteorology, finance)
I Gaming realism
I Web servers handle thousands of hits per second
Past SolutionI Increase clock speed, put more electronic gates
I We are reaching the physical limits
Current and Future SolutionI Multiply cores, processors and machines
I Systems ways more complex to use efficiently
; The OS needs to evolve to help
Martin Quinson RSA (2008-2009) Chap 1: Introduction (16/355)
Multi-Processors
Shared Memory Processor(SMP)
CPU
SharedMemory
C
C
CCC
C
C
CC C
Cluster System
M
M
MMM
M
M
M M M
C
C
C C C
C
C
CCCFull System
NetworkLocal
Distributed Systems
M
M
MMM
M
M
M M M
C
C
C C C
C
C
CCCFull System
Internet
I SMP communicate through shared memory
I Clusters and DS communicate through classical network (are thus out of topic)
Martin Quinson RSA (2008-2009) Chap 1: Introduction (17/355)
UMA (Uniform Memory Access)
Classical UMA
CPU CPU
Bus
sharedmemory
UMA with cache
Bus
sharedCPU CPUmemorycache cache
Advanced UMA
priv.mem.
priv.mem.
Bus
sharedCPU CPUmemorycache cache
I Every processor access the memory at the same speed
I But memory to slow in classical design, thus adding a cache
I Can go further by adding a private memory to each processor
Martin Quinson RSA (2008-2009) Chap 1: Introduction (18/355)
Implementing UMA: crossbar commuter
I Non-blocking network: Several memory access possible in parallel
Martin Quinson RSA (2008-2009) Chap 1: Introduction (19/355)
NUMA: NON-uniform Memory Access
I Biggest challenge: feed CPU with data (memory slower than CPU)I Idea: Put several CPU per board, and plug boards on mainboard
CPUcache
CPUcache
sharedmemory
sharedmemory
CPUcache
CPUcache
sharedmemory
CPUcache
CPUcache
memory networkdisksMainboard
One card One card One card
IssueI Memory access is non-uniform (slower when far away)
Need specific programming approach to keep efficient
I Cache consistency can turn into a nightmare
Martin Quinson RSA (2008-2009) Chap 1: Introduction (20/355)
Multi-core: Parallelism on Chip
I Idea: Reduce distance to elements (thus latency)
I How: Put several computing elements on the same chip
AMD/Intel bicore chips
cache L1
cache L2
cache L1
corecomputingcomputing
core
Cell Processor
RAMRAM
controllers
memory
controllers
I/O
SPE 1
SPE 3
SPE 5
SPE 7
SPE 2
SPE 4
SPE 6
SPE 8
64bits PowerPCPower Processor Element (PPE)
(c)
Nic
ola
sB
lachfo
rd2005
EIB
(Ele
ment
Inte
rconnect
Bus)
Current TrendI Put more and more cores on chip
I Even put non-symmetric cores: PPE is classical RISC, SPE are SIMD
Martin Quinson RSA (2008-2009) Chap 1: Introduction (21/355)
Computer Architecture Future
Put more and more core on chips
I Intel Research produced a 80 cores chip (delivering 1Tflop)I Complete Cluster-On-Chip envisionned to come soon
Increase even further Architecture Hierarchy
I Researchers build NUMA of CellsI Or Clusters of Cells
Change Paradigm
I GPU have several memory caches, with differing performanceI Flash disk are radically different of classical hard disks
other disk technology are under radarI Embedded Systems and Sensor Network radically change goals
The Operating System must deal with this complexityI Computer Architecture is a very active research area, lead by industryI Operating System is thus also an active research area
(this all is a bit out of scope, but you need to understand underlying complexity)
Martin Quinson RSA (2008-2009) Chap 1: Introduction (22/355)
First Chapter
Introduction
Computer ArchitectureHow Modern Computers Work
Executing ProgramsStorage HierarchyData movements
Current Trend: Multi-processors/multi-cores
Operating System IntroductionWhat is an Operating System?Roles and SubsystemsProtectionRecurring Themes for OS
Operating System Design
Case of Linux
Martin Quinson RSA (2008-2009) Chap 1: Introduction (23/355)
What is an Operating System?
Software between Applications and Reality
I Shields Applications from Hardware complexity: makes them portable
I Shields Applications from Hardware limitation: makes finite into (near) infinite
I Shields Hardware from Applications: provide protection
and these are difficult goals
Disks Graphic Cards
Sound
Operating System
Hardware
Firefox KDEEmacs
Martin Quinson RSA (2008-2009) Chap 1: Introduction (24/355)
History of Operating Systems
Step 0: OS as a Standard Library
I One machine, one user, one software
I Still used in embedded systems
I OS simple (but complex applications)
OS
Hardware
Application
Step 1: Multiple SoftwareI Previous inefficient: process blocked ; machine wasted
I Hack: Allow more than one process, switch when blocked
I Problems: Infinite loops, or random write in memory?
I OS’s Protection: Interposition, Privileges, Preemption
OS
Hardware
gcc emacs
Step 2: Multiple UsersI Simple OS expensive: one machine per user
I Hack: allow more than one user at the same time
I Problems: What if user gluttons, evil or too numerous?
I OS’s Protection: Authentication, Right Management
Jim Bobgcc emacs
Hardware
OS
Martin Quinson RSA (2008-2009) Chap 1: Introduction (25/355)
Roles of an Operating System
RolesI Startup computer at boot time, shutdown at the end
I Passive Role: offers functions that the application may call (API)I Access to devices (display, store data to disks), Start new processes, etc.
I Active Role: interposition when application request resource usage
I Process Scheduling, Virtual Memory, etc. (not on step 0 previous slide)
System Call (syscalls) functions callable by applications to request service from OSKernel System running the active role, and implementing the system callsCommand Interface Textual (shell) or Graphical (mouse): regular apps using API
Firmware Soft running on device controllers
Kernel
Hardware
System Calls API
Command interfaceSystem tools
Applications
Firmware
Operating System
Martin Quinson RSA (2008-2009) Chap 1: Introduction (26/355)
Main OS Sub-Systems
Process HandlingI Process creation (fork, exec) and termination (wait, waitpid)I Suspend, resume (sleep, pause)I IPC (signals, pipes, semaphores, shared memory, etc)
Memory HandlingI Motivations
I Memory only storage directly accessible from CPU⇒ load applications in memory to run them
I Need to protect apps from each others ⇒ bulletproof partitioning
I The OS knows which memory zone is leased and to whoI It allocates and takes back memory on need
I/O HandlingI Controls with any device (through controllers)I Unifies interface device ↔ OS (portability)
Other Sub-SystemsI File System: stable storage (naming, robustness – cf. RS module)I Networking: communicate with other machines (cf. second half of module)
Martin Quinson RSA (2008-2009) Chap 1: Introduction (27/355)
ProtectionMotivation
I An OS has to protect some resourcesI Hardware: Memory, CPU time, Devices (fair share; no hardware misuse)I Software: data on disk, in memory, other (privacy, access management)
I Particularly true for multi-users OSes
Hardware-Aided ProtectionI Modern CPU provide at least two execution levels:
I User mode: not privileged ; peasantI Privileged mode: privileged ; god (also called supervisor, superuser or kernel)
I Applications run in user mode, kernel runs in privileged mode(reset in syscalls, or by interrupt giving control back to OS)
I Some instructions said privileged, only accessible in corresponding mode (I/O)
I User level requests privileged instructions from kernel through syscalls
mode bit = 0
User Space
Kernel Space
mode bit = 1User
applicationContext Switching
Hardware Interrupt
runssyscall
Calls asyscall
Resumeexecution
Martin Quinson RSA (2008-2009) Chap 1: Introduction (28/355)
Protection Examples
I/O Protection
I Any I/O instructions are privileged
I Any I/O request must transit through kernel
I (before, on MS-DOS on 80386, virus could destroy floppy disk)
Memory ProtectionI Example of regions you don’t want the user to mess with:
I Interrupt vector (they could install their own handler)I Authentication tables (they could pretend to be anyone)I Other users’ data (no confidentiality)
I Hardware-level Memory Management Unit (MMU)I Two specific registers: base and limit bounding accessible area by applicationI Assembly code to change them is privilegedI Requesting memory out of the bounds gives control back to OSI Bounds not effective in kernel mode
CPU time (no infinite loop)
I Regular clock interrupts give control back to OS
Martin Quinson RSA (2008-2009) Chap 1: Introduction (29/355)
OS Theme #1: Finite Pie, Infinite demand
How to make the pie go farther?
I Key: Resource usage bursty, so give to other when idleI Not new: rather than one classroom, instructor, restaurant per person, share.
But more utilization = more complexityI How to manage? (ex: 1 road per car vs freeway); abstraction (lanes), synchronization (traffic lights) capacity increase (build)
What when illusion breaks? (resource really exhausted)Refuse service (busy signal), give up (VM swapping), backoff and retry (TCP/IP),break (freeway)
How to share pie?I Ask users? Yeah, right.I Usually monitor usage and attempt to be fair by re-apportion
How to handle pigs?I Quota (disk), ejection (swap), buy more resources, break down (net),
laws (road)I Hard to distinguish responsible busy progs from stupid selfish pigs
Martin Quinson RSA (2008-2009) Chap 1: Introduction (30/355)
OS Theme #2: Performance
Trick #1: Exploit Bursty Applications
I Take stuff from idle guy to busy one. Both happy
Trick #2: Exploit skew
I 80% of time in 20% of codeI 90% of memory accesses to only 10% of totalI Idea of caches:
I Put 10% of memory in fast expensive memory, rest in slow cheap oneI Looks like a big fast memory
Trick #3; Exploit history
I Past predicts future (because future = past)I What’s best cache entry to remove? If future=past, least recently used oneI Works all the time (weather forecast, stock market, etc.)
Martin Quinson RSA (2008-2009) Chap 1: Introduction (31/355)
Operating System Design
IntroductionI No “perfect” solution, but some approaches proven successful
I Internal structure of different OS vary widely
GoalsI User Goals: Easy to use and learn, reliable, safe, fast
I System Goals: Easy to design, implement, and maintain; flexible, reliable,error-free, and efficient
Policy and Mechanism
I Classical SE consideration:Separate what will be done (policy) from how (mechanism)
I Allows maximum flexibility, portability over implementations
Martin Quinson RSA (2008-2009) Chap 1: Introduction (32/355)
Simple Structure: MS-DOS
Main design goal:
I Stuff more functionalities in 640k
Implications:
I Not well structured
I Layer bypassed when needed
I Hard to maintain, and code for
Martin Quinson RSA (2008-2009) Chap 1: Introduction (33/355)
Layered Operating System
Similar to TCP/IP or OSI
I Build your OS as a stack of layers
I Layer 0 is Hardware, Highest is UI
I Layer N only use services of N-1
Traditional UNIX
Martin Quinson RSA (2008-2009) Chap 1: Introduction (34/355)
Monolithic Operating Systems
DefinitionI Every functions of the OS in one big binary
(process, memory, IPC, file systems, network stack, driver pilots)
I Everything runs in kernel mode
BenefitsI Design and implementation easier
I Better performance1
DrawbacksI Ever growing code base (as drivers are added)
I Memory waste (even unused elements are loaded)
I Hard to maintain (multiple interactions)
I Security not enforced (bug in one driver ; system crash)
1This point is commonly accepted, but have very strong opponents.Martin Quinson RSA (2008-2009) Chap 1: Introduction (35/355)
Micro-kernel Operating Systems
Idea: move all you can to user-space
I Only remain low-level address space and thread management, plus IPC
I Scheduling, Virtual Memory mapping, FS, Drivers, etc. run as daemons
Big picture
Application
Scheduler
File system VM
Driver
Syscalls
Calls Trapped by
kernel
Application
IPC Thread Low Mem
File system
Scheduler
Driver
VM
Calls Trapped by
Syscalls
kernel
Martin Quinson RSA (2008-2009) Chap 1: Introduction (36/355)
Do Micro-kernels Suck?
This is a neat ideaI A micro-kernel is a few dozen of kilobytes, Linux a few hundreds of megabytes
I Easier to trust a small code base (kernel mode bugs are disasters)
I Easier to optimize (for example on ARM were MMU is hard)
Why didn’t it work yet?
I First implementation was ... not a technical success (Mach 1)
I Idea that IPC times between daemons must be a performance killmore IPC instead of function calls, more context switches for each IPC...
I But recent micro-kernel prove this wrong:L4 has a 4-5% performance overhead on most benchmarks
Some examples
I L4 (wombat, dawbat, . . . ), GNU/HURDMinix
I Mac OS/X (cheater!), QNX
I Still waiting for the big dayMartin Quinson RSA (2008-2009) Chap 1: Introduction (37/355)
Modular Operating Systems
DefinitionI Everything runs in kernel space, but loading parts on need
I Elements well partitioned, communicating through interface
Goal: Some advantages of micro-kernels without performance loss
I Code is modular, Software Engineers are happy
I We still have function calls between OS components instead of IPC
Almost every modern OS is architectured this way
Martin Quinson RSA (2008-2009) Chap 1: Introduction (38/355)
Virtual MachinesVirtualization Idea
I Push the layered approach to its extremeI Hardware+(host) OS = Some kind of hardwareI Guest OS (running on top) have illusion of running on real hardwareI Host OS in charge of sharing real resources between several Guest OSes
(First implementation by IBM in 1972 in mainframes)
Para-Virtualization IdeaI Quite the same, but guest OS not presented exact same interface than real OSI Thus needs to be modified, but result reveals fasterI Host OS then called Hypervisor
Martin Quinson RSA (2008-2009) Chap 1: Introduction (39/355)
First Chapter
Introduction
Computer ArchitectureHow Modern Computers Work
Executing ProgramsStorage HierarchyData movements
Current Trend: Multi-processors/multi-cores
Operating System IntroductionWhat is an Operating System?Roles and SubsystemsProtectionRecurring Themes for OS
Operating System Design
Case of Linux
Martin Quinson RSA (2008-2009) Chap 1: Introduction (40/355)
Architecture du noyau Unix
Moufida Maimour Systemes d’exploitation II (06/07) (47/216)
Architecture du noyau : approche descriptive
Le noyau constitue de 3 grandes parties :I l’interface des appels systeme, interface entre les programme utilisateur
et le noyauI le sous-systeme de gestion des processus
I gestion des processus creation, terminaison, suspension, synchronisation etcommunication.
I ordonnancement traite la gestion du partage du temps et des priorites.I gestion de la memoire gere le partage des objets, la protection
interprocessus, le swapping ou la pagination.
I le sous systeme de gestion des fichiersI gestion du buffer cache gere l’allocation des tampons d’E/S.I gestion des fichiers traite la protection, l’allocation de l’espace disque, la
designation des fichiersI gestion des peripheriques gere les fichiers en mode caractere et en mode
bloc, l’acces aux peripheriques, y compris aux reseaux.
Moufida Maimour Systemes d’exploitation II (06/07) (48/216)
Architecture du noyau : approche fonctionnelleLe noyau UNIX est decoupe en 2 grandes parties qui cooperent pour lepartage des ressources du systeme et pour la mise-en-oeuvre de certainsservices :
Partie superieurefournit des services aux processus utilisateurs en reponse aux
appels systeme et aux exceptions.
I execution synchrone en mode noyau pour pouvoir acceder a la fois auxstructure de donnees du noyau et aux contextes des processusutilisateurs.
Partie inferieurecomposee d’un ensemble de sous-programmes invoques pour le traitement
des interruptions materielles
I des activites se deroulant d’une facon asynchrone et s’executent en modenoyau
Moufida Maimour Systemes d’exploitation II (06/07) (49/216)
Invocation des services systeme
Interruptions materielles et exceptionsI Une interruption est provoquee par un signal provenant du monde
exterieur au processeur, et modifiant le comportement de celui-ci. Le butest de le prevenir de l’occurrence d’un evenement exterieur :
I fin d’une E/S, top d’horloge . . .I 80x86 : 32-238. Linux utilise le vecteur 128 (0x80) pour les appels systeme.
I Une exception est un signal provoque par un disfonctionnement duprogramme en cours d’execution :
I division par zero, faute de page . . .I 80x86 : 20 differentes exceptions 0..19. Les valeurs de 20 a 31 sont
reservees par Intel pour le futur.I Chaque interruption ou exception dispose d’un sous-programme (handler)
qui prend en charge l’evenement correspondant : table de vecteursd’interruption ou IDT : Interrupt Descriptor Table dans le langage Linux.
Moufida Maimour Systemes d’exploitation II (06/07) (50/216)
Invocation des services systeme
Traitement d’une exception ou d’une interruptionI Arrivee de l’interruption/exceptionI Sauvegarde du contexte actuel (PC . . . ) en utilisant la pile noyauI Acces a la table des vecteurs d’interruption pour determiner l’adresse du
sous-programme de l’interruption (le handler) et chargement du PC avecson adresse
I Execution du sous-programme en mode noyauI Retablissement de l’ancien contexte et reprise de l’ancien programme en
mode utilisateur
Moufida Maimour Systemes d’exploitation II (06/07) (51/216)
Invocation des services systeme
Les appels systemeI L’interface entre le SE et les programmes utilisateurs est definie par
l’ensemble des appels systeme fournis par ce dernier.I Un appel systeme peut etre vu comme un appel d’une fonction classique
effectuee en mode noyau.I Un appel systeme est generalement realise a l’aide d’une interruption
logicielle avec un deroutement vers un emplacement specifique dans latable des vecteurs d’interruptions.
I Une interruption logicielle est declenchee par un programme a l’aided’une instruction speciale (trap, syscall)
I Il n’ y a pas de changement de processus (preemption)I Le handler est execute en utilisant les ressources du contexte du
processus interrompu (la pile noyau).I Des informations necessaires a la requete peuvent etre passees (par
registres, pile ou memoire)
Moufida Maimour Systemes d’exploitation II (06/07) (52/216)
Invocation des services systeme
Implementation des appels systeme
Moufida Maimour Systemes d’exploitation II (06/07) (53/216)
Invocation des services systemeBibliotheque standard C
I le code d’un appel systeme est souvent en assembleur, mais une fonctionde bibliotheque de fonctions en C est souvent fournie.
Moufida Maimour Systemes d’exploitation II (06/07) (54/216)
Invocation des services systemeAppels systeme, exemple : read
count = read(df, tampon, nbOctets)
Fonction read()
de bibliothèque
Espace utilisateur
Espace noyau
0
0xFFFFFFFF
Incrémenter SP
Empiler df
Empiler &tampon
Empiler nbOctets
Appel read()
Branchement
Appel à read()
depuis le programme
utilisateur
l’appel système
code de
Retour à l’appelant
Déroutement vers le noyau
Placement du code de read
dans un registre
1
4 5
6
7
2
3
Moufida Maimour Systemes d’exploitation II (06/07) (55/216)
Invocation des services systeme
Traitement des exceptions sous LinuxI La plupart des exceptions issues du CPU sont interpretes par Linux
comme des cas d’erreursI A l’occurence d’une exception, le noyau envoie un signal au processus qui
a cause l’exception.I Exemple. division par zero : envoi signal SIGFPE.I Handler d’une exception :
1. Sauvegarde du contenu de la plupart des registre dans la pile noyau(assembleur)
2. Traitement de l’exception (fonction C)3. Quitter le handler en invoquant la fonction ret from exception
Moufida Maimour Systemes d’exploitation II (06/07) (56/216)
Invocation des services systeme
Traitement des interruptions sous LinuxI Difference avec les exceptions : ne peut pas envoyer un signal au
processus en cours Rightarrow traitement differentI Interruptions de tempsI Interruptions interprocessusI Interruptions d’E/S
1. Sauvegarde la valeur de l’IRQ et le contenu des registres dans la pile noyau2. Envoi d’un ACK au PIC, lui permettant de traiter d’autres interruptions3. Execution les routines de traitement d’interruption (ISR : Interrupt Service
Routines” associees aux peripheriques qui partage la ligne IRQ.4. Invocation de la fonction ret from intr()
Moufida Maimour Systemes d’exploitation II (06/07) (57/216)
Invocation des services systeme
Appels systeme sous Linux
...
xyz()
...SYSCALL
xyz() {
}
...
...
system_call :
...sys_xyz() ...SYSEXIT
sys_xyz() {
...
}
Mode NoyauMode utilisateur
Programmeapplication libc std lib handler
System call service routine
Wrapper routine System call
Moufida Maimour Systemes d’exploitation II (06/07) (58/216)
Second Chapter
Process Handling
Introduction
Process ImplementationProcess Memory LayoutProcess Control Block
Process Scheduling: Theoretical ConceptsContext SwitchingOS Scheduling InfrastructureScheduling Algorithms
Scheduling in Real OSesUNIX
SolarisHP-UX4.4BSDLinux 2.6
Windows XP
Process CreationUNIXWindows
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (41/355)
Introduction to processes
What is a Process?I Fact: Computer has to deal with variety of programs
I jobs on batch systems, user or system programs on time shared systemsI “jobs” and “tasks” used interchangeably in following
I Process: Dynamic entity executing a program on a processor
I OS point of view:I Program counter and stack: active part, doing stuff (thread)I Address space (memory protection): passive thing, thread environmentI Internal state (opened files, etc): environment on OS side
Process != ProgramI Programm:
Code + data (passive)
int i;int main() {
printf("Salut\n");}
I Process:Programm running
Code main()
int i;Data
Heap
Stack
I Even if you use the same program than me, it won’t be the same processMartin Quinson RSA (2008-2009) Chap 2: Process Handling (42/355)
Why processes?
To deal with complexity
I Allow activities to coexist simplyEach live in a separate box, and only deal with OS. OS handle all of them similarly
gameOS
emacs
firefox
OSgamefirefoxemacs
For efficiency
I When a process blocks, execute next one
��������������������������������
time saved
wasted timewithout overlap
with overlap
t
gcc
gcc
emacs (waiting for user)
emacs(blocked)
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (43/355)
Process Memory Layout
Process point of view OS point of viewBig picture
(reserved)
Heap
data
(cf. malloc)
Global
Code
spaceaddressing
Hole in
StackExecution
UNIX details
(protected)
Kernel
Heap
(cf. malloc)
Low addresses
MAXINT
bytes
(4Gb/4Tb)
Globals
Code
Constants
Dynamic
Libraries
High addresses
Cadre0
Cadre1
Pile
Segment
Data
SegmentText
SegmentStack
SectionBSS
DataSection
TextSection
Sections inProgramBinary
ProcessSegments0x00000000
0xefffffff
Holes inaddressing
space
User Mode access only privateaddressing space
Kernel Mode idem +
I protected addr. spaceI own kernel-mode stack
for calls made in kernel mode
(one per process for reentrance)
RemarksI Memory of each process isolated
; protection
I Code shared between processes(done automatically by OS)
I Data segment not shared(unless you use shm&mmap)
I Threads share everything(but the stack)
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (44/355)
Process Control Block (PCB)
Information associated with each process
I Process state(running, ready, blocked, etc)
I Program counters
I CPU registers
I CPU scheduling informations
I Memory-management information
I Accounting information
I I/O status information
I . . .
Process state
Process ID
Program counter
Registers
Memory limits
list of open files
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (45/355)
PCB Data Structures
PCB classically split in two parts
I Memory was very expensive back in the age
; reduce size of resident areas
Process tableI Always in memory
I Contains info on every process(even swapped ones)
I What’s needed for scheduling(amongst other)
User StructureI Part of process virtual memory
(can be swapped away)
I What’s needed when process active
Kernel
Structure
User
Data
Stack
Process
structure
stack
Text
Process tableresiding in memoryvirtual memory
Process
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (46/355)
PCB in 4.4BSD (partial view)
User StructureI Execution state:
general registers, SP, PC
I Pointer to entry in process table
I Information on syscall currently run
I Open file descriptors
I Current directory
I Accounting informationI Time spent in user/kernel modesI Limits (CPU time, memory, . . . )I Maximal stack size
I Kernel stack of this process
Process StructureI Identification: PID, PPID, UID
I Scheduling: priority, blocked time
I Memory: pointer to pages table
I Synchro: blocking event description
I Signals: pending ones, handlers
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (47/355)
Second Chapter
Process Handling
Introduction
Process ImplementationProcess Memory LayoutProcess Control Block
Process Scheduling: Theoretical ConceptsContext SwitchingOS Scheduling InfrastructureScheduling Algorithms
Scheduling in Real OSesUNIX
SolarisHP-UX4.4BSDLinux 2.6
Windows XP
Process CreationUNIXWindows
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (48/355)
Process States
Existing states
I new: just created
I running: instructions get executed
I waiting: blocked, waiting some event to occur
I ready: waiting to be assigned some processor
I terminated: finished execution
Transition diagram
new
ready running
terminated
waiting
I/O or event wait
exitinterrupt
admitted
scheduledevent completion
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (49/355)
Context Switching
Process ContextI User Context: Stack, Data and Text segments
I Hardware Context: CPU registers and pointers
I System Context: User Structure part of PCB (process structure, kernel stack)
Context Switching
I Needed to change running process (interrupt, I/O request, etc)
I Save one process’s context and restore the one of another
I SynchronyI Explicit: call to sleep()I Implicit: time elapsed, I/O request
I AsynchronyI For example hardware interrupt
I All this is overhead : keep it fastTiming is hardware-dependent
Save state in PCB0
Restaure state from PCB1
Save state in PCB1
Restaure state from PCB0
running
inactive
Process P0 Process P1Operating Systemsyscall, interrupt or trap
syscall, interrupt or trap
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (50/355)
Deciding which Process gets Scheduled
First IdeasI Scan process table for first runnable
/ Expensive, weird priority (small pids get more)At least separate runnable and blocked threads!
I FIFO? (put threads on back of list, pull them off from front)
(some toy OSes do so)I Priority? (give some threads more chances to get CPU)
Scheduling ChallengesFairness Don’t starve processesPrioritize More important firstDeadline Must be finished before ’x’ (car breaks, music&voice)Optimizations Some schedules ways faster than others
No Optimal PolicyI Many variables, can’t optimize them all (multi-objective optimization)I Conflicting goals
I I want to finish soonish, who cares about you?I Less important jobs should not completely starve
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (51/355)
OS Scheduling Infrastructure
QueuesI Processes placed in several queues depending on their state
I Job Queue: all jobs in systemI Ready Queue: jobs in main mem, ready and waitingI Device Queue: jobs waiting for an I/O device
I Processes migrate among the different queues
Big Picture
headtail
headtail
headtail PCB4 PCB7
PCB2
PCB1 PCB3
PCB5PCB6
headtail
terminal
disk 0
ready
tape 0
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (52/355)
OS Scheduling Queues
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (53/355)
OS Schedulers
Short Term SchedulerI Decides which jobs from ready queue gets scheduled
I Runs often (ms) ; must be fast
Long Term Scheduler
I Decides which jobs gets into ready queue
I Runs less often (second to minute) ; can be slow
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (54/355)
Process Scheduling
Process Categorization
I CPU-bound: only uses CPU (would go faster with bigger CPU)
I IO-bound: limited by I/O speed (would go faster with faster disks/memory)
RemarksI Very few processes are CPU-bound for long time
I In real code, same program alternatively CPU-bound and I/O bound
Usage Bursts
I CPU burst=code section being CPU bound (idem for I/O)
I Improving scheduling requires to understand bursts distribution
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (55/355)
CPU Bursts Distribution
I Interactive systems ; shorter CPU burstsI Scientific Code ; (very) long CPU bursts (CPU burners)
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (56/355)
Scheduling Criteria
Scheduling Goal
I User perspective: Reduce completion time
I Owner perspective: Maximize resource utilization
CriteriaI CPU utilization: keep CPU busy ;max
I Throughput: amount of jobs completed per unit of time ;max
I Turnaround time: makespan of a particular job ;min
I Waiting time: amount of time a job waited in ready state ;min
I Response time: time between submission and first action (time shared) ;min
A whole load of algorithms exist
I Some are simple (silly?)
I Some are cleaver, specifically designed to improve one criteria
I Impossible to satisfy all criteria at the same time
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (57/355)
First Come First Served (FCFS) Algorithm
I Simply implemented with a linked list
Workload 1Process Burst time Arrival Waiting Time
P1 24 0 0P2 3 1 24P3 3 2 28
Gantt-chartP1 P2 P3
0 20 25 30
Average Waiting Time: 17.3
Workload 2Process Burst time Arrival Waiting Time
P1 24 2 10P2 3 0 0P3 3 1 5
Gantt-chartP2 P1P3
0 5 10 30
Average Waiting Time: 3.3
I This effect is called Convoy Effect (small placed after long do suffer)
I This is not adapted to interactive systems (I/O bound jobs suffer)
I What about prioritizing short jobs?
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (58/355)
Shortest Job First (SJB) Algorithm
Process Burst time Waiting TimeP1 6 3P2 8 16P3 7 9P4 3 0
Gantt chartP4
0 3
P1
9 16
P3
24
P2
Average Waiting Time: 7
SJB is as optimal as unrealistic
I Impossible to achieve lower waiting time (but long jobs suffer)
I But how to know burst time in advance?
Guessing Burst Time
I Use past to predict future! (as usual)
I Exponential averaging:I tn: actual length of nth CPU burst
I τn: guess for nth CPU burst
I α: parameter between 0 and 1
I τn+1 = αtn + (1− α)τn
I α = 0 =⇒ τn+1 = τn; recent measurements ignored
I α = 1 =⇒ τn+1 = tn; only last measurement used
I τn =P
j(1− α)n−j × tj; measurement coefs reduce with ageMartin Quinson RSA (2008-2009) Chap 2: Process Handling (59/355)
Round-Robin (RR) Algorithm
Big lines
I Interrupt the process after a while (regardless of whether it’s done or not)I Schedule someone else
Advantages
I No convey effect: small job not blocked for ever behind big jobsI Big jobs do not starve yielding for small jobs
Picking the right quantum
I Quantum too big ; good throughput, bad interactivity
P1
P2
0 1 2010 30 4021 26 31 366 15 16
running
i/o I Reactivity of P1 very bad(lags)
I I/O device underused
I Quantum too small ; good reactivity, high overhead
P1
P2
Processes continuously interrupted
I Quantum = ∞ ; FCFSI Classical value: 10-100 milliseconds
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (60/355)
Scheduling with Priority
Process Priority
I Associate a priority (an integer) to each process
I CPU allocated to ready process with highest priority
I Can be preemptive or not (whether we interrupt processes not done yet)
ProblemI Low priority processes may never get to the resource (starvation)
I Solution: Aging (priority increases when not served)
Particular casesI FCFS: give the same priority to anyone
I SJF: priority inversely proportional to burst length
RemarkI On UNIX, processes are traditionally given a nice value
(inversely proportional to priority: nice processes give CPU to others)
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (61/355)
Multi-Level Scheduling
Split ready queue in sub-queue, with specific scheduling policy
I Foreground (interactive jobs): RR
I Background (batch jobs): FCFS
Need to schedule between queues
I Any foreground first (but possible starvation)
I Preemptive to share 80%/20% of CPU
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (62/355)
Algorithmes de schedulingFeedback Multilevel Scheduling
I Ici, les processus peuvent changer de files, ce qui permet de separer lesprocessus ayant des caracteristiques differents en termes de cycles UC
I Si un processus a des cycles UC longs, le deplacer dans une file d’attentemoins prioritaire⇒ les processus interactifs finissent par avoir la prioritela plus elevee
I Si un processus attend longtemps, il est deplace dans une file plusprioritaire
quantum = 8
quantum = 16
FCFS
Moufida Maimour Systemes d’exploitation II (06/07) (90/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (63/355)
Algorithmes de schedulingFeedback Multilevel Scheduling
est defini parI le nombre de files d’attenteI l’algorithme de scheduling pour chaque fileI la methode utilisee pour determiner le moment de changer la priorite d’un
processus
quantum = 8
quantum = 16
FCFS
Moufida Maimour Systemes d’exploitation II (06/07) (91/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (64/355)
Ordonnancement SOLARIS
3 classes d’ordonnancementI Timesharing and interactive (TS & IA) : RR avec
priorite (plus de priorite aux processus les plusinteractifs)
I System(SYS) : FCFS avec preemption et prioritesfixes
I Realtime (RT) : RR avec priorite ou un processus(RT) a une priorite fixe durant sa vie
+Interruptions
et interatif
Temps éel
Système
Temps partagé
169
0
59
60
99
100
159
160
Moufida Maimour Systemes d’exploitation II (06/07) (92/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (65/355)
Ordonnancement SOLARIS
Dispatch table : processus interactifsI time quantum : la longueur
par defaut du quantumassigne au processus
I time quantum expired : lanouvelle priorite pour unprocessus qui utilise latotalite de son quantum
I return from sleep : lanouvelle priorite pour unprocessus qui se bloqueavant d’utiliser la totalitede son quantum
Moufida Maimour Systemes d’exploitation II (06/07) (93/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (66/355)
Ordonnancement HP-UX
2 types d’ordonnanceurs :
Temps-reel (RealTime ou RT)I FIFO ou RRI priorites fixes, ne peuvent pas etre
changees par le noyauI sans requisition : un processus
s’execute jusqu’a sa fin ou sebloquer
Temps-partage (TimeShare ou TS) :I RRI la valeur de la priorite augmente
(priorite diminue) avec l’utilisationde l’UC et diminue en attendant
I avec requisition
Moufida Maimour Systemes d’exploitation II (06/07) (94/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (67/355)
Ordonnancement HP-UXL’ordonnanceur temps partage
Le noyau differencie en termes depriorite, les processus utilisateurs desprocessus systeme (mode noyau, enattente d’un evenement). Ces derniersont une priorite superieure.
I en mode utilisateur, un processuspeut etre requisitionne, arrete oumeme transfere en memoiresecondaire
I en mode noyau, un processuss’execute jusqu’a se bloquer, uneinterruption ou se terminer
Processus temps réel
Processus système
Processus utilisateurs
Pri
ori
té
+ 0
128
178
255
Moufida Maimour Systemes d’exploitation II (06/07) (95/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (68/355)
4.4BSD
Etats d’un processusI SIDL : etat intermediaire lors de la creation d’un processus (idle)I SRUN : pret (executable ou Runnable)I SSLEEP : attente d’un evenementI SSTOP : arrete par son pere ou un signalI SZOMB : en attente de terminaison
RemarquesI Il n’y a pas d’etat “en cours d’execution”I Un certain nombre de flags completent les informations sur l’etat d’un
processus
Moufida Maimour Systemes d’exploitation II (06/07) (96/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (69/355)
4.4BSD
Scheduling(1)I Un processus a 2 priorites :
I mode utilisateur p usrpri ∈ [PUSER,127] ou PUSER=50 et correspond a lapriorite attribuee au processus utilisateur le plus prioritaire.
I mode noyau p priority ∈ [0,PUSER] donnant plus de chance a un processusen mode noyau afin qu’il libere des que possible les ressources systeme qu’ildetient.
I un quantum = 0.1s (valeur empirique)I la priorite d’un processus est ajustee dynamiquement :
pusrpri = PUSER +pcpu
4+ 2pnice (1)
I pnice permet a l’utilisateur de moduler sa priorite,I pcpu incrementee toutes les 10 ms et donne une estimation de la
consommation UC du processus actif.
⇒ La priorite d’un processus diminue avec sa consommation UC.
Moufida Maimour Systemes d’exploitation II (06/07) (97/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (70/355)
4.4BSD
Scheduling (2)I toutes les secondes pcpu est reajustee selon la formule :
pcpu =2 load
2 load + 1pcpu + pnice (2)
ou load est une estimation de la charge du systeme et correspond a lalongueur de la file d’attente des processus prets.
I Lors de la reactivation d’un processus utilisateur en attente (Asleep),l’ordonnanceur reajuste pcpu :
pcpu = pcpu
(2 load
2 load + 1
)pslptime
(3)
ou pslptime comptabilise le temps d’attente du processus
⇒ sert a estomper le passe lointain.
Moufida Maimour Systemes d’exploitation II (06/07) (98/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (71/355)
4.4BSD
Scheduling (3)Exemple.On considere un seul processus qui est actif et qui consomme toute l’UC. Ceprocessus consomme Ti ticks a la frequence de l’horloge pendant la duree i .load = 1Toutes les secondes, le filtre est applique avec la formule suivantepcpu = 0.66 pcpu :
pcpu = 0.66 T0
pcpu = 0.66 T1 + 0.44 T0
pcpu = 0.66 T2 + 0.44 T1 + 0.3 T0
pcpu = 0.66 T3 + ... + 0.20 T0
pcpu = 0.66 T4 + ... + 0.13 T0
On remarque que l’effet de T0 s’estompe avec le temps.
Moufida Maimour Systemes d’exploitation II (06/07) (99/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (72/355)
4.4BSDLa runqueue
I L’ensemble des processus dans l’etat Runnable constitue la file desprocessus prets : la runqueue
I L’implantation du scheduling est realisee par une liste chaınee desprocessus associee a chaque groupe de priorites flottantes.
I qs la table des tetes et queues de listes des filesI whichqs une table associee a qs pour indiquer l’occupation de chaque file.
whichqs
0
1
0
proc proc
queue
tete
numéro
de la
priorité
qs
Moufida Maimour Systemes d’exploitation II (06/07) (100/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (73/355)
Linux 2.6 : implantation des processus
Descripteur de processus (task struct)contient toutes les informations relatives a un processus
Etat d’un processus (champ state)I TASK RUNNING le processus est pret a etre execute ou en cours
d’execution.I TASK INTERRUPTIBLE le processus est suspendu en attendant qu’une
condition soit realisee :I une interruption materielle,I liberation d’une source que le processus attendI reception d’un signal, . . .
I TASK STOPPED processus arrete a cause d’un signal SIGSTOP,SIGTSTP, SIGTTIN ou SIGTTOU.
I TASK ZOMBIE l’execution du processus est terminee alors que son peren’a pas encore utilise un appel systeme de type wait() pour obtenir desinformations a propos du processus mort.
I . . .
Moufida Maimour Systemes d’exploitation II (06/07) (101/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (74/355)
Linux 2.6 : implantation des processusDescripteur de processus (task struct)
Lightweight processes (processus legers)I Un processus leger correspond a un threadI Un groupe de threads est un ensemble de processus legers qui implante
une meme application multithreadee :I partagent l’espace d’adressageI agissent comme un tout vis-a-vis de certains appels systeme : getpid(), kill(),
. . .I peuvent etre schedules separementI chacun son pid, mais un seul pid de groupe qui est le pid du premier thread
du groupe
Identification d’un processusI pid : identifiant du processus de 0 a 32767 = PID MAX DEFAULT −1
/proc/sys/kernel/pid maxpid map array permet de savoir quel pid est deja affecte
I tgid (threag group leader pid) : pid du premier processus leger du groupegetpid() retourne tgid et non pas pid (POSIX compatible)
Moufida Maimour Systemes d’exploitation II (06/07) (102/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (75/355)
Linux 2.6 : implantation des processusDescripteur de processus (task struct)
addr_limit
cpu
*task
thread_info
task_struct
state
liste des signaux reçus
signal_struct
files_struct
info fichiers
info mémoire
mm_struct
pour les processus
info de bas niveau
Moufida Maimour Systemes d’exploitation II (06/07) (103/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (76/355)
Linux 2.6 : implantation des processus
Descripteur de processus : 80x86
thread unionunion thread union {struct thread info thread info;
unsigned long stack[2048];}
thread_info
Pile
thread_info
Descripteur de processus
015fa000
015fbfff
015fb000
015fa034task52
octets
esp
Moufida Maimour Systemes d’exploitation II (06/07) (104/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (77/355)
Linux 2.6 : implantation des processusDescripteur de processus : 80x86
thread_info
Pile
thread_info
Descripteur de processus
015fa000
015fbfff
015fb000
015fa034task52
octets
esp
A partir de esp, le noyau peut trouver, pour le processus en cours :l’adresse de la structure “thread info”current thread infomovl $0xffffe000,%ecxandl %esp,%ecxmovl %ecx,p
l’adresse de son descripteurcurrent macromovl $0xffffe000,%ecxandl %esp,%ecxmovl (%ecx),p
Moufida Maimour Systemes d’exploitation II (06/07) (105/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (78/355)
Linux 2.6 : Implantation des processus
Des listes de descripteurs de processusI liste de tous les processus,I liste des processus prets, une liste par niveau de priorite. Usage de la
structure prio array t :I int nr active : nombre des descripteurs de la liste,I unsigned long[5] : bitmap, si un flag a 1 alors la liste correspondante est non
vide,I struct list head[140] queue, les tetes des 140 listes de priorite.
I liste des processus en attente, une liste par evenementI un flag est utilise, s’il est a 1, reveiller un seul processus sinon reveiller tous.
Moufida Maimour Systemes d’exploitation II (06/07) (106/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (79/355)
Linux 2.6 : ordonnancement des processus
Classes d’ordonnancementI SCHED FIFO (FIFO real-time process) : si aucun autre processus n’est
plus prioritaire, un processus continue a s’executer.I SCHED RR (Round-Robin real-time process) : permet un equite parmi
ceux ayant la meme priorite.I SCHED NORMAL (conventional, time shared process)
Moufida Maimour Systemes d’exploitation II (06/07) (107/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (80/355)
Linux 2.6 : ordonnancement des processusPrincipe
I Deux domaines separes de priorites statiques :I priorite conventionnelle : 100-139 correspondant au nice de -20 a 19. La
valeur du nice peut etre changee avec l’appel systeme nice() ou setpiority()I priorite temps reel : 0-99
I Ordonnancement a priorite dynamique : chaque processus a une prioriteinitiale qui peut diminuer (si tributaire UC) et augmenter (si tributaire E/S)
I Utilisation d’un quantum (timeslice) variable qui peut etre consomme enplusieurs fois.
I Recalcul des timeslices lorsque tous les processus ont consomme latotalite des timeslices.
I Ordonnancement avec requisition (preemptive scheduling) :I arrivee d’un nouveau processus avec une plus grande prioriteI timeslice devient nul
MaxMin Default
100ms 800ms5ms
priorité
Moufida Maimour Systemes d’exploitation II (06/07) (108/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (81/355)
Linux 2.6 : cas des processus conventionnels
Calcul du timeslice
timeslice =
{(140− staticP)× 20 if staticP < 120(140− staticP)× 5 sinon
Ou staticP est la priorite statique du processus.
Priorite statique Nice value timeslice (ms)100 -20 800110 -10 600120 0 100130 +10 50139 +19 5
Moufida Maimour Systemes d’exploitation II (06/07) (109/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (82/355)
Linux 2.6 : cas des processus conventionnels
Priorite dynamiqueI le nombre auquel se refere le scheduler actuellement pour elire le
prochain processus a executer :
dynamicP = max(100, min(staticP − bonus + 5, 139))
I bonus ∈ [0..10] : bonus < 5 correspond a unepenalite
I le bonus depend de l’historique du processus(“average sleep time”)
I un processus est considere comme interactifsi
dynamicP ≤ 3× staticP/4 + 28
ou
bonus − 5 ≥ staticP/4− 28 = interactiveDelta
Avg sleep time bonus0-100 ms 0
100-200 ms 1200-300 ms 2300-400 ms 3400-500 ms 4500-600 ms 5600-700 ms 6700-800 ms 7800-900 ms 8900-1000 ms 9
1 seconde 10
Moufida Maimour Systemes d’exploitation II (06/07) (110/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (83/355)
Linux 2.6 : cas des processus conventionnels
Pour eviter la famine et optimiser le recalcul des timeslices :I processus actifs, qui n’ont pas fini leurs timeslicesI processus expires, ceux deja servis
Remarque 1 : les processus temps reels sont toujours places dans la liste desprocessus actifs.
Remarque 2 : le scheduler 2.6 trouve le prochain processus a executer en untemps constant (O(1)) contrairement au 2.4.Rappelliste des processus prets, une liste par niveau de priorite. Usage de lastructure prio array t :
I int nr active : nombre des descripteurs de la liste,I unsigned long[5] : bitmap, si un flag a 1 alors la liste correspondante est
non vide,I struct list head[140] queue, les tetes des 140 listes de priorite.
Moufida Maimour Systemes d’exploitation II (06/07) (111/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (84/355)
Linux 2.6 : Implantation des appels systeme
Rappel
...
xyz()
...SYSCALL
xyz() {
}
...
...
system_call :
...sys_xyz() ...SYSEXIT
sys_xyz() {
...
}
Mode NoyauMode utilisateur
Programmeapplication libc std lib handler
System call service routine
Wrapper routine System call
Moufida Maimour Systemes d’exploitation II (06/07) (112/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (85/355)
Linux 2.6 : Implantation des appels systeme
Deux methodes d’invocation1. L’interruption 0x80, vecteur d’interruption 128. Intel reserve les vecteurs
32-238 pour les interruptions materielles. Usage de l’instructiond’assembleur iret
2. Les instructions sysenter et sysexit introduits avec le pentium II
La methode int 0x80I Initialisation (au demarrage du systeme) du vecteur d’interruption 128 par
l’adresse de la routine : get system gate(0x80,&system call)I Sauvegarde des registresI Passage du numero de l’appel systeme par le registre EAXI Passage possible d’arguments (par registre)I Si erreur, un appel systeme retourne une valeur negative dont la valeur
absolue donne le code de l’erreur errno. Cette derniere est mise a jourpar la wrapper routine.
I Restitution des registres sauvegardes. Retour dans le processus appelantet au mode utilisateur.
Moufida Maimour Systemes d’exploitation II (06/07) (113/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (86/355)
Linux 2.6 : Implantation des appels systeme
Code en assembleur
ENTRY(system call)pushl %eaxSAVE ALLmovl $0xffffe0000, %ebpandl %esp, %ebxcmpl $(nr syscalls), %eaxjae syscall badsys
syscall call :call *sys call table(0,%eax,4)movl %eax,24(%esp)
syscall exit :climovl 8(%ebp), %ecxtestw $0xffff, %cx
restore all :RESTORE ALL
Moufida Maimour Systemes d’exploitation II (06/07) (114/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (87/355) 5.18 Silberschatz, Galvin and Gagne ©2005Operating System Concepts – 7th Edition, Feb 2, 2005
Windows XP SchedulingWindows XP Scheduling
■ Windows XP schedules threads using a prioritybased, preemptive scheduling algorithm.
■ The Windows XP scheduler ensures that the higher priority thread will always run.
■ The portion of the Windows XP kernel that handles scheduling is called the dispatcher.
■ A thread selected to run by the dispatcher will run until it is preempted by a higher priority thread, until it terminates, until its time quantum ends, or until it calls a blocking system call (such as for Input/Output operation)
■ If a higher priority realtime thread becomes ready while a lower priority thread is running, the lowerpriority thread will be preempted.
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (88/355)
5.19 Silberschatz, Galvin and Gagne ©2005Operating System Concepts – 7th Edition, Feb 2, 2005
Windows XP SchedulingWindows XP Scheduling
■ There are two priority classes:● The variable class contains threads having priorities from 1
to 15● The realtime class contains threads with priorities from 16
to 31.● A single thread running at priority 0 is used for memory
management.■ Each scheduling priority has a separate queue of the
corresponding processes■ The dispatcher uses a queue for each scheduling priority and
traverses the set of queues from highest to lowest until it finds a thread that is ready to run.
■ If no ready thread is found, the dispatcher will execute a special thread called the idle thread.
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (89/355)
5.20 Silberschatz, Galvin and Gagne ©2005Operating System Concepts – 7th Edition, Feb 2, 2005
Windows XP PrioritiesWindows XP Priorities
The relative priorities The priority
within a class classes
By default, the base priority is the value of the Normal relative priority for the specific class
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (90/355)
5.21 Silberschatz, Galvin and Gagne ©2005Operating System Concepts – 7th Edition, Feb 2, 2005
Windows XP Priorities: Some RulesWindows XP Priorities: Some Rules
■ Processes are typically members of the NORMAL_PRIORITY_CLASS.
■ A process will belong to this class unless the parent of the process was of the IDLE_PRIORITY_CLASS or unless another class was specified when the process was created.
■ The initial priority of a thread is typically the base priority of the process the thread belongs to.
■ When a thread’s time quantum runs out, the thread is interrupted; if the thread is in the variablepriority class, its priority is lowered. However, the priority is never lowered below the base priority.
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (91/355)
5.22 Silberschatz, Galvin and Gagne ©2005Operating System Concepts – 7th Edition, Feb 2, 2005
Windows XP Priorities: Some RulesWindows XP Priorities: Some Rules
■ Lowering the thread’s priority tends to limit the CPU consumption of computebound threads.
■ When a variablepriority thread is released from a wait operation, the dispatcher boosts the priority. The amount of boost depends on what the thread is waiting for:
● A thread that was waiting for keyboard I/O would get a large increase
● A threat that was waiting for a disk operation would get a moderate increase
■ Windows XP distinguishes between the foreground process that is currently selected on the screen and the background processes that are not currently selected. When a process moves into the background, Windows XP increases the scheduling quantum by some factor – typically by 3.
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (92/355)
Second Chapter
Process Handling
Introduction
Process ImplementationProcess Memory LayoutProcess Control Block
Process Scheduling: Theoretical ConceptsContext SwitchingOS Scheduling InfrastructureScheduling Algorithms
Scheduling in Real OSesUNIX
SolarisHP-UX4.4BSDLinux 2.6
Windows XP
Process CreationUNIXWindows
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (93/355)
Creation d’un processus
fork()Sous Unix, il y a separation entre :
I la creation d’un processus (fork())I l’execution d’un programme (exec())
fork()I duplique le contexte complet du pere pour generer le filsI retourne le pid du fils cree au pereI retourne 0 au fils
Moufida Maimour Systemes d’exploitation II (06/07) (66/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (94/355)
Creation d’un processus
fork()Problemes a resoudre
I Allocation des ressources pour le processus fils :I systeme : entree dans la table des processus, pidI memoire : texte, donnees, pile utilisateur, pile noyau et la structure utilisateur
I Creation d’un contexte d’execution pour le processus fils a partir ducontexte du pere
I Lancement du nouveau processusI double retour de la fonction fork()I ordonnancement du processus fils
Moufida Maimour Systemes d’exploitation II (06/07) (67/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (95/355)
Creation d’un processusfork()
ouPartage
Code du processus filsCode du processus pereDuplication
fork()
returnreturn
fork() fork()
Moufida Maimour Systemes d’exploitation II (06/07) (68/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (96/355)
Terminaison d’un processus
Algorithme de exit(status)I ignorer les signauxI RAZ des timersI etat = SZOMBI fermer les fichiers ouverts par ce processusI decrementer las compteurs dans la table des fichiers ouverts du systemeI . . .I liberer la memoire virtuelle, physique, la structure U et la pile noyauI sortir le processus de la file des processus prets et le mettre dans la file
des zombiesI faire adopter tous les fils du processus par le processus initI stockage de la valeur “status” dans la structure du processus (zombie)I envoi du signal SIGCHLD au pere (qui sera reveille par ce signal)I appel de la fonction de commutation de contexte
Moufida Maimour Systemes d’exploitation II (06/07) (69/216)Martin Quinson RSA (2008-2009) Chap 2: Process Handling (97/355)
Process Creation on WindowsNo exact Windows equivalent of fork() and exec()
Windows has CreateProcess method
I Does both fork+exec in one step
I Creates new process + loads specified program into it
I Much more parameters than fork+execI More precisely: 10I Ok to put NULL for most of them
Example
#include <windows.h>#include <iostream.h>void main( ) {
PROCESS_INFORMATION pi; // Filled by CreateProcessSTARTUPINFO si; // Read by CreateProcess, ok to zero itZeroMemory( &si, sizeof(si) );si.cb = sizeof(si);if (!CreateProcess(NULL,"toto.exe 5 10", NULL,NULL,TRUE,0,NULL,NULL, &si, &pi) )
cerr << "CreateProcess failed." << endl;WaitForSingleObject ( pi.hProcess, INFINITE ); // Wait for process terminaisonCloseHandle( pi.hProcess ); // cleanupsCloseHandle( pi.hThread );
}
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (98/355)
CreateProcess Syntax
BOOL CreateProcess(LPCTSTR lpApplicationName, // pointer to name of executable moduleLPTSTR lpCommandLine, // pointer to command line stringLPSECURITY ATTRIBUTES lpPA, // Process security attributesLPSECURITY ATTRIBUTES lpTA, // Thread security attributesBOOL bInheritHandles, // handle inheritance flagDWORD dwCreationFlags, // creation flagsLPVOID lpEnvironment, // pointer to new environment blockLPCTSTR lpCurrentDirectory, // pointer to current directory nameLPSTARTUPINFO lpStartupInfo, // pointer to STARTUPINFOLPPROCESS INFORMATION lpPI // pointer to PROCESS INFORMATION
);
I Two ways to specify program to start(first arg ; program location; second arg ; command line)
I Creation flags are combined with |I 0 ; in same windowI CREATE NEW CONSOLE useful;I Specify priority, linkage to parent, etc.
I Structures pi and si used for process communication (how to start, basic info)
Martin Quinson RSA (2008-2009) Chap 2: Process Handling (99/355)
Third Chapter
Memory Handling2
Hardware Memory ManagementIntroductionVirtual MemorySegmentationPagingExamplesPDP-11
x86MIPS and DEC Alpha
Swapping
Virtual Memory Operating System
Memory Allocation
2Greatly inspired from David Mazieres course at Stanford.David Mazieres RSA (2008-2009) Chap 3: Memory Handling (100/355)
We want processes to coexist in memory
What about simply sharing memory between processes?
OSgcc
firefoxemacs
0x8000
0x7000
0x9000
0x6000
0x5000
What if...I emacs needs more memory than allocated?
I firefox needs more memory than exists on machine?
I gcc have an error and writes into 0x6500?
I emacs does not use all its memory?
Other open question
I When does emacs know it runs at 0x5000? (compile, link or run time)
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (101/355)
Issues in sharing physical memory
ProtectionI A bug in one process can corrupt memory in another
I Must somehow prevent process A from trashing B’s memory
I Also prevent A from even observing B’s memory(ssh-agent contains secrets)
Transparency
I A process shouldn’t require particular memory locations
I Processes often require large amounts of contiguous memory(for stack, large data structures, etc.)
Resource exhaustionI Programmers typically assume machine has ”enough” memory
I Sum of sizes of all processes often greater than physical memory
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (102/355)
Virtual Memory Goals
physicalmemory
CPU
store load
gcc
virtual address
MMUTranslation box
physical address
data
legal? yesno
disks
Give each program its own ”virtual” address space
I At run time, relocate each load and store to its actual memory
I So app doesn’t care what physical memory it’s using
Also enforce protection
I Prevent one app from messing with another’s memory
And allow programs to see more memory than exists
I Somehow relocate some memory accesses to disk
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (103/355)
Virtual Memory Advantages
Can re-locate program while running
I Run partially in memory, partially on disk
Most of a process’s memory will be idle
I Think of the 80/20 rule
busy
idle
idle
Process 2memoryPhysical
Process 1
busy
idle
I Write idle parts to disk until needed
I Let other processes use memory for idle part
I Like CPU virtualization: when process not using CPU, switch.When not using a page switch it to another process.
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (104/355)
Virtual Memory Implementation (1/2)
Challenge: VM = extra layer, could be slow
First Idea: Load-time linking
jump 0x2000
0x3000
0x1000
jump 0x5000
0x4000
0x6000
OS
a.out’
static a.out
I Link as usual, but keep the list of references
I Fix up process when actually executedI Determine where process will reside in memoryI Adjust all references within program (using addition)
ProblemsI How to enforce protection
I How to move once in memory (Consider: data pointers)
I What if no contiguous free region fits program?David Mazieres RSA (2008-2009) Chap 3: Memory Handling (105/355)
Virtual Memory Implementation (2/2)
Challenge: VM = extra layer, could be slow
Better Idea: base+bound registers
jump 0x2000
0x3000
0x1000
jump 0x5000
0x4000
0x6000
OS
a.out’
static a.out
I Two special privileged registers: base and bound
I On each load/store:I Physical address = virtual address + base registerI Check 0 leq virtual address ¡ bound, else trap to kernel
I How to move process in memory?I Change base register
I What happens on context switch?I OS must re-load base and bound register
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (106/355)
Virtual Memory Actual Implementation
DefinitionsI Programs load/store to virtual (or logical) addresses
I Actual memory uses physical (or real) addresses
MMUCPUmemoryphysicalphysical
address
virtual
address
Memory Management Unit (MMU)
I Usually part of CPU
I Accessed with privileged instructions(e.g., load bound registers)
I Translates from virtual to physical addresses
I Gives per-process view of memory called address space
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (107/355)
Address Space
MMU
0
0
0
P1
P2
P3
ViewVirtual Addresses
OS
Physical AddressesView
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (108/355)
Base and bound trade-offs
Advantages
I Cheap in terms of hardware: only two registers
I Cheap in terms of cycles: do add and compare in parallel
I Examples: Cray-1 used this scheme
Disadvantages
I Growing a process is expensive or impossible
I No way to share code or data(E.g., two copies of gcc)
gccemacs
shgcc
free
One solution: Multiple segments per process
I E.g., separate code, stack, data segments
I Possibly multiple data segments
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (109/355)
Segmentation
Let processes have many base/bounds regs
I Address space build from many segments
I Can share/protect memory on segment granularity
Must specify segment as part of virtual address
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (110/355)
Segmentation mechanics
Implementation
I Each process has a segment table
I Each virtual address indicates a segment and offset:I Top bits of addr select seg, low bits select offset (PDP-10)I Seg select by instruction, or operand (pc selects text)
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (111/355)
Segmentation Example
2-bit segment number (1st digit), 12 bit offset (last 3)
I Where is 0x0240? 0x1108? 0x265c? 0x3002? 0x1600?
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (112/355)
Segmentation trade-offs
Advantages
I Multiple segments per process
I Allows sharing (how?)
I Don’t need entire process in memory
Disadvantages
I Requires translation hardware, which could limit performance
I Segments not completely transparent to program(e.g., default segment faster or uses shorter instruction)
I n byte segments needs n contiguous bytes of physical memory
I Makes fragmentation a real problem.
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (113/355)
Fragmentation
What is it?I Inability to use free memory
Where does it come from?I Variable-sized pieces ; many small holes
(external fragmentation)
I Fixed-sized pieces ; no external holes, but force internal waste(internal fragmentation)
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (114/355)
Alternatives to hardware MMU
Language-level protection (Java)
I Single address space for different modules
I Language enforces isolation
I Singularity OS does this(OS with type-checking and design by contract in place of hardware protection)
Software fault isolationI Instrument compiler output
I Checks before every store operation prevents modules from trashing each other
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (115/355)
Paging
Big Idea
I Divide memory up into small pages
I Map virtual pages to physical pages (each process has separate mapping)
Hardware gives control to OS on certain operations
I Read-only pages trap to OS on write
I Invalid pages trap to OS on read or write
I OS can change mapping and resume application
Other features sometimes foundI Hardware can set ”accessed” and ”dirty” bits
I Control page execute permission separately from read/write
I Control caching of page
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (116/355)
Paging trade-offs
Trade-offsI Eliminates external fragmentation
I Simplifies allocation, free, and swap
I Internal fragmentation of .5 pages per”segment”
Simplified Allocation
I Allocate any physical page to any process
I Can store idle virtual pages on disk
memory emacsgccPhysical Disk
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (117/355)
Paging data structures
Pages are fixed size (typically 4K)I Least significant 12 (log 4K) bits of address are page offsetI Most significant bits are page number
Each process has a page tableI Maps Virtual Page Numbers to Physical Page NumbersI Also includes bits for protection, validity, etc.
On memory accessI Translate virtual page number to physical page number, then add offset
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (118/355)
Example: Paging on PDP-11
64K virtual memory, 8K pages
I Separate address space for instructions & data
I I.e., can’t read your own instructions with a load
Entire page table stored in registers
I 8 Instruction page translation registers
I 8 Data page translations
/ Swap 16 machine registers on each context switch
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (119/355)
x86 PagingBasics
I Normally 4KB pagesI Paging enabled by bits in a control register (%cr0)
I only privileged OS code can manipulate control registersI %cr3: points to 4KB page directory
I Page directory: 1024 PDEs (page directory entries)Each contains physical address of a page table
I Page table: 1024 PTEs (page table entries)I Each contains physical address of virtual 4K pageI Page table covers 4 MB of virtual memory
Page Translation Mechanics
Page Directory
Directory Entry
%cr3 (PDBR)
Page Table
Page−Table Entry
4−KByte Page
Physical Address
32*
10
12
10
20
0
irectory e f s
31 21 111222
Linear Address
D Tabl O f et
aligned onto a 4-Kbytes boundary 1024 PDE × 1024 PTE = 220 pages = 4Gb
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (120/355)
x86 Page Directory Entry (4 Kb page)
31
Available for system programmer’s use
Global Page (ignored)
Page Size (0 indicates 4K)
Reserved (set to 0)
12 11 9 8 7 6 5 4 3 2 1 0
PCA0
Accessed
Cache Disabled
Write−Through
User/Supervisor
Read/Write
Present
DP
PWT
U/S
R/
WAvailPage Table Base Address
PS
G
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (121/355)
x86 Page Table Entry (4 Kb page)
31
Available for system programmer’s use
Global Page
Page Table Attribute Index
Dirty
12 11 9 8 7 6 5 4 3 2 1 0
PCAD
Accessed
Cache Disabled
Write−Through
User/Supervisor
Read/Write
Present
DP
PWT
U/S
R/
WAvailPage Base Address
PAT
G
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (122/355)
Making Paging Fast
x86 paging translation require 3 memory reference per load/store
I Look up page table address in page directory
I Look up physical page number in page table
I Actually access physical page corresponding to virtual address
Page Directory
Directory Entry
%cr3 (PDBR)
Page Table
Page−Table Entry
4−KByte Page
Physical Address
32*
10
12
10
20
0
irectory e f s
31 21 111222
Linear Address
D Tabl O f et
aligned onto a 4-Kbytes boundary 1024 PDE × 1024 PTE = 220 pages = 4Gb
Translation Lookaside Buffer (TLB)
I For speed, CPU caches recently used translations
I Typical: 64-2K entries, 4-way to fully associative, 95% hit rate
I Each entry maps virtual page number → PPN + protection information
I On each memory reference:I Check TLB. If there get physical address fastI If not, walk page tables, insert in TLB for next time (must evict some entry)
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (123/355)
TLB details
TLB operates at CPU pipeline speed
⇒ small, fast
Complication
I What to do when switch address space?
I x86 solution: Flush TLB on context switch
I MIPS solution: Tag each entry with associated process’s ID
In General, OS must manually keep TLB validI e.g. x86 INVLPG instruction
I Invalidates a page translation in TLBI Must execute after changing a possibly used page table entryI Otherwise, hardware will miss page table change
I More Complex on a multiprocessor (TLB shootdown)
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (124/355)
x86 Paging Extensions
PSE: Page Size Extension
I Setting bit 7 in PDE (and bit 4 of %cr4) makes a 4MB translation(no page table, direct translation)
I Useful for big chucks (less meta-data, but more internal fragmentation)
PAE: Physical Address Extensions
I Physical @ are 36 bits (up to 64Gb); virtual @ still 32 bits (more 4Gb apps / box)
I Three-level translation walk (table entries are 64bits)
%cr3
Page Directory
64bit PageDirectory Entry
Page Table
64bit PageTable Entry
9 9
4k Page
PhysicalAddress
12
3031 20 11 0
2
Directory PointerEntry
Directory PointerEntry
Directory PointerEntry
Directory PointerEntry
21 1229
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (125/355)
Long Mode PAE
CharacteristicsI Physical memory: Up to 1Tb currently (4Pb in future)
I Virtual memory: up to 256Tb currently (16Eb in future)
I Four-level translation walk
%cr3
2930
12999
11122021
9
48 38394763
Directory PointerEntry
0
Page-Map
Level-4 Table
PML4E
Page-Directory
Pointer Table
PDPE
Page Directory Page Table 4k Page
PhysicalAddress
PDE PTE
I Why are the upper 16 bits not used?
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (126/355)
Where do the OS live?
In its own address space?
I Can’t do this on most hardware(e.g., syscall instruction won’t switch address spaces)
I Also would make it harder to parse syscall arguments passed as pointers
So in the same address space as process
I Use protection bits to prohibit user code from writing kernel
Typically all kernel text, most data at same VA in every address space
I On x86, must manually set up page tables for this
I Usually just map kernel in contiguous physical memory when boot loader putskernel into contiguous physical memory
I Some hardware puts physical memory (kernel-only) somewhere in virtualaddress space
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (127/355)
Example Memory Layout
4 Gig
0xf000000kernel text & most data
First 256MB physical memory
USTACKTOPuser stack
Invalid Memorymapped kernel data
Invalid Memory
program text (read-only)
0
program dataBSS
heap
[mmaped regions]
break point
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (128/355)
Very different MMUs exist
MIPSI Hardware has 64-entry TLB (references to addresses not in TLB trap to kernel)
I Each TLB entry has the following fields:Virtual page, Pid, Page frame, NC, D, V, Global
I Kernel itself unpagedI All of physical memory contiguously mapped in high VMI Kernel uses these pseudo-physical addresses
I User TLB fault hander very efficientI Two hardware registers reserved for itI utlb miss handler can itself fault–allow paged page tables
I OS is free to choose page table format!
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (129/355)
Very different MMUs exist
DEC AlphaI Software managed TLB (like MIPS)
I 8KB, 64KB, 512KB, 4MB pages all availableI TLB supports 128 instruction/128 data entries of any size
I But TLB miss handler not part of OSI Processor ships with special ”PAL code” in ROMI Processor-specific, but provides uniform interface to OSI Basically firmware that runs from main memory like OS
I Various events vector directly to PAL codeCALL PAL instruction, TLB miss/fault, FP disabled
I PAL code runs in special privileged processor modeInterrupts always disabled; Have access to special instructions and registers
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (130/355)
Paging to disk
Example of swapping
I gcc needs a new page of memory
I OS re-claims an idle page from emacsI If page is clean (i.e., also stored on disk):
I E.g., page of text from emacs binary on diskI Can always re-read same page from binaryI So okay to discard contents now & give page to gcc
I If page is dirty (meaning memory is the only copy)I Must write page to disk first before giving to gcc
I Either way:I Mark page invalid in emacsI emacs will fault on next access to virtual pageI On fault, OS reads page data back from disk into new page, maps new page
into emacs, resumes executing
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (131/355)
Third Chapter
Memory Handling
Hardware Memory ManagementIntroductionVirtual MemorySegmentationPagingExamplesPDP-11
x86MIPS and DEC Alpha
Swapping
Virtual Memory Operating System
Memory Allocation
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (132/355)
Paging
• Use disk to simulate larger virtual than physical mem– p. 2/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (133/355)
Working set model
• Disk much, much slower than memory
- Goal: Run at memory, not disk speeds
• 90/10 rule: 10% of memory gets 90% of memory refs
- So, keep that 10% in real memory, the other 90% on disk
- How to pick which 10%?
– p. 3/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (134/355)
Paging challenges
• How to resume a process after a fault?
- Need to save state and resume
- Process might have been in the middle of an instruction!
• What to fetch?
- Just needed page or more?
• What to eject?
- How to allocate physical pages amongst processes?
- Which of a particular proc’s pages to keep in memory?
– p. 4/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (135/355)
Re-starting instructions• Hardware provides kernel w. info about page fault
- Faulting virtual address (e.g., in % r2 reg on x86)
- Address of instruction that caused fault
- Was the access a read or write? Was it an instruction fetch?
Was it caused by user access to kernel-only memory?
• Hardware must allow resuming after a fault
• Idempotent instructions are easy
- E.g., simple load or store instruction can be restarted
- Just re-execute any instruction that only accesses one address
• Complex instructions must be re-started, too
- E.g., x86 move string instructions
- Specify srd, dst, count in %esi, %edi, %e x registers
- On fault, registers adjusted to resume where move left off
– p. 5/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (136/355)
What to fetch
• Bring in page that caused page fault
• Pre-fetch surrounding pages?
- Reading two disk blocks approximately as fast as reading one
- As long as no track/head switch, seek time dominates
- If application exhibits spacial locality, then big win to store and
read multiple contiguous pages
• Also pre-zero unused pages in idle loop
- Need 0-filled pages for stack, heap, anonymously mmapped
memory
- Zeroing them only on demand is slower
- So many OSes zero freed pages while CPU is idle
– p. 6/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (137/355)
Selecting physical pages
• May need to eject some pages
- More on eviction policy in two slides
• May also have a choice of physical pages
• Direct-mapped physical caches
- Virtual → Physical mapping can affect performance
- Applications can conflict with each other or themselves
- Scientific applications benefit if consecutive virtual pages to not
conflict in the cache
- Many other applications do better with random mapping
– p. 7/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (138/355)
Superpages
• How should OS make use of “large” mappings
- x86 has 2/4MB pages that might be useful
- Alpha has even more choices: 8KB, 64KB, 512KB, 4MB
• Sometimes more pages in L2 cache than TLB entries
- Don’t want costly TLB misses going to main memory
• Transparent superpage support [Navarro]
- “Reserve” appropriate physical pages if possible
- Promote contiguous pages to superpages
- Does complicate evicting (esp. dirty pages) – demote
– p. 8/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (139/355)
Straw man: FIFO eviction
• Evict oldest fetched page in system
• Example—reference string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
• 3 physical pages: 9 page faults
– p. 9/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (140/355)
Straw man: FIFO eviction
• Evict oldest fetched page in system
• Example—reference string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
• 3 physical pages: 9 page faults
• 4 physical pages: 10 page faults
– p. 9/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (141/355)
Belady’s Anomaly
• More phys. mem. doesn’t always mean fewer faults
– p. 10/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (142/355)
Optimal page replacement
• What is optimal (if you knew the future)?
– p. 11/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (143/355)
Optimal page replacement
• What is optimal (if you knew the future)?
- Replace page that will not be used for longest period of time
• Example—reference string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
• With 4 physical pages:
– p. 11/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (144/355)
LRU page replacement• Approximate optimal with least recently used
- Because past often predicts the future
• Example—reference string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
• With 4 physical pages: 8 page faults
• Problem 1: Can be pessimal – example?
• Problem 2: How to implement?– p. 12/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (145/355)
LRU page replacement• Approximate optimal with least recently used
- Because past often predicts the future
• Example—reference string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
• With 4 physical pages: 8 page faults
• Problem 1: Can be pessimal – example?
- Looping over memory (then want MRU eviction)
• Problem 2: How to implement?– p. 12/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (146/355)
Straw man LRU implementations
• Stamp PTEs with timer value
- E.g., CPU has cycle counter
- Automatically writes value to PTE on each page access
- Scan page table to find oldest counter value = LRU page
- Problem: Would double memory traffic!
• Keep doubly-linked list of pages
- On access remove page, place at tail of list
- Problem: again, very expensive
• What to do?
- Just approximate LRU, don’t try to do it exactly
– p. 13/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (147/355)
Clock algorithm
• Use accessed bit supported by most hardware
- E.g., Pentium will write 1 to A bit in PTE on first access
- Software managed TLBs like MIPS can do the same
• Do FIFO but skip accessed pages
• Keep pages in circular FIFO list
• Scan:
- page’s A bit = 1, set to 0 & skip
- else if A == 0, evict
• A.k.a. second-chance replacement
– p. 14/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (148/355)
Clock alg. (continued)• Large memory may be a problem
- Most pages reference in long interval
• Add a second clock hand
- Leading edge clears A bits
- Trailing edge evicts pages with A=0
• Can also take advantage of hardware Dirty bit
- Each page can be (Unaccessed, Clean), (Unaccessed, Dirty),
(Accessed, Clean), or (Accessed, Dirty)
- Consider clean pages for eviction before dirty
• Or use n-bit accessed count instead just A bit
- On sweep: count = (A << (n− 1)) | (count >> 1)
- Evict page with lowest count
– p. 15/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (149/355)
Other replacement algorithms
• Random eviction
- Dirt simple to implement
- Not overly horrible (avoids Belady & pathological cases)
• LFU (least frequently used) eviction
- instead of just A bit, count # times each page accessed
- least frequently accessed must not be very useful
(or maybe was just brought in and is about to be used)
- decay usage counts over time (for pages that fall out of usage)
• MFU (most frequently used) algorithm
- because page with the smallest count was probably just
brought in and has yet to be used
• Neither LFU nor MFU used very commonly
– p. 16/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (150/355)
Naïve paging
• Naïve page requires 2 disk I/Os per page fault– p. 17/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (151/355)
Page buffering
• Idea: reduce # of I/Os on the critical path
• Keep pool of free page frames
- On fault, still select victim page to evict
- But read fetched page into already free page
- Can resume execution while writing out victim page
- Then add victim page to free pool
• Can also yank pages back from free pool
- Contains only clean pages, but may still have data
- If page fault on page still in free pool, recycle
– p. 18/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (152/355)
Page allocation
• Allocation can be global or local
• Global allocation doesn’t consider page ownership
- E.g., with LRU, evict least recently used page of any proc
- Works well if P1 needs 20% of memory and P2 needs 70%:
- Doesn’t protect you from memory pigs
(imagine P2 keeps looping through array that is size of mem)
• Local allocation isolates processes (or users)
- Separately determine how much mem each procshould have
- Then use LRU/clock/etc. to determine which pages to evict
within each process
– p. 19/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (153/355)
Thrashing
• Thrashing: processes on system require morememory than it has
- Each time one page is brought in, another page, whose contents
will soon be referenced, is thrown out
- Processes will spend all of their time blocked, waiting for pages
to be fetched from disk
- I/O devs at 100% utilization but system not getting much
useful work done
• What we wanted: virtual memory the size of disk
with access time of of physical memory
• What we have: memory with access time = disk
access
– p. 20/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (154/355)
Reasons for thrashing• Process doesn’t reuse memory, so caching doesn’t
work (past != future)
• Process does reuse memory, but it does not “fit”
• Individually, all processes fit and reuse memory, but
too many for system
- At least this case is possible to address
– p. 21/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (155/355)
Multiprogramming & Thrashing
• Need to shed load when thrashing
– p. 22/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (156/355)
Dealing with thrashing
• Approach 1: working set
- Thrashing viewed from a caching perspective: given locality of
reference, how big a cache does the process need?
- Or: how much memory does process need in order to make
reasonable progress (its working set)?
- Only run processes whose memory requirements can be
satisfied
• Approach 2: page fault frequency
- Thrashing viewed as poor ratio of fetch to work
- PFF = page faults / instructions executed
- If PFF rises above threshold, process needs more memory
not enough memory on the system? Swap out.
- If PFF sinks below threshold, memory can be taken away
– p. 23/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (157/355)
Working sets
• Working set changes across phases
- Baloons during transition
– p. 24/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (158/355)
Calculating the working set
• Working set: all pages proc. will access in next T time
- Can’t calculate without predicting future
• Approximate by assuming past predicts future
- So working set ≈ pages accessed in last T time
• Keep idle time for each page
• Periodically scan all resident pages in system
- A bit set? Clear it and clear the page’s idle time
- A bit clear? Add CPU consumed since last scan to idle time
- Working set is pages with idle time < T
– p. 25/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (159/355)
Two-level scheduler• Divide processes into active & inactive
- Active – means working set resident in memory
- Inactive – working set intentionally not loaded
• Balance set: union of all active working sets
- Must keep balance set smaller than physical memory
• Use long-term scheduler
- Moves procs from active → inactive until balance set small
enough
- Periodically allows inactive to become active
- As working set changes, must update balance set
• Complications
- How to chose T?
- How to pick processes for active set
- How to count shared memory (e.g., libc)– p. 26/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (160/355)
Some complications of paging
• What happens to available memory?
- Some physical memory tied up by kernel VM structures
• What happens to user/kernel crossings?
- More crossings into kernel
- Pointers in syscall arguments must be checked
• What happens to IPC?
- Must change hardware address space
- Increases TLB misses
- Context switch flushes TLB entirely on x86
(But not on MIPS. . . Why?)
– p. 27/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (161/355)
64-bit address spaces
• Straight hierarchical page tables not efficient
• Solution 1: Guarded page tables [Liedtke]
- Omit intermediary tables with only one entry
- Add predicate in high level tables, stating the only virtual
address range mapped underneath + # bits to skip
• Solution 2: Hashed page tables
- Store Virtual → Physical translations in hash table
- Table size proportional to physical memory
- Clustering makes this more efficient
– p. 28/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (162/355)
Typical virtual address space
Invalid Memory
program text (read-only)
0
program dataBSS
heap
USTACKTOPuser stack
4 Gigkernel memory
Invalid Memory
breakpoint
• Dynamically allocated memory goes in heap
- Typically right above BSS (uninitialized data) section
• Top of heap called breakpoint
- Memory between breakpoint and stack is invalid
– p. 29/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (163/355)
Early VM system calls
• OS keeps “Breakpoint” – top of heap
- Memory regions between breakpoint & stack fault
• har *brk ( onst har *addr);- Set and return new value of breakpoint
• har *sbrk (int in r);- Increment value of the breakpoint & return old value
• Can implement mallo in terms of sbrk- But hard to “give back” physical memory to system
– p. 30/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (164/355)
Memory mapped files
memory-mapped file
memory-mapped file
USTACKTOPuser stack
4 Gigkernel memory
Invalid Memory
program text (read-only)
0
program dataBSS
heapbreakpoint
Invalid Memory
• Other memory objects between heap and stack– p. 31/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (165/355)
mmap system call
• void *mmap (void *addr, size_t len, int prot,int flags, int fd, off_t offset)- Map file specified by fd at virtual address addr- If addr is NULL, let kernel choose the address
• prot – protection of region
- OR of PROT_EXEC, PROT_READ, PROT_WRITE, PROT_NONE
• flags- MAP_ANON – anonymous memory (fd should be -1)
- MAP_PRIVATE – modifications are private
- MAP_SHARED – modifications seen by everyone
– p. 32/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (166/355)
More VM system calls
• int msyn (void *addr, size_t len, int flags);- Flush changes of mmapped file to backing store
• int munmap(void *addr, size_t len)- Removes memory-mapped object
• int mprote t(void *addr, size_t len, int prot)- Changes protection on pages to or of PROT_. . .
• int min ore(void *addr, size_t len, har *ve )- Returns in ve which pages present
– p. 33/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (167/355)
Catching page faults
struct sigaction {
union { /* signal handler */
void (*sa_handler)(int);
void (*sa_sigaction)(int, siginfo_t *, void *);
};
sigset_t sa_mask; /* signal mask to apply */
int sa_flags;
};
int sigaction (int sig, const struct sigaction *act,
struct sigaction *oact)
• Can specify function to run on SIGSEGV– p. 34/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (168/355)
Example: OpenBSD/i386 siginfo
struct sigcontext {int sc_gs; int sc_fs; int sc_es; int sc_ds;int sc_edi; int sc_esi; int sc_ebp; int sc_ebx;int sc_edx; int sc_ecx; int sc_eax;
int sc_eip; int sc_cs; /* instruction pointer */int sc_eflags; /* condition codes, etc. */int sc_esp; int sc_ss; /* stack pointer */
int sc_onstack; /* sigstack state to restore */int sc_mask; /* signal mask to restore */
int sc_trapno;int sc_err;
};
– p. 35/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (169/355)
4.4 BSD VM system
• Each process has a vmspace structure containing
- vm_map – machine-independent virtual address space
- vm_pmap – machine-dependent data structures
- statistics – e.g. for syscalls like getrusage ()
• vm_map is a linked list of vm_map_entry structs
- vm_map_entry covers contiguous virtual memory
- points to vm_object struct
• vm_object is source of data
- e.g. vnode object for memory mapped file
- points to list of vm_page structs (one per mapped page)
- shadow objects point to other objects for copy on write
– p. 36/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (170/355)
vm_map_entry
vm_map_entry
vm_map_entry
vm_map_entry
shadowobject
vm_page
object
vnode/
shadowobject
vm_page
vnode/
object
vnode/
object
vm_page
vm_page
vm_page
vm_page
vm_page
vm_map
vm_pmap
stats
vmspace
– p. 37/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (171/355)
Pmap (machine-dependent) layer
• Pmap layer holds architecture-specific VM code
• VM layer invokes pmap layer
- On page faults to install mappings
- To protect or unmap pages
- To ask for dirty/accessed bits
• Pmap layer is lazy and can discard mappings
- No need to notify VM layer
- Process will fault and VM layer must reinstall mapping
• Pmap handles restrictions imposed by cache
– p. 38/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (172/355)
Example uses
• vm_map_entry structs for a process
- r/o text segment → file object
- r/w data segment → shadow object → file object
- r/w stack → anonymous object
• New vm_map_entry objects after a fork:
- Share text segment directly (read-only)
- Share data through two new shadow objects
(must share pre-fork but not post fork changes)
- Share stack through two new shadow objects
• Must discard/collapse superfluous shadows
- E.g., when child process exits
– p. 39/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (173/355)
What happens on a fault?
• Traverse vm_map_entry list to get appropriate entry
- No entry? Protection violation? Send process a SIGSEGV
• Traverse list of [shadow] objects
• For each object, traverse vm_page structs
• Found a vm_page for this object?
- If first vm_object in chain, map page
- If read fault, install page read only
- Else if write fault, install copy of page
• Else get page from object
- Page in from file, zero-fill new page, etc.
– p. 40/40David Mazieres RSA (2008-2009) Chap 3: Memory Handling (174/355)
Third Chapter
Memory Handling
Hardware Memory ManagementIntroductionVirtual MemorySegmentationPagingExamplesPDP-11
x86MIPS and DEC Alpha
Swapping
Virtual Memory Operating System
Memory Allocation
David Mazieres RSA (2008-2009) Chap 3: Memory Handling (175/355)
Dynamic memory allocation
• Almost every useful program uses it
- Gives wonderful functionality benefits
- Don’t have to statically specify complex data structures
- Can have data grow as a function of input size
- Allows recursive procedures (stack growth)
- But, can have a huge impact on performance
• Today: how to implement it
• Some interesting facts:
- Two or three line code change can have huge, non-obvious
impact on how well allocator works (examples to come)
- Proven: impossible to construct an "always good" allocator
- Surprising result: after 35 years, memory management still
poorly understood
– p. 2/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (176/355)
Why is it hard?
• Satisfy arbitrary set of allocation and free’s.
• Easy without free: set a pointer to the beginning of
some big chunk of memory (“heap”) and increment
on each allocation:
• Problem: free creates holes (“fragmentation”) Result?
Lots of free space but cannot satisfy request!
– p. 3/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (177/355)
More abstractly
• What an allocator must do:
- Track which parts of memory in use, which parts are free.
- Ideal: no wasted space, no time overhead.
• What the allocator cannot do:
- Control order of the number and size of requested blocks.
- Change user ptrs = (bad) placement decisions permanent.
• The core fight: minimize fragmentation
- App frees blocks in any order, creating holes in “heap”.
- Holes too small? cannot satisfy future requests.
– p. 4/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (178/355)
What is fragmentation really?
• Inability to use memory that is free
• Two causes
- Different lifetimes—if adjacent objects die at different times,
then fragmentation:
- If they die at the same time, then no fragmentation:
- Different sizes: If all requests the same size, then no
fragmentation (paging artificially creates this):
– p. 5/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (179/355)
Important decisions• Placement choice: where in free memory to put a
requested block?
- Freedom: can select any memory in the heap
- Ideal: put block where it won’t cause fragmentation later.
(impossible in general: requires future knowledge)
• Splitting free blocks to satisfy smaller requests
- Fights internal fragmentation.
- Freedom: can chose any larger block to split.
- One way: chose block with smallest remainder (best fit).
• Coalescing free blocks to yield larger blocks
- Freedom: when coalescing done (deferring can be good) fights
external fragmentation.
– p. 6/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (180/355)
Impossible to “solve” fragmentation
• If you read allocation papers to find the best allocator
- All discussions revolve around tradeoffs.
- The reason? There cannot be a best allocator.
• Theoretical result:
- For any possible allocation algorithm, there exist streams of
allocation and deallocation requests that defeat the allocator
and force it into severe fragmentation.
• What is bad?
- Good allocator: requires gross memory M · log(nmax/nmin),M = bytes of live data, nmin = smallest allocation, nmax = largest
- Bad allocator: M · (nmax/nmin)
– p. 7/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (181/355)
Pathological examples
• Given allocation of 7 20-byte chunks
- What’s a bad stream of frees and then allocates?
• Given 100 bytes of free space
- What’s a really bad combination of placement decisions and
malloc & frees?
• Next: two allocators (best fit, first fit) that, in practice,work pretty well.
- “pretty well” = ∼20% fragmentation under many workloads
– p. 8/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (182/355)
Best fit
• Strategy: minimize fragmentation by allocatingspace from block that leaves smallest fragment
- Data structure: heap is a list of free blocks, each has a header
holding block size and pointers to next
- Code: Search freelist for block closest in size to the request.
(Exact match is ideal)
- During free (usually) coalesce adjacent blocks
• Problem: Sawdust
- Remainder so small that over time left with “sawdust”
everywhere.
- Fortunately not a problem in practice.
– p. 9/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (183/355)
Best fit gone wrong
• Simple bad case: allocate n, m (m < n) in alternating
orders, free all the ms, then try to allocate an m + 1.
• Example: start with 100 bytes of memory
- alloc 19, 21, 19, 21, 19
- free 19, 19, 19:
- alloc 20? Fails! (wasted space = 57 bytes)
• However, doesn’t seem to happen in practice (though
the way real programs behave suggest it easily could)
– p. 10/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (184/355)
First fit
• Strategy: pick the first block that fits
- Data structure: free list, sorted lifo, fifo, or by address
- Code: scan list, take the first one.
• LIFO: put free object on front of list.
- Simple, but causes higher fragmentation
• Address sort: order free blocks by address.
- Makes coalescing easy (just check if next block is free)
- Also preserves empty/idle space (locality good when paging)
• FIFO: put free object at end of list.
- Gives similar fragmentation as address sort, but unclear why
– p. 11/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (185/355)
Subtle pathology: LIFO FF
• Storage management example of subtle impact of
simple decisions
• LIFO first fit seems good:
- Put object on front of list (cheap), hope same size used again
(cheap + good locality).
• But, has big problems for simple allocation patterns:
- Repeatedly intermix short-lived large allocations, with
long-lived small allocations.
- Each time large object freed, a small chunk will be quickly
taken. Pathological fragmentation.
– p. 12/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (186/355)
First fit: Nuances
• First fit + address order in practice:
- Blocks at front preferentially split, ones at back only split when
no larger one found before them
- Result? Seems to roughly sort free list by size
- So? Makes first fit operationally similar to best fit: a first fit of a
sorted list = best fit!
• Problem: sawdust at beginning of the list
- Sorting of list forces a large requests to skip over many small
blocks. Need to use a scalable heap organization
• When better than best fit?
- Suppose memory has free blocks:
- Suppose allocation ops are 10 then 20 (best fit best)
- Suppose allocation ops are 8, 12, then 12 (first fit best)
– p. 13/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (187/355)
First/best fit: weird parallels
• Both seem to perform roughly equivalently
• In fact the placement decisions of both are roughlyidentical under both randomized and real workloads!
- No one knows why.
- Pretty strange since they seem pretty different.
• Possible explanations:
- First fit like best fit because over time its free list becomes
sorted by size: the beginning of the free list accumulates small
objects and so fits tend to be close to best.
- Both have implicit “open space hueristic” try not to cut into
large open spaces: large blocks at end only used when have to
be (e.g., first fit: skips over all smaller blocks).
– p. 14/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (188/355)
Some worse ideas
• Worst-fit:
- Strategy: fight against sawdust by splitting blocks to maximize
leftover size
- In real life seems to ensure that no large blocks around.
• Next fit:
- Strategy: use first fit, but remember where we found the last
thing and start searching from there.
- Seems like a good idea, but tends to break down entire list.
• Buddy systems:
- Round up allocations to power of 2 to make management faster.
- Result? Heavy internal fragmentation.
– p. 15/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (189/355)
Slab allocation• Kernel allocates many instances of same structures
- E.g., a 1.7 KB task_stru t for every process on system
• Often want contiguous physical memory (for DMA)
• Slab allocation optimizes for this case:
- A slab is multiple pages of contiguous physical memory
- A cache contains one or more slabs
- Each cache stores only one kind of object (fixed size)
• Each slab is full, empty, or partial
• E.g., need new task_stru t?
- Look in the task_stru t cache
- If there is a partial slab, pick free task_stru t in that
- Else, use empty, or may need to allocate new slab for cache
• Advantages: speed, and no internal fragmentation– p. 16/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (190/355)
Known patterns of real programs• So far we’ve treated programs as black boxes.
• Most real programs exhibit 1 or 2 (or all 3) of thefollowing patterns of alloc/dealloc:
- ramps: accumulate data monotonically over time
- peaks: allocate many objects, use briefly, then free all
- plateaus: allocate many objects, use for a long time
– p. 17/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (191/355)
Pattern 1: ramps
• In a practical sense: ramp = no free!
- Implication for fragmentation?
- What happens if you evaluate allocator with ramp programs
only?
– p. 18/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (192/355)
Pattern 2: peaks
• Peaks: allocate many objects, use briefly, then free all
- Fragmentation a real danger.
- Interleave peak & ramp? Interleave two different peaks?
- What happens if peak allocated from contiguous memory?
– p. 19/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (193/355)
Exploiting peaks• Peak phases: alloc a lot, then free everything
- So have new allocation interface: alloc as before, but only
support free of everything.
- Called “arena allocation”, “obstack” (object stack), or
procedure call (by compiler people).
• arena = a linked list of large chunks of memory.
- Advantages: alloc is a pointer increment, free is “free”.
No wasted space for tags or list pointers.
– p. 20/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (194/355)
Pattern 3: Plateaus
• Plateaus: allocate many objects, use for a long time
- what happens if overlap with peak or different plateau?
– p. 21/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (195/355)
Fighting fragmentation
• Segregation = reduced fragmentation:
- Allocated at same time ∼ freed at same time
- Different type ∼ freed at different time
• Implementation observations:
- Programs allocate small number of different sizes.
- Fragmentation at peak use more important than at low.
- Most allocations small (< 10 words)
- Work done with allocated memory increases with size.
- Implications?
– p. 22/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (196/355)
Simple, fast segregated free lists
• Array of free lists for small sizes, tree for larger
- Place blocks of same size on same page.
- Have count of allocated blocks: if goes to zero, can return page
• Pro: segregate sizes, no size tag, fast small alloc
• Con: worst case waste: 1 page per size even w/o free,
after pessimal free waste 1 page per object
– p. 23/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (197/355)
Typical space overheads
• Free list bookkeeping + alignment determineminimum allocatable size:
- Store size of block.
- Pointers to next and previous freelist element.
- Machine enforced overhead: alignment. Allocator doesn’t
know type. Must align memory to conservative boundary.
- Minimum allocation unit? Space overhead when allocated?
– p. 24/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (198/355)
Getting more space from OS
• On Unix, can use sbrk- E.g., to activate a new zero-filled page:
• For large allocations, sbrk a bad idea
- May want to give memory back to OS
- Can’t w. sbrk unless big chunk last thing allocated
- So allocate large chunk using mmap’s MAP_ANON– p. 25/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (199/355)
Faults + resumption = power
• Resuming after fault lets us emulate many things
- “every problem can be solved with layer of indirection”
• Example: sub-page protection
• To protect sub-page region in paging system:
- Set entire page to weakest permission; record in PT
- Any access that violates perm will cause an access fault
- Fault handler checks if page special, and if so, if access allowed.
Continue or raise error, as appropriate
– p. 26/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (200/355)
More fault resumption examples• Emulate accessed bits:
- Set page permissions to “invalid”.
- On any access will get a fault: Mark as accessed
• Avoid save/restore of FP registers
- Make first FP operation fault to detect usage
• Emulate non-existent instructions:
- Give inst an illegal opcode; OS fault handler detects and
emulates fake instruction
• Run OS on top of another OS!
- Slam OS into normal process
- When does something “privileged,” real
OS gets woken up with a fault.
- If op allowed, do it, otherwise kill.
- IBM’s VM/370. Vmware (sort of)– p. 27/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (201/355)
Not just for kernels
• User-level code can resume after faults, too
• mprote t – protects memory
• siga tion – catches signal after page fault
- Return from signal handler restarts faulting instruction
• Many applications detailed by Appel & Li
• Example: concurrent snapshotting of process
- Mark all of processes memory read-only w. mprote t- One thread starts writing all of memory to disk
- Other thread keeps executing
- On fault – write that page to disk, make writable, resume
– p. 28/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (202/355)
Distributed shared memory
• Virtual memory allows us to go to memory or disk
- But, can use the same idea to go anywhere! Even to another
computer. Page across network rather than to disk. Faster, and
allows network of workstations (NOW)
– p. 29/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (203/355)
Persistent stores
• Idea: Objects that persist across program invocations
- E.g., object-oriented database; useful for CAD/CAM type apps
• Achieve by memory-mapping a file
• But only write changes to file at end if commit
- Use dirty bits to detect which pages must be written out
- Or with mprotect/sigaction emulated dirty bits on write faults
• On 32-bit machine, store can be larger than memory
- But single run of program won’t access > 4GB of objects
- Keep mapping betw. 32-bit mem ptrs and 64-bit disk offsets
- Use faults to bring in pages from disk as necessary
- After reading page, translate pointers—known as swizzling
– p. 30/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (204/355)
Garbage collection• In safe languages, run time knows about all pointers
- So can move an object if you change all the pointers
• What memory locations might a program access?
- Any objects whose pointers are currently in registers
- Recursively, any pointers in objects it might access
- Anything else is unreachable, or garbage; memory can be re-used
• Example: stop-and-copy garbage collection
- Memory full? Temporarily pause program, allocate new heap
- Copy all objects pointed to by registers into new heap
- Mark old copied objects as copied, record new location
- Start scanning through new heap. For each pointer:
- Copied already? Adjust pointer to new location
- Not copied? Then copy it and adjust pointer
- Free old heap—program will never access it—and continue
– p. 31/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (205/355)
Concurrent garbage collection• Idea: Stop & copy, but without the stop
- Mutator thread runs program, collector concurrently does GC
• When collector invoked:
- Protect from space & unscanned to space from mutator
- Copy objects in registers into to space, resume mutator
- All pointers in scanned to space point to to space
- If mutator accesses unscanned area, fault, scan page, resume
from space
1 2 3to space
areascanned
areaunscanned
4 5 6 mutator faultson access
=
– p. 32/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (206/355)
Heap overflow detection• Many GCed languages need fast allocation
- E.g., in lisp, constantly allocating cons cells
- Allocation can be as often as every 50 instructions
• Fast allocation is just to bump a pointer
char *next_free;char *heap_limit;
void *alloc (unsigned size) {if (next_free + size > heap_limit) /* 1 */
invoke_garbage_collector (); /* 2 */char *ret = next_free;next_free += size;return ret;
}
• But would be even faster to eliminate lines 1 & 2!
– p. 33/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (207/355)
Heap overflow detection 2
• Mark page at end of heap inaccessible
- mprote t (heap_limit, PAGE_SIZE, PROT_NONE);• Program will allocate memory beyond end of heap
• Program will use memory and fault
- Note: Depends on specifics of language
- But many languages will touch allocated memory immediately
• Invoke garbage collector
- Must now put just allocated object into new heap
• Note: requires more than just resumption
- Faulting instruction must be resumed
- But must resume with different target virtual address
- Doable on most architectures since GC updates registers
– p. 34/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (208/355)
Reference counting• Seemingly simpler GC scheme:
- Each object has “ref count” of pointers to it
- Increment when pointer set to it
- Decremented when pointer killed
- ref count == 0? Free object
• Works well for hierarchical data structures
- E.g., pages of physical memory
– p. 35/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (209/355)
Reference counting pros/cons• Circular data structures always have ref count > 0
- No external pointers means lost memory
• Can do manually w/o PL support, but error-prone
• Potentially more efficient than real GC
- No need to halt program to run collector
- Avoids weird unpredictable latencies
• Potentially less efficient than real GC
- With real GC, copying a pointer is cheap
- With reference counting, must write ref count each time– p. 36/36David Mazieres RSA (2008-2009) Chap 3: Memory Handling (210/355)
Fourth Chapter
I/O subsystem3
DisksI/O subsystem of the OSDisk Control Algorithms
Files and directoriesBasicsConsistency and Resilience
3From David Mazieres course at Stanford.Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (211/355)
Memory and I/O buses
I/O bus1880Mbps 1056Mbps
Crossbar
Memory
CPU
• CPU accesses physical memory over a bus
• Devices access memory over I/O bus with DMA
• Devices can appear to be a region of memory
– p. 1/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (212/355)
Realistic PC architectureAdvanced
ProgramableInterrupt
Controllerbus
I/O
APIC
CPU
BridgeMain
memory
North
bussidefront-
SouthBridge
busISA
CPU
USB
busAGP
PCIIRQsbus
PCI
– p. 2/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (213/355)
What is memory?
• SRAM – Static RAM
- Like two NOT gates circularly wired input-to-output
- 4–6 transistors per bit, actively holds its value
- Very fast, used to cache slower memory
• DRAM – Dynamic RAM
- A capacitor + gate, holds charge to indicate bit value
- 1 transistor per bit – extremely dense storage
- Charge leaks—need slow comparator to decide if bit 1 or 0
- Must re-write charge after reading, and periodically refresh
• VRAM – “Video RAM”
- Dual ported, can write while someone else reads
– p. 3/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (214/355)
What is I/O bus? E.g., PCI
– p. 4/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (215/355)
Communicating with a device• Memory-mapped device registers
- Certain physical addresses correspond to device registers
- Load/store gets status/sends instructions – not real memory
• Device memory – device may have memory OS can
write to directly on other side of I/O bus
• Special I/O instructions
- Some CPUs (e.g., x86) have special I/O instructions
- Like load & store, but asserts special I/O pin on CPU
- OS can allow user-mode access to I/O ports with finer
granularity than page
• DMA – place instructions to card in main memory
- Typically then need to “poke” card by writing to register
- Overlaps unrelated computation with moving data over
(typically slower than memory) I/O bus– p. 5/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (216/355)
DMA buffers
Bufferdescriptorlist
Memory buffers
100
1400
1500
1500
1500
…
• Include list of buffer locations in main memory
• Card reads list then accesses buffers (w. DMA)
- Allows for scatter/gather I/O
– p. 6/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (217/355)
Example: Network Interface Card
Host
I/O
bus
Adaptor
Network linkBus
interfaceLink
interface
• Link interface talks to wire/fiber/antenna
- Typically does framing, link-layer CRC
• FIFOs on card provide small amount of buffering
• Bus interface logic uses DMA to move packets to and
from buffers in main memory
– p. 7/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (218/355)
Example: IDE disk with DMA
– p. 8/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (219/355)
Driver architecture• Device driver provides several entry points to kernel
- Reset, ioctl, output, interrupt, read, write, strategy . . .
• How should driver synchronize with card?
- E.g., Need to know when transmit buffers free or packets arrive
- Need to know when disk request complete
• One approach: Polling
- Sent a packet? Loop asking card when buffer is free
- Waiting to receive? Keep asking card if it has packet
- Disk I/O? Keep looping until disk ready bit set
• Disadvantages of polling
- Can’t use CPU for anything else while polling
- Or schedule poll in future and do something else, but then high
latency to receive packet or process disk block
– p. 9/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (220/355)
Interrupt driven devices• Instead, ask card to interrupt CPU on events
- Interrupt handler runs at high priority
- Asks card what happened (xmit buffer free, new packet)
- This is what most general-purpose OSes do
• Bad under high network packet arrival rate
- Packets can arrive faster than OS can process them
- Interrupts are very expensive (context switch)
- Interrupts handlers have high priority
- In worst case, can spend 100% of time in interrupt handler and
never make any progress – receive livelock
- Best: Adaptive switching between interrupts and polling
• Very good for disk requests
• Rest of today: Disks (network devices in 1.5 weeks)
– p. 10/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (221/355)
Anatomy of a disk
• Stack of magnetic platters
- Rotate together on a central spindle @3,600-15,000 RPM
- Drive speed drifts slowly over time
- Can’t predict rotational position after 100-200 revolutions
• Disk arm assembly
- Arms rotate around pivot, all move together
- Pivot offers some resistance to linear shocks
- Arms contain disk heads–one for each recording surface
- Heads read and write data to platters
– p. 11/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (222/355)
Disk
– p. 12/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (223/355)
Disk
– p. 12/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (224/355)
Disk
– p. 12/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (225/355)
Storage on a magnetic platter
• Platters divided into concentric tracks
• A stack of tracks of fixed radius is a cylinder
• Heads record and sense data along cylinders
- Significant fractions of encoded stream for error correction
• Generally only one head active at a time
- Disks usually have one set of read-write circuitry
- Must worry about cross-talk between channels
- Hard to keep multiple heads exactly aligned
– p. 13/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (226/355)
Cylinders, tracks, & sectors
– p. 14/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (227/355)
Disk positioning system
• Move head to specific track and keep it there
- Resist physical socks, imperfect tracks, etc.
• A seek consists of up to four phases:
- speedup–accelerate arm to max speed or half way point
- coast–at max speed (for long seeks)
- slowdown–stops arm near destination
- settle–adjusts head to actual desired track
• Very short seeks dominated by settle time (∼1 ms)
• Short (200-400 cyl.) seeks dominated by speedup
- Accelerations of 40g
– p. 15/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (228/355)
Seek details
• Head switches comparable to short seeks
- May also require head adjustment
- Settles take longer for writes than reads
• Disk keeps table of pivot motor power
- Maps seek distance to power and time
- Disk interpolates over entries in table
- Table set by periodic “thermal recalibration”
- 500 ms recalibration every 25 min, bad for AV
• “Average seek time” quoted can be many things
- Time to seek 1/3 disk, 1/3 time to seek whole disk,
– p. 16/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (229/355)
Sectors
• Disk interface presents linear array of sectors
- Generally 512 bytes, written atomically
• Disk maps logical sector #s to physical sectors
- Zoning–puts more sectors on longer tracks
- Track skewing–sector 0 pos. varies by track (why?)
- Sparing–flawed sectors remapped elsewhere
• OS doesn’t know logical to physical sector mapping
- Larger logical sector # difference means larger seek
- Highly non-linear relationship (and depends on zone)
- OS has no info on rotational positions
- Can empirically build table to estimate times
– p. 17/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (230/355)
Sectors
• Disk interface presents linear array of sectors
- Generally 512 bytes, written atomically
• Disk maps logical sector #s to physical sectors
- Zoning–puts more sectors on longer tracks
- Track skewing–sector 0 pos. varies by track (sequential access speed)
- Sparing–flawed sectors remapped elsewhere
• OS doesn’t know logical to physical sector mapping
- Larger logical sector # difference means larger seek
- Highly non-linear relationship (and depends on zone)
- OS has no info on rotational positions
- Can empirically build table to estimate times
– p. 17/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (231/355)
Disk interface• Controls hardware, mediates access
• Computer, disk often connected by bus (e.g., SCSI)
- Multiple devices may contentd for bus
• Possible disk/interface features:
• Disconnect from bus during requests
• Command queuing: Give disk multiple requests
- Disk can schedule them using rotational information
• Disk cache used for read-ahead
- Otherwise, sequential reads would incur whole revolution
- Cross track boundaries? Can’t stop a head-switch
• Some disks support write caching
- But data not stable–not suitable for all requests
– p. 18/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (232/355)
Disk performance• Placement & ordering of requests a huge issue
- Sequential I/O much, much faster than random
- Long seeks much slower than short ones
- Power might fail any time, leaving inconsistent state
• Must be careful about order for crashes
- More on this in next two lectures
• Try to achieve contiguous accesses where possible
- E.g., make big chunks of individual files contiguous
• Try to order requests to minimize seek times
- OS can only do this if it has a multiple requests to order
- Requires disk I/O concurrency
- High-performance apps try to maximize I/O concurrency
• Next: How to schedule concurrent requests– p. 23/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (233/355)
Scheduling: FCFS
• “First Come First Served”
- Process disk requests in the order they are received
• Advantages
• Disadvantages
– p. 24/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (234/355)
Scheduling: FCFS
• “First Come First Served”
- Process disk requests in the order they are received
• Advantages
- Easy to implement
- Good fairness
• Disadvantages
- Cannot exploit request locality
- Increases average latency, decreasing throughput
– p. 24/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (235/355)
Shortest positioning time first (SPTF)• Shortest positioning time first (SPTF)
- Always pick request with shortest seek time
• Advantages
• Disadvantages
• Improvement
• Also called Shortest Seek Time First (SSTF)
– p. 25/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (236/355)
Shortest positioning time first (SPTF)• Shortest positioning time first (SPTF)
- Always pick request with shortest seek time
• Advantages
- Exploits locality of disk requests
- Higher throughput
• Disadvantages
- Starvation
- Don’t always know what request will be fastest
• Improvement: Aged SPTF
- Give older requests higher priority
- Adjust “effective” seek time with weighting factor:
Teff = Tpos − W · Twait
• Also called Shortest Seek Time First (SSTF)
– p. 25/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (237/355)
“Elevator” scheduling (SCAN)• Sweep across disk, servicing all requests passed
- Like SPTF, but next seek must be in same direction
- Switch directions only if no further requests
• Advantages
• Disadvantages
• CSCAN:
• Also called LOOK/CLOOK in textbook
- (Textbook uses [C]SCAN to mean scan entire disk uselessly)
– p. 26/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (238/355)
“Elevator” scheduling (SCAN)• Sweep across disk, servicing all requests passed
- Like SPTF, but next seek must be in same direction
- Switch directions only if no further requests
• Advantages
- Takes advantage of locality
- Bounded waiting
• Disadvantages
- Cylinders in the middle get better service
- Might miss locality SPTF could exploit
• CSCAN: Only sweep in one direction
Very commonly used algorithm in Unix
• Also called LOOK/CLOOK in textbook
- (Textbook uses [C]SCAN to mean scan entire disk uselessly)
– p. 26/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (239/355)
VSCAN(r)
• Continuum between SPTF and SCAN
- Like SPTF, but slightly uses “effective” positioning time
If request in same direction as previous seek: Teff = Tpos
Otherwise: Teff = Tpos + r · Tmax
- when r = 0, get SPTF, when r = 1, get SCAN
- E.g., r = 0.2 works well
• Advantages and disadvantages
- Those of SPTF and SCAN, depending on how r is set
– p. 27/27Martin Quinson RSA (2008-2009) Chap 4: I/O subsystem (240/355)
CS 140 Lecture: files and directories
Dawson Engler Stanford CS department
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (241/355)
File system fun◆ File systems = the hardest part of OS
– More papers on FSes than any other single topic◆ Main tasks of file system:
– don’t go away (ever)– associate bytes with name (files)– associate names with each other (directories)– Can implement file systems on disk, over network, in
memory, in non-volatile ram (NVRAM), on tape, w/ paper.– We’ll focus on disk and generalize later
◆ Today: files and directories + a bit of speed.
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (242/355)
The medium is the message◆ Disk = First thing we’ve seen that doesn’t go away
– So: Where everything important lives. Failure.◆ Slow (ms access vs ns for memory)
◆ Huge (100x bigger than memory)– How to organize large collection of ad hoc information?
Taxonomies! (Basically FS = general way to make these)
memorycrash
Processor speed: ~2x/yr
Disk access time: 7%/yr
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (243/355)
Memory vs. Disk
◆ Smallest write: sector◆ Atomic write = sector
◆ ~10ms– not on a good curve
◆ 20MB/s◆ NUMA
◆ Crash?– Contents not gone (“non-
volatile”)– Lose? Corrupt? No ok.
◆ (usually) bytes◆ byte, word
◆ Random access: nanosecs– faster all the time
◆ Seq access 200-1000MB/s◆ UMA
◆ Crash?– Contents gone (“volatile”)
– Lose + start over = okDawson Engler RSA (2008-2009) Chap 4: I/O subsystem (244/355)
Some useful facts◆ Disk reads/writes in terms of sectors, not bytes
– read/write single sector or adjacent groups
◆ How to write a single byte? “Read-modify-write”– read in sector containing the byte– modify that byte– write entire sector back to disk– key: if cached, don’t need to read in
◆ Sector = unit of atomicity. – sector write done completely, even if crash in middle
» (disk saves up enough momentum to complete)– larger atomic units have to be synthesized by OSDawson Engler RSA (2008-2009) Chap 4: I/O subsystem (245/355)
The equation that ruled the world.◆ Approximate time to get data:
◆ So?– Each time touch disk = 10s ms. – Touch 50-100 times = 1 *second*– Can do *billions* of ALU ops in same time.
◆ This fact = Huge social impact on OS research– Most pre-2000 research based on speed.– Publishable speedup = ~30%– Easy to get > 30% by removing just a few accesses.– Result: more papers on FSes than any other single topic
seek time(ms) + rotational delay(ms) + bytes / disk bandwidth
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (246/355)
Files: named bytes on disk◆ File abstraction:
– user’s view: named sequence of bytes
– FS’s view: collection of disk blocks– file system’s job: translate name & offset to disk blocks
◆ File operations:– create a file, delete a file– read from file, write to file
◆ Want: operations to have as few disk accesses as possible & have minimal space overhead
offset:int disk addr:int
int main() { … foo.c
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (247/355)
◆ In some sense, the problems we will look at are no different than those in virtual memory– like page tables, file system meta data are simply data
structures used to construct mappings.– Page table: map virtual page # to physical page #
– file meta data: map byte offset to disk block address
– directory: map name to disk address or file #
What’s so hard about grouping blocks???
Page table28 33
Unix inode 418 8003121
directoryfoo.c 44
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (248/355)
◆ In some ways problem similar: – want location transparency, oblivious to size, & protection
◆ In some ways the problem is easier: – CPU time to do FS mappings not a big deal (= no TLB)– Page tables deal with sparse address spaces and random
access, files are dense (0 .. filesize-1) & ~sequential◆ In some way’s problem is harder:
– Each layer of translation = potential disk access– Space a huge premium! (But disk is huge?!?!) Reason?
Cache space never enough, the amount of data you can Get into one fetch never enough.
– Range very extreme: Many <10k, some more than GB.– Implications?
FS vs VM
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (249/355)
Problem: how to track file’s data?◆ Disk management:
– Need to keep track of where file contents are on disk– Must be able to use this to map byte offset to disk block
◆ Things to keep in mind while designing file structure:– Most files are small – Much of the disk is allocated to large files– Many of the I/O operations are made to large files– Want good sequential and good random access (what do
these require?)◆ Just like VM: data structures recapitulate cs107
– Arrays, linked list, trees (of arrays), hash tables.
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (250/355)
Simple mechanism: contiguous allocation◆ “Extent-based”: allocate files like segmented memory
– When creating a file, make the user specify pre-specify its length and allocate all space at once
– File descriptor contents: location and size
– Example: IBM OS/360
– Pro?– Cons? (What does VM scheme does this correspond to?)
file a (base=1,len=3) file b (base=5,len=2)
what happens if file c needs 2 sectors???
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (251/355)
Simple mechanism: contiguous allocation◆ “Extent-based”: allocate files like segmented memory
– When creating a file, make the user specify pre-specify its length and allocate all space at once
– File descriptor contents: location and size
– Example: IBM OS/360
– Pro: simple, fast access, both sequential and random. – Cons? (Segmentation)
file a (base=1,len=3) file b (base=5,len=2)
what happens if file c needs 2 sectors???
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (252/355)
Linked files◆ Basically a linked list on disk.
– Keep a linked list of all free blocks– file descriptor contents: a pointer to file’s first block– in each block, keep a pointer to the next one
– Pro? – Con? – Examples (sort-of): Alto, TOPS-10, DOS FAT
file a (base=1) file b (base=5)
how do you find the last block in a?
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (253/355)
Linked files◆ Basically a linked list on disk.
– Keep a linked list of all free blocks– file descriptor contents: a pointer to file’s first block– in each block, keep a pointer to the next one
– Pro: easy dynamic growth & sequential access, no fragmentation
– Con? – Examples (sort-of): Alto, TOPS-10, DOS FAT
file a (base=1) file b (base=5)
how do you find the last block in a?
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (254/355)
Example: DOS FS (simplified)◆ Uses linked files. Cute: links reside in fixed-sized
“file allocation table” (FAT) rather than in the blocks.
– Still do pointer chasing, but can cache entire FAT so can be cheap compared to disk access.
file a 6 4 3
free eof 1eof 3
eof4...
file b 2 1
FAT (16-bit entries)
a: 6b: 2
Directory (5) 0123456
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (255/355)
FAT discussion◆ Entry size = 16 bits
– What’s the maximum size of the FAT? – Given a 512 byte block, what’s the maximum size of FS?– One attack: go to bigger blocks. Pro? Con?
◆ Space overhead of FAT is trivial:– 2 bytes / 512 byte block = ~.4% (Compare to Unix)
◆ Reliability: how to protect against errors?
◆ Bootstrapping: where is root directory?
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (256/355)
FAT discussion◆ Entry size = 16 bits
– What’s the maximum size of the FAT? – Given a 512 byte block, what’s the maximum size of FS?– One attack: go to bigger blocks. Pro? Con?
◆ Space overhead of FAT is trivial:– 2 bytes / 512 byte block = ~.4% (Compare to Unix)
◆ Reliability: how to protect against errors? – Create duplicate copies of FAT on disk. – State duplication a very common theme in reliability
◆ Bootstrapping: where is root directory? – Fixed location on disk: FAT (opt) FAT root dir …
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (257/355)
Indexed files◆ Each file has an array holding all of it’s block pointers
– (purpose and issues = those of a page table)– max file size fixed by array’s size (static or dynamic?)– create: allocate array to hold all file’s blocks, but
allocate on demand using free list
– Pro?– con?
file a file b
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (258/355)
Indexed files◆ Each file has an array holding all of it’s block pointers
– (purpose and issues = those of a page table)– max file size fixed by array’s size (static or dynamic?)– create: allocate array to hold all file’s blocks, but
allocate on demand using free list
– pro: both sequential and random access easy– Con: mapping table = large contig chunk of space. Same
problem we were trying to initially solve.
file a file b
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (259/355)
Indexed files◆ Issues same as in page tables
– Large possible file size = lots of unused entries– Large actual size? table needs large contiguous disk chunk– Solve identically: small regions with index array, this
array with another array, … Downside?
2^32 file size
2^20 entries!
4K blocks
idle
idle
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (260/355)
Ptr 1ptr 2 …ptr 128
Multi-level indexed files: ~4.3 BSD◆ File descriptor (inode) = 14 block pointers + “stuff”
Ptr 1ptr 2ptr 3ptr 4...
ptr 13ptr 14
stuffdata blocks
Ptr 1ptr 2 …ptr 128
Indirect block
Double indirect block
Indirect blks
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (261/355)
◆ Pro?– simple, easy to build, fast access to small files– Maximum file length fixed, but large. (With 4k blks?)
◆ Cons:– what’s the worst case # of accesses?– What’s some bad space overheads?
◆ An empirical problem:– because you allocate blocks by taking them off unordered
freelist, meta data and data get strewn across disk
Unix discussion
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (262/355)
◆ Inodes are stored in a fixed sized array– Size of array determined when disk is initialized and can’t
be changed. Array lives in known location on disk. Originally at one side of disk:
– Now is smeared across it (why?)
– The index of an inode in the inode array called an i-number. Internally, the OS refers to files by inumber
– When file is opened, the inode brought in memory, when closed, it is flushed back to disk.
More about inodes
Inode array file blocks ...
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (263/355)
Example: (oversimplified) Unix file system◆ Want to modify byte 4 in /a/b.c:
◆ readin root directory (inode 2)◆ lookup a (inode 12); readin◆ lookup inode for b.c (13); readin
◆ use inode to find blk for byte 4 (blksize = 512, so offset = 0 gives blk 14); readin and modify
Root directory
. : 2 : dir a: 12: dir . :12 dir .. :2:dir b.c :13:inode
refcnt=1
int main() { …
14 0 … 0
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (264/355)
◆ Problem: – “spend all day generating data, come back the next
morning, want to use it.” F. Corbato, on why files/dirs invented.
◆ Approach 0: have user remember where on disk the file is. – (e.g., social security numbers)
◆ Yuck. People want human digestible names– we use directories to map names to file blocks
◆ Next: What is in a directory and why?
Directories
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (265/355)
◆ Approach 1: have a single directory for entire system.– put directory at known location on disk– directory contains <name, index> pairs– if one user uses a name, no one else can– many ancient PCs work this way. (cf “hosts.txt”)
◆ Approach 2: have a single directory for each user– still clumsy. And ls on 10,000 files is a real pain– (many older mathematicians work this way)
◆ Approach 3: hierarchical name spaces– allow directory to map names to files or other dirs– file system forms a tree (or graph, if links allowed)– large name spaces tend to be hierarchical (ip addresses,
domain names, scoping in programming languages, etc.)
A short history of time
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (266/355)
◆ Used since CTSS (1960s)– Unix picked up and used really nicely.
◆ Directories stored on disk just like regular files– inode contains special flag bit set– user’s can read just like any other file– only special programs can write (why?)– Inodes at fixed disk location
– File pointed to by the index may be another directory
– makes FS into hierarchical tree(what needed to make a DAG?)
◆ Simple. Plus speeding up file ops = speeding up dir ops!
Hierarchical Unix/
afs bin cdrom dev sbin tmp
awk chmod chown
<name, inode#><afs, 1021><tmp, 1020><bin, 1022><cdrom, 4123><dev, 1001><sbin, 1011> ...
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (267/355)
◆ Bootstrapping: Where do you start looking? – Root directory– inode #2 on the system– 0 and 1 used for other purposes
◆ Special names:– Root directory: “/”– Current directory: “.”– Parent directory: “..” – user’s home directory: “~”
◆ Using the given names, only need two operations to navigate the entire name space:– cd ‘name’: move into (change context to) directory “name”– ls : enumerate all names in current directory (context)
Naming magic
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (268/355)
Unix example: /a/b/c.c
a
b
c.c
Name space
“.”
“..”
“.” Physical organization
Inode table
disk
<a,3>What inode holds file for a? b? c.c?
2345...
<b,5>
<c.c, 14>
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (269/355)
◆ Cumbersome to constantly specify full path names– in Unix, each process associated with a “current working
directory”– file names that do not begin with “/” are assumed to be
relative to the working directory, otherwise translation happens as before
◆ Shells track a default list of active contexts – a “search path”– given a search path { A, B, C } a shell will check in A,
then check in B, then check in C– can escape using explicit paths: “./foo”
◆ Example of locality
Default context: working directory
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (270/355)
◆ More than one dir entry can refer to a given file– Unix stores count of pointers (“hard links”) to inode
– to make: “ln foo bar” creates a synonym (‘bar’) for ‘foo’
◆ Soft links:– also point to a file (or dir), but object can be deleted
from underneath it (or never even exist). – Unix builds like directories: normal file holds pointed to
name, with special “sym link” bit set
– When the file system encounters a symbolic link it automatically translates it (if possible).
Creating synonyms: Hard and soft links
ref = 2...
foo bar
/bar“baz”
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (271/355)
Micro-case study: speeding up a FS◆ Original Unix FS: Simple and elegant:
◆ Nouns: – data blocks – inodes (directories represented as files)– hard links– superblock. (specifies number of blks in FS, counts of
max # of files, pointer to head of free list) ◆ Problem: slow
– only gets 20Kb/sec (2% of disk maximum) even for sequential disk transfers!
inodes data blocks (512 bytes)
disksuperblock
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (272/355)
A plethora of performance costs◆ Blocks too small (512 bytes)
– file index too large – too many layers of mapping indirection– transfer rate low (get one block at time)
◆ Sucky clustering of related objects:– Consecutive file blocks not close together– Inodes far from data blocks– Inodes for directory not close together– poor enumeration performance: e.g., “ls”, “grep foo *.c”
◆ Next: how FFS fixes these problems (to a degree)Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (273/355)
Problem 1: Too small block size◆ Why not just make bigger?
◆ Bigger block increases bandwidth, but how to deal with wastage (“internal fragmentation”)?– Use idea from malloc: split unused portion.
Block size space wasted file bandwidth512 6.9% 2.6%1024 11.8% 3.3%2048 22.4% 6.4%4096 45.6% 12.0%1MB 99.0% 97.2%
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (274/355)
Handling internal fragmentation◆ BSD FFS:
– has large block size (4096 or 8192)– allow large blocks to be chopped into small ones
(“fragments”)– Used for little files and pieces at the ends of files
◆ Best way to eliminate internal fragmentation?– Variable sized splits of course– Why does FFS use fixed-sized fragments (1024, 2048)?
file a File b
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (275/355)
◆ Our central fact: – Moving disk head expensive
◆ So? Put related data close
– Fastest: adjacent– sectors (can span platters)
– Next: in same cylinder– (can also span platters)
– Next: in cylinder close by
Prob’ 2: Where to allocate data?
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (276/355)
◆ 1 or more consecutive cylinders into a “cylinder group”
– Key: can access any block in a cylinder without performing a seek. Next fastest place is adjacent cylinder.
– Tries to put everything related in same cylinder group– Tries to put everything not related in different group (?!)
Clustering related objects in FFS
Cylinder group 1
cylinder group 2
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (277/355)
◆ Tries to put sequential blocks in adjacent sectors– (access one block, probably access next)
◆ Tries to keep inode in same cylinder as file data:– (if you look at inode, most likely will look at data too)
◆ Tries to keep all inodes in a dir in same cylinder group– (access one name, frequently access many)– “ls -l”
Clustering in FFS
file a file b
1 2 3 1 2
Inode 1 2 3
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (278/355)
What’s a cylinder group look like?◆ Basically a mini-Unix file system:
◆ How how to ensure there’s space for related stuff?– Place different directories in different cylinder groups– Keep a “free space reserve” so can allocate near existing
things– when file grows to big (1MB) send its remainder to
different cylinder group.
inodes data blocks (512 bytes)
superblock
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (279/355)
◆ Old Unix (& dos): Linked list of free blocks– Just take a block off of the head. Easy.
– Bad: free list gets jumbled over time. Finding adjacent blocks hard and slow
◆ FFS: switch to bit-map of free blocks– 1010101111111000001111111000101100– easier to find contiguous blocks. – Small, so usually keep entire thing in memory– key: keep a reserve of free blocks. Makes finding a
close block easier
Prob’ 3: Finding space for related objects
head
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (280/355)
◆ Usually keep entire bitmap in memory:– 4G disk / 4K byte blocks. How big is map?
◆ Allocate block close to block x?– check for blocks near bmap[x/32] – if disk almost empty, will likely find one near– as disk becomes full, search becomes more expensive and
less effective.◆ Trade space for time (search time, file access time)
– keep a reserve (e.g, 10%) of disk always free, ideally scattered across disk
– don’t tell users (df --> 110% full)– N platters = N adjacent blocks– with 10% free, can almost always find one of them free
Using a bitmap
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (281/355)
So what did we gain?◆ Performance improvements:
– able to get 20-40% of disk bandwidth for large files– 10-20x original Unix file system!– Better small file performance (why?)
◆ Is this the best we can do? No.◆ Block based rather than extent based
– name contiguous blocks with single pointer and length– (Linux ext2fs)
◆ Writes of meta data done synchronously– really hurts small file performance– make asynchronous with write-ordering (“soft updates”)
or logging (the episode file system, ~LFS)– play with semantics (/tmp file systems)Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (282/355)
Other hacks?◆ Obvious:
– Big file cache.◆ Fact: no rotation delay if get whole track.
– How to use?◆ Fact: transfer cost negligible.
– Can get 20x the data for only ~5% more overhead– 1 sector = 10ms + 8ms + 50us (512/10MB/s) = 18ms– 20 sectors = 10ms + 8ms + 1ms = 19ms– How to use?
◆ Fact: if transfer huge, seek + rotation negligible– Mendel: LFS. Hoard data, write out MB at a time.
Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (283/355)
Review: FFS background• 1980s improvement to original Unix FS, which had:
- 512-byte blocks
- Free blocks in linked list
- All inodes at beginning of disk
- Low throughput: 512 bytes per average seek time
• Unix FS performance problems:
- Transfers only 512 bytes per disk access
- Eventually random allocation → 512 bytes / disk seek
- Inodes far from directory and file data
- Within directory, inodes far from each other
• Also had some usability problems:
- 14-character file names a pain
- Can’t atomically update file in crash-proof way
– p. 2/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (284/355)
Review: FFS [McKusic] basics
• Change block size to at least 4K
- To avoid wasting space, use “fragments” for ends of files
• Cylinder groups spread inodes around disk
• Bitmaps replace free list
• FS reserves space to improve allocation
- Tunable parameter, default 10%
- Only superuser can use space when over 90% full
• Usability improvements:
- File names up to 255 characters
- Atomic rename system call
- Symbolic links assign one file name to another
– p. 3/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (285/355)
FFS disk layout
superblocks
cylindergroups
inodes data blocks
informationbookkeeping
• Each cylinder group has its own:
- Superblock
- Bookkeeping information
- Set of inodes
- Data/directory blocks
– p. 4/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (286/355)
Basic FFS data structures
data
data
data
data
namei-number
...
contents
directory
...
inode
...
indirectblock
...double indirindirect ptr
...
metadata
...
...
data ptrdata ptr
data ptrdata ptr
. . .• Inode is key data structure for each file
- Has permissions and access/modification/inode-change times
- Has link count (# directories containing file); file deleted when 0
- Points to data blocks of file (and indirect blocks)
• By convention, inode #2 always root directory– p. 5/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (287/355)
FFS superblock
• Superblock contains file system parameters
- Disk characteristics, block size, CG info
- Information necessary to get inode given i-number
• Replicated once per cylinder group
- At shifting offsets, so as to span multiple platters
- Contains magic number to find replicas if 1st superblock dies
• Contains non-replicated “summary info”
- # blocks, fragments, inodes, directories in FS
- Flag stating if FS was cleanly unmounted
– p. 6/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (288/355)
Cylinder groups
• Groups related inodes and their data
• Contains a number of inodes (set when FS created)
- Default one inode per 2K data
• Contains file and directory blocks
• Contains bookkeeping information
- Block map – bit map of available fragments
- Summary info within CG – # free inodes, blocks/frags, files,
directories
- # free blocks by rotational position (8 positions)
[In 1980s, disks weren’t commonly zoned, so this was
reasonable]
– p. 7/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (289/355)
Inode allocation
• Allocate inodes in same CG as directory if possible
• New directories put in new cylinder groups
- Consider CGs with greater than average # free inodes
- Chose CG with smallest # directories
• Within CG, inodes allocated randomly (next free)
- Would like related inodes as close as possible
- OK, because one CG doesn’t have that many inodes
– p. 8/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (290/355)
Fragment allocation
• Allocate space when user writes beyond end of file
• Want last block to be a fragment if not full-size
- If already a fragment, may contain space for write – done
- Else, must deallocate any existing fragment, allocate new
• If no appropriate free fragments, break full block
• Problem: Slow for many small writes
- (Partial) soution: new stat struct field st_blksize- Tells applications file system block size
- stdio library can buffer this much data
– p. 9/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (291/355)
Block allocation
• Try to optimize for sequential access
- If available, use rotationally close block in same cylinder
- Otherwise, use block in same CG
- If CG totally full, find other CG with quadratic hashing
- Otherwise, search all CGs for some free space
• Problem: Don’t want one file filling up whole CG
- Otherwise other inodes will have data far away
• Solution: Break big files over many CGs
- But large extents in each CGs, so sequential access doesn’t
require many seeks
– p. 10/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (292/355)
Directories
• Inodes like files, but with different type bits
• Contents considered as 512-byte chunks
• Each chunk has dire t structure(s) with:
- 32-bit inumber
- 16-bit size of directory entry
- 8-bit file type (NEW)
- 8-bit length of file name
• Coalesce when deleting
- If first dire t in chunk deleted, set inumber = 0
• Periodically compact directory chunks
– p. 11/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (293/355)
Updating FFS for the 90s
• No longer want to assume rotational delay
- With disk caches, want data contiguously allocated
• Solution: Cluster writes
- FS delays writing a block back to get more blocks
- Accumulates blocks into 64K clusters, written at once
• Allocation of clusters similar to fragments/blocks
- Summary info
- Cluster map has one bit for each 64K if all free
• Also read in 64K chunks when doing read ahead
– p. 12/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (294/355)
Fixing corruption – fsck• Must run FS check (fsck) program after crash
• Summary info usually bad after crash
- Scan to check free block map, block/inode counts
• System may have corrupt inodes (not simple crash)
- Bad block numbers, cross-allocation, etc.
- Do sanity check, clear inodes with garbage
• Fields in inodes may be wrong
- Count number of directory entries to verify link count, if no
entries but count 6= 0, move to lost+found- Make sure size and used data counts match blocks
• Directories may be bad
- Holes illegal, . and .. must be valid, . . .
- All directories must be reachable
– p. 13/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (295/355)
Crash recovery permeates FS code
• Have to ensure fsck can recover file system
• Example: Suppose all data written asynchronously
• Delete/truncate a file, append to other file, crash
- New file may reuse block from old
- Old inode may not be updated
- Cross-allocation!
- Often inode with older mtime wrong, but can’t be sure
• Append to file, allocate indirect block, crash
- Inode points to indirect block
- But indirect block may contain garbage
– p. 14/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (296/355)
Ordering of updates
• Must be careful about order of updates
- Write new inode to disk before directory entry
- Remove directory name before deallocating inode
- Write cleared inode to disk before updating CG free map
• Solution: Many metadata updates syncrhonous
- Of course, this hurts performance
- E.g., untar much slower than disk b/w
• Note: Cannot update buffers on the disk queue
- E.g., say you make two updates to same directory block
- But crash recovery requires first to be synchronous
- Must wait for first write to complete before doing second
– p. 15/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (297/355)
Performance vs. consistency
• FFS crash recoverability comes at huge cost
- Makes tasks such as untar easily 10-20 times slower
- All because you might lose power or reboot at any time
• Even while slowing ordinary usage, recovery slow
- If fsck takes one minute, then disks get 10× bigger . . .
• One solution: battery-backed RAM
- Expensive (requires specialized hardware)
- Often don’t learn battery has died until too late
- A pain if computer dies (can’t just move disk)
- If OS bug causes crash, RAM might be garbage
• Better solution: Advanced file system techniques
- Topic of rest of lecture
– p. 16/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (298/355)
First attempt: Ordered updates
• Must follow three rules in ordering updates:
1. Never write pointer before initializing the structure it points to
2. Never reuse a resource before nullifying all pointers to it
3. Never clear last pointer to live resource before setting new one
• If you do this, file system will be recoverable
• Moreover, can recover quickly
- Might leak free disk space, but otherwise correct
- So start running after reboot, scavenge for space in background
• How to achieve?
- Keep a partial order on buffered blocks
– p. 17/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (299/355)
Ordered updates (continued)
• Example: Create file A
- Block X contains an inode
- Block Y contains a directory block
- Create file A in inode block X, dir block Y
• We say Y → X meaning X must be written before Y
• Can delay both writes, so long as order preserved
- Say you create a second file B in blocks X and Y
- Only have to write each out once for both creates
– p. 18/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (300/355)
Problem: Cyclic dependencies
• Suppose you create file A, unlink file B
- Both files in same directory block & inode block
• Can’t write directory until inode A initialized
- Otherwise, after crash directory will point to bogus inode
- Worse yet, same inode # might be re-allocated
- So could end up with file name A being an unrelated file
• Can’t write inode block until dir entry B cleared
- Otherwise, B could end up with too small a link count
- File could be deleted while links to it still exist
• Otherwise, fsck has to be very slow
- Check every directory entry and inode link count
– p. 19/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (301/355)
Cyclic dependencies illustrated
Inode #4
Inode #5
Inode #6
Inode #7
Inode Block Directory Block
< −−,#0 >
< C,#7 >
< B,#5 >
Inode #4
Inode #5
Inode #6
Inode #7
Inode Block Directory Block
< A,#4 >
< C,#7 >
< B,#5 >
Inode #4
Inode #5
Inode #6
Inode #7
Inode Block Directory Block
< A,#4 >
< C,#7 >
< −−,#0 >
(a) Original Organization (b) Create File A
(c) Remove File B– p. 20/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (302/355)
More problems
• Crash might occur between ordered but relatedwrites
- E.g., summary information wrong after block freed
• Block aging
- Block that always has dependency will never get written back
• Solution: “Soft updates” [Ganger]
- Write blocks in any order
- But keep track of dependencies
- When writing a block, temporarily roll back any changes you
can’t yet commit to disk
– p. 21/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (303/355)
Breaking dependencies w. rollback
Inode #4
Inode #5
Inode #6
Inode #7
Inode Block Directory Block
< −−,#0 >
< C,#7 >
< B,#5 >
Inode #4
Inode #5
Inode #6
Inode #7
Inode Block Directory Block
< A,#4 >
< C,#7 >
< −−,#0 >
(a) After Metadata Updates
Main Memory Disk
• Now say we decide to write directory block. . .
– p. 22/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (304/355)
Breaking dependencies w. rollback
Main Memory Disk
Inode #4
Inode #5
Inode #6
Inode #7
Inode Block Directory Block
< −−,#0 >
< C,#7 >
< −−,#0 >
Inode #4
Inode #5
Inode #6
Inode #7
Inode Block Directory Block
< A,#4 >
< C,#7 >
< −−,#0 >
(b) Safe Version of Directory Block Written
• Note: Directory block still dirty
• But now inode block has no dependencies
• Say we write inode block out. . .
– p. 22/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (305/355)
Breaking dependencies w. rollback
Main Memory Disk
Inode #4
Inode #5
Inode #6
Inode #7
Inode Block Directory Block
< −−,#0 >
< C,#7 >
< −−,#0 >
Inode #4
Inode #5
Inode #6
Inode #7
Inode Block Directory Block
< A,#4 >
< C,#7 >
< −−,#0 >
(c) Inode Block Written
• Now inode block clean (same in memory as on disk)
• But have to write directory block a second time. . .
– p. 22/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (306/355)
Breaking dependencies w. rollback
Main Memory Disk
Inode #4
Inode #5
Inode #6
Inode #7
Inode Block Directory Block
< A,#4 >
< C,#7 >
< −−,#0 >
Inode #4
Inode #5
Inode #6
Inode #7
Inode Block Directory Block
< A,#4 >
< C,#7 >
< −−,#0 >
(d) Directory Block Written
• All data stably on disk
– p. 22/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (307/355)
Soft updates
• Structure for each updated field or pointer, contains:
- old value
- new value
- list of updates on which this update depends (dependees)
• Can write blocks in any order
- But must temporarily undo updates with pending
dependencies
- Must lock rolled-back version so applications don’t see it
- Choose ordering based on disk arm scheduling
• Some dependencies better handled by postponingin-memory updates
- E.g., Just mark block as free in bitmap after pointer cleared
– p. 23/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (308/355)
Simple example
• Create zero-length file A
• Depender: Directory entry for A
- Can’t be written untill dependees on disk
• Dependees:
- Inode – must be initialized before dir entry written
- Bitmap – must mark inode allocated before dir entry written
• Old value: empty directory entry
• New value: 〈filename A, inode #〉
– p. 24/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (309/355)
Operations requiring soft updates (1)
1. Block allocation
- Must write the disk block, the free map, & a pointer
- Disk block & free map must be written before pointer
- Use Undo/redo on pointer (& possibly file size)
2. Block deallocation
- Must write the cleared pointer & free map
- Just update free map after pointer written to disk
- Or just immediately update free map if pointer not on disk
• Say you quickly append block to file then truncate
- You will know pointer to block not written because of the
allocated dependency structure
- So both operations together require do disk I/O
– p. 25/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (310/355)
Operations requiring soft updates (2)
3. Link addition (see simple example)
- Must write the directory entry, inode, & free map (if new inode)
- Inode and free map must be written before dir entry
- Use undo/redo on i# in dir entry (ignore entries w. i# 0)
4. Link removal
- Must write directory entry, inode & free map (if nlinks==0)
- Must decrement nlinks only after pointer cleared
- Clear directory entry immediately
- Decrement in-memory nlinks once pointer written
- If directory entry was never written, decrement immediately
(again will know by presence of dependency structure)
• Note: Quick create/delete requires no disk I/O
– p. 26/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (311/355)
Soft update issues• fsync – sycall to flush file changes to disk
- Must also flush directory entries, parent directories, etc.
• unmount – flush all changes to disk on shutdown
- Some buffers must be flushed multiple times to get clean
• Deleting large directory trees frighteningly fast
- unlink syscall returns even if inode/indir block not cached!
- Dependencies allocated faster than blocks written
- Cap # dependencies allocated to avoid exhausting memory
• Useless write-backs
- Syncer flushes dirty buffers to disk every 30 seconds
- Writing all at once means many dependencies unsatisfied
- Fix syncer to write blocks one at a time
- Fix LRU buffer eviction to know about dependencies
– p. 27/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (312/355)
Soft updates fsck
• Split into foreground and background parts
• foreground must be done before remounting FS
- Need to make sure per-cylinder summary info makes sense
- Recompute free block/inode counts from bitmaps – very fast
- Will leave FS consistent, but might leak disk space
• Background does traditional fsck operations
- Can do in background after mounting to recuperate free space
- Must be done in forground after a media failure
• Difference from traditional FFS fsck:
- May have many, many inodes with non-zero link counts
- Don’t stick them all in lost+found (unless media failure)
– p. 28/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (313/355)
An alternative: Journaling
• Reserve a portion of disk for write-ahead log
- Write any metadata operation first to log, then to disk
- After crash/reboot, re-play the log (efficient)
- My re-do already committed change, but won’t miss anything
• Performance advantage:
- Log is consecutive portion of disk
- Multiple log writes very fast (at disk b/w)
- Consider updates committed when written to log
• Example: delete directory tree
- Record all freed blocks, changed directory entries in log
- Return control to user
- Write out changed directories, bitmaps, etc. in background
(sort for good disk arm scheduling)
– p. 29/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (314/355)
Journaling details
• Must find oldest relevant log entry
- Otherwise, redundant and slow to replay whole log
• Use checkpoints
- Once all records up to log entry N have been processed and
affected blocks stably committed to disk. . .
- Record N to disk either in reserved checkpoint location, or in
checkpoint log record
- Never need to go back before most recent checkpointed N
• Must also find end of log
- Typically circular buffer; don’t play old records out of order
- Can include begin transaction/end transaction records
- Also typically have checksum in case some sectors bad
– p. 30/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (315/355)
Case study: XFS [Sweeney]• Main idea: Think big
- Big disks, files, large # of files, 64-bit everything
- Yet maintain very good performance
• Break disk up into Allocation Groups (AGs)
- 0.5 – 4 GB regions of disk
- New directories go in new AGs
- Within directory, inodes of files go in same AG
- Unlike cylinder groups, AGs too large to minimize seek times
- Unlike cylinder groups, no fixed # of inodes per AG
• Advantages of AGs:
- Parallelize allocation of blocks/inodes on multiprocessor
(independent locking of different free space structures)
- Can use 32-bit block pointers within AGs
(keeps data structures smaller)– p. 31/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (316/355)
B+-treesptr
ptr
ptr
ptr
K K K
ptr
ptr
ptr
ptr
ptr
ptr
ptr
ptr
KV
KV
KV
KV
KV
KV
• XFS makes extensive use of B+-trees
- Indexed data structure stores ordered Keys & Values
- Keys must have an ordering defined on them
- Stored data in blocks for efficient disk access
• For B+-tree w. n items, all operation O(log n):
- Retrieve closest 〈key, value〉 to target key k
- Insert a new 〈key, value〉 pair
- Delete 〈key, value〉 pair
– p. 32/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (317/355)
B+-trees continued
• See any algorithms book for details (e.g., [Cormen])
• Some operations on B-tree are complex:
- E.g., insert item into completely full B+-tree
- May require “splitting” nodes, adding new level to tree
- Would be bad to crash & leave B+tree in inconsistent state
• Journal enables atomic complex operations
- First write all changes to the log
- If crash while writing log, incomplete log record will be
discarded, and no change made
- Otherwise, if crash while updating B+-tree, will replay entire
log record and write everything
– p. 33/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (318/355)
B+-trees in XFS• B+-trees are complex to implement
- But once you’ve done it, might as well use everywhere
• Use B+-trees for directories (keyed on filename hash)
- Makes large directories efficient
• Use B+-trees for inodes
- No more FFS-style fixed block pointers
- Instead, B+-tree maps: file offset → 〈start block, # blocks〉- Ideally file is one or small number of contiguous extents
- Allows small inodes & no indirect blocks even for huge files
• Use to find inode based on inumber
- High bits of inumber specify AG
- B+-tree in AG maps: starting i# → 〈block #, free-map〉- So free inodes tracked right in leaf of B+-tree
– p. 34/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (319/355)
More B+-trees in XFS
• Free extents tracked by two B+-trees
1. start block # → # free blocks
2. # free blocks → # start blocks #
• Use journal to update both atomically & consistently
• #1 allows you to coalesce adjacent free regions
• #1 allows you to allocate near some target
- E.g., when extending file, but next block near previous one
- When first writing to file, but data near inode
• #2 allows you to do best fit allocation
- Leave large free extents for large files
– p. 35/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (320/355)
Contiguous allocation
• Ideally want each file contiguous on disk
- Sequential file I/O should be as fast as sequential disk I/O
• But how do you know how large a file will be?
• Idea: delayed allocation
- write syscall only affects the buffer cache
- Allow write into buffers before deciding where to place on disk
- Assign disk space only when buffers are flushed
• Other advantages:
- Short-lived files never need disk space allocated
- mmaped files often written in random order in memory, but will
be written to disk mostly contiguously
- Write clustering: find other nearby stuff to write to disk
– p. 36/36Dawson Engler RSA (2008-2009) Chap 4: I/O subsystem (321/355)
Fift Chapter
Security4
4From David Mazieres course at Stanford.Dawson Engler RSA (2008-2009) Chap 5: Security (322/355)
View access control as a matrix
• Subjects (processes/users) access objects (e.g., files)
• Each cell of matrix has allowed permissions
– p. 1/31Dawson Engler RSA (2008-2009) Chap 5: Security (323/355)
Specifying policy• Manually filling out matrix would be tedious
• Use tools such as groups or role-based access control:
– p. 2/31Dawson Engler RSA (2008-2009) Chap 5: Security (324/355)
Two ways to slice the matrix
• Along columns:
- Kernel stores list of who can access object along with object
- Most systems you’ve used probably do this
- Examples: Unix file permissions, Access Control Lists (ACLs)
• Along rows:
- Capability systems do this
- More on these later. . .
– p. 3/31Dawson Engler RSA (2008-2009) Chap 5: Security (325/355)
Example: Unix protection
• Each process has a User ID & one or more group IDs
• System stores with each file:
- User who owns the file and group file is in
- Permissions for user, any one in file group, and other
• Shown by output of ls -l command:- user︷︸︸︷rwx group︷︸︸︷r-x other︷︸︸︷r-x owner︷︸︸︷dm group︷ ︸︸ ︷ s140 ... index.html- User permissions apply to processes with same user ID
- Else, group permissions apply to processes in same group
- Else, other permissions apply
– p. 4/31Dawson Engler RSA (2008-2009) Chap 5: Security (326/355)
Unix continued
• Directories have permission bits, too
- Need write perm. on directory to create or delete a file
• Special user root (UID 0) has all privileges
- E.g., Read/write any file, change owners of files
- Required for administration (backup, creating new users, etc.)
• Example:
- drwxr-xr-x 56 root wheel 4096 Apr 4 10:08 /et - Directory writable only by root, readable by everyone
- Means non-root users cannot directly delete files in /et
– p. 5/31Dawson Engler RSA (2008-2009) Chap 5: Security (327/355)
Non-file permissions in Unix
• Many devices show up in file system
- E.g., /dev/tty1 permissions just like for files
• Other access controls not represented in file system
• E.g., must usually be root to do the following:
- Bind any TCP or UDP port number less than 1,024
- Change the current process’s user or group ID
- Mount or unmount file systems
- Create device nodes (such as /dev/tty1) in the file system
- Change the owner of a file
- Set the time-of-day clock; halt or reboot machine
– p. 6/31Dawson Engler RSA (2008-2009) Chap 5: Security (328/355)
Example: Login runs as root• Unix users typically stored in files in /et
- Files password, group, and often shadow or master.passwd• For each user, files contain:
- Textual username (e.g., “dm”, or “root”)
- Numeric user ID, and group ID(s)
- One-way hash of user’s password: {salt, H(passwd, salt)}- Other information, such as user’s full name, login shell, etc.
• /usr/bin/login runs as root
- Reads username & password from terminal
- Looks up username in /et /passwd, etc.
- Computes H(typed password, salt) & checks that it matches
- If matches, sets group ID & user ID for username
- Execute user’s shell with exe system call
– p. 7/31Dawson Engler RSA (2008-2009) Chap 5: Security (329/355)
Setuid• Some legitimate actions require more privs than UID
- E.g., how should users change their passwords?
- Stored in root-owned /et /passwd & /et /shadow files
• Solution: Setuid/setgid programs
- Run with privileges of file’s owner or group
- Each process has real and effective UID/GID
- real is user who launched setuid program
- effective is owner/group of file, used in access checks
- E.g., /usr/bin/passwd – changes users password
- E.g., /bin/su – acquire new user ID with correct password
• Have to be very careful when writing setuid code
- Attackers can run setuid programs any time (no need to wait
for root to run a vulnerable job)
- Attacker controls many aspects of program’s environment
– p. 8/31Dawson Engler RSA (2008-2009) Chap 5: Security (330/355)
Other permissions
• When can process A send a signal to process B w. kill?
- Allow if sender and receiver have same effective UID
- But need ability to kill processes you launch even if suid
- So allow if real UIDs match, as well
- Can also send SIGCONT w/o UID match if in same session
• Debugger system call ptrace
- Lets one process modify another’s memory
- Setuid gives a program more privilege than invoking user
- So don’t let process ptrace more privileged process
- E.g., Require sender to match real & effective UID of target
- Also disable setuid if ptraced target calls exec
- Exception: root can ptrace anyone
– p. 9/31Dawson Engler RSA (2008-2009) Chap 5: Security (331/355)
A security hole
• Even without root or setuid, attackers can trick root
owned processes into doing things. . .
• Example: Want to clear unused files in /tmp• Every night, automatically run this command as root:find /tmp -atime +3 -exe rm -f -- {} \;• find identifies files not accessed in 3 days
- executes rm, replacing {} with file name
• rm -f -- path deletes file path
- Note “--” prevents path from being parsed as option
• What’s wrong here?
– p. 10/31Dawson Engler RSA (2008-2009) Chap 5: Security (332/355)
An attack
find/rm Attacker
creat (“/tmp/etc/passwd”)
readdir (“/tmp”) → “etc”
lstat (“/tmp/etc”) → DIRECTORY
readdir (“/tmp/etc”) → “passwd”
rename (“/tmp/etc” → “/tmp/x”)
symlink (“/etc”, “/tmp/etc”)
unlink (“/tmp/etc/passwd”)
• Time-of-check-to-time-of-use (TOCTTOU) bug
- find checks that /tmp/et is not symlink
- But meaning of file name changes before it is used
– p. 11/31Dawson Engler RSA (2008-2009) Chap 5: Security (333/355)
xterm command• Provides a terminal window in X-windows
• Used to run with setuid root privileges
- Requires kernel pseudo-terminal (pty) device
- Required root privs to change ownership of pty to user
- Also writes protected utmp/wtmp files to record users
• Had feature to log terminal session to file
fd = open (logfile, O_CREAT|O_WRONLY|O_TRUNC, 0666);
/* ... */
a ess
– p. 12/31Dawson Engler RSA (2008-2009) Chap 5: Security (334/355)
xterm command• Provides a terminal window in X-windows
• Used to run with setuid root privileges
- Requires kernel pseudo-terminal (pty) device
- Required root privs to change ownership of pty to user
- Also writes protected utmp/wtmp files to record users
• Had feature to log terminal session to file
if (access (logfile, W_OK) < 0)
return ERROR;fd = open (logfile, O_CREAT|O_WRONLY|O_TRUNC, 0666);
/* ... */
• a ess call avoids dangerous security hole
- Does permission check with real, not effective UID
– p. 12/31Dawson Engler RSA (2008-2009) Chap 5: Security (335/355)
xterm command• Provides a terminal window in X-windows
• Used to run with setuid root privileges
- Requires kernel pseudo-terminal (pty) device
- Required root privs to change ownership of pty to user
- Also writes protected utmp/wtmp files to record users
• Had feature to log terminal session to file
if (access (logfile, W_OK) < 0)
return ERROR;fd = open (logfile, O_CREAT|O_WRONLY|O_TRUNC, 0666);
/* ... */
• a ess call avoids dangerous security hole
- Does permission check with real, not effective UID
- Wrong: Another TOCTTOU bug
– p. 12/31Dawson Engler RSA (2008-2009) Chap 5: Security (336/355)
An attack
xterm Attacker
creat (“/tmp/X”)
access (“/tmp/X”) → OK
unlink (“/tmp/X”)
symlink (“/tmp/X” → “/etc/passwd”)
open (“/tmp/X”)
• Attacker changes /tmp/X between check and use
- xterm unwittingly overwrites /et /passwd- Time-of-check-to-time-of-use (TOCTTOU) bug
• OpenBSD man page: “CAVEATS: access() is a
potential security hole and should never be used.”
– p. 13/31Dawson Engler RSA (2008-2009) Chap 5: Security (337/355)
SSH configuration files
• SSH 1.2.12 – secure login program, runs as root
- Needs to bind TCP port under 1,024 (privileged operation)
- Needs to read client private key (for host authentication)
• Also needs to read & write files owned by user
- Read configuration file ~/.ssh/ onfig- Record server keys in ~/.ssh/known_hosts
• Author wanted to avoid TOCTTOU bugs:
- First binds socket & reads root-owned secret key file
- Then drops all privileges before accessing user files
- Idea: avoid using any user-controlled arguments/files until
you have no more privileges than the user
– p. 14/31Dawson Engler RSA (2008-2009) Chap 5: Security (338/355)
Trick question: ptrace bug
• Dropping privs allows user to “debug” SSH
- Depends on OS, but at the time several had ptrace
implementations that made SSH vulerable
• Once in debugger
- Could use privileged port to connect anywhere
- Could read secret host key from memory
- Could overwrite local user name to get privs of other user
• The fix: restructure into 3 processes!
- Perhaps overkill, but really wanted to avoid problems
– p. 15/31Dawson Engler RSA (2008-2009) Chap 5: Security (339/355)
A linux security hole
• Some programs acquire then release privileges
- E.g., su user setuid, becomes user if password correct
• Consider the following:
- A and B unprivileged processes owned by attacker
- A ptraces B
- A executes “su user” to its own identity
- While su is superuser, B execs su root(A is superuser, so this is not disabled)
- A types password, gets shell, and is attached to su root- Can manipulate su root’s memory to get root shell
– p. 16/31Dawson Engler RSA (2008-2009) Chap 5: Security (340/355)
Editorial
• Previous examples show two limitations of Unix
• Many OS security policies subjective not objective
- When can you signal/debug process? Re-bind network port?
- Rules for non-file operations somewhat incoherent
- Even some file rules weird (Creating hard links to files)
• Correct code is much harder to write than incorrect
- Delete file without traversing symbolic link
- Read SSH configuration file (requires 3 processes??)
- Write mailbox owned by user in dir owned by root/mail
• Don’t just blame the application writers
- Must also blame the interfaces they program to
– p. 17/31Dawson Engler RSA (2008-2009) Chap 5: Security (341/355)
Another security problem [Hardy]
• Setting: A multi-user time sharing system
- This time it’s not Unix
• Wanted fortran compiler to keep statistics
- Modified compiler /sysx/fort to record stats in /sysx/stat- Gave compiler “home files license”—allows writing to
anything in /sysx (kind of like Unix setuid)
• What’s wrong here?
– p. 18/31Dawson Engler RSA (2008-2009) Chap 5: Security (342/355)
A confused deputy
• Attacker could overwrite any files in /sysx- System billing records kept in /sysx/bill got wiped
- Probably command like fort -o /sysx/bill file.f• Is this a compiler bug?
- Original implementors did not anticipate extra rights
- Can’t blame them for unchecked output file
• Compiler is a “confused deputy”
- Inherits privileges from invoking user (e.g., read file.f)
- Also inherits from home files license
- Which master is it serving on any given system call?
- OS doesn’t know if it just sees open ("/sysx/bill", ...)– p. 19/31Dawson Engler RSA (2008-2009) Chap 5: Security (343/355)
Capabilities
• Slicing matrix along rows yields capabilities
- E.g., For each process, store a list of objects it can access
- Process explicitly invokes particular capabilities
• Can help avoid confused deputy problem
- E.g., Must give compiler an argument that both specifies the
output file and conveys the capability to write the file
(think about passing a file descriptor, not a file name)
- So compiler uses no ambient authority to write file
• Three general approaches to capabilities:
- Hardware enforced (Tagged architectures like M-machine)
- Kernel-enforced (Hydra, KeyKOS)
- Self-authenticating capabilities (like Amoeba)
– p. 20/31Dawson Engler RSA (2008-2009) Chap 5: Security (344/355)
Hydra
• Machine & programing env. built at CMU in ’70s
• OS enforced object modularity with capabilities
- Could only call object methods with a capability
• Agumentation let methods manipulate objects
- A method executes with the capability list of the object, not the
caller
• Template methods take capabilities from caller
- So method can access objects specified by caller
– p. 21/31Dawson Engler RSA (2008-2009) Chap 5: Security (345/355)
KeyKOS
• Capability system developed in the early 1980s
• Goal: Extreme security, reliability, and availability
• Structured as a “nanokernel”
- Kernel proper only 20,000 likes of C, 100KB footprint
- Avoids many problems with traditional kernels
- Traditional OS interfaces implemented outside the kernel
(including binary compatibility with existing OSes)
• Basic idea: No privileges other than capabilities
- Means kernel provides purely objective security mechanism
- As objective as pointers to objects in OO languages
- In fact, partition system into many processes akin to objects
– p. 22/31Dawson Engler RSA (2008-2009) Chap 5: Security (346/355)
Unique features of KeyKOS
• Single-level store
- Everything is persistent: memory, processes, . . .
- System periodically checkpoints its entire state
- After power outage, everything comes back up as it was
(may just lose the last few characters you typed)
• “Stateless” kernel design only caches information
- All kernel state reconstructible from persistent data
• Simplifies kernel and makes it more robust
- Kernel never runs out of space in memory allocation
- No message queues, etc. in kernel
- Run out of memory? Just checkpoint system
– p. 23/31Dawson Engler RSA (2008-2009) Chap 5: Security (347/355)
KeyKOS capabilities
• Refered to as “keys” for short
• Types of keys:
- devices – Low-level hardware access
- pages – Persistent page of memory (can be mapped)
- nodes – Container for 16 capabilities
- segments – Pages & segments glued together with nodes
- meters – right to consume CPU time
- domains – a thread context
• Anyone possessing a key can grant it to others
- But creating a key is a privileged operation
- E.g., requires “prime meter” to divide it into submeters
– p. 24/31Dawson Engler RSA (2008-2009) Chap 5: Security (348/355)
Capability details
• Each domain has a number of key “slots”:
- 16 general-purpose key slots
- address slot – contains segment with process VM
- meter slot – contains key for CPU time
- keeper slot – contains key for exceptions
• Segments also have an associated keeper
- Process that gets invoked on invalid reference
• Meter keeper (allows creative scheduling policies)
• Calls generate return key for calling domain
- (Not required–other forms of message don’t do this)
– p. 25/31Dawson Engler RSA (2008-2009) Chap 5: Security (349/355)
KeyNIX: UNIX on KeyKOS
• “One kernel per process” architecture
- Hard to crash kernel
- Even harder to crash system
• Proc’s kernel is it’s keeper
- Unmodified Unix binary makes Unix syscall
- Invalid KeyKOS syscall, transfers control to Unix keeper
• Of course, kernels need to share state
- Use shared segment for process and file tables
– p. 26/31Dawson Engler RSA (2008-2009) Chap 5: Security (350/355)
KeyNIX overview
– p. 27/31Dawson Engler RSA (2008-2009) Chap 5: Security (351/355)
Keynix I/O
• Every file is a different process
- Elegant, and fault isolated
- Small files can live in a node, not a segment
- Makes the namei() function very expensive
• Pipes require queues
- This turned out to be complicated and inefficient
- Interaction with signals complicated
• Other OS features perform very well, though
- E.g., fork is six times faster than Mach 2.5
– p. 28/31Dawson Engler RSA (2008-2009) Chap 5: Security (352/355)
Self-authenticating capabilities
• Every access must be accompanied by a capability
- For each object, OS stores random check value
- Capability is: {Object, Rights, MAC(check, Rights)}• OS gives processes capabilities
- Process creating resource gets full access rights
- Can ask OS to generate capability with restricted rights
• Makes sharing very easy in distributed systems
• To revoke rights, must change check value
- Need some way for everyone else to reacquire capabilities
• Hard to control propagation
– p. 29/31Dawson Engler RSA (2008-2009) Chap 5: Security (353/355)
Amoeba• A distributed OS, based on capabilities of form:
- server port, object ID, rights, check
• Any server can listen on any machine
- Server port is hash of secret
- Kernel won’t let you listen if you don’t know secret
• Many types of object have capabilities
- files, directories, processes, devices, servers (E.g., X windows)
• Separate file and directory servers
- Can implement your own file server, or store other object types
in directories, which is cool
• Check is like a secret password for the object
- Server records check value for capabilities w. all rights
- Restricted capability’s check is hash of old check, rights
– p. 30/31Dawson Engler RSA (2008-2009) Chap 5: Security (354/355)
Limitations of capabilities
• IPC performance a losing battle with CPU makers
- CPUs optimized for “common” code, not context switches
- Capability systems usually involve many IPCs
• Capability programming model never took off
- Requires changes throughout application software
- Call capabilities “file descriptors” or “Java pointers” and
people will use them
- But discipline of pure capability system challenging so far
- People sometimes quip that capabilities are an OS concept of
the future and always will be
– p. 31/31Dawson Engler RSA (2008-2009) Chap 5: Security (355/355)