10/25/2007 ecs150, Fall 2007 1
UCDavis, ecs150Fall 2007
ecs150 Fall 2007:Operating SystemOperating System#4: Memory Management(chapter 5)
Dr. S. Felix Wu
Computer Science Department
University of California, Davishttp://www.cs.ucdavis.edu/~wu/
10/25/2007 ecs150, Fall 2007 2
UCDavis, ecs150Fall 2007
10/25/2007 ecs150, Fall 2007 3
UCDavis, ecs150Fall 2007
text
data
BSS
user stack
args/envkernel
data
file volumewith
executable programs
Fetches for clean text or data are typically fill-from-file.
Modified (dirty) pages are pushed to backing store (swap) on eviction.
Paged-out pages are fetched from backing store when needed.
Initial references to user stack and BSS are satisfied by zero-fill on demand.
10/25/2007 ecs150, Fall 2007 4
UCDavis, ecs150Fall 2007
Logical vs. Physical AddressLogical vs. Physical Address The concept of a logical address space that is bound to a
separate physical address space is central to proper memory management.– Logical address – generated by the CPU; also referred to as
virtual address.
– Physical address – address seen by the memory unit.
Logical and physical addresses are the same in compile-time and load-time address-binding schemes; logical (virtual) and physical addresses differ in execution-time address-binding scheme.
10/25/2007 ecs150, Fall 2007 5
UCDavis, ecs150Fall 2007 Memory-Management Unit Memory-Management Unit
((MMUMMU))
Hardware device that maps virtual to physical address.
In MMU scheme, the value in the relocation register is added to every address generated by a user process at the time it is sent to memory.
The user program deals with logical addresses; it never sees the real physical addresses.
MMU
CPU
Memory
Virtualaddress
Physicaladdress
Data
10/25/2007 ecs150, Fall 2007 6
UCDavis, ecs150Fall 2007
Paging: Paging: Page Page and and FrameFrame
Logical address space of a process can be noncontiguous; process is allocated physical memory whenever the latter is available.
Divide physical memory into fixed-sized blocks called frames (size is power of 2, between 512 bytes and 8192 bytes).
Divide logical memory into blocks of same size called pages. Keep track of all free frames. To run a program of size n pages, need to find n free frames and
load program. Set up a page table to translate logical to physical addresses. Internal fragmentation.
10/25/2007 ecs150, Fall 2007 7
UCDavis, ecs150Fall 2007
frames
10/25/2007 ecs150, Fall 2007 8
UCDavis, ecs150Fall 2007 Address Translation Architecture Address Translation Architecture
10/25/2007 ecs150, Fall 2007 9
UCDavis, ecs150Fall 2007
Address Translation SchemeAddress Translation Scheme Address generated by CPU is divided into:
– Page number (p) – used as an index into a page table which contains base address of each page in physical memory.
– Page offset (d) – combined with base address to define the physical memory address that is sent to the memory unit.
10/25/2007 ecs150, Fall 2007 10
UCDavis, ecs150Fall 2007
Virtual MemoryVirtual Memory
MAPPINGin MMU
10/25/2007 ecs150, Fall 2007 11
UCDavis, ecs150Fall 2007
shared by all user processes
10/25/2007 ecs150, Fall 2007 12
UCDavis, ecs150Fall 2007
kernel
10/25/2007 ecs150, Fall 2007 13
UCDavis, ecs150Fall 2007
10/25/2007 ecs150, Fall 2007 14
UCDavis, ecs150Fall 2007
text
dataidatawdata
header
symboltable, etc.
programsections
text
data
BSS
user stack
args/envkernel
data
processsegments
physicalpage frames
virtualmemory
(big)
physicalmemory(small)
executablefile
backingstorage
virtual-to-physical translations
pageout/eviction
page fetch
MAPPINGin MMU
How to represent
10/25/2007 ecs150, Fall 2007 15
UCDavis, ecs150Fall 2007
PagingPaging
Advantages? Disadvantages?
10/25/2007 ecs150, Fall 2007 16
UCDavis, ecs150Fall 2007
FragmentationFragmentation
External Fragmentation – total memory space exists to satisfy a request, but it is not contiguous.
Internal Fragmentation – allocated memory may be slightly larger than requested memory; this size difference is memory internal to a partition, but not being used.
Reduce external fragmentation by compaction– Shuffle memory contents to place all free memory together in one
large block.– Compaction is possible only if relocation is dynamic, and is done at
execution time.– I/O problem
Latch job in memory while it is involved in I/O. Do I/O only into OS buffers.
10/25/2007 ecs150, Fall 2007 17
UCDavis, ecs150Fall 2007
Page size?Page Table Size?
10/25/2007 ecs150, Fall 2007 18
UCDavis, ecs150Fall 2007
32 bitsAddress bus232 bytes1 page = 4K bytes256M bytes main memory
1 page = 212
220 pages
222 bytes4 MB
10/25/2007 ecs150, Fall 2007 19
UCDavis, ecs150Fall 2007 Page Table EntryPage Table Entry
cachingdisabled
referenced modified
protection
present/absent
page frame number
10/25/2007 ecs150, Fall 2007 20
UCDavis, ecs150Fall 2007 Free FramesFree Frames
Before allocation After allocation
10/25/2007 ecs150, Fall 2007 21
UCDavis, ecs150Fall 2007
Page FaultsPage Faults Page table access Load the missing page (replace one) Re-access the page table access.
How large is the page table?– 232 address space, 4K (212) size page.– How many entries? 220 entries (1 MB).– If 246, you need to access to both segment table and page
table…. (226 GB or 216 TB) Cache the page table!!
10/25/2007 ecs150, Fall 2007 22
UCDavis, ecs150Fall 2007
Page FaultsPage Faults Hardware Trap
– /usr/src/sys/i386/i386/trap.c VM page fault handler vm_fault()
– /usr/src/sys/vm/vm_fault.c
10/25/2007 ecs150, Fall 2007 23
UCDavis, ecs150Fall 2007
/usr/src/sys/vm/vm_map.h
How to implement?
On the hard disk or Cache – Page Faults
10/25/2007 ecs150, Fall 2007 24
UCDavis, ecs150Fall 2007
Implementation of Page TableImplementation of Page Table Page table is kept in main memory. Page-table base register (PTBR) points to the page table. Page-table length register (PRLR) indicates size of the
page table. In this scheme every data/instruction access requires two
memory accesses. One for the page table and one for the data/instruction.
The two memory access problem can be solved by the use of a special fast-lookup hardware cache called associative memory or translation look-aside buffers (TLBs)
10/25/2007 ecs150, Fall 2007 25
UCDavis, ecs150Fall 2007
Two IssuesTwo Issues
Virtual Address Access Overhead The size of the page table
10/25/2007 ecs150, Fall 2007 26
UCDavis, ecs150Fall 2007
TLB (Translation Lookaside Buffer)TLB (Translation Lookaside Buffer)
Associative Memory:– expensive, but fast -- parallel searching
TLB: select a small number of page table entries and store them in TLB
virt-page modified protectionpage frame
140 1 RW 31
20 0 RX 38
130 1 RW 29
129 1 RW 62
10/25/2007 ecs150, Fall 2007 27
UCDavis, ecs150Fall 2007
Associative MemoryAssociative Memory
Associative memory – parallel search
Address translation (A´, A´´)– If A´ is in associative register, get frame # out.
– Otherwise get frame # from page table in memory
Page # Frame #
10/25/2007 ecs150, Fall 2007 28
UCDavis, ecs150Fall 2007
10/25/2007 ecs150, Fall 2007 29
UCDavis, ecs150Fall 2007 Paging Hardware With TLBPaging Hardware With TLB
TLB MissVersusPage Fault
10/25/2007 ecs150, Fall 2007 30
UCDavis, ecs150Fall 2007
Hardware or SoftwareHardware or Software
TLB is part of MMU (hardware):– Automated page table entry (pte) update– OS handling TLB misses
Why software????– Reduce HW complexity– Flexibility in Paging/TLB content management
for different applications
10/25/2007 ecs150, Fall 2007 31
UCDavis, ecs150Fall 2007
Inverted Page TableInverted Page Table
264 address space with 4K pages– page table: 252 ~ 1 million gigabytes
10/25/2007 ecs150, Fall 2007 32
UCDavis, ecs150Fall 2007
Inverted Page Table (iPT)Inverted Page Table (iPT)
264 address space with 4K pages– page table: 252 ~ 1 million gigabytes
One entry per one page of real memory.– 128 MB with 4K pages ==> 214 entries
Disadvantage:– For every memory access, we need to search
for the whole paging hash list.
10/25/2007 ecs150, Fall 2007 33
UCDavis, ecs150Fall 2007
Page Table Page Table
10/25/2007 ecs150, Fall 2007 34
UCDavis, ecs150Fall 2007 Inverted Page Table Inverted Page Table
10/25/2007 ecs150, Fall 2007 35
UCDavis, ecs150Fall 2007
10/25/2007 ecs150, Fall 2007 36
UCDavis, ecs150Fall 2007 BrainstormingBrainstorming
How to design an “inverted page table” such that we can do it “faster”?
10/25/2007 ecs150, Fall 2007 37
UCDavis, ecs150Fall 2007 Hashed Page TablesHashed Page Tables
Common in address spaces > 32 bits.
The virtual page number is hashed into a page table. This page table contains a chain of elements hashing to the same location.
Virtual page numbers are compared in this chain searching for a match. If a match is found, the corresponding physical frame is extracted.
10/25/2007 ecs150, Fall 2007 38
UCDavis, ecs150Fall 2007
virtual page#Hash
virtual page# physical page#
10/25/2007 ecs150, Fall 2007 39
UCDavis, ecs150Fall 2007
10/25/2007 ecs150, Fall 2007 40
UCDavis, ecs150Fall 2007 iPT/Hash Performance iPT/Hash Performance
IssuesIssues still do TLB (hw/sw)
– if we can hit the TLB, we do NOT need to access the iPT and hash.
caching the iPT and/or Hash Table??– any benefits under regular on-demand caching
schemes? hardware support for iPT/Hash
10/25/2007 ecs150, Fall 2007 41
UCDavis, ecs150Fall 2007 TLB (Translation Lookaside TLB (Translation Lookaside
Buffer)Buffer) Associative Memory:
– expensive, but fast -- parallel searching TLB: select a small number of page table
entries and store them in TLBvirt-page modified protectionpage frame
140 1 RW 31
20 0 RX 38
130 1 RW 29
129 1 RW 62
10/25/2007 ecs150, Fall 2007 42
UCDavis, ecs150Fall 2007
Paging Paging Virtual MemoryVirtual Memory
CPU address-ability: 32 bits -- 232 bytes!!– 232 is 4 Giga bytes (un-segmented).
– Pentium II can support up to 246 (64 Tera) bytes 32 bits – address, 14 bits – segment#, 2 bits – protection.
Very large addressable space (64 bits), and relatively smaller physical memory available…– Let the programs/processes enjoy a much larger virtual
space!!
10/25/2007 ecs150, Fall 2007 43
UCDavis, ecs150Fall 2007 VM with 1 SegmentVM with 1 Segment
MAPPINGin MMU
10/25/2007 ecs150, Fall 2007 44
UCDavis, ecs150Fall 2007 Eventually…Eventually…
MAPPINGin MMU
???
10/25/2007 ecs150, Fall 2007 45
UCDavis, ecs150Fall 2007
On-Demand PagingOn-Demand Paging
On-demand paging:– we have to kick someone out…. But which
one?– Triggered by page faults.
Loading in advance. (Predictive/Proactive)– try to avoid page fault at all.
10/25/2007 ecs150, Fall 2007 46
UCDavis, ecs150Fall 2007
Demand PagingDemand Paging
On a page fault the OS:– Save user registers and process state. – Determine that exception was page fault. – Find a free page frame. – Issue read from disk to free page frame. – Wait for seek and latency and transfers page
into memory. – Restore process state and resume execution.
10/25/2007 ecs150, Fall 2007 47
UCDavis, ecs150Fall 2007 Page ReplacementPage Replacement
1. Find the location of the desired page on disk.
2. Find a free frame:- If there is a free frame, use it.- If there is no free frame, use a page
replacement algorithm to select a victim frame.
3. Read the desired page into the (newly) free frame. Update the page and frame tables.
4. Restart the process.
10/25/2007 ecs150, Fall 2007 48
UCDavis, ecs150Fall 2007
Page Replacement Page Replacement AlgorithmsAlgorithms
minimize page-fault rate
10/25/2007 ecs150, Fall 2007 49
UCDavis, ecs150Fall 2007
Page ReplacementPage Replacement Optimal FIFO Least Recently Used (LRU) Not Recently Used (NRU) Second Chance Clock Paging
10/25/2007 ecs150, Fall 2007 50
UCDavis, ecs150Fall 2007
OptimalOptimal
Estimate the next page reference time in the future.
Select the longest one.
10/25/2007 ecs150, Fall 2007 51
UCDavis, ecs150Fall 2007
LRULRU
an implementation issue– I need to keep tracking the last modification or
access time for each page– timestamp: 32 bits
How to implement LRU efficiently?
10/25/2007 ecs150, Fall 2007 52
UCDavis, ecs150Fall 2007 LRU ApproximationLRU Approximation
Reference bit (one-bit timestamp)– With each page associate a bit, initially = 0– When page is referenced bit set to 1.– Replace the one which is 0 (if one exists). We do not know
the order, however. Second chance
– Need reference bit.– Clock replacement.– If page to be replaced (in clock order) has reference bit = 1.
then: set reference bit 0. leave page in memory. replace next page (in clock order), subject to same rules.
10/25/2007 ecs150, Fall 2007 53
UCDavis, ecs150Fall 2007 NRUNRU
Not Recently Used Clear the bits every 20 milliseconds.
referenced modified
What is the problem??
10/25/2007 ecs150, Fall 2007 54
UCDavis, ecs150Fall 2007
Page Replacement??Page Replacement??
Efficient Approximation of LRU No periodic refreshing
How to do that?
10/25/2007 ecs150, Fall 2007 55
UCDavis, ecs150Fall 2007
Second Chance/Clock PagingSecond Chance/Clock Paging
Do not need any “periodic” bit clearing Have a “current candidate pointer” moving
along the “clock” Choose the first page with zero flag(s)
10/25/2007 ecs150, Fall 2007 56
UCDavis, ecs150Fall 2007
Clock PagesClock Pages
A
B
C
D
E
F
A
B
C
D
E
F
G
10/25/2007 ecs150, Fall 2007 57
UCDavis, ecs150Fall 2007
Clock PagesClock Pages
G
B
C
D
E
F
G
B
C
D
E
F
H
10/25/2007 ecs150, Fall 2007 58
UCDavis, ecs150Fall 2007
Clock PagesClock Pages
G
B
C
D
E
F
G
B
C
D
E
F
H
10/25/2007 ecs150, Fall 2007 59
UCDavis, ecs150Fall 2007
Clock PagesClock Pages
G
B
H
D
E
F
G
B
H
D
E
F
I
10/25/2007 ecs150, Fall 2007 60
UCDavis, ecs150Fall 2007
Clock PagesClock Pages
G
B
H
D
E
F
G
B
H
D
E
F
I
10/25/2007 ecs150, Fall 2007 61
UCDavis, ecs150Fall 2007
Clock PagesClock Pages
G
I
H
D
E
F
G
I
H
D
E
F
10/25/2007 ecs150, Fall 2007 62
UCDavis, ecs150Fall 2007
EvaluationEvaluation the page-fault rate. Evaluate algorithm by running it on a
particular string of memory references (reference string) and computing the number of page faults on that string.
In all our examples, the reference string is
2, 3, 2, 1, 5, 2, 4, 5, 3, 2, 5, 2.
10/25/2007 ecs150, Fall 2007 63
UCDavis, ecs150Fall 2007
3 physical pages
2
2, 3, 2, 1, 5, 2, 4, 5, 3, 2, 5, 2
2
3
2
3
2
3
1
5 (2)
3
1
5
2 (3)
1
FIFO
Page Faults
10/25/2007 ecs150, Fall 2007 64
UCDavis, ecs150Fall 2007
Page ReplacementPage Replacement
2, 3, 2, 1, 5, 2, 4, 5, 3, 2, 5, 2 OPT/LRU/FIFO/CLOCK and 3 pages how many page faults?
10/25/2007 ecs150, Fall 2007 65
UCDavis, ecs150Fall 2007
10/25/2007 ecs150, Fall 2007 66
UCDavis, ecs150Fall 2007 ThrashingThrashing
If a process does not have “enough” pages, the page-fault rate is very high. This leads to:– low CPU utilization.
– operating system thinks that it needs to increase the degree of multiprogramming.
– another process added to the system.
Thrashing a process is busy swapping pages in and out.
10/25/2007 ecs150, Fall 2007 67
UCDavis, ecs150Fall 2007 Thrashing Thrashing
Why does paging work?Locality model– Process migrates from one locality to another.– Localities may overlap.
Why does thrashing occur? size of locality > total memory size
10/25/2007 ecs150, Fall 2007 68
UCDavis, ecs150Fall 2007
How to Handle Thrashing?How to Handle Thrashing?
Brainstorming!!
10/25/2007 ecs150, Fall 2007 69
UCDavis, ecs150Fall 2007 Locality In A Memory-Reference PatternLocality In A Memory-Reference Pattern
10/25/2007 ecs150, Fall 2007 70
UCDavis, ecs150Fall 2007
FreeBSD VMFreeBSD VM
10/25/2007 ecs150, Fall 2007 71
UCDavis, ecs150Fall 2007
/usr/src/sys/vm/vm_map.h
How to implement?
10/25/2007 ecs150, Fall 2007 72
UCDavis, ecs150Fall 2007
Text
InitializedData(Copy on Write)
UnintializedData(Zero-Fill)AnonymousObject
Stack(Zero-Fill)AnonymousObject
10/25/2007 ecs150, Fall 2007 73
UCDavis, ecs150Fall 2007
Page-level Allocation
• Kernel maintains a list of free physical pages.
• Two principal clients:the paging systemthe kernel memory allocator
10/25/2007 ecs150, Fall 2007 74
UCDavis, ecs150Fall 2007 Memory allocationMemory allocation
Page-levelallocator
physical page
Kernel MemoryAllocator
Pagingsystem
Networkbuffers
Datastructures
tempstorage
process Buffer cache
10/25/2007 ecs150, Fall 2007 75
UCDavis, ecs150Fall 2007
kernel textinitialized/un-initialized data
kernelmalloc
networkbuffer
kernelI/O
10/25/2007 ecs150, Fall 2007 76
UCDavis, ecs150Fall 2007 Why Kernel MA is special?Why Kernel MA is special?
Typical request is for less than 1 page Originally, kernel used statically allocated, fixed size
tables, but it is too limited Kernel requires a general purpose allocator for both
large and small chunks of memory. handles memory requests from kernel modules, not
user level applications– pathname translation routine, STREAMS or I/O buffers,
zombie structures, table table entries (proc structure etc)
10/25/2007 ecs150, Fall 2007 77
UCDavis, ecs150Fall 2007
KMA RequirementsKMA Requirements
utilization factor = requested/required memory – Useful metric that factors in fragmentation.– 50% considered good
KMA must be fast since extensively used Simple API similar to malloc and free.
• desirable to free portions of allocated space, this is different from typical user space malloc and free interface
Properly aligned allocations: for example 4 byte alignment Support burst-usage patterns Interaction with paging system – able to borrow pages from
paging system if running low
10/25/2007 ecs150, Fall 2007 78
UCDavis, ecs150Fall 2007
KMA SchemesKMA Schemes Resource Map Allocator Simple Power-of-Two Free Lists The McKusick-Karels Allocator
– Freebsd
The Buddy System– Linux
SVR4 Lazy Buddy Allocator Mach-OSF/1 Zone Allocator Solaris Slab Allocator
– Freebsd, linux, Solaris,
10/25/2007 ecs150, Fall 2007 79
UCDavis, ecs150Fall 2007
Resource Map AllocatorResource Map Allocator
Resource map is a set of <base,size> pairs that monitor areas of free memory
Initially, pool described by a single map entry = <pool_starting_address, pool_size>
Allocations result in pool fragmenting with one map entry for each contiguous free region
Entries sorted in order of increasing base address Requests satisfied using one of three policies:
– First fit – Allocates from first free region with sufficient space. UNIX, fasted, fragmentation is concern
– Best fit – Allocates from smallest that satisfies request. May leave several regions that are too small to be useful
– Worst fit - Allocates from largest region unless perfect fit is found. Goal is to leave behind larger regions after allocation
10/25/2007 ecs150, Fall 2007 80
UCDavis, ecs150Fall 2007
<0,1024>
<256,128>
<128,32>
after: rmalloc(256), rmalloc(320), rmfree(256,128)
offset_t rmalloc(size)void rmfree(base, size)
<576,448>
after: rmfree(128,128)<128,256> <576,448>
<288,64> <544,128> <832,32>
10/25/2007 ecs150, Fall 2007 81
UCDavis, ecs150Fall 2007
Resource Map -Good/BadResource Map -Good/Bad Advantages:
– simple, easy to implement
– not restricted to memory allocation, any collection of objects that are sequentially ordered and require allocation and freeing in contiguous chunks.
– Can allocate exact size within any alignment restrictions. Thus no internal fragmentation.
– Client may release portion of allocated memory.
– adjacent free regions are coalesced
10/25/2007 ecs150, Fall 2007 82
UCDavis, ecs150Fall 2007 Resource Map -Good/Bad
• Disadvantages:Map may become highly fragmented resulting in
low utilization. Poor for performing large requests.
Resource map size increases with fragmentation static table will overflow dynamic table needs it’s own allocator
Map must be sort for free region coalescing. Sorting operations are expensive.
Requires linear search of map to find free region that matches allocation request.
Difficult to return borrowed pages to paging system.
10/25/2007 ecs150, Fall 2007 83
UCDavis, ecs150Fall 2007
Simple Power of TwosSimple Power of Twos has been used to implement malloc() and free() in the
user-level C library (libc).
Do you know how it is implemented?
10/25/2007 ecs150, Fall 2007 84
UCDavis, ecs150Fall 2007
Simple Power of TwosSimple Power of Twos has been used to implement malloc() and free() in the
user-level C library (libc). Uses a set of free lists with each list storing a particular
size of buffer. Buffer sizes are a power of two. Each buffer has a one word header
– when free, header stores pointer to next free list element
– when allocated, header stores pointer to associated free list (where it is returned to when freed). Alternatively, header may contain size of buffer
10/25/2007 ecs150, Fall 2007 85
UCDavis, ecs150Fall 2007
How to allocate?– char *ptr = (char *) malloc(100);
10/25/2007 ecs150, Fall 2007 86
UCDavis, ecs150Fall 2007
How to allocate?– char *ptr = (char *) malloc(100);
10/25/2007 ecs150, Fall 2007 87
UCDavis, ecs150Fall 2007
How to allocate?– char *ptr = (char *) malloc(100);
10/25/2007 ecs150, Fall 2007 88
UCDavis, ecs150Fall 2007
How to free?– char *ptr = (char *) malloc(100);– free(ptr);
10/25/2007 ecs150, Fall 2007 89
UCDavis, ecs150Fall 2007
Extra FOUR bytes for a pointer or sizeFree next Free blockUsed size
10/25/2007 ecs150, Fall 2007 90
UCDavis, ecs150Fall 2007
free list One word header per buffer (pointer)
– malloc(X): size = roundup(X + sizeof(header))– roundup(Y) = 2n, where 2n-1 < Y <= 2n
free(buf) must free entire buffer.
10/25/2007 ecs150, Fall 2007 91
UCDavis, ecs150Fall 2007
Simple and reasonably fast eliminates linear searches and fragmentation.
– Bounded time for allocations when buffers are available
familiar API simple to share buffers between kernel modules
since free’ing a buffer does not require knowing its size
10/25/2007 ecs150, Fall 2007 92
UCDavis, ecs150Fall 2007
Rounding requests to power of 2 results in wasted memory and poor utilization.– aggravated by requiring buffer headers since it is not unusual
for memory requests to already be a power-of-two.
no provision for coalescing free buffers since buffer sizes are generally fixed.
no provision for borrowing pages from paging system although some implementations do this.
no provision for returning unused buffers to page allocator
10/25/2007 ecs150, Fall 2007 93
UCDavis, ecs150Fall 2007
Simple Power of Two Simple Power of Two void *malloc (size){ int ndx = 0; /* free list index */ int bufsize = 1 << MINPOWER /* size of smallest buffer */ size += 4; /* Add for header */ assert (size <= MAXBUFSIZE); while (bufsize < size) { ndx++; bufsize <<= 1; } /* ndx is the index on the freelist array from which a buffer * will be allocated */}
10/25/2007 ecs150, Fall 2007 94
UCDavis, ecs150Fall 2007
Can we eliminate the need for the Extra FOUR bytes?
10/25/2007 ecs150, Fall 2007 95
UCDavis, ecs150Fall 2007 McKusick-Karels AllocatorMcKusick-Karels Allocator
/usr/src/sys/kern/kern_malloc.c/usr/src/sys/kern/kern_malloc.c
Improved power of twos implementation All buffers within a page must be of equal size Adds page usage array, kmemsizes[], to manage pages Managed Memory must be contiguous pages Does not require buffer headers to indicate page size.
When freeing memory, free(buff) simply masks of the lower order bit to get the page address (actually the page offset = pg) which is used as an index into the kmemsizes array.
10/25/2007 ecs150, Fall 2007 96
UCDavis, ecs150Fall 2007
10/25/2007 ecs150, Fall 2007 97
UCDavis, ecs150Fall 2007
28
1 page = 212 (4K) bytesSeparate 16 28-bytes blocks
10/25/2007 ecs150, Fall 2007 98
UCDavis, ecs150Fall 2007
26
1 page = 212 (4K) bytesSeparate 64 26-bytes blocks
10/25/2007 ecs150, Fall 2007 99
UCDavis, ecs150Fall 2007
On-Demand Page/kmem allocation
10/25/2007 ecs150, Fall 2007 100
UCDavis, ecs150Fall 2007
How would we know the size of this piece of memory?
free(ptr);
10/25/2007 ecs150, Fall 2007 101
UCDavis, ecs150Fall 2007
How to point to the next free block?
10/25/2007 ecs150, Fall 2007 102
UCDavis, ecs150Fall 2007
Used blocks: check the page#Free blocks: pointer
10/25/2007 ecs150, Fall 2007 103
UCDavis, ecs150Fall 2007 McKusick-Karels AllocatorMcKusick-Karels Allocator
Improved power of twos implementation All buffers within a page must be of equal size Adds page usage array, kmemsizes[], to manage pages Managed Memory must be contiguous pages Does not require buffer headers to indicate page size.
When freeing memory, free(buff) simply masks of the lower order bit to get the page address (actually the page offset = pg) which is used as an index into the kmemsizes array.
10/25/2007 ecs150, Fall 2007 104
UCDavis, ecs150Fall 2007
• Disadvantages:similar drawbacks to simple power-of-twos
allocatorvulnerable to burst-usage patterns since no
provision for moving buffers between lists
• Advantages:eliminates space wastage in common case where
allocation request is a power-of-twooptimizes round-up computation and eliminates it
if size is known at compile time
McKusick-Karels Allocator
10/25/2007 ecs150, Fall 2007 105
UCDavis, ecs150Fall 2007
The Buddy SystemThe Buddy System
Another interesting power-of-2 memory allocation used in Linux Kernel
10/25/2007 ecs150, Fall 2007 106
UCDavis, ecs150Fall 2007 Buddy SystemBuddy System
FreeIn-useallocate(256), allocate(128), allocate(64),allocate(128), release(C, 128), release (D, 64)
1
Bitmap (32-Bytes chunks)
0 1023
32 64 128 256 512Free list
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10/25/2007 ecs150, Fall 2007 107
UCDavis, ecs150Fall 2007
0 1023
32 64 128 256 512Free list
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1023
32 64 128 256 512Free list
10/25/2007 ecs150, Fall 2007 108
UCDavis, ecs150Fall 2007
0 1023
32 64 128 256 512Free list
C D D’ B’ F F’ E’
10/25/2007 ecs150, Fall 2007 109
UCDavis, ecs150Fall 2007 Buddy SystemBuddy System
Bitmap (32-Bytes chunks)
0 1023
FreeIn-use
32 64 128 256 512Free list
allocate(256), allocate(128), allocate(64),allocate(128), release(C, 128), release (D, 64)
1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A A’
10/25/2007 ecs150, Fall 2007 110
UCDavis, ecs150Fall 2007 Buddy SystemBuddy System
Bitmap (32-Bytes chunks)
0 1023
FreeIn-use
32 64 128 256 512Free list
allocate(256), allocate(128), allocate(64),allocate(128), release(C, 128), release (D, 64)
1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A’B B’
10/25/2007 ecs150, Fall 2007 111
UCDavis, ecs150Fall 2007 Buddy SystemBuddy System
Bitmap (32-Bytes chunks)
0 1023
FreeIn-use
32 64 128 256 512Free list
allocate(256), allocate(128), allocate(64),allocate(128), release(C, 128), release (D, 64)
1
1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A’B B’
10/25/2007 ecs150, Fall 2007 112
UCDavis, ecs150Fall 2007 Buddy SystemBuddy System
Bitmap (32-Bytes chunks)
0 1023
FreeIn-use
32 64 128 256 512Free list
allocate(256), allocate(128), allocate(64),allocate(128), release(C, 128), release (D, 64)
1
1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A’B C C’
10/25/2007 ecs150, Fall 2007 113
UCDavis, ecs150Fall 2007 Buddy SystemBuddy System
Bitmap (32-Bytes chunks)
0 1023
FreeIn-use
32 64 128 256 512Free list
allocate(256), allocate(128), allocate(64),allocate(128), release(C, 128), release (D, 64)
1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A’B C D D’
10/25/2007 ecs150, Fall 2007 114
UCDavis, ecs150Fall 2007 Buddy SystemBuddy System
Bitmap (32-Bytes chunks)
0 1023
FreeIn-use
32 64 128 256 512Free list
allocate(256), allocate(128), allocate(64),allocate(128), release(C, 128), release (D, 64)
1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
B C D D’ F E’F’
10/25/2007 ecs150, Fall 2007 115
UCDavis, ecs150Fall 2007 releasing a blockreleasing a block
Bitmap (32-Bytes chunks)
0 1023
FreeIn-use
32 64 128 256 512Free list
allocate(256), allocate(128), allocate(64),allocate(128), release(C, 128), release (D, 64)
1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
B C D D’ F E’F’
Why “SIZE”?
10/25/2007 ecs150, Fall 2007 116
UCDavis, ecs150Fall 2007 releasing a blockreleasing a block
Bitmap (32-Bytes chunks)
0 1023
FreeIn-use
32 64 128 256 512Free list
allocate(256), allocate(128), allocate(64),allocate(128), release(C, 128), release (D, 64)
1
1 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
B C D D’ F E’F’
10/25/2007 ecs150, Fall 2007 117
UCDavis, ecs150Fall 2007 releasing a blockreleasing a block
Bitmap (32-Bytes chunks)
0 1023
FreeIn-use
32 64 128 256 512Free list
allocate(256), allocate(128), allocate(64),allocate(128), release(C, 128), release (D, 64)
1
1 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
B C D D’ F E’F’
10/25/2007 ecs150, Fall 2007 118
UCDavis, ecs150Fall 2007 releasing a blockreleasing a block
Bitmap (32-Bytes chunks)
0 1023
FreeIn-use
32 64 128 256 512Free list
allocate(256), allocate(128), allocate(64),allocate(128), release(C, 128), release (D, 64)
1
1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
B C D D’ F E’F’
10/25/2007 ecs150, Fall 2007 119
UCDavis, ecs150Fall 2007 Merging free blocksMerging free blocks
Bitmap (32-Bytes chunks)
0 1023
FreeIn-use
32 64 128 256 512Free list
allocate(256), allocate(128), allocate(64),allocate(128), release(C, 128), release (D, 64)
1
1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
B B’ F E’F’
10/25/2007 ecs150, Fall 2007 120
UCDavis, ecs150Fall 2007
sizeof(struct proc)?sizeof(struct proc)?
10/25/2007 ecs150, Fall 2007 121
UCDavis, ecs150Fall 2007
sizeof(struct proc)?sizeof(struct proc)?
452 bytes How should we allocate the memory?
– Power of 2 --- 512 bytes, no so bad– IF (Internal Fragmentation): 60 bytes or 12%
10/25/2007 ecs150, Fall 2007 122
UCDavis, ecs150Fall 2007
struct with 300 bytesstruct with 300 bytes
IF? How many pages (4K per page) needed for
hosting 16 entries?
10/25/2007 ecs150, Fall 2007 123
UCDavis, ecs150Fall 2007
““Slab”Slab”
One or more pages for one slab One slab dedicated to ONE TYPE of
objects (with the same size)– Breaking the power-of-2 rule– Example, a 2-pages slab can hold 27 entities of
300 bytes (versus 16 entities using 512 bytes blocks).
10/25/2007 ecs150, Fall 2007 124
UCDavis, ecs150Fall 2007
Slab AllocatorSlab Allocator
10/25/2007 ecs150, Fall 2007 125
UCDavis, ecs150Fall 2007
10/25/2007 ecs150, Fall 2007 126
UCDavis, ecs150Fall 2007 Design of Slab AllocatorDesign of Slab Allocator
vnodecache
proccache
mbufcache
msgbcache
page-level allocator
back end
front end
vnodevnodevnodevnode
procprocprocmbufmbuf msgbmsgbmsgbmsgbmsgb
Objects in use by the kernel
cachep = kmem_cache_create (name, size, align, ctor, dtor);
10/25/2007 ecs150, Fall 2007 127
UCDavis, ecs150Fall 2007
VMVM
text
data
BSS
user stack
args/envkernel
data
file volumewith
executable programs
Fetches for clean text or data are typically fill-from-file.
Modified (dirty) pages are pushed to backing store (swap) on eviction.
Paged-out pages are fetched from backing store when needed.
Initial references to user stack and BSS are satisfied by zero-fill on demand.
10/25/2007 ecs150, Fall 2007 128
UCDavis, ecs150Fall 2007
Text
InitializedData(Copy on Write)
UnintializedData(Zero-Fill)AnonymousObject
Stack(Zero-Fill)AnonymousObject
10/25/2007 ecs150, Fall 2007 129
UCDavis, ecs150Fall 2007
““mmap”mmap”
Memory Mapped File– Read/write versus direct memory access– Sharing a file among multiple processes
10/25/2007 ecs150, Fall 2007 130
UCDavis, ecs150Fall 2007
10/25/2007 ecs150, Fall 2007 131
UCDavis, ecs150Fall 2007
10/25/2007 ecs150, Fall 2007 132
UCDavis, ecs150Fall 2007
10/25/2007 ecs150, Fall 2007 133
UCDavis, ecs150Fall 2007
10/25/2007 ecs150, Fall 2007 134
UCDavis, ecs150Fall 2007
““mmap”mmap”
Memory Mapped File– Read/write versus direct memory access– Sharing a file among multiple processes
Two modes: Shared or Private– Applications?
10/25/2007 ecs150, Fall 2007 135
UCDavis, ecs150Fall 2007
FORK
10/25/2007 ecs150, Fall 2007 136
UCDavis, ecs150Fall 2007 Private Mapping: Debugging