Computer Architecture 2012 – virtual memory 1 Computer Architecture Virtual Memory (VM) By Dan Tsafrir, 10/6/2011 Presentation based on slides by Lihu

Computer Architecture 2012 – virtual memory1

Computer Architecture

Virtual Memory (VM)

By Dan Tsafrir, 10/6/2011Presentation based on slides by Lihu Rappoport


http://www.youtube.com/watch?v=3ye2OXj32DM (funny beginning)


DRAM (dynamic random-access memory)

Corsair 1333 MHz DDR3Laptop Memory

Price (at amazon.com):

– $43 for 4 GB

– $79 for 8 GB

“The physical memory”


VM – motivation

Provides isolation between processes

– Processes can concurrently run on a single machine

– Vm prevents them from accessing the memory of one another

– (But still allows for convenient sharing when required) Provides illusion of large memory

– VM size can be bigger than physical memory size

– VM decouples program from real size (can differ across machines) Provides illusion of contiguous memory

– Programmers need not worry about where data is placed exactly Allows for memory dynamic growth

– Can add memory to processes at runtime as needed Allows for memory overcommitment

– Sum of VM spaces (across all processes) can be >= physical

– DRAM often one of the most costly parts in the system


VM – terminology

Virtual address space

– Space used by the programmer

– “Ideal” = contagious & as big is you’d like

Physical address

– The real, underlying physical memory address

– Completely abstracted away by OS/HW


VM – basic idea

Divide memory (virtual & physical) into fixed size blocks

– “page” = chunk of contagious data in virtual space

– “frame” = physical memory exactly enough to hold one page

– |page| = |frame| (= size)

– page size = power of 2 = 2k (bytes)

– By default, k=12 almost always => page size is 4KB

While virtual address space is contiguous

– Pages can be mapped into arbitrary frames

Pages can reside

– In memory or on disk (hence, overcommitment)

All programs are written using vm address space

– HW does on-the-fly translation from virtual and physical addresses

– Use a page table to translate between virtual and physical addresses


VM – simplistic illustration

Memory acts as a cache for the secondary storage (disk) Immediate advantages

– Illusion of contiguity & of having more physical memory– Program actual location unimportant– Dynamic growth, isolation, & sharing are easy to obtain

pages(virtual space)

frames(DRAM)

address translation

disk


Translation – use a “page table”

63

page offset (12bit)

011

virtual page number (52bit)

page offset (12bit)physical frame number (20bit)

virtual address (64bit)

physical address (32bit)

12

(page size is typically 212 byte = 4KB)

how to map?



V D frameNumber

1

page table baseregister

0

valid bit

dirty bit

AC

access control




63

page offset (12bit)

011

virtual page number (52bit)

page offset (12bit)

11 0

physical frame number (20bit)

31

virtual address (64bit)

physical address (32bit)

V D frameNumber

1

page table baseregister

0

valid bit

dirty bit

12

AC

access control

12




V D frameNumberAC

“PTE” (page table entry)


Page tables

Valid

1

Physical Memory

Disk

Page Tablepoints to memory

frame or disk address

1

1

1

1

1

11

1

0

0

0

Virtual page number


Checks

If ( valid == 1 )

page is in main memory at frame address stored in table

Data is readily available (e.g., can copy it to the cache)

else /*page fault */

need to fetch page from disk

causes a trap, usually accompanied by a context switch:

current process suspended while page is fetched from disk Access Control

– R=read-only, R/W=read/write, X=execute– If ( access type incompatible with specified access rights )

protection violation fault

traps to fault-handler Demand paging

– Pages fetched from secondary memory only upon the first fault– Rather then, e.g., upon file open


Page replacement

Page replacement policy– Decided which page to place on disk

LRU (least recently used)– Typically too wasteful (updated upon each memory reference)

FIFO (first in first out)– Simplest: no need to update upon references, but ignores usage

Second-chance– Set per-page “was it referenced?” bit (can be done by HW or SW)– Swap out first page with bit = 0, FIFO order– When traversed, if bit = 1, set it to be 0 and push the associated

page to end of the list (in FIFO terms, page becomes newest) Clock

– More efficient variant of second-chance– Pages are cyclically ordered (no FIFO); search clockwise for first

page with bit=0; set bit=0 for pages that have bit=1


Page replacement – cont.

NRU (not recently used)

– More sophisticated LRU approximation

– HW or SW maintains per-page ‘referenced’ & ‘modified’ bits

– Periodically (clock interrupt), SW turns ‘referenced’ off

– Replacement algorithm partitions pages to Class 0: not referenced, not modified Class 1: not referenced, modified Class 2: referenced, not modified Class 3: referenced, modified

– Choose at random a page from the lowest class for removal

– Underlying principles (order is important): Prefer keeping referenced over unreferenced Prefer keeping modified over unmodified

– Can a page be modified but not referenced?


Page replacement – advanced

ARC (adaptive replacement cache)

– Factors not only recency (when latest access),but also frequency (how many times accessed)

– User determines which factor has more weight

– Better (but more wasteful) than LRU

– Develop by IBM: Nimrod Megiddo & Dharmendra Modha– Details: http://www.usenix.org/events/fast03/tech/full_papers/megiddo/megiddo.pdf

CAR (clock with adaptive replacement)

– Similar to ARC, and comparable in performance

– But, unlike ARC, doesn’t require user-specified parameters

– Likewise developed by IBM: Sorav Bansal & Dharmendra Modha– Details: http://www.usenix.org/events/fast04/tech/full_papers/bansal/bansal.pdf


Page faults

Page faults: the data is not in memory retrieve it from disk

– CPU detects the situation (valid=0)

– But it cannot remedy the situation (doesn’t know disk; it’s the OS job)

– Thus, it must trap to OS

– OS loads page from disk Possibly writing victim page to disk (if no room & if dirty) Possibly avoids reading from disk due to OS “buffer cache”

– OS updates page table (valid=1)

– OS resumes process; now, HW will retry & succeed! Page fault incurs a significant penalty

– “Major” page fault = must go get page from disk

– “Minor” page fault = page already resides in OS buffer cache Possible only for files; not for “anonymous” spaces like the stack

– => pages shouldn’t be too small (as noted, typically 4KB)


Page size

Smaller page size (typically 4KB)– PROS: minimizes internal fragmentation– CONS: increase size of page table

Bigger size (called “superpages” if > 4K)– PROS:

Amortize disk access cost May prefetch useful data May discard useless data early

– CONS: Increased fragmentation Might transfer unnecessary info at the expense of useful info

Lots of work to increase page size beyond 4K– HW supports it for years; OS is the “bottleneck”– Attractive because:

Bigger DRAMs, increasing memory/disk performance gap


TLB (translation lookaside buffer)

Page table resides in memory

– Each translation requires a memory access

– Might be required for each load/store!

TLB

– Cache recently used PTEs

– speed up translation

– typically 128 to 256 entries

– usually 4 to 8 way associative

– TLB access time is comparable to L1 cache access time

Yes

NoTLB Hit ?

AccessPage Table

Virtual Address

Physical Addresses

TLB Access


TLB is a cache for recent address translations:

Making Address Translation Fast

Valid

1

1

1

1

0

1

1

0

1

1

0

1

1

1

1

1

0

1

Physical Memory

Disk

Virtual page number

Page Table

Valid Tag Physical PageTLB

Physical PageOr

Disk Address


TLB Access

Tag Set

Offset

Set#

Hit/Miss

Way MUX

PTE

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

====

Way 0 Way 1 Way 2 Way 3 Way 0 Way 1 Way 2 Way 3

Virtual page number


Unified L2 L2 is unified (no separation for data/inst) – as the main memory

– In case of a miss in either: d-L1, i-L1, d-TLB, or i-TLB=> try to get missed data from L2

– PTEs can and do reside in L2

L1 Instruction cache

L1 Data Cache

L2 cache

DataTLB

InstructionTLB

translations translations Memory


VM & cache

TLB access is serial with cache access => performance is crucial! Page table entries can be cached in L2 cache (as data)

Yes

No

AccessTLB

AccessPage TableIn Memory

Access Cache

Virtual Address

L1Cache Hit ?

Yes

No

Physical Addresses Data

No AccessMemory

L2Cache Hit ?

TLBHit ?

L2Cache Hit ?

No


Overlapped TLB & cache access

#Set is not contained within the Page Offset

The #Set is not known until the physical page number is known

Cache can be accessed only after address translation done

VM view of a Physical Address

Cache view of a Physical Address

0

Page offset

11

Physical Page Number

1229

0

disp

13

tag

1429 5

set

6


Overlapped TLB & cache access (cont)

In the above example #Set is contained within the Page Offset

The #Set is known immediately

Cache can be accessed in parallel with address translation

Once translation is done, match upper bits with tags

Limitation: Cache ≤ (page size × associativity)

Virtual Memory view of a Physical Address

Cache view of a Physical Address

0

Page offset

11


1229

029 5

disptag set

61112


Overlapped TLB & cache access (cont)

Tag Set

Page offset

Set#

Virtual page number

set disp

Set#

Physical page number

TLB

Hit/Miss

Way MUX====

Cache

Way MUX= = = = = = = =

Hit/Miss

Data


Overlapped TLB & cache access (cont) Assume cache is 32K Byte, 2 way set-associative, 64 byte/line

– (215/ 2 ways) / (26 bytes/line) = 215-1-6 = 28 = 256 sets In order to still allow overlap between set access and TLB access

– Take the upper two bits of the set number from bits [1:0] of the VPN

Physical_addr[13:12] may be different than virtual_addr[13:12]

– Tag is comprised of bits [31:12] of the physical address The tag may mis-match bits [13:12] of the physical address

– Cache miss allocate missing line according to its virtual set address and physical tag

0

Page offset

11


1229

0

disp

13 12 11

tag

1429 5

set

6

VPN[1:0]


DMA (direct memory access) DMA copies page from/to, e.g., disk controller (or other I/O dev)

– Access memory without requiring CPU involvement– Assume we copy from memory to disk (swap out page)– Read each relevant block:

Snoop-invalidate if resides in cache (L1, L2), meaning: if it is modified, copy line from cache into memory invalidates cache line

– Writes the line to the disk controller– This means that when a page is swapped-out of memory

All data in the caches which belongs to that page is invalidated The page in the disk is up-to-date

In the page table – Assign 0 to valid bit in PTE of swapped-out pages– The rest of the PTE bits may be used by the OS for keeping the

location of the page on disk– TLB entry of swapped out page is likewise invalidated


Context switch

Each process has its own address space– Akin to saying “each process has its own page table”– OS allocates frames for process => updates process's page table– If only one PTE points to frame throughput the system

Only the associated process can access the corresponding frame– Shared memory

Two PTEs of two processes point to the same frame Upon context switching

– Save current architectural state to memory: Architectural registers, including Register that holds the page table base address in memory

– Flush TLB As same virtual addresses are routinely reused (Recently “VPID” added to TLB to some x86’s => no need to flush)

– Load the new architectural state from memory Architectural registers Register that holds the page table base address in memory


Virtually-addressed cache Cache uses virtual addresses (tags are virtual)

Only require address translation on cache miss– TLB not in path to cache hit! But…

Aliasing: >=2 virtual addresses mapped to same physical address– => >=2 cache lines holding data of same physical address – => Must update all cache entries with same physical address

data

Trans-lation

Cache

MainMemory

VA

hit

PA

CPU


Virtually-addressed cache

Cache must be flushed at task switch

– Possible solution: include unique process ID (PID) in tag (like the

VPID we discussed earlier)

Documents

Computer Architecture 2012 – virtual memory 1 Computer Architecture Virtual Memory (VM) By Dan Tsafrir, 10/6/2011 Presentation based on slides by Lihu