View
215
Download
0
Category
Preview:
Citation preview
Appendix CMemory Hierarchy
Why care about memory hierarchies?
1
10
100
1,000
10,000
100,000
1980 1985 1990 1995 2000 2005 2010
Year
Pe
rfo
rma
nc
e
Memory
ProcessorProcessor-MemoryPerformance GapGrowing
Major source of stall cycles: memory accesses
2
Levels of the Memory Hierarchy
CPU Registers100s Bytes<0.5 ns
CacheK Bytes1 ns1-0.1 cents/bit
Main MemoryM Bytes100ns$.0001-.00001 cents /bit
DiskG Bytes, 10 ms (10,000,000 ns)
10 - 10 cents/bit-5 -6
CapacityAccess TimeCost
Tapeinfinitesec-min10 -8
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
Motivating memory hierarchies Two structures that hold data
Registers: small array of storage Memory: large array of storage
What characteristics would we like memory to have? High capacity Low latency Low cost
Can’t satisfy these requirements with one memory technology
4
Memory hierarchy Solution: use a little bit of everything!
Small SRAM array (cache) Small means fast and cheap
Larger DRAM array (Main memory) Hope you rarely have to use it
Extremely large disk Costs are decreasing at a faster rate than we fill them
5
Terminology Find data you want at a given level: hit Data is not present at that level: miss
In this case, check the next lower level Hit rate: Fraction of accesses that hit at a
given level (1 – hit rate) = miss rate
Another performance measure: average memory access time
AMAT = (hit time) + (miss rate) x (miss penalty)
6
Memory hierarchy operation We’d like most accesses to use the cache
Fastest level of the hierarchy But, the cache is much smaller than the
address space Most caches have a hit rate > 80%
How is that possible? Cache holds data most likely to be accessed
7
Principle of locality Programs don’t access data randomly—they
display locality in two forms Temporal locality: if you access a memory
location (e.g., 1000), you are more likely to re-access that location than some random location
Spatial locality: if you access a memory location (e.g., 1000), you are more likely to access a location near it (e.g., 1001) than some random location
8
Cache Basics Fast (but small) memory close to processor When data referenced
If in cache, use cache instead of memory If not in cache, bring into cache
(actually, bring entire block of data, too) Maybe have to kick something else out to do it!
Important decisions Placement: where in the cache can a block go? Identification: how do we find a block in cache? Replacement: what to kick out to make room in
cache? Write policy: What do we do about stores?
4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level?
(Block placement) Q2: How is a block found if it is in the upper level?
(Block identification) Q3: Which block should be replaced on a miss?
(Block replacement) Q4: What happens on a write?
(Write strategy)
10
Q1: Cache Placement Placement
Which memory blocks are allowedinto which cache lines
Placement Policies Direct mapped (block can go to only one line) Fully Associative (block can go to any line) Set-associative (block can go to one of N lines)
E.g., if N=4, the cache is 4-way set associative Other two policies are extremes of this
(E.g., if N=1 we get a direct-mapped cache)
Q1: Block placement Block 12 placed in 8 block cache:
Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets
Cache
01234567 0123456701234567
Memory
111111111122222222223301234567890123456789012345678901
Fully AssociativeDirect Mapped(12 mod 8) = 4
2-Way Assoc(12 mod 4) = 0
12
Q2: Cache Identification When address referenced, need to
Find whether its data is in the cache If it is, find where in the cache This is called a cache lookup
Each cache line must have A valid bit (1 if line has data, 0 if line empty)
We also say the cache line is valid or invalid A tag to identify which block is in the line
(if line is valid)
Q2: Block identification Tag on each block
No need to check index or block offset Increasing associativity shrinks index,
expands tag
BlockOffset
Block Address
IndexTag
14
Address breakdown Block offset: byte address within block
# block offset bits = log2(block size)
Index: line (or set) number within cache # index bits = log2(# of cache lines)
Tag: remaining bits
TagBlockoffset
Index
15
Address breakdown example Given the following:
32-bit address 32 KB direct-mapped cache Each block has 64-byte
What are the sizes for the tag, index, and block offset fields?
index = 9 bits since there are 32KB/64B = 29 blocks
block offset = 6 bits since each block has 64B= 26
tag = 32 – 9 – 6 = 17 bits
16
Q3: Block replacement When we need to evict a line, what do we
choose? Easy choice for direct-mapped What about set-associative or fully-associative?
Want to choose data that is least likely to be used next Temporal locality suggests that’s the line that was
accessed farthest in the past Least recently used (LRU)
Hard to implement exactly in hardware—often approximated
Random (randomly selected line) FIFO (line that has been in cache the longest)
17
Q4: What happens on a write? Write-Through Write-Back
Policy
Data written to cache block
also written to lower-level
memory
Write data only to the cache
Update lower level when a
block falls out of the cache
Debug Easy Hard
Do read misses produce writes? No Yes
Do repeated writes make it to lower level?
Yes No
18
Write Policy Do we allocate cache lines on a write?
Write-allocate A write miss brings block into cache
No-write-allocate A write miss leaves cache as it was
Do we update memory on writes? Write-through
Memory immediately updated on each write Write-back
Memory updated when line replaced
Write Buffers for Write-Through Caches
Q. Why a write buffer ?
ProcessorCache
Write Buffer
Lower Level
Memory
Holds data awaiting write-through to lower level memory
A. So CPU doesn’t stall Q. Why a buffer,
why not just one register ?
A. Bursts of writes arecommon.
20
Write-Back Caches Need a Dirty bit for each line
A dirty line has more recent data than memory Line starts as clean (not dirty) Line becomes dirty on first write to it
Memory not updated yet, cache has the only up-to-date copy of data for a dirty line
Replacing a dirty line Must write data back to memory (write-back)
Basic cache design Cache memory can copy data from any part
of main memory Tag: Memory address Block: Actual data
On each access Compare the address with the tag
If they match hit! Get the data from the cache block
If they don’t miss Get the data from main memory
22
Cache organization Cache consists of multiple tag/block pairs,
called cache lines/blocks Can search lines in parallel (within reason) Each line also has a valid bit Write-back caches have a dirty bit
Note that block sizes can vary Most systems use between 32 and 128 bytes Larger blocks exploit spatial locality Larger block size smaller tag size
23
Direct-mapped cache example Assume the following simple setup
Only 2 levels to hierarchy 16-byte memory 4-bit addresses Cache organization
Direct-mapped 8 total bytes 2 bytes per block 4 lines Write-back cache
Leads to the following address breakdown: Offset: 1 bit Index: 2 bits Tag: 1 bit
24
Direct-mapped cache example: initial stateInstructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Registers:$t0 = ?, $t1 = ?
0
1
2
3
4
5
6
7
Block #
Direct-mapped cache example: access #1Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Address = 1 = 00012
Tag = 0 Index = 00 Offset = 1
Hits: 0Misses: 0
Registers:$t0 = ?, $t1 = ?
26
0
1
2
3
4
5
6
7
Block #
Direct-mapped cache example: access #1Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 0 78 29
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Address = 1 = 00012
Tag = 0 Index = 00 Offset = 1
Hits: 0Misses: 1
Registers:$t0 = 29, $t1 = ?
27
0
1
2
3
4
5
6
7
Block #
Direct-mapped cache example: access #2Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 0 78 29
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Address = 8 = 10002
Tag = 1 Index = 00 Offset = 0
Hits: 0Misses: 1
Registers:$t0 = 29, $t1 = ?
28
0
1
2
3
4
5
6
7
Block #
Direct-mapped cache example: access #2Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Address = 8 = 10002
Tag = 1 Index = 00 Offset = 0
Hits: 0Misses: 2
Registers:$t0 = 29, $t1 = 18
29
0
1
2
3
4
5
6
7
Block #
Direct-mapped cache example: access #3Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Address = 4 = 01002
Tag = 0 Index = 10 Offset = 0
Hits: 0Misses: 2
Registers:$t0 = 29, $t1 = 18
30
0
1
2
3
4
5
6
7
Block #
Direct-mapped cache example: access #3Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
1 1 0 18 150
0 0 0 0 0
Address = 4 = 01002
Tag = 0 Index = 10 Offset = 0
Hits: 0Misses: 3
Registers:$t0 = 29, $t1 = 18
04/20/23 31M. Geiger CIS 570 Lec. 13
0
1
2
3
4
5
6
7
Block #
Direct-mapped cache example: access #4Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
1 1 0 18 150
0 0 0 0 0
Address = 13 = 11012
Tag = 1 Index = 10 Offset = 1
Hits: 0Misses: 3
Registers:$t0 = 29, $t1 = 18
32
0
1
2
3
4
5
6
7
Block #
Direct-mapped cache example: access #4Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 18
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
1 1 0 18 150
0 0 0 0 0
Address = 13 = 11012
Tag = 1 Index = 10 Offset = 1
Hits: 0Misses: 4
Registers:$t0 = 29, $t1 = 18
Must write backdirty block
04/20/23 33M. Geiger CIS 570 Lec. 13
0
1
2
3
4
5
6
7
Block #
Direct-mapped cache example: access #4Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 18
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
1 1 1 19 29
0 0 0 0 0
Address = 13 = 11012
Tag = 1 Index = 10 Offset = 1
Hits: 0Misses: 4
Registers:$t0 = 29, $t1 = 18
34
0
1
2
3
4
5
6
7
Block #
Direct-mapped cache example: access #5Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 18
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
1 1 1 19 29
0 0 0 0 0
Address = 9 = 10012
Tag = 1 Index = 00 Offset = 1
Hits: 0Misses: 4
Registers:$t0 = 29, $t1 = 18
04/20/23 35M. Geiger CIS 570 Lec. 13
0
1
2
3
4
5
6
7
Block #
Direct-mapped cache example: access #5Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 18
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
1 1 1 19 29
0 0 0 0 0
Address = 9 = 10012
Tag = 1 Index = 00 Offset = 1
Hits: 1Misses: 4
Registers:$t0 = 29, $t1 = 21
04/20/23 36M. Geiger CIS 570 Lec. 13
0
1
2
3
4
5
6
7
Block #
Cache performance Simplified model:
CPU time = (CPU clock cycles + memory stall cycles) cycle time
memory stall cycles = # of misses x miss penalty = IC X Misses/instruction x miss penalty = IC x memory accesses/instruction x miss rate miss penalty
Average CPI = CPI(without stalls)+ memory accesses/instruction x miss rate miss penalty
AMAT = hit time + miss rate x miss penalty
37
Example A computer has CPI =1 when all hits. Loads and
stores are 50% of instructions. If the miss penalty is 25 cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits
For all hits: CPU time =(ICxCPI +0)xCCT =ICx1.0 x CCT Real cache with stalls
Memory stall cycles =IC x (1+0.5)x0.02x25 = ICx0.75 CPU time = (ICx1.0+ICx0.75)xCCT = 1.75ICx CCT
Speedup = 1.75ICxCCT/ICxCCT = 1.75
38
Average memory access time For unified cache
AMAT = (hit time) + (miss rate) x (miss penalty) For split cache
AMAT=%instructions x (hit time+ instruction miss rate x miss penalty) + %data x (hit time + Data miss rate x miss penalty)
For multi-level cache AMAT = hit timeL1 + miss rateL1x miss penaltyL1 = hit timeL1
+ miss rateL1x (hit timeL2 + miss rateL2 x miss penaltyL2) Miss rate(L2) is measured on the leftovers from L1 cache.
39
Example (split cache vs unified cache) Which has the lower miss rate? A 16KB instruction cache with a 16KB data cache or a 32KB unified cache? If the miss per 1000 instructions for instruction, data and unified caches are 3.82, 40.9 and 43.3, respectively. Assume 36% of instructions are data transfer instructions. Assume a hit takes 1 CC and the miss penalty is 200 cc. A load or store hit takes 1 extra cc on a unified cache. Which is the AMAT?
Find miss rate = miss/instructions / memory access/instruction Miss rate(I) = 3.82/1000 / 1 = 0.004 Miss rate(D) = 40.9/1000 / 0.36 = 0.114 Miss rate (U) = 43.3/1000 / (1+0.36) = 0.0318 Miss rate (Split) = 74%x0.0004 + 26%x0.114 = 0.0326 A 32KB unified cache has a slightly lower miss rate
40
Example (cont.)
AMAT=%instructions x (hit time+ instruction miss rate x miss penalty) + %data x (hit time + Data miss rate x miss penalty)
AMAT (split) = 74%(1 + 0.004x200)+ 26%x(1+0.114x200) =7.52
AMAT(unified) = 74%x(1+0.0318x200) + 26%(1+1+0.0318x200) = 7.62
41
Another Example (multilevel cache) Suppose that in 1000 memory reference there are 40 misses in the L1 cache and 20 misses in the L2 cache. What are the miss rates? Assume miss penalty from L2 cache to memory is 200 CC, the hit time of the L2 cache is 10 CC, the hit time of L1 is 1 CC. What is the AMAT?
Miss rate (L1) = 40/1000 = 4% Miss rate (L2) = 20/40 = 50%
AMAT = hit timeL1 + miss rateL1x miss penaltyL1 = = hit timeL1 + miss rateL1x (hit timeL2 + miss rateL2 x miss penaltyL2) = 1 + 4%(10+ 50%x200) = 5.4 CC
42
Reasons for cache misses AMAT = (hit time) + (miss rate) x (miss penalty) Reduce misses improve performance The three C’s
First reference to an address: Compulsory miss Increasing the block size
Cache is too small to hold data: Capacity miss Increase the cache size
Replaced from a busy line or set: Conflict miss Increase associativity Would have had hit in a fully associative cache
43
44
Six Basic Cache Optimizations Reducing Miss Rate
1. Larger Block size (Compulsory misses)
2. Larger Cache size (Capacity misses)
3. Higher Associativity (Conflict misses Reducing Miss Penalty
4. Multilevel Caches
5. Giving Read misses Priority over Writes e.g., Read complete before earlier writes in write buffer
Reducing hit time6. Avoiding Address Translation during Cache Indexing
Problems with memory DRAM is too expensive to buy many
gigabytes We need our programs to work even if they
require more memory than we have A program that works on a machine with 512 MB
should still work on a machine with 256 MB Most systems run multiple programs
45
Solutions Leave the problem up to the programmer
Assume programmer knows exact configuration Overlays
Compiler identifies mutually exclusive regions Virtual memory
Use hardware and software to automatically translate references from virtual address (what the programmer sees) to physical address (index to DRAM or disk)
46
Benefits of virtual memory
CPU Memory
A0-A31 A0-A31
D0-D31 D0-D31
Data
User programs run in a standardizedvirtual address space
Address Translation hardware managed by the operating system (OS)
maps virtual address to physical memory
“Physical Addresses”
AddressTranslation
Virtual Physical
“Virtual Addresses”
Hardware supports “modern” OS features:Protection, Translation, Sharing
47
Managing virtual memory Effectively treat main memory as a cache
Blocks are called pages Misses are called page faults
Virtual address consists of virtual page number and page offset
Virtual page number Page offset01131
48
Page tables encode virtual address spaces
A machine usually supports
pages of a few sizes
(MIPS R4000):
A valid page table entry codes physical memory “frame” address for the page
A virtual address spaceis divided into blocks
of memory called pages
Physical Address Space
Virtual Address Space
frame
frame
frame
frame
49
Page tables encode virtual address spaces
A machine usually supports
pages of a few sizes
(MIPS R4000):
PhysicalMemory Space
A valid page table entry codes physical memory “frame” address for the page
A virtual address spaceis divided into blocks
of memory called pagesframe
frame
frame
frame
A page table is indexed by a virtual address
virtual address
Page Table
04/20/23 50
PhysicalMemory Space
Page table maps virtual page numbers to physical frames (“PTE” = Page Table Entry)
Virtual memory => treat memory cache for disk
Details of Page Table Virtual Address
Page Table
indexintopagetable
Page TableBase Reg
V AccessRights PA
V page no. offset12
table locatedin physicalmemory
P page no. offset12
Physical Address
frame
frame
frame
frame
virtual address
Page Table
51
Paging the page tableA table for 4KB pages for a 32-bit address
space has 1M entries Each process needs its own address space!
P1 index P2 index Page Offset
31 12 11 02122
32 bit virtual address
Top-level table wired in main memory
Subset of 1024 second-level tables in main memory; rest are on disk or
unallocated
Two-level Page Tables
04/20/23 52M. Geiger CIS 570 Lec. 13
VM and Disk: Page replacement policy
...
Page Table
1 0
useddirty
1 00 11 10 0
Set of all pagesin Memory Tail pointer:
Clear the usedbit in thepage table
Head pointerPlace pages on free list if used bitis still clear.Schedule pages with dirty bit set tobe written to disk.
Freelist
Free Pages
Dirty bit: page written.
Used bit: set to
1 on any reference
Architect’s role: support setting dirty and used
bits04/20/23 53M. Geiger CIS 570 Lec. 13
Virtual memory performance Address translation requires a physical
memory access to read the page table Must then access physical memory again to
actually get the data Each load performs at least 2 memory reads Each store performs at least 1 memory read
followed by a write
04/20/23 54M. Geiger CIS 570 Lec. 13
Improving virtual memory performance Use a cache for common translations
Translation lookaside buffer (TLB)
Virtual page
v tag Physical page
Pg offset
04/20/23 55M. Geiger CIS 570 Lec. 13
Caches and virtual memory Using two different addresses: virtual and
physical Which should we use to access cache? Physical address
Pros: simpler to manage Cons: slower access
Virtual address Pros: faster access Cons: aliasing, difficult management
Use both: virtually indexed, physically tagged
56
Three Advantages of Virtual Memory Translation:
Program can be given consistent view of memory, even though physical memory is scrambled
Makes multithreading reasonable (now used a lot!) Only the most important part of program (“Working Set”) must be in physical
memory. Contiguous structures (like stacks) use only as much physical memory as
necessary yet still grow later. Protection:
Different threads (or processes) protected from each other. Different pages can be given special behavior
(Read Only, Invisible to user programs, etc). Kernel data protected from User programs Very important for protection from malicious programs
Sharing: Can map same physical page to multiple users
(“Shared memory”) Allows programs to share same physical memory without knowing what else
is thereMakes memory appear larger than it actually is
57
Average memory access time AMAT = (hit time) + (miss rate) x (miss
penalty) Given the following:
Cache: 1 cycle access time Memory: 100 cycle access time Disk: 10,000 cycle access time
What is the average memory access time if the cache hit rate is 90% and the memory hit rate is 80%?
58
Recommended