View
217
Download
0
Category
Preview:
Citation preview
CS1104 – Computer CS1104 – Computer OrganizationOrganization
PART 2: Computer PART 2: Computer ArchitectureArchitecture
Lecture 10Lecture 10
Memory HierarchyMemory Hierarchy
2
Topics
Memories: Types Memory Hierarchy: why? Basics of caches Measuring cache performance Improving cache performance Framework for memory hierarchies
3
SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) access time: 5-25 ns Cost (US$) per MByte in 1997: 100 to 250
DRAM:
value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10) access time: 60-120 ns Cost (US$) per MByte in 1997: 5 to 10
Memories: Review
4
1997
Memory Hierarchy: why?
Users want large and fast memories! SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per Mbyte.
Try and give it to them anyway build a memory hierarchy CPU
Level 1
Level 2
Level n
Size
Speed
5
Memory Hierarchy: requirements
If level is closer to Processor, it must… Be smaller Be faster Contain a subset (most recently used data) of lower levels
beneath it Contain all the data in higher levels above it
Lowest Level (usually disk or the main memory) contains all the available data
6
Locality
A principle that makes having a memory hierarchy a good idea
If an item is referenced,
temporal locality: it will tend to be referenced again soonspatial locality : nearby items will tend to be referenced soon.
Our initial focus: two levels (upper, lower) block: minimum unit of data hit: data requested is in the upper level miss: data requested is not in the upper level
7
Exploiting locality
Caches
8
Two issues: How do we know if a data item is in the cache? If it is, how do we find it?
Our first example: block size is one word of data "direct mapped"
For each item of data at the lower level, there is exactly one location in the cache where it might be.
e.g., lots of items at the lower level share locations in the upper level
Cache
9
Direct Mapped Cache
Cache Location 0 can be occupied by data from: Memory location 0, 4, 8, ... In general: any memory location that is
multiple of 4
MemoryMemory Address
0123456789ABCDEF
4 Byte Direct Mapped Cache
Cache Index
0123
10
Mapping: address is modulo the number of blocks in the cache
Direct Mapped Cache
00001 00101 01001 01101 10001 10101 11001 11101
000
Cache
Memory
001
010
011
100
101
110
111
11
Issues with Direct Mapped Caches
1. Since multiple memory addresses map to same cache index, how do we tell which one is in there?
2. What if we have a block size > 1 byte? Solution: divide memory address into three fields
ttttttttttttttttt iiiiiiiiii oooo
tag index byte
to check to offset
if have selectwithin correct block blockblock
12
Direct Mapped Caches: Terminology
All fields are read as unsigned integers. Index: specifies the cache index (which “row” of the cache
we should look in) Offset: once we’ve found correct block, specifies which byte
within the block we want Tag: the remaining bits after offset and index are determined;
these are used to distinguish between all the memory addresses that map to the same location
13
Direct Mapped Cache: Example
Suppose we have a 16KB direct-mapped cache with 4 word blocks.
Determine the size of the tag, index and offset fields if we’re using a 32-bit architecture.
Offset
need to specify correct byte within a block block contains 4 words = 16 bytes = 24 bytes need 4 bits to specify correct byte
14
Direct Mapped Cache: Example
Index need to specify correct row in cache cache contains 16 KB = 214 bytes block contains 24 bytes (4 words) # rows/cache = # blocks/cache (since
there’s one block/row)
= bytes/cachebytes/row = 214 bytes/cache
24 bytes/row = 210 rows/cache
need 10 bits to specify this many rows
15
Direct Mapped Cache: Example
Tag use remaining bits as tag tag length = mem addr length
- offset- index
= 32 - 4 - 10 bits = 18 bits
so tag is leftmost 18 bits of memory address
16
For MIPS:
Direct Mapped Cache
20 10
Byteoffset
Valid Tag DataIndex
0
1
2
1021
1022
1023
Tag
Index
Hit Data
20 32
31 30 13 12 1 1 2 1 0Address (bit positions)
17
32-bit byte addresses Direct-mapped cache of size 2n words with one-word (4-byte) blocks
What is the size of the “tag field”? 32 – (n+2) bits (2 bits for byte offset and n bits for index)
What is the total number of bits in the cache?
2n x (block size + tag size + valid field size)
= 2n x (32 + (32 – n – 2) + 1) because the block size is 32 bits
= 2n x (63 – n)
#Bits required (example)
18
Taking advantage of spatial locality:
Direct Mapped Cache
Address (showing bit positions)
16 12 Byteoffset
V Tag Data
Hit Data
16 32
4Kentries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
31 16 15 4 32 1 0
Address (bit positions)
19
Accessing data in a direct mapped cache
Example: 16KB, direct-mapped, 4 word blocks
Read 4 addresses 0x00000014, 0x0000001C,
0x00000034, 0x00008014 Memory values on right:
only cache/memory level of hierarchy
Address (hex) Value of WordMemory
0000001000000014000000180000001C
abcd
... ...
0000003000000034000000380000003C
efgh
0000801000008014000080180000801C
ijkl
... ...
... ...
... ...
20
Accessing data in a direct mapped cache
4 Addresses: 0x00000014, 0x0000001C, 0x00000034, 0x00008014
4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields
000000000000000000 0000000001 0100
000000000000000000 0000000001 1100
000000000000000000 0000000011 0100
000000000000000010 0000000001 0100 Tag Index Offset
21
Read hits this is what we want!
Read misses stall the CPU, fetch block from memory, deliver to cache, restart the load
instruction
Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later)
Write misses: read the entire block into the cache, then write the word (allocate on write
miss) do not read the cache line; just write to memory (no allocate on write miss)
Hits vs. Misses
22
Make reading multiple words easier by using banks of memory
It can get a lot more complicated...
Hardware Issues
CPU
Cache
Bus
Memory
a. One-word-wide memory organization
CPU
Bus
b. Wide memory organization
Memory
Multiplexor
Cache
CPU
Cache
Bus
Memorybank 1
Memorybank 2
Memorybank 3
Memorybank 0
c. Interleaved memory organization
23
Increasing the block size tends to decrease miss rate:
Performance
1 KB
8 KB
16 KB
64 KB
256 KB
256
40%
35%
30%
25%
20%
15%
10%
5%
0%
Mis
s ra
te
64164
Block size (bytes)
24
Performance
Use split caches because there is more spatial locality in code:
ProgramBlock size in
wordsInstruction miss rate
Data miss rate
Effective combined miss rate
gcc 1 6.1% 2.1% 5.4%4 2.0% 1.7% 1.9%
spice 1 1.2% 1.3% 1.2%4 0.3% 0.6% 0.4%
25
Memory access times
CPU
Cache
Bus
Memory
a. One-word-wide memory organization
CPU
Bus
b. Wide memory organization
Memory
Multiplexor
Cache
CPU
Cache
Bus
Memorybank 1
Memorybank 2
Memorybank 3
Memorybank 0
c. Interleaved memory organization
• #clock cycles to send the address (say 1)• #clock cycles to initiate each DRAM access (say 15)• #clock cycles to transfer a word of data (say 1)
Clock cycles required to access 4 words:
1 + 4x15 + 4x1 1 + 1x15 + 4x11 + 1x15 + 1
26
Improving performance
Two ways of improving performance:
decreasing the miss ratio: associativity decreasing the miss penalty: multilevel caches
27
Decreasing miss ratio with associativity
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data
Four-way set associative
Set
0
1
Tag Data
One-way set associative(direct mapped)
Block
0
7
1
2
3
4
5
6
Tag Data
Two-way set associative
Set
0
1
2
3
Tag Data
block
2 blocks / set
4 blocks / set
8 blocks / set
28
4-way set-associative cacheAddress
22 8
V TagIndex
0
1
2
253
254255
Data V Tag Data V Tag Data V Tag Data
3222
4-to-1 multiplexor
Hit Data
123891011123031 0
29
Tag size versus associativity
Cache of 4K blocks, four word block size (or four-word cache lines), and 32-bit addresses
Direct mapped Byte offset = 4 bits (each block = 4 words = 16 bytes) Index + Tag = 32 – 4 = 28 bits For 4K blocks, 12 index bits are required #Tag bits for each block = 28 – 12 = 16 Total #Tag bits = 16 x 4 = 64Kbits
4-way set-associative #Sets = 1K, therefore 10 bits index bits are required #Tag bits for each block = 28 – 10 = 18 Total #Tag bits = 4 x 18 x 1K = 72Kbits
30
Block replacement policy
In a direct mapped cache, when a miss occurs, the requested block can go only at one position.
In a set-associative cache, there can be multiple positions in a set for storing each block. If all the positions are filled, which block should be replaced?
Least Recently Used (LRU) Policy Randomly choose a block and replace it
31
Common framework for memory hierarchies
Q1: where can a block be placed?
note: a block is in this case a cache line
Direct mapped cache: one position n-way set-associative cache: n positions (typically 2-8) Fully associative: everywhere
Q2: how is a block found? Direct mapped: index part of address indicates entry n-way: use index to search in all the n cache blocks Fully associative: check all tags
32
Common framework for memory hierarchies
Q3: wich block should be replaced on a miss? Direct mapped: no choice Associative caches: use replacement algorithm, like LRU
Q4: what happens on a write? write-through write-back on a write miss: allocate no-allocate
33
Common framework for memory hierarchies
Understanding (cache) misses:The three Cs
Compulsory miss Capacity miss Conflict miss
cache size
miss rate
direct mapped (1-way)2-way
fully associative
compulsory
capacity
4-way
34
Reading
3rd edition of the textbook Chapter 7, Sections 7.1 – 7.3 and Section 7.5
2nd edition of the textbook Chapter 7, Sections 7.1 – 7.3 and Section 7.5
Recommended