Upload
seth-padilla
View
18
Download
0
Embed Size (px)
DESCRIPTION
Chapter 7b: Cache Memory Performance. 6-bit Address. Main Memory. 00 00 00. 5600. 00 01 00. 3223. 00 10 00. 2. 1. 0. 31. 23. Cache. Tag. Index. 00 11 00. 1122. 01 00 00. 0. Valid. Index. Tag. Data. 01 01 00. 32324. 00. Y. 00. 5600. 01 10 00. 845. 01. Y. 11. - PowerPoint PPT Presentation
Citation preview
Ch7b- 2EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
Direct Mapping Review
7.2
IndexTag
Always zero (words)
Each word has only one placeit can be in the cache: Index must match exactly
Each word has only one placeit can be in the cache: Index must match exactly
Va
lidTag DataIndex
Cache
56003223
231122
032324
84543
97677554
43377852447775433
3649
Main Memory00 00 0000 01 0000 10 0000 11 0001 00 0001 01 0001 10 0001 11 0010 00 0010 01 0010 10 0010 11 0011 00 0011 01 0011 10 0011 11 00
6-bit Address
560000Y0077511Y0184501Y10
3323400N11
01
Index2
Tag31
Memory Address:Split depends oncache size
Ch7b- 3EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
Missed me, Missed me...
7.2
• What to do on a hit:
• Carry on... (Hits should take one cycle or less)
• What to do on an instruction fetch miss:
• Undo PC increment (PC <-- PC-4)
• Do a memory read
• Stall until memory returns the data
• Update the cache (data, tag and valid) at index
• Un-stall• What to do on a load miss
• Same thing, except don’t mess with the PC
Ch7b- 4EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
Missed me, Missed me...
7.2
• What to do on a store (hit or miss)
• Won’t do to just write it to the cache
• The cache would have a different (newer) value than main memory
• Simple Write-Through
• Write both the cache and memory• Works correctly, but slowly
• Buffered Write-Through
• Write the cache
• Buffer a write request to main memory• 1 to 10 buffer slots are typical
Ch7b- 5EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
Splitting up
• It is common to use two separate caches for Instructions and for Data
• All Instruction fetches use the I-cache
• All data accesses (loads and stores) use the D-cache
• This allows the CPU to access the I-cache at the same time it is accessing the D-cache
• Still have to share a single memory
IF RF M WBEX
7.2
Note: The hit rate will probably be lower than for a combined cache of the same total size.
Ch7b- 6EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
What about Spatial Locality?
7.2
• Spatial locality says that physically close data is likely to be accessed close together
Word 2 Word 1 Word 0
012131431
Index
1018
AddressAddress
Tag
34
Blockoffset
2
Byteoffset
2
One 4-word BlockOne 4-word BlockAll words in the same block have the same index and tag
• On a cache miss, don’t just grab the word needed, but also the words nearby
• The easiest way to do this is to increase the block size
DataTagVWord
CacheEntry
Note: 22 = 4
3
Ch7b- 7EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
TagData (4-word Blocks)
Index V
012
20472046
...
...
32KByte/4-Word Block D.M. Cache
7.2
014141531
Hit!
Tag Index
11
17
17
Byte offset23
32 KB / 4 Words/Block / 4 Bytes/Word --> 2K blocks
Block offset
Data
32
Mux3 2 1 0
211=2K
Ch7b- 8EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
How Much Change?
7.2
Miss rates for DEC 3100 (MIPS machine)
spice 1 1.2% 1.3% 1.2%
gcc 1 6.1% 2.1% 5.4%
spice 4 0.3% 0.6% 0.4%
gcc 4 2.0% 1.7% 1.9%
Benchmark Block Size Instruction Data miss Combined(words) miss rate miss rate
Separate 64KB Instruction/Data Caches (16K 1-word blocks or 4K 4-word blocks)
Ch7b- 9EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
The cost of a cache miss
7.2
• For a memory access, assume:
• 1 clock cycle to send address to memory
• 40 clock cycles for each DRAM access(clock cycle 0.5ns, 20 ns access time)
• 1 clock cycle to send each resulting data word
This actuallydepends onthe busspeed
• Miss access time (4-word block)
• 4 x (Address + access + sending data word)
• 4 x (1 + 40 + 1) = 168= 168 cycles for each miss
Ch7b- 10EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
Memory Interleaving
7.2
InterleavingInterleaving
DefaultDefaultBegin accessing one word, and while waiting, start accessing other three words (pipelining)
CPU
Cache
Memory
4 bytes
Bus
Bus
CPU
Cache
Memory2
4 bytes
Memory1Memory3 Memory0
Bus
Bus Bus Bus BusRequires 4 separate memories, each 1/4 size
Must finish accessing one word before starting the next access
(1+40+1)x4 = 168 cycles
1 40 1
45 cycles
1 40 11 40 1
1 40 1
Spread out addresses among the memories
Interleaving worksperfectly with caches
Interleaving worksperfectly with caches
Sophisticated DRAMs (EDO, SDRAM, etc.) provide support for this
Ch7b- 11EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
The issue of Writes
7.2
Perform a write to a location with index 1000, tag 2420, word 1 (value 4334)
On a read miss, we read the entire block from memory into the cache
On a write hit, we write one word into the block. The other words in theblock are unchanged.
On a write miss, we write one word into the block and update the tag.
2330001 322 355 2word 3 word 2 word 1 word 0V tagBlock with
index 1000: 2420 4334
The other words are still the old data (for tag 3000). Bad news!
Solution 1: Don’t update the cache on a write miss. Write only to memory.
Solution 2: On a write miss, first read the referenced block in (including the old value of the word being written), then write the new word into the cache and write-through to memory.
Ch7b- 12EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
Choosing a block size
7.2
• Large block sizes help with spatial locality, but...
• It takes time to read the memory in• Larger block sizes increase the time for misses
• It reduces the number of blocks in the cache• Number of blocks = cache size/block size
• Need to find a middle ground
• 16-64 bytes works nicely
Ch7b- 13EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
Other Cache organizations
7.3
Direct MappedDirect Mapped
0:1:23:4:5:6:7:89:
10:11:12:13:14:15:
V Tag DataIndexIndex
Address = Tag | Index | Block offset
Fully AssociativeFully Associative
No IndexNo Index
Address = Tag | Block offset
Each address has only one possible location
Each address has only one possible location
Tag DataV
Ch7b- 14EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
Fully Associative vs. Direct Mapped
7.3
• Fully associative caches provide much greater flexibility
• Nothing gets “thrown out” of the cache until it is completely full
• Direct-mapped caches are more rigid
• Any cached data goes directly where the index says to, even if the rest of the cache is empty
• A problem, though...
• Fully associative caches require a complete search through all the tags to see if there’s a hit
• Direct-mapped caches only need to look one place
Ch7b- 15EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
A Compromise
7.3
2-Way set associative2-Way set associative
Address = Tag | Index | Block offset
4-Way set associative4-Way set associative
Address = Tag | Index | Block offset
0:
1:
2:
3:
4:
5:
6:
7:
V Tag Data
Each address has two possiblelocations with the same index
Each address has two possiblelocations with the same index
One fewer index bit: 1/2 the indexes
One fewer index bit: 1/2 the indexes
0:
1:
2:
3:
V Tag Data
Each address has four possiblelocations with the same index
Each address has four possiblelocations with the same index
Two fewer index bits: 1/4 the indexes
Two fewer index bits: 1/4 the indexes
Ch7b- 16EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
Set Associative Example
V Tag DataIndex00000000
000:
001:
010:
011:
100:
101:
110:
111:
01001110001100110100010011110001101100001100111000
MissMissMissMissMiss
7.3
Index V Tag Data
0
0000000
00:
01:
10:
11:
V Tag DataIndex0
0000000
0:
1:
Direct-Mapped 2-Way Set Assoc. 4-Way Set Assoc.
01001110001100110100010011110001101100001100111000
MissMissHitMissMiss
01001110001100110100010011110001101100001100111000
MissMissHitMissHit
Byte offset (2 bits)Block offset (2 bits)Index (1-3 bits)Tag (3-5 bits)
010 -1 110010
0100 -
1 1100 -1
011110
01101100
1 01001
1 11001
1 01101
-
--
128-byte cache, 4-word blocks, 10 bit addresses,1-4 way assocativity
Ch7b- 17EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
New Performance Numbers
7.3
Miss rates for DEC 3100 (MIPS machine)
spice Direct 0.3% 0.6% 0.4%
gcc Direct 2.0% 1.7% 1.9%
spice 2-way 0.3% 0.6% 0.4%
gcc 4-way 1.6% 1.4% 1.5%
Benchmark Associativity Instruction Data miss Combinedrate miss rate
Separate 64KB Instruction/Data Caches (4K 4-word blocks)
gcc 2-way 1.6% 1.4% 1.5%
spice 4-way 0.3% 0.6% 0.4%
Ch7b- 18EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
Block Replacement Strategies
7.5
• We have to replace a block when there is a collision
• Collisions occur whenever the selected set is full
• Strategy 1: Ideal (Oracle)
• Replace the block that won’t be used again for the longest time• Drawback - Requires knowledge of the future
• Strategy 2: Least Recently Used (LRU)
• Replace the block that was last used (hit) the longest time ago
• Drawback - Requires difficult bookkeeping
• Strategy 3: Approximate LRU
• Set a use bit for each block every time it is hit, clear all periodically
• Replace a block without its use bit set• Strategy 4: Random
• Pick a block at random (works almost as well as approx. LRU)
Ch7b- 19EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University
The Three C’s of Misses
7.5
• Compulsory Misses
• The first time a memory location is accessed, it is always a miss
• Also known as cold-start misses
• Only way to decrease miss rate is to increase the block size
• Capacity Misses
• Occur when a program is using more data than can fit in the cache
• Some misses will result because the cache isn’t big enough
• Increasing the size of the cache solves this problem
• Conflict Misses
• Occur when a block forces out another block with the same index
• Increasing Associativity reduces conflict misses
• Worst in Direct-Mapped, non-existent in Fully Associative