Upload
ruby-mcdowell
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
Computer Architecture CSE 3322Lecture 20Web Site
crystal.uta.edu/~jpatters/cse3322
Phase II Project due Monday Dec 1
Problems: 7.20, 7.22, 7.27, 7.28 Due Nov 17
DECStation 3100
Block Instruction Data EffectiveProgram Size Miss Rate Miss Rate Miss Rate
1 6.1% 2.1% 5.4%4 2.0% 1.7% 1.9%
1 1.2% 1.3% 1.2%4 0.3% 0.6% 0.4%
gcc
spice
Write Misses included in 4 word block, but notin 1 word.
DECStation 3100
Block Instruction Data EffectiveProgram Size Miss Rate Miss Rate Miss Rate
1 6.1% 2.1% 5.4%4 2.0% 1.7% 1.9%
1 1.2% 1.3% 1.2%4 0.3% 0.6% 0.4%
gcc
spice
Write Misses included in 4 word block, but notin 1 word.Remember Miss Penalty goes UP !
Average Memory Access Time =Hit Time + Miss Rate * Miss Penalty
MissPenalty
Block Size
MissRate
Block Size
Access Time
Transfer Time
Constant Size Cache
Fewer Blocks
Reducing the Miss Penalty
Reduce the time to read the multiple words from MainMemory to the cache block.
Reducing the Miss Penalty
Reduce the time to read the multiple words from MainMemory to the cache block.
Don’t wait for the complete block to be transferred“Early Restart”
Reducing the Miss Penalty
Reduce the time to read the multiple words from MainMemory to the cache block.
Don’t wait for the complete block to be transferred“Early Restart”Access and transfer each word sequentially.As soon as the requested word is in cache, restart the processor to access cache and finish the block transferwhile the cache is available.
Reducing the Miss Penalty
Reduce the time to read the multiple words from MainMemory to the cache block.
Don’t wait for the complete block to be transferred“Early Restart”Access and transfer each word sequentially.As soon as the requested word is in cache, restart the processor to access cache and finish the block transferwhile the cache is available.
Variation: “Requested Word First”
Reducing the Miss Penalty
Reduce the time to read the multiple words from MainMemory to the cache block.
Don’t wait for the complete block to be transferred“Early Restart”Access and transfer each word sequentially.As soon as the requested word is in cache, restart the processor to access cache and finish the block transferwhile the cache is available.
Variation: “Requested Word First”Disadvantage: Complex Control
Likely access cache block before transferis complete
Reducing the Miss Penalty
Reduce the time to read the multiple words from MainMemory to the cache block.
Assume Memory Access times:• 1 clock cycle to send address• 10 Clock cycles to access DRAM• 1 clock cycle to send a word of data
Reducing the Miss Penalty
Reduce the time to read the multiple words from MainMemory to the cache block.
Assume Memory Access times:• 1 clock cycle to send address• 10 Clock cycles to access DRAM• 1 clock cycle to send a word of data
For sequential transfer of 4 data words:
Miss Penalty = 1 + 4 *( 10 +1) = 45 clock cycles
What if we could read a block of words simultaneouslyfrom the Main Memory?
Cache Entry
Valid
Tag Word3 Word2 Word1 Word0
32 32 32 32
Main Memory
What if we could read a block of words simultaneouslyfrom the Main Memory?
Cache Entry
Valid
Tag Word3 Word2 Word1 Word0
32 32 32 32
Main Memory
Miss Penalty = 1 + 10 + 1 = 12 clock cycles
Miss Penalty for Sequential = 45 clock cycles
What about 4 banks of Memory? “Interleaved Memory”
Cache
Bank 3 Bank 2 Bank 1 Bank 0Address
Banks are accessed in parallel Words are transferred serially
What about 4 banks of Memory? “Interleaved Memory”
Cache
Bank 3 Bank 2 Bank 1 Bank 0Address
Banks are accessed in parallel Words are transferred serially
Miss Penalty = 1 + 10 + 4 * 1 = 16 clock cycles
Miss Penalty for Parallel = 12 clock cyclesMiss Penalty for Sequential = 45 clock cycles
Average Memory Access Time =Hit Time + Miss Rate * Miss Penalty
Average Access Time
Block Size
Increase Cache sizeIncrease Block size
Main MemoryOrganization
CPU Performance with Cache Memory
For a program:CPU time = CPU execution time + CPU Hold time
Assuming no penalty for Hit
CPU Performance with Cache Memory
For a program:CPU time = CPU execution time + CPU Hold time
CPU Hold time = Memory Stall Clock Cycles* Clock Cycle time
Assuming no penalty for Hit
CPU Performance with Cache Memory
For a program:CPU time = CPU execution time + CPU Hold time
CPU Hold time = Memory Stall Clock Cycles* Clock Cycle time
Memory Stall Clock Cycles = Read Stall Cycles +Write Stall Cycles
Assuming no penalty for Hit
CPU Performance with Cache Memory
For a program:CPU time = CPU execution time + CPU Hold time
CPU Hold time = Memory Stall Clock Cycles* Clock Cycle time
Memory Stall Clock Cycles = Read Stall Cycles +Write Stall Cycles
Read Stall Cycles = Reads * Read Miss Rate * Read Miss Penalty Program
Assuming no penalty for Hit
CPU Performance with Cache Memory
Write Stall Cycles = Writes * Write Miss Rate * Write Miss Penalty Program
+ Write Buffer Stalls
CPU Performance with Cache Memory
Write Stall Cycles = Writes * Write Miss Rate * Write Miss Penalty Program
+ Write Buffer Stalls
Write Buffer Stalls should be << Write Miss Stalls
CPU Performance with Cache Memory
Write Stall Cycles = Writes * Write Miss Rate * Write Miss Penalty Program
+ Write Buffer Stalls
Write Buffer Stalls should be << Write Miss Stalls
So, approximately,
Write Stall Cycles = Writes * Write Miss Rate * Write Miss Penalty Program
CPU Performance with Cache Memory
Memory Stall Clock Cycles = Read Stall Cycles +
Write Stall Cycles
= Reads * Read Miss Rate * Read Miss Penalty
Program
+ Writes * Write Miss Rate * Write Miss Penalty Program
CPU Performance with Cache Memory
Memory Stall Clock Cycles = Read Stall Cycles +
Write Stall Cycles
= Reads * Read Miss Rate * Read Miss Penalty
Program
+ Writes * Write Miss Rate * Write Miss Penalty Program
The Miss Penalties are approximately the same ( Fetch the Block)So, combining the Reads and Writes together into a weighted Miss Rate
Memory Stall Cycles = Memory Accesses * Miss Rate * Miss Penalty Program
CPU Performance with Cache MemoryFor a program:
CPU time = CPU execution time + CPU Hold time
CPU Hold time = Memory Stall Clock Cycles
* Clock Cycle time
CPU time = CPU execution time + Memory Accesses * Miss Rate * Miss Penalty* Clock Cycle time Program
Assuming no penalty for Hit
CPU Performance with Cache MemoryFor a program:
CPU time = CPU execution time + CPU Hold time
CPU Hold time = Memory Stall Clock Cycles
* Clock Cycle time
CPU time = CPU execution time + Memory Accesses * Miss Rate * Miss Penalty* Clock Cycle time ProgramDividing both sides by Instructions / Program and Clock Cycle time
Effective CPI = Execution CPI +Memory Accesses * Miss Rate * Miss Penalty
Instruction
Assuming no penalty for Hit
CPU Performance with Cache Memory
Effective CPI = Execution CPI +
Memory Accesses * Miss Rate * Miss Penalty
Instruction
Consider the DECStation 3100 with 4 word blocks running spiceCPI = 1.2 without missesInstruction Miss Rate = 0.3%Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9%1.) Sequential Memory : Miss penalty = 65 clock cycles2.) 4 Bank Interleaved: Miss penalty = 20 clock cycles
CPU Performance with Cache Memory
Effective CPI = Execution CPI +
Memory Accesses * Miss Rate * Miss Penalty
Instruction
Eff CPI = 1.2 + ( 1 * .003 + .09 * .006) Miss Penalty
= 1.2 + .00354 * Miss Penalty
Consider the DECStation 3100 with 4 word blocks running spiceCPI = 1.2 without missesInstruction Miss Rate = 0.3%Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9%1.) Sequential Memory : Miss penalty = 65 clock cycles2.) 4 Bank Interleaved: Miss penalty = 20 clock cycles
CPU Performance with Cache Memory
Effective CPI = Execution CPI +
Memory Accesses * Miss Rate * Miss Penalty
Instruction
Eff CPI = 1.2 + ( 1 * .003 + .09 * .006) Miss Penalty
= 1.2 + .00354 * Miss Penalty
1.) Eff CPI = 1.2 + .00354* 65 = 1.2 + .2301 = 1.43
Consider the DECStation 3100 with 4 word blocks running spiceCPI = 1.2 without missesInstruction Miss Rate = 0.3%Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9%1.) Sequential Memory : Miss penalty = 65 clock cycles2.) 4 Bank Interleaved: Miss penalty = 20 clock cycles
CPU Performance with Cache Memory
Effective CPI = Execution CPI +
Memory Accesses * Miss Rate * Miss Penalty
Instruction
Eff CPI = 1.2 + ( 1 * .003 + .09 * .006) Miss Penalty
= 1.2 + .00354 * Miss Penalty
1.) Eff CPI = 1.2 + .00354* 65 = 1.2 + 0.2301 = 1.43
2.) Eff CPI = 1.2 + .00354 * 20 = 1.2 + 0.071 = 1.271
Consider the DECStation 3100 with 4 word blocks running spiceCPI = 1.2 without missesInstruction Miss Rate = 0.3%Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9%1.) Sequential Memory : Miss penalty = 65 clock cycles2.) 4 Bank Interleaved: Miss penalty = 20 clock cycles
CPU Performance with Cache MemoryConsider the DECStation 3100 with 4 word blocks running spiceCPI = 1.2 without missesInstruction Miss Rate = 0.3%Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9%4 Bank Interleaved: Miss penalty = 20 clock cyclesEff CPI = 1.271 clock cycles
What if we get a new processor and cache that runs at twice the clockfrequency, but keep the same main memory speed?
CPU Performance with Cache MemoryConsider the DECStation 3100 with 4 word blocks running spiceCPI = 1.2 without missesInstruction Miss Rate = 0.3%Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9%4 Bank Interleaved: Miss penalty = 20 clock cyclesEff CPI = 1.271 clock cycles
What if we get a new processor and cache that runs at twice the clockfrequency, but keep the same main memory speed?
Miss penalty = 40 clock cycles
Eff CPI = 1.2 +.00354 * 40 = 1.2 + 0.1416 = 1.342
CPU Performance with Cache MemoryConsider the DECStation 3100 with 4 word blocks running spiceCPI = 1.2 without missesInstruction Miss Rate = 0.3%Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9%4 Bank Interleaved: Miss penalty = 20 clock cyclesEff CPI = 1.271 clock cycles
What if we get a new processor and cache that runs at twice the clockfrequency, but keep the same main memory speed?
Miss penalty = 40 clock cycles
Eff CPI = 1.2 +.00354 * 40 = 1.2 + 0.1416 = 1.342
Performance Fast clock = 1.271 * 2 *clock cycle time = 1.89 Slow clock 1.342 * clock cycle time
31 . . . 16 15 . . . 4 3 2 1 0 Address
Byte OffsetBlock Offset
IndexTag
16 12
v Tag Word3 Word2 Word1 Word0
4KEntries
= 16
Hit
Mux
32 32 32 32
2
32Data
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address Hit or Miss
678980678981
0 3 2 10
1 7 6 54
2 11 10 98
3 15 14 1312
7 31 30 2928
8 35 34 3332
15 63 62 6160
X 4X+3 4X+2 4X+1 4X
Block Address
Word Address
Word Addr 4
0 3 2 10
1 7 6 54
2 11 10 98
3 15 14 1312
7 31 30 2928
8 35 34 3332
15 63 62 6160
X 4X+3 4X+2 4X+1 4X
Block Address
Word Address
Word Addr 4
Cache Address0123
7
0 3 2 10
1 7 6 54
2 11 10 98
3 15 14 1312
7 31 30 2928
8 35 34 3332
15 63 62 6160
X 4X+3 4X+2 4X+1 4X
Block Address
Word Address
Word Addr 4
Cache Address0123
70
7
0 3 2 10
1 7 6 54
2 11 10 98
3 15 14 1312
7 31 30 2928
8 35 34 3332
15 63 62 6160
X 4X+3 4X+2 4X+1 4X
Block Address
Word Address
Word Addr 4
Cache Address0123
70
7
X Modulo 8
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address Hit or Miss
678980678981
Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address Hit or Miss
6 1 1 Miss78980678981
Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address Hit or Miss
6 1 1 Miss7 1 1 Hit8980678981
Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address Hit or Miss
6 1 1 Miss7 1 1 Hit8 2 2 Miss980678981
Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address Hit or Miss
6 1 1 Miss7 1 1 Hit8 2 2 Miss9 2 2 Hit80678981
Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address Hit or Miss
6 1 1 Miss7 1 1 Hit8 2 2 Miss9 2 2 Hit80 20 4 Miss678981
Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address Hit or Miss
6 1 1 Miss7 1 1 Hit8 2 2 Miss9 2 2 Hit80 20 4 Miss6 1 1 Hit7 1 1 Hit8 2 2 Hit9 2 2 Hit81 20 4 Hit
Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address Hit or Miss
6 1 1 Miss7 1 1 Hit8 2 2 Miss9 2 2 Hit68 6 17 18 29 269
Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address Hit or Miss
6 1 1 Miss7 1 1 Hit8 2 2 Miss9 2 2 Hit68 17 1 Miss 6 17 18 29 269
Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address Hit or Miss
6 1 1 Miss7 1 1 Hit8 2 2 Miss9 2 2 Hit68 17 1 Miss 6 1 1 Miss7 1 1 Hit8 2 2 Hit9 2 2 Hit69
Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address Hit or Miss
6 1 1 Miss7 1 1 Hit8 2 2 Miss9 2 2 Hit68 17 1 Miss 6 1 1 Miss7 1 1 Hit8 2 2 Hit9 2 2 Hit69 17 1 Miss
Cache Address =( Word Addr ) modulo 8 4
How about putting a block in any unused block of the eight blocks?
Tag Word3 Word2 Word1 Word0
How about putting a block in any unused block of the eight blocks?
Tag Word3 Word2 Word1 Word0
How can you find it?
How about putting a block in any unused block of the eight blocks?
Tag Word3 Word2 Word1 Word0
How can you find it?Expand the Tag to the block address and compare
How about putting a block in any unused block of the eight blocks?
Tag Word3 Word2 Word1 Word0
Fully Associative Memory – Addressed by it’s contents
Block Address – 28 bitsAddress
Fully Associative Memory – Addressed by it’s contents
Block Address – 28 bitsAddress
• For practical Hit time, must have parallel comparisonsof the Tag and the Block Address
• Only feasible for small number of blocks
Byte Offset
Block Offset
Fully Associative Memory – Addressed by it’s contents
Block Address – 28 bitsAddress
Tag Data Tag Data Tag Data Tag Data
BlkAddr
= = = =
+Hit
Mux
DataValid bitnot shown
Block Offsetselects Word
Byte Offset
Block Offset
Fully Associative Memory – Addressed by it’s contents
Block Address – 28 bitsAddress
Tag Data Tag Data Tag Data Tag Data
BlkAddr
= = = =
+Hit
Mux
DataValid bitnot shown
HardwareNot Feasiblefor large Cache
Byte Offset
Block Offset
Make sets of Blocks Associative
Two-way set associative
Tag0 Data0 Tag1 Data101...
Index
Valid bitnot shown
• Addr by Index• Compare Two Tags in parallel for Hit
2k-1
Make sets of Blocks Associative
Two-way set associative
Tag0 Data0 Tag1 Data101...
Index
Valid bitnot shown
Tag Index
Block Offset
Byte Offset
• Addr by Index• Compare Two Tags in parallel for Hit
Address
2k-1
Block replacement strategies
For each Index there are 2, 4, ... n options for replacement.
Strategies
1. LRU – Least Recently Used
• Replace the block that has been unused for the longest time
• Implementation
Block replacement strategies
For each Index there are 2, 4, ... n options for replacement
Strategies
1. LRU – Least Recently Used
• Replace the block that has been unused for the longest time
2. Random
• Select the block to be replaced randomly
• Implementation
Consider a Two Way Associative Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address(Set) Hit or Miss
Entry 0 Entry 1678968 678969
Cache Address =( Word Addr ) modulo 4 4
Consider a Two Way Associative Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address(Set) Hit or Miss
Entry 0 Entry 16 1 1 Miss7 1 1 Hit8 2 2 Miss9 2 2 Hit68 678969
Cache Address =( Word Addr ) modulo 4 4
Consider a Two Way Associative Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address(Set) Hit or Miss
Entry 0 Entry 16 1 1 Miss7 1 1 Hit8 2 2 Miss9 2 2 Hit68 17 1 Miss 678969
Cache Address =( Word Addr ) modulo 4 4
Consider a Two Way Associative Cache with 4 word blocks with size of 8 blocks or 32 words.Reference Sequence Word Address Block Address Cache Address(Set) Hit or Miss
Entry 0 Entry 16 1 1 Miss7 1 1 Hit8 2 2 Miss9 2 2 Hit68 17 1 Miss 6 1 1 Hit7 1 1 Hit8 2 2 Hit9 2 2 Hit69 17 1 Hit
Cache Address =( Word Addr ) modulo 4 4