Upload
eleanor-hunt
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
CS6461 – Computer ArchitectureFall 2015
Morris LancasterAdapted from Professor Stephen Kaisler’s Notes
Lecture 4 – Memory Systems
(Some material extracted from slides byArvind (MIT)
Krste Asanovic (MIT/UCB)Joel Emer (Intel/MIT)James Hoe (CMU)
John Kubiatowicz (UCB)David Patterson (UCB)
04/21/23 CS61 Computer Architecture 2
The Ideal Memory
• Size: Infinitely large• Speed: Infinitely fast, e.g., no latency• Cost: Free (well, infinitesimal)
• However, once reality sets in, we realize these features are mutually exclusive and not attainable with the technology available today.
• Tomorrow, however, is another story.
04/21/23 CS61 Computer Architecture 3
Memory Hierarchy
Increasing distance from the processor in access time
L1$
L2$
Main Memory
Secondary Memory
Processor
(Relative) size of the memory at each level
Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in Main Memory that is a subset of is in Secondary Memory (usually disk).
4-64 bytes (word)
1 to 8 blocks
1,024 – 4M bytes (disk sector = page)
16-128 bytes (block)
We show only two levels of cache here, but as we will see in later lectures, some processors have three levels of cache.
Note: page sizes may be much larger, e.g., 64 KBytes or even 1 Mbyte.
04/21/23 CS61 Computer Architecture 4
Memory Hierarchy
• Memory is much slower compared to processor.• Faster memories are more expensive.• Due to high decoding time and other reasons, larger memories
are always slower.– Therefore, locate small but very fast memory (SRAM: L1 cache)
very close to processor. – L2 cache will be larger and slower, between L1 and the main
memory.– Many processors now have L3 cache (see Multicores)– Main memory is usually GBs large and is made up of DRAMs:
several nanoseconds.
• Secondary memory is on discs (hard disk, CD, DVD, Flash, SSD), is hundreds of GBs to TBs, and takes microseconds to access.
• CPU registers are the closest to CPU, but do not use memory addresses – they have separate ids.
04/21/23 CS61 Computer Architecture 5
Memory History - 0
See @ Computer History Museum,Mountain View, CA
04/21/23 CS61 Computer Architecture 6
Memory History - I
Information introduced to the memory in the form of electric pulses was transduced into mechanical waves that propagated relatively slowly through a medium
A relay is an electrically operated switch. Many relays use an electromagnet to operate a switching mechanism mechanically,
04/21/23 CS61 Computer Architecture 7
Memory History - II
A drum is a large metal cylinder that is coated on the outside surface with a ferromagnetic recording material.
A vacuum tube is a glass enclosure inside which are an anode, cathode, and other filaments. The tube is evacuated to a vacuum.
04/21/23 CS61 Computer Architecture 8
Memory History - III
The Williams tube depends on an effect called secondary emission. When a dot is drawn on a cathode ray tube, the area of the dot becomes slightly positively charged and the area immediately around it becomes slightly negatively charged, creating a charge well. The charge well remains on the surface of the tube for a fraction of a second, allowing the device to act as a computer memory. The lifetime of the charge well depends on the electrical resistance of the inside of the tube.
Magnetic cores are little ‘donuts’ with three wires passing through them: two select x and y, while the third sets the charge or not on the core.
Who invented magnetic cores?
04/21/23 CS61 Computer Architecture 9
Core Memory
• Core memory was first large scale reliable main memory
• invented by Forrester in late 40s/early 50s at MIT for Whirlwind project
• Bits stored as magnetization polarity on small ferrite cores threaded onto 2 dimensional grid of wires
• Coincident current pulses on X and Y wires would write cell and also sense original state (destructive reads)
• Robust, non-volatile storage
• Core access time ~ 1ms
04/21/23 CS61 Computer Architecture 10
Semiconductor (various types)
• DRAM: Dynamic Random Access Memory– DRAM needs its cells recharged or given a new
charge every few milliseconds.• SRAM: Static Random Access Memory
– SRAM does not need recharging since it operates on a principle of moving current that is switched in one of two directions rather than a storage cell which holds a charge in place.
04/21/23 CS61 Computer Architecture 11
Modern DRAM
04/21/23 CS61 Computer Architecture 12
Parity versus Non-Parity
• Parity is an error-detection scheme:– developed to detect errors in data – principally over
communications lines but also applied to storage.– By adding a single bit to each byte of data, one could check the
integrity of the other 8 bits while data is transmitted or moved from storage.
• This led to error-correcting codes, which are a topic in itself.
• Today, memory errors are rare because of the very high quality of the manufacturing process, so most memory is non-parity.
• Still used in some mission-critical systems
04/21/23 CS61 Computer Architecture 13
Semiconductor Memory Evolution
DRAM
1970 RAM 4.77 MHz
1987 Fast-Page Mode DRAM 20 MHz
1995 Extended Data Output 20 MHz
1997 PC66 Synchronous DRAM 66 MHz
Synchronous DRAM
1998 PC100 Synchronous DRAM 100 MHz
1999 Rambus DRAM 800 MHz
1999 PC133 Synchronous DRAM 133 MHz
2000 DDR Synchronous DRAM 266 MHz
2002 Enhanced DRAM 450 MHz
2005 DDR2 660 MHz
2009 DDR3 800 MHz
And so on …
04/21/23 CS61 Computer Architecture 14
Memory Interleaving/Banking
• Memory interleaving divides memory into banks as shown below.– Addresses are distributed across the banks – 4-way interleaving is
depicted.– Interleaving allows simultaneous access to words in memory if the words
are in separate banks.– As we will see, this may conflict with caching.
04/21/23 CS61 Computer Architecture 15
Basic Memory
• You can think of computer memory as being one big array of data.– The address serves as an array index.
– Each address refers to one word of data.
• You can read or modify the data at any given memory address, just like you can read or modify the contents of an array at any given index.
• If you’ve worked with pointers in C or C++, then you’ve already worked with memory addresses.
2k x n memory
ADRS OUTDATACSWR
kn
n
2k x n memory
ADRS OUTDATACSWR
kn
nkkknnn
nnnCS WR Memory operation
0 x None1 0 Read selected word1 1 Write selected word
04/21/23 CS61 Computer Architecture 16
Basic Memory - II
• The above depicts the main interface to RAM. • - A Chip Select, CS, enables or disables the RAM.• - ADRS specifies the address or location to read from or
write to.• - WR selects between reading from or writing to the
memory.• To read from memory, WR should be set to 0.
• OUT will be the n-bit value stored at ADRS.• To write to memory, we set WR = 1.
• DATA is the n-bit value to save in memory.
04/21/23 CS61 Computer Architecture 17
Basic Memory - III
Ro
w A
dd
ress
D
eco
de
r
Col.1
Col.2M
Row 1
Row 2N
Column Decoder & Sense Amplifiers
M
N
N+M
bit linesword lines
Memory cell(one bit)
DData
Bits stored in 2-dimensional arrays on chip Modern chips have around 4 logical banks on each chip
04/21/23 CS61 Computer Architecture 18
DRAM Packaging
• DIMM (Dual Inline Memory Module) contains multiple chips with clock/control/address signals connected in parallel (sometimes need buffers to drive signals to all chips)
• Data pins work together to return wide word (e.g., 64-bit data bus using 16x4-bit parts)
04/21/23 CS61 Computer Architecture 19
Memory System Design: Key Ideas
• The Principle of Locality:– Programs access a relatively small portion of the address space at
any instant of time.
• Instructions and data both exhibit spatial and temporal locality– Temporal locality: If a particular instruction or data item is used
now, there is a good chance that it will be used again in the near future.
– Spatial locality: If a particular instruction or data item is used now, there is a good chance that the instructions or data items that are located in memory immediately following or preceding this item will soon be used.
• Therefore, it is a good idea to move such instruction and data items that are expected to be used soon from slow memory to fast memory (cache).
• BUT! This is prediction, and therefore will not always be correct – depends on the extent of locality.
04/21/23 CS61 Computer Architecture 20
Simple Example
Simple calculation assuming just the application program:•Assume 1 GHz processor using 10 ns memory,
– 35% of all executed instructions are load or store.
– The application runs 1 billion instructions.
•Straight Memory Execution time = (1*109 + 0.35*10*109) *10-9 = 4.5 s.•Assume all instructions and data that are required are stored in a perfect cache that operates within the clock period.
– Execution time with perfect cache = 1 ns.
•Now, assume that the cache has a hit rate of 90%.– Execution time with cache = (1 + 0.35*0.1*10) = 1.35 s.
(0.1 comes from the 10% cache misses)
•Caches are 95-99% successful in having the required
instructions and 75-90% successful for data.
04/21/23 CS61 Computer Architecture 21
Cache - I
• A cache is a small (size << main memory), fast memory that temporarily holds data and instructions and makes them available to the processor much faster than main memory.
• Cache space (~MBytes) is smaller than main memory (~GBytes);• Why do caches succeed in improving performance? LOCALITY!• Hit: data appears in some block in the upper level (example:
Block X) – Hit Rate: the fraction of memory access found in the upper level– Hit Time: Time to access the upper level which consists of RAM access
time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level (Block Y)– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in the upper level + Time to
deliver the block the processor
• Hit Time << Miss Penalty (500 instructions on Alpha 21264!)
04/21/23 CS61 Computer Architecture 22
Cache - II
04/21/23 CS61 Computer Architecture 23
Cache - III
Cache Algorithm (READ)
04/21/23 CS61 Computer Architecture 24
Cache Performance
• Memory access time = cache hit time or cache miss rate * miss penalty
• To improve performance: reduce memory time
=> we need to reduce hit time, miss rate & miss penalty.• As L1 caches are in the critical path of instruction execution, hit time
is the most important parameter.• When one parameter is improved, others might suffer• Misses:
– Compulsory miss: always occurs first time.
– Capacity miss: reduces with increase in cache size.
– Conflict miss: reduces with level of associativity.
• Types: – Instruction or Data Cache: 1-way or 2-way
– Data Cache: write through & write-back
04/21/23 CS61 Computer Architecture 25
Cache Topology
• Determines the number and interconnection of caches• Early caches were focused on instructions, while data
were fetched directly from memory• Then, unified caches holding both instructions and data
were used• Then, split caches – one for instructions and one for data• Today, either unified or split depending on processor use
04/21/23 CS61 Computer Architecture 26
Split vs. Unified Caches
• Advantages of unified caches:– Balance the load between instruction and data fetches
depending on the dynamics of the program execution;– Design and implementation are cheaper.
• Advantages of split caches (Harvard Architectures)– Competition for the cache between instruction processing (which
fetches from instruction cache) and execution functional units (which fetch from data cache) is eliminated
– Instruction fetch can proceed in parallel with memory access from the execution unit.
04/21/23 CS61 Computer Architecture 27
Cache Writing
• A Write Buffer is needed between the Cache and Memory– Processor: writes data into the cache and the write buffer
– Memory controller: write contents of the buffer to memory
• Write buffer is just a FIFO:– Typical number of entries: 4
– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
• Memory system designer’s nightmare:– Store frequency (w.r.t. time) > 1 / DRAM write cycle
– Write buffer saturation
ProcessorCache
Write Buffer
DRAM
04/21/23 CS61 Computer Architecture 28
Cache Writing - II
Q. Why a write buffer ? A. So CPU doesn’t stall
Q. Why a buffer, why not just one register ?
A. Bursts of writes arecommon.
Q. Are Read After Write (RAW) hazards an issue for write buffer?
A. Yes! Drain buffer before next read, or send read 1st after check write buffers.
Note: We will discuss RAW hazards in a future lecture.
04/21/23 CS61 Computer Architecture 29
Cache Writing - III
• Write: Need to update upper cache(s) and main memory whenever a store instruction modifies L1 cache.– Write Hit: the item to be modified is in L1.– Write Through: as if no L1, write also to L2.– Write Back: set a Dirty Bit, and update L2 before replacing
the block.– Although write through is an inefficient strategy, most L1s
and some upper level caches follow this approach that read hit time is not affected due to complicated logic to update dirty bit.
• Write Miss: the item to be modified is not in L1.– Write allocate: exploit locality, and bring the block to L1.– Write no-allocate: do not fetch the missing block.
04/21/23 CS61 Computer Architecture 30
Cache Writing - IV
04/21/23 CS61 Computer Architecture 31
Cache Organization: Summary
04/21/23 CS61 Computer Architecture 32
Direct-Mapped Cache
• A block can be placed in one location only, given by:
(Block address) MOD (Number of blocks in cache)
04/21/23 CS61 Computer Architecture 33
Direct Mapped Cache - II
0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 0 1
0 0 0 C a c h e
M e m o r y
0 0 1 0 1 0
0 1 1 1 0 0
1 0 1 1 1 0
1 1 1
A block can be placed in one location only, given by: (Block address) MOD (Number of blocks in cache) In this case: (Block address) MOD (8)
32 memory blocks cacheable
8 cache block frames
(11101) MOD (100) = 101
Cache
Memory
04/21/23 CS61 Computer Architecture 34
Direct Mapped Cache - III
• A memory block is mapped into a unique cache line, depending on the memory address of the respective block.
• A memory address is considered to be composed of three fields:– the least significant bits (2 in our example) identify the byte within the
block; [assume four bytes/block]
– the rest of the address (22 bits in our example) identify the block in main memory; for the cache logic, this part is interpreted as two fields:
– 2a. the least significant bits (14 in our example) specify the cache line;
– 2b. the most significant bits (8 in our example) represent the tag, which is stored in the cache together with the line.
• Tags are stored in the cache in order to distinguish among blocks which fit into the same cache line.
04/21/23 CS61 Computer Architecture 35
Direct Mapped Cache - IV
• Advantages:– simple and cheap;– the tag field is short; only those bits have to be stored
which are not used to address the cache (compare with the following approaches);
– access is very fast.
• Disadvantage:– a given block fits into a fixed cache location– a given cache line will be replaced whenever there is a
reference to another memory block which fits to the same line, regardless what the status of the other cache lines is.
– This can produce a low hit ratio, even if only a very small part of the cache is effectively used.
04/21/23 CS61 Computer Architecture 36
2-Way Associative Cache
• A block can be placed in a restricted set of places, or cache block frames. – A set is a group of block frames in the cache. – A block is first mapped onto the set and then it can be placed anywhere within the set. – The set in this case is chosen by: (Block address) MOD (Number of sets in cache)
04/21/23 CS61 Computer Architecture 37
Two-Way Set Associative Cache - II
• Location 0 can be occupied by data from:– Memory location 0, 2, 4, 6, 8, ... etc.– In general: any memory location whose LSB of the address
is 0– Address<0> => cache index
• On a miss, the block will be placed in one of the two cache lines belonging to that set which corresponds to the 13 bits field in the memory address.
• The replacement algorithm decides which line to use.• A memory block is mapped into any of the lines of a
set.– The set is determined by the memory address, but the line
inside the set can be any one.
04/21/23 CS61 Computer Architecture 38
Two-Way, Set Associative Cache - III
• Several tags (corresponding to all lines in the set) have to be checked in order to determine if we have a hit or miss. If we have a hit, the cache logic finally points to the actual line in the cache.
• The number of lines in a set is determined by the designer:– 2 lines/set: two-way set associative mapping;– 4 lines/set: four-way set associative mapping
• Set associative mapping keeps most of the advantages of direct mapping:– short tag field– fast access– relatively simple
04/21/23 CS61 Computer Architecture 39
Two-Way Set Associative Cache - IV
• Set associative mapping tries to eliminate the main shortcoming of direct mapping– a certain flexibility is given concerning the line to be
replaced when a new block is read into the cache.
• Cache hardware is more complex for set associative mapping than for direct mapping.
• In practice 2 and 4-way set associative mapping are used with very good results.
• Larger sets do not produce further significant performance improvement– Interesting thesis topic: Is this true for multicore
architectures?
04/21/23 CS61 Computer Architecture 40
Full Associative Cache
• A block can be placed anywhere in cache• Lookup hardware for many tags can be large and slow
04/21/23 CS61 Computer Architecture 41
Cache Replacement Strategy
• Replacing a block on a cache miss?– Easy for Direct Mapped– Set Associative or Fully Associative:
• Random
• LRU (Least Recently Used)
• FIFO (First In First Out)
• LFU (Least Frequently used)
• LRU is the most efficient: relatively simple to implement and good results.
• FIFO is simple to implement.• Random replacement is the simplest to implement and
results are surprisingly good.
04/21/23 CS61 Computer Architecture 42
Comparison by Size
• Associativity: 2-way 4-way 8-waySize LRU Random LRU Random LRU Random
16 KB 5.2% 5.7% 4.7% 5.3% 4.4%5.0%
64 KB 1.9% 2.0% 1.5% 1.7% 1.4%1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
• So, what does this tell us?• You don’t gain a significant advantage by going to a
larger cache size.
04/21/23 CS61 Computer Architecture 43
Cache Performance Revisited
• Suppose a processor executes at – Clock Rate = 1000 MHz (1 ns per cycle)– CPI = 1.0 – 50% arithmetic/logic– 30% load/store– 20% control
• Suppose that 10% of memory operations get 100 cycle miss penalty
• CPI = ideal CPI + average stalls time per instruction= 1.0(cycle)
+ ( 0.30 (data-operations/instruction) * 0.10 (miss/data-op) x 100 (cycle/miss) )= 1.0 cycle + 3.0 cycle
= 4.0 cycle• 75 % of the time the processor is stalled waiting for memory!• A 1% instruction miss rate would add an additional 1.0 cycles to
the CPI!
04/21/23 CS61 Computer Architecture 44
Lower Miss Rate: 16 KB D/I or 32 KB Unified Cache?
• Assume a hit takes 1 clock cycle and the miss penalty is 50 cycles.• Assume a load or store takes 1 extra clock cycle on a unified cache
since there is only one cache port.• Assume 75% of memory accesses are instruction references
– (75% X 0.64%) + (25% X 6.47%) = 2.10%
• Average memory access time (split)– = 75% X ( 1 + 0.64% X 50) + 25% X ( 1 + 6.47% X 50) = 0.990 + 1.059
= 2.05
• Average memory access time (unified)– = 75% X ( 1 + 1.99% X 50) + 25% X ( 1 + 1 + 1.99% X 50)
– = 1.496 + 0.749 = 2.24
04/21/23 CS61 Computer Architecture 45
Cache Addressing
• Access caches by virtual address or physical address
04/21/23 CS61 Computer Architecture 46
Cache Addressing
• Physical or Virtual Address– For Index or tag or both
Tag->Index
Physical Virtual
Physical PP PV
Virtual VP VV
For complex, unusualsystems
Operate concurrently w/ MMU
Must translate address first
Very fast
Must translate address first
MMU checks tag
04/21/23 CS61 Computer Architecture 47
Summary: Cache Issues
• How many caches?– Write-Through or Write Back?– Direct-Mapped or Set Associative?
• How to determine at a read if we have a miss or hit?• If there is a miss and there is no place for a new slot in
the cache which information should be replaced?• How to preserve consistency between cache and main
memory at write?• Replacement Strategy?
04/21/23 CS61 Computer Architecture 48
Cache Optimizations
Henk Corperaal, www.ics.ele.tue.nl/~heco/courses/acaGood topics for a term paper!
• Reducing hit time– Small and simple caches– Way prediction– Trace caches
• Increasing cache bandwidth– Pipelined caches– Multibanked caches– Nonblocking caches
• Reducing Miss Penalty– Critical word first– Merging write buffers
• Reducing Miss Rate– Compiler optimizations
• Reducing miss penalty or miss rate via parallelism– Hardware prefetching– Compiler prefetching
04/21/23 CS61 Computer Architecture 49
Cache Optimizations - I
1. Fast Hit via Small and Simple Caches
Index tag memory and thereafter compare takes time
Small cache is fasterAlso L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip
Simple direct mappingCan overlap tag check with data transmission since no choice
04/21/23 CS61 Computer Architecture 50
Cache Optimizations - II
2. Fast Hit via Way Prediction
Make set-associative caches faster
Keep extra bits in cache to predict the “way,” or block within the set, of next cache access.
Multiplexor is set early to select desired block, only 1 tag comparison performed
Miss 1st check other blocks for matches in next clock cycle
Accuracy 85%
Drawback: CPU pipeline is stalled if hit takes 1 or 2 cycles
04/21/23 CS61 Computer Architecture 51
Cache Optimizations - III
3. Fast Hit via Trace Cache
Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace cache line
Single fetch brings in multiple basic blocks
Trace cache indexed by start address and next n branch predictions
+ better utilize long blocks (don’t exit in middle of block, don’t enter at label in middle of block)
complicated address mapping since addresses no longer aligned to power-of-2 multiples of word size
- instructions may appear multiple times in multiple dynamic traces due to different branch outcomes
BR BR BR
BR
BR
BR
04/21/23 CS61 Computer Architecture 52
Cache Optimizations - IV
4. Increase Cache Bandwidth by Pipelining
Pipeline cache access to maintain bandwidth, but higher latency
Nr. of Instruction cache access pipeline stages:
1: Pentium
2: Pentium Pro through Pentium III
4: Pentium 4
greater penalty on mispredicted branches
more clock cycles between the issue of the load and the use of the data
04/21/23 CS61 Computer Architecture 53
Cache Optimizations - V
5. Increasing Cache Bandwidth: Non-blocking cache or lockup-free cache
allow data cache to continue to supply cache hits during a missrequires out-of-order execution CPU
“hit under miss” reduces the effective miss penalty by continuing during miss
“hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple missesRequires that memory system can service multiple missesSignificantly increases the complexity of the cache controller as
there can be multiple outstanding memory accessesRequires multiple memory banks (otherwise cannot support)Pentium Pro allows 4 outstanding memory misses
04/21/23 CS61 Computer Architecture 54
Cache Optimizations - VI
6. Increase Cache Bandwidth via Multiple Banks
Divide cache into independent banks that can support simultaneous accessesE.g., T1 (“Niagara”) L2 has 4 banks
Banking works best when accesses naturally spread themselves across banks mapping of addresses to banks affects behavior of memory system
Simple mapping that works well is “sequential interleaving” Spread block addresses sequentially across banks
E.g., with 4 banks, Bank 0 has all blocks with address%4 = 0;
bank 1 has all blocks whose address%4 = 1; …
04/21/23 CS61 Computer Architecture 55
Cache Optimizations - VII
7. Early Restart/Critical Word First to reduce miss penalty
Don’t wait for full block to be loaded before restarting CPUEarly restart—As soon as the requested word of the block arrives,
send it to the CPU and continue
Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue while filling the rest of the words in the block
Generally useful only when blocks are large
04/21/23 CS61 Computer Architecture 56
Cache Optimizations - VIII
8. Merging Write Buffer to Reduce Miss PenaltyWrite buffer to allow processor to continue while waiting to write to memoryE.g., four writes are merged into one buffer entry rather than putting them in
separate buffersLess frequent write backs …
04/21/23 CS61 Computer Architecture 57
Cache Optimizations - IX
9. Reducing Misses By Compiler Optimizations
InstructionsReorder procedures in memory so as to reduce conflict misses
Profiling to look at conflicts (using developed tools)
DataMerging Arrays: improve spatial locality by single array of compound
elements vs. 2 arrays
Loop Interchange: change nesting of loops to access data in order stored in memory
Loop Fusion: combine 2 independent loops that have same looping and some variables overlap
Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows
04/21/23 CS61 Computer Architecture 58
Example: Merging Arrays
04/21/23 CS61 Computer Architecture 59
Example: Loop Interchange
04/21/23 CS61 Computer Architecture 60
Example: Loop Fusion
04/21/23 CS61 Computer Architecture 61
Blocking Applied to Array Multiplication
04/21/23 CS61 Computer Architecture 62
Blocking Applied to Array Multiplication
04/21/23 CS61 Computer Architecture 63
Blocking Applied to Array Multiplication
• Conflict misses in caches vs. Blocking size– Lam et al [1991] a blocking factor of 24 had a fifth the misses vs.
48 despite both fit in cache
04/21/23 CS61 Computer Architecture 64
Cache Optimizations - X
10. Reducing Cache Misses by Hardware PrefetchingUse extra memory bandwidth (if available)
Instruction Prefetching
Typically, CPU fetches 2 blocks on a miss: the requested block and the next consecutive block.
Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer
Data Prefetching
Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8 different 4 KB pages
Prefetching invoked if 2 successive L2 cache misses to a page, if distance between those cache blocks is < 256 bytes
04/21/23 CS61 Computer Architecture 65
Cache Optimizations - XI
11. Reducing Cache Misses by Software- Controlled
PrefetchingData Prefetch
Load data into register (HP PA-RISC loads)
Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause faults;a form of speculative execution
Issuing Prefetch Instructions takes time
Is cost of prefetch issues < savings in reduced misses?
Wider superscalar reduces difficulty of issue bandwidth
04/21/23 CS61 Computer Architecture 66
Cache Optimizations - Summary