Upload
delilah-inglett
View
218
Download
0
Embed Size (px)
Citation preview
Memory Hierarchy Design
Adapted from CS 252 Spring 2006 UC BerkeleyCopyright (C) 2006 UCB
Since 1980, CPU has outpaced DRAM ...
CPU60% per yr2X in 1.5 yrs
DRAM9% per yr2X in 10 yrs
10
DRAM
CPU
Performance(1/latency)
100
1000
1980
2000
1990
Year
Gap grew 50% per year
Q. How do architects address this gap? A. Put smaller, faster “cache”
memories between CPU and DRAM.
Create a “memory hierarchy”.
1977: DRAM faster than microprocessors
Apple ][ (1977)
Steve WozniakSteve
Jobs
CPU: 1000 ns DRAM: 400 ns
Levels of the Memory Hierarchy
CPU Registers100s Bytes<10s ns
CacheK Bytes10-100 ns1-0.1 cents/bit
Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bit
DiskG Bytes, 10 ms (10,000,000 ns)
10 - 10 cents/bit-5 -6
CapacityAccess TimeCost
Tapeinfinitesec-min10 -8
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
Memory Hierarchy: Apple iMac G5
iMac G51.6 GHz
07 Reg L1 Inst
L1 Data L2 DRA
M Disk
Size 1K 64K 32K 512K 256M 80GLatenc
yCycles,
Time
1,0.6 ns
3,1.9 ns
3,1.9 ns
11,6.9 ns
88,55 ns
107,12 ms
Let programs address a memory space that scales to the disk size, at
a speed that is usually as fast as register access
Managed by compiler
Managed by hardware
Managed by OS,hardware,
application
Goal: Illusion of large, fast, cheap memory
iMac’s PowerPC 970: All caches on-chip
(1K)
Registers
512KL2
L1 (64K Instruction)
L1 (32K Data)
Memory Hierarchy
• Provides (virtually) unlimited amounts of fast memory.
• Takes advantage of locality.– Principle of locality (p.47): Programs tend to
reuse data and instructions they have used recently. (Temporal and Spatial)
– Most programs do not access all code or data uniformly.
The Principle of Locality• The Principle of Locality:
– Program access a relatively small portion of the address space at any instant of time.
• Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
• Last 15 years, HW relied on locality for speed
It is a property of programs which is exploited in machine design.
Programs with locality cache well ...
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
ory
Ad
dre
ss (
on
e d
ot
per
acc
ess)
SpatialLocality
Temporal Locality
Bad locality behavior
Multi-Level Memory Hierarchy
typical valuessmallerfastermore expensive
• Goal: To provide a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level
• All data in one level are found in the level below. (The upper level is a subset of the lower level.)
Memory Hierarchy
Why Important?
• Microprocessor performance improved 35% per year until 1986, and 55% per year since 1987.
• Memory: 7% per year performance improvement in latency
• Processor-memory performance gap
Memory Hierarchy: Terminology• Hit: data appears in some block in the upper level (example: Block X)
– Hit Rate: the fraction of memory access found in the upper level
– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level (Block Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
• Hit Time << Miss Penalty (500 instructions on 21264!)
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
Cache Measures
• Hit rate: fraction found in that level– So high that usually talk about Miss rate– Miss rate fallacy: as MIPS to CPU performance,
miss rate to average memory access time in memory • Average memory-access time
= Hit time + Miss rate x Miss penalty (ns or clocks)
• Miss penalty: time to replace a block from lower level, including time to replace in CPU
– access time: time to lower level
= f(latency to lower level)– transfer time: time to transfer block
=f(BW between upper & lower levels)
Cache Performance
Memory stall cycles =
Number of misses x Miss penalty =
MissesIC x x Miss penalty = Instruction
Memory accesses IC x x Miss rate x Miss penalty Instruction
Example
• Assume we have a computer where the clocks per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hits?
4 Questions for Memory Hierarchy
• Q1: Where can a block be placed in the upper level? (Block placement)
• Q2: How is a block found if it is in the upper level? (Block identification)
• Q3: Which block should be replaced on a miss? (Block replacement)
• Q4: What happens on a write? (Write strategy)
Block Placement
• Direct Mapped– (block address) mod (number of blocks in
cache)
• Fully-associative
• Set associative– (block address) mod (number of sets in
cache)– n-way set associative: n blocks in a set
Block Placement
Block Placement
• The vast majority of processor caches today are direct mapped, two-way set associative, or four-way set associative.
Block Identification
used to check all the blocksin the set
used to selectthe set
address of thedesired data within the block
address
DECstation 3100 cacheAddress (showing bit positions)
16 14 Byteoffset
Valid Tag Data
Hit Data
16 32
16Kentries
16 bits 32 bits
31 30 17 16 15 5 4 3 2 1 0
block size = one word
64KB Cache Using 4-Word BlocksAddress (showing bit positions)
16 12 Byteoffset
V Tag Data
Hit Data
16 32
4Kentries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
31 16 15 4 32 1 0
Block Size vs. Miss Rate• Increasing block size: decreasing miss rate.
–The instruction references have better spatial locality.
• Serious problem with just increasing the block size: miss penalty increases.
–Miss penalty: the time required to fetch the block from the next lower level of the hierarchy and load it into the cache.
–Miss penalty = (latency to the first word) + (transfer time for the rest of the block)
Block Size vs. Miss Rate
1 KB
8 KB
16 KB
64 KB
256 KB
256
40%
35%
30%
25%
20%
15%
10%
5%
0%
Mis
s ra
te
64164
Block size (bytes)
Block Identification
• If the total cache size is kept the same, increasing associativity– increases the number of blocks per set– decreases the size of the index– increases the size of the tag
Block Replacement
• With direct-mapped– Only one block frame is checked for a hit, and
only that block can be replaced.
• With fully associative or set associative,– Random– Least-recently used (LRU)– First in, First out (FIFO)
• See Figure 5.6 on p. 400.
Write Strategy• Write through
– The information is written to both the block in the cache and to the block in the memory.
– Easier to implement, the cache is always clean.
• Write back– The information is written only to the block in
the cache. The modified block is written to main memory only when it is replaced.
– Less memory bandwidth, less power
• Write through with write buffer
What happens on a write?Write-Through Write-Back
Policy
Data written to cache block
also written to lower-level
memory
Write data only to the cache
Update lower level when a
block falls out of the cache
Debug Easy HardDo read misses produce writes? No Yes
Do repeated writes make it to lower level?
Yes No
Additional option -- let writes to an un-cached address allocate a new cache line (“write-
allocate”).
Write Buffers for Write-Through Caches
Q. Why a write buffer ?
ProcessorCache
Write Buffer
Lower Level
Memory
Holds data awaiting write-through to lower level memory
A. So CPU doesn’t stall
Q. Why a buffer, why not just one register ?
A. Bursts of writes arecommon.Q. Are Read After
Write (RAW) hazards an issue for write buffer?
A. Yes! Drain buffer before next read, or send read 1st after check write buffers.
Write Miss Policy
• Write allocate– The block is allocated on a write miss,
followed by the write hit.
• No-write allocate– Write misses do not affect the cache.– Unusual– The block is modified only in the memory.
Exampl
• Assume a fully associative write-back cache with many cache entries that starts empty. What are the number of hits and misses when using no-write allocate versus write allocate?
Write Mem[100];
Write Mem[100];
Read Mem[200];
Write Mem[200];
Write Mem[100];
Write Miss Policy
• Write-back caches use write allocate normally.
• Write-through caches often use no-write allocate.
Example
• Assuming a cache of 16K blocks (1 block = 1 word) and 32-bit addressing, find the total number of sets and the total number of tag bits that are direct-mapped, 2-way set-associative, 4-way set-associative, and fully associative. Assume that this machine is byte-addressable.
Cache Performance
Average memory access time =
Hit time + Miss rate X Miss penalty