View
38
Download
0
Category
Tags:
Preview:
DESCRIPTION
Hitting the Memory Wall. Memory density and capacity have grown along with the CPU power and complexity, but memory speed has not kept pace. The Need for a Memory Hierarchy. The widening speed gap between CPU and main memory Processor operations take of the order of 1 ns - PowerPoint PPT Presentation
Citation preview
Slide 1
Hitting the Memory Wall
Memory density and capacity have grown along with the CPU power and complexity, but memory speed has not kept pace.
1990 1980 2000 2010 1
10
10
Re
lati
ve p
erf
orm
anc
e
Calendar year
Processor
Memory
3
6
Slide 2
The Need for a Memory Hierarchy
The widening speed gap between CPU and main memory
Processor operations take of the order of 1 ns
Memory access requires 10s or even 100s of ns
Memory bandwidth limits the instruction execution rate
Each instruction executed involves at least one memory access
Hence, a few to 100s of MIPS is the best that can be achieved
A fast buffer memory can help bridge the CPU-memory gap
The fastest memories are expensive and thus not very large
A second (third?) intermediate cache level is thus often used
Slide 3
Typical Levels in a Hierarchical Memory
Names and key characteristics of levels in a memory hierarchy.
Tertiary Secondary
Main
Cache 2
Cache 1
Reg’s $Millions $100s Ks
$10s Ks
$1000s
$10s
$1s
Cost per GB Access latency Capacity
TBs 10s GB
100s MB
MBs
10s KB
100s B
min+ 10s ms
100s ns
10s ns
a few ns
ns
Speed gap
Data movement in a memory hierarchy.
Memory Hierarchy
Pages Lines
Words
Registers
Main memory
Cache
Virtual memory
(transferred explicitly
via load/store) (transferred automatically
upon cache miss) (transferred automatically
upon page fault)
Cache memory: provides illusion of very high speed
Virtual memory: provides illusion of very large size
Main memory: reasonable cost, but slow & small
Slide 4
Slide 5
The Need for a Cache
Cache memories act as intermediaries between the superfast processor and the much slower main memory.
Level-2 cache
Main memory
CPU CPU registers
Level-1 cache
Level-2 cache
Main memory
CPU CPU registers
Level-1 cache
(a) Level 2 between level 1 and main (b) Level 2 connected to “backside” bus
One level of cache with hit rate h
Ceff = hCfast + (1 – h)(Cslow + Cfast) = Cfast + (1 – h)Cslow
Slide 6
Performance of a Two-Level Cache System
Example
•CPU with CPIexecution = 1.1 running at clock rate = 500 MHZ•1.3 memory accesses per instruction.•L1 cache operates at 500 MHZ with a miss rate of 5%•L2 cache operates at 250 MHZ with local miss rate 40%, (T2 = 2 cycles)•Memory access penalty, M = 100 cycles. Find CPI.
CPI = CPIexecution + Mem Stall cycles per instructionWith No Cache, CPI = 1.1 + 1.3 x 100 = 131.1With single L1, CPI = 1.1 + 1.3 x .05 x 100 = 7.6Mem Stall cycles/instruction = Mem accesses /instruction x Stall cycles / access Stall cycles per memory access = (1-H1) x H2 x T2 + (1-H1)(1-H2) x M = .05 x .6 x 2 + .05 x .4 x 100 = .06 + 2 = 2.06Mem Stall cycles/instruction = Mem accesses/instruction x Stall cycles/access = 2.06 x 1.3 = 2.678CPI = 1.1 + 2.678 = 3.778Speedup = 7.6/3.778 = 2
Slide 7
Cache Memory Design Parameters (assuming a single cache level)
Cache size (in bytes or words). A larger cache can hold more of the program’s useful data but is more costly and likely to be slower.
Block or cache-line size (unit of data transfer between cache and main). With a larger cache line, more data is brought in cache with each miss. This can improve the hit rate but also may bring low-utility data in.
Placement policy. Determining where an incoming cache line is stored. More flexible policies imply higher hardware cost and may or may not have performance benefits (due to more complex data location).
Replacement policy. Determining which of several existing cache blocks (into which a new cache line can be mapped) should be overwritten. Typical policies: choosing a random or the least recently used block.
Write policy. Determining if updates to cache words are immediately forwarded to main (write-through) or modified blocks are copied back to main if and when they must be replaced (write-back or copy-back).
Slide 8
What Makes a Cache Work?
Assuming no conflict in address mapping, the cache will hold a small program loop in its entirety, leading to fast execution.
9-instruction program loop
Address mapping (many-to-one)
Cache memory
Main memory
Cache l ine/block (unit of t ransfer between main and cache memories)
Temporal localitySpatial locality
Slide 9
Temporal and Spatial Localities
Addresses
Time
From Peter Denning’s CACM paper, July 2005 (Vol. 48, No. 7, pp. 19-24)
Temporal:Accesses to the same address are typically clustered in time
Spatial:When a location is accessed, nearby locations tend to be accessed also
Slide 10
Desktop, Drawer, and File Cabinet Analogy
Items on a desktop (register) or in a drawer (cache) are more readily accessible than those in a file cabinet (main memory).
Main memory
Register file
Access cabinet in 30 s
Access desktop in 2 s
Access drawer in 5 s
Cache memory
Once the “working set” is in the drawer, very few trips to the file cabinet are needed.
Slide 11
Caching Benefits Related to Amdahl’s Law
Example
In the drawer & file cabinet analogy, assume a hit rate h in the drawer. Formulate the situation shown in previous figure in terms of Amdahl’s law.
Solution
Without the drawer, a document is accessed in 30 s. So, fetching 1000 documents, say, would take 30 000 s. The drawer causes a fraction h of the cases to be done 6 times as fast, with access time unchanged for the remaining 1 – h. Speedup is thus 1/(1 – h + h/6) = 6 / (6 – 5h). Improving the drawer access time can increase the speedup factor but as long as the miss rate remains at 1 – h, the speedup can never exceed 1 / (1 – h). Given h = 0.9, for instance, the speedup is 4, with the upper bound being 10 for an extremely short drawer access time.Note: Some would place everything on their desktop, thinking that this yields even greater speedup. This strategy is not recommended!
Slide 12
Compulsory, Capacity, and Conflict Misses
Compulsory misses: With on-demand fetching, first access to any item is a miss. Some “compulsory” misses can be avoided by prefetching.
Capacity misses: We have to oust some items to make room for others. This leads to misses that are not incurred with an infinitely large cache.
Conflict misses: Occasionally, there is free room, or space occupied by useless data, but the mapping/placement scheme forces us to displace useful items to bring in other items. This may lead to misses in future.
Given a fixed-size cache, dictated, e.g., by cost factors or availability of space on the processor chip, compulsory and capacity misses are pretty much fixed. Conflict misses, on the other hand, are influenced by the data mapping scheme which is under our control.
We study two popular mapping schemes: direct and set-associative.
Slide 13
Direct-Mapped Cache
Direct-mapped cache holding 32 words within eight 4-word lines. Each line is associated with a tag and a valid bit.
3-bit line index in cache
2-bit word offset in line Main memory locations
0-3 4-7
8-11
36-39 32-35
40-43
68-71 64-67 72-75
100-103 96-99 104-107
Tag Word
address
Valid bits
Tags
Read tag and specified word
Com-pare
1,Tag
Data out
Cache miss
1 if equal
Slide 14
Accessing a Direct-Mapped Cache
Example 1
Components of the 32-bit address in an example direct-mapped cache with byte addressing.
Show cache addressing for a byte-addressable memory with 32-bit addresses. Cache line W = 16 B. Cache size L = 4096 lines (64 KB).
Solution
Byte offset in line is log216 = 4 b. Cache line index is log24096 = 12 b.
This leaves 32 – 12 – 4 = 16 b for the tag.
12-bit line index in cache
4-bit byte offset in line
Byte address in cache
16-bit line tag
32-bit address
Slide 15
1 KB Direct Mapped Cache, 32B blocks• For a 2N byte cache:
– The uppermost (32 - N) bits are always the Cache Tag– The lowest M bits are the Byte Select (Block Size = 2M)
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0431
:
Cache Tag Example: 0x50
Ex: 0x01
0x50
Stored as partof the cache “state”
Valid Bit
:
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte Select
Ex: 0x00
9
Example 2
Slide 16
Set-Associative Cache
Two-way set-associative cache holding 32 words of data within 4-word lines and 2-line sets.
Main memory locations
0-3
16-19
32-35
48-51
64-67
80-83
96-99
112-115
Valid bits
Tags
1
0
2-bit set index in cache
2-bit word offset in line
Tag
Word address
Option 0
Option 1
Read tag and specified word from each option
Com-pare
1,Tag
Com-pare
Data out
Cache
miss
1 if equal
Slide 17
Accessing a Set-Associative Cache
Example 1
Components of the 32-bit address in an example two-way set-associative cache.
Show cache addressing scheme for a byte-addressable memory with 32-bit addresses. Cache line width 2W = 16 B. Set size 2S = 2 lines. Cache size 2L = 4096 lines (64 KB).
Solution
Byte offset in line is log216 = 4 b. Cache set index is (log24096/2) = 11 b.
This leaves 32 – 11 – 4 = 17 b for the tag.11-bit set index in cache
4-bit byte offset in line
Address in cache used to read out two candidate
items and their control info
17-bit line tag
32-bit address
Slide 18
Two-way Set Associative Cache
• N-way set associative: N entries for each Cache Index– N direct mapped caches operates in parallel (N typically 2 to 4)
• Example: Two-way set associative cache– Cache Index selects a “set” from the cache– The two tags in the set are compared in parallel– Data is selected based on the tag result
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
Example 2
Slide 19
Disadvantage of Set Associative Cache
• N-way Set Associative Cache v. Direct Mapped Cache:– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss
• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:– Possible to assume a hit and continue. Recover later if
miss.
Advantage of Set Associative Cache
• Improves cache performance by reducing conflict misses• In practice, degree of associativity is often kept at 4 or 8
Slide 20
Effect of Associativity on Cache Performance
Performance improvement of caches with increased associativity.
4-way Direct 16-way 64-way 0
0.1
0.3
Mis
s ra
te
Associativity
0.2
2-way 8-way 32-way
Slide 21
Cache Write Strategies
Write Though: Data is written to both the cache block and to a block of main memory.
– The lower level always has the most updated data; an important feature for I/O and multiprocessing.
– Easier to implement than write back.– A write buffer is often used to reduce CPU write stall while
data is written to memory.
ProcessorCache
Write Buffer
DRAM
Slide 22
Cache Write Strategies cont.
Write back: Data is written or updated only to the cache block. The modified or dirty cache block is written to main memory when it’s being replaced from cache.
– Writes occur at the speed of cache– A status bit called a dirty bit, is used to indicate
whether the block was modified while in cache; if not the block is not written to main memory.
– Uses less memory bandwidth than write through.
Slide 23
Cache and Main Memory
Harvard architecture: separate instruction and data memoriesvon Neumann architecture: one memory for instructions and data
Split cache: separate instruction and data caches (L1)Unified cache: holds instructions and data (L1, L2, L3)
Slide 24
Cache and Main Memory cont.
(16KB instruction cache + 16KB data cache) vs. 32KB unified cacheHit cycle: 1, Miss cycle: 50, 75% instruction access16KB I&D: instruction miss rate=0.64%, data miss rate=6.47%32KB: miss rate = 1.99%
Average memory access time = % instructions × (read hit time + read
miss rate × miss penalty) + % data × (write hit time + write miss rate × miss penalty)
Split= 75% × (1 + 0.64% × 50) + 25% × (1 + 6.47% × 50) = 2.05
Unified= 75% × (1 + 1.99 × 50) + 25% × (1 + 1* + 1.99% × 50) = 2.24
*: 1 extra clock cycle since there is only one cache port to satisfy two simultaneous requests
Slide 25
Improving Cache Performance
For a given cache size, the following design issues and tradeoffs exist:
Line width (2W). Too small a value for W causes a lot of maim memory accesses; too large a value increases the miss penalty and may tie up cache space with low-utility items that are replaced before being used.
Set size or associativity (2S). Direct mapping (S = 0) is simple and fast; greater associativity leads to more complexity, and thus slower access, but tends to reduce conflict misses.
Line replacement policy. Usually LRU (least recently used) algorithm or some approximation thereof; not an issue for direct-mapped caches. Somewhat surprisingly, random selection works quite well in practice.
Write policy. Write through, write back
Slide 26
2:1 cache rule of thumb
A direct-mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/2.
E.g.Ref. p. 424 fig. 5.14: miss rate 8 KB direct-mapped (0.068%) = 4 KB 2-way set associative (0.076%)16 KB (0.049%) = 8 KB (0.049%)32 KB (0.042%) = 16 KB (0.041%)64 KB (0.037%) = 32 KB (0.038%)
Caches larger than 128 KB do not prove the rule.
Slide 27
90/10 locality rule
A program executes about 90% its instructions in 10% of its code
Slide 28
Four classic memory hierarchy questions
• Where can a block be placed in the upper level? (block placement): direct mapped, set-associative..
• How is a block found if it is in the upper level? (block identification): tag, index, offset
• Which block should be replaced on a miss? (block replacement): Random, LRU, FIFO..
• What happens on a write? (write strategy): write through, write back
Slide 29
Reducing cache miss penaltyliterature review 1
• Multilevel caches
• Critical word first and early restart
• Giving priority to read misses over writes
• Merging write buffer
• Victim caches
• ..
Slide 30
Reducing cache miss rateliterature review 2
• Larger block size
• Larger caches
• Higher associativity
• Way prediction and pseudoassociative caches
• Compiler optimizations
• ..
Slide 31
Reducing cache miss penalty or miss rate via parallelism
literature review 3• Nonblocking caches
• H/W prefetching of instructions and data
• Compiler controlled prefetching
• ..
Slide 32
Reducing hit timeliterature review 4
• Small and simple caches
• Avoiding address translation during indexing of the cache
• Pipelined cache access
• Trace caches
• ..
Recommended