1
CMPE 421 Parallel Computer Architecture
PART4Caching with Associativity
2
Fully Associative CacheReducing Cache Misses by More Flexible Placement Blocks Instead of direct mapped, we allow any memory block to
be placed in any cache slot. There are many different potential addresses that mapped to
each index Use any available entry to store memory elements Remember: Direct memory caches are more rigid, any cache
data goes directly where the index says to, even if the rest of the cache is empty
But in Fully associative cache, nothing gets “thrown out” of the cache until it is completely full.
It’s harder to check for a hit (hit time will increase). Requires lots more hardware (a comparator for each
cache slot). Each tag will be a complete block address (No index bits
are used).
3
Fully Associative Cache
Must compare tags of all entries in parallel to find the desired one (if there is a hit)
But Direct mapped cache only need to look one place
No conflict misses, only capacity misses Practical only for caches with small number of blocks,
since searching increases the hardware cost
4
Fully Associative Cache
5
Direct Mapped vs Fully Associative
Direct MappedDirect Mapped
0:1:23:4:5:6:7:89:
10:11:12:13:14:15:
V Tag DataIndexIndex
Address = Tag | Index | Block offset
Fully AssociativeFully Associative
No IndexNo Index
Address = Tag | Block offset
Each address has only one possible location
Each address has only one possible location
Tag DataV
6
Trade off
Fully Associate is much more flexible, so the miss rate will be lower. Direct Mapped requires less hardware (cheaper).
– will also be faster!
Tradeoff of miss rate vs. hit time. Therefore we might be able to compromise to find best solution
between direct mapped cache and fully associative cache We can also provide more flexibility without going to a fully associative
placement policy. For each memory location, provide a small number of cache slots that
can hold the memory element. This is much more flexible than direct-mapped, but requires less
hardware than fully associative.
SOLUTION: Set Associative
7
SET Associative Cache A fixed number of locations where each block can be
placed. N-way set associative means there are N places (slots)
where each block can be placed. Divide the cache into a number of sets each set is of size
N “ways” (N way set associative) Therefore, A memory block maps to unique set (specified
by index field) and can be placed in any “way” of that set So there N choices A memory block can be mapped is Set-accociative cache
- (Block address) modulo (Number of set in the cache)
- Remember that in a direct mapped cache the position of memory block is given by
(Block address) modulo (Number of cache blocks)
8
A Compromise
2-Way set associative2-Way set associative
Address = Tag | Index | Block offset
4-Way set associative4-Way set associative
Address = Tag | Index | Block offset
0:
1:
2:
3:
4:
5:
6:
7:
V Tag Data
Each address has two possiblelocations with the same index
Each address has two possiblelocations with the same index
One fewer index bit: 1/2 the indexes
One fewer index bit: 1/2 the indexes
0:
1:
2:
3:
V Tag Data
Each address has four possiblelocations with the same index
Each address has four possiblelocations with the same index
Two fewer index bits: 1/4 the indexes
Two fewer index bits: 1/4 the indexes
9
Range of Set Associative Caches
Block offset Byte offsetIndexTag
Decreasing associativity
Fully associative(only one set)Tag is all the bits exceptblock and byte offset
Direct mapped(only one way)Smaller tags
Increasing associativity
Selects the setUsed for tag compare Selects the word in the block
Index is the set number is used to determine which set the block can be placed in
10
Range of Set Associative Caches
For a fixed size cache, each increase by a factor of two in
associativity doubles the number of blocks per set (i.e. the numbers or ways)
And halves the number of sets,Decreases the size of the index by 1 bitAnd increases the size of the tag by 1 bit
Block offset Byte offsetIndexTag
11
Set Associative Cache
0
Cache
Main Memory
Q1: How do we find it?
Use next 1 low order memory address bit to determine which cache set (i.e., modulo the number of sets in the cache)
Tag Data
Q2: Is it there?
Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache
V
0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xx
Two low order bits define the byte in the word (32-b words)One word blocks
1
01
0
1
(block address) modulo (# set in the cache)
Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block
12
Set Associative Cache Organization
FIGURE 7.17 The implementation of a four-way set-associative cache requires four comparators and a 4-to-1 multiplexor. The comparators determine which element of the selected set (if any) matches the tag. The output of the comparators is used to select the data from one of the four blocks of the indexed set, using a multiplexor with a decoded select signal. In some implementations, the Output enable signals on the data portions of the cache RAMs can be used to select the entry in the set that drives the output. The Output enable signal comes from the comparators, causing the element that matches to drive the data outputs.
13
Set Associative Cache Organization
This is called a 4-way set associative cache because there are four cache entries for each cache index. Essentially, you have four direct mapped cache working in parallel.
This is how it works: the cache index selects a set from the cache. The four tags in the set are compared in parallel with the upper bits of the memory address.
If no tags match the incoming address tag, we have a cache miss.
Otherwise, we have a cache hit and we will select the data from the way where the tag matches occur.
This is simple enough. What is its disadvantages?
14
N-way Set Associative Cache versus Direct Mapped Cache:
N way set associative cache will also be slower than a direct mapped cache because
N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set
selection In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later
if miss.
15
Remember the Example for Direct Mapping (ping pong effect)
0 4 0 4
0 4 0 4
Consider the main memory word reference string 0 4 0 4 0 4 0 4
miss miss miss miss
miss miss miss miss
00 Mem(0) 00 Mem(0)01 4
01 Mem(4)000
00 Mem(0)01
4
00 Mem(0)01 4
00 Mem(0)01
401 Mem(4)
00001 Mem(4)
000
Start with an empty cache - all blocks initially marked as not valid
Ping pong effect due to conflict misses - two memory locations that map into the same cache block
8 requests, 8 misses
16
Solution: Use set associative cache
0 4 0 4
Consider the main memory word reference string 0 4 0 4 0 4 0 4
miss miss hit hit
000 Mem(0) 000 Mem(0)
Start with an empty cache - all blocks initially marked as not valid
010 Mem(4) 010 Mem(4)000 Mem(0) 000 Mem(0)
010 Mem(4)
Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist!
8 requests, 2 misses
17
Set Associative Example
V Tag DataIndex00000000
000:
001:
010:
011:
100:
101:
110:
111:
01001110001100110100010011110001101100001100111000
MissMissMissMissMiss
Index V Tag Data
0
0000000
00:
01:
10:
11:
V Tag DataIndex0
0000000
0:
1:
Direct-Mapped 2-Way Set Assoc. 4-Way Set Assoc.
01001110001100110100010011110001101100001100111000
MissMissHitMissMiss
01001110001100110100010011110001101100001100111000
MissMissHitMissHit
Byte offset (2 bits)Block offset (2 bits)Index (1-3 bits)Tag (3-5 bits)
010 -1 110010
0100 -
1 1100 -1
011110
01101100
1 01001
1 11001
1 01101
-
--
18
New Performance Numbers
Miss rates for DEC 3100 (MIPS machine)
spice Direct 0.3% 0.6% 0.4%
gcc Direct 2.0% 1.7% 1.9%
spice 2-way 0.3% 0.6% 0.4%
gcc 4-way 1.6% 1.4% 1.5%
Benchmark Associativity Instruction Data miss Combinedrate miss rate
Separate 64KB Instruction/Data Caches
gcc 2-way 1.6% 1.4% 1.5%
spice 4-way 0.3% 0.6% 0.4%
19
Benefits of Set Associative Caches The choice of direct mapped or set associative depends
on the cost of a miss versus the cost of implementation
0
2
4
6
8
10
12
1-way 2-way 4-way 8-way
Associativity
Mis
s R
ate
4KB8KB16KB32KB64KB128KB256KB512KB
Data from Hennessy & Patterson, Computer Architecture, 2003
Largest gains are in going from direct mapped to 2-way (20%+ reduction in miss rate)
20
Benefits of Set Associative Caches
As the cache size grow, the relative improvement from associativity increases only slightly
Since overall miss rate of a larger cache is lower, the opportunity for improving the miss rate decreases
And the obsolete improvement in miss rate from associativity shrinks significantly
21
Cache Block Replacement PolicyFor deciding which block to replace when a new entry is coming Random Replacement:
Hardware randomly selects a cache item and throw it out
First in First Out (FIFO) Equally fair / equally unfair to all frames
Least Recently Used (LRU) strategy: Use idea of temporal locality to select the entry that has not been accessed
recently Additional bit(s) required in the cache entry to track access order
- Must update on each access, must scan all on a replace
For two way set associative cache one needs one bit for LRU replacement.
Common approach is to use pseudo LRU strategy Example of a Simple “Pseudo” Least Recently Used Implementation: Assume 64 Fully Associative Entries Hardware replacement pointer points to one cache entry Whenever an access is made to the entry the pointer points to:
- Move the pointer to the next entry
-Otherwise: do not move the pointer
Source of Cache Misses
Direct Mapped
N way Set Associative
Fully Associative
Cache Size Big Medium Small
Compulsory Miss Same Same Same
Conflict Miss High Medium Zero
Capacity Miss Low(er) Medium High
Designing a cache
Design Cache Effect on Miss rate Negative performance effect
Increase size Decrease Capacity Misses May increase Access time
Increase Associativity
Decrease conflict Misses May increase Access time
Increase Block Size
May decrease compulsory misses
May increase miss penalty
May Increase Capacity Misses
Not: If you are running “billions” of instructions compulsory misses are insignificand
Key Cache Design Parameters
L1 typical L2 typical
Total size (blocks) 250 to 2000
4000 to 250,000
Total size (KB) 16 to 64 500 to 8000
Block size (B) 32 to 64 32 to 128
Miss penalty (clocks) 10 to 25 100 to 1000
Miss rates (global for L2)
2% to 5% 0.1% to 2%
Two Machines’ Cache Parameters
Intel P4 AMD Opteron
L1 organization Split I$ and D$ Split I$ and D$
L1 cache size 8KB for D$, 96KB for trace cache (~I$)
64KB for each of I$ and D$
L1 block size 64 bytes 64 bytes
L1 associativity 4-way set assoc. 2-way set assoc.
L1 replacement ~ LRU LRU
L1 write policy write-through write-back
L2 organization Unified Unified
L2 cache size 512KB 1024KB (1MB)
L2 block size 128 bytes 64 bytes
L2 associativity 8-way set assoc. 16-way set assoc.
L2 replacement ~LRU ~LRU
L2 write policy write-back write-back
Where can a block be placed/found?
# of sets Blocks per set
Direct mapped # of blocks in cache 1
Set associative (# of blocks in cache)/ associativity
Associativity (typically 2 to 16)
Fully associative 1 # of blocks in cache
Location method # of comparisons
Direct mapped Index 1
Set associative Index the set; compare set’s tags
Degree of associativity
Fully associative Compare all blocks tags # of blocks
27
Multilevel caches
Two level cache structure allows the primary cache (L1) to focus on reducing hit time to yield a shorter clock cycle.
The second level cache (L2) focuses on reducing the penalty of long memory access time.
Compared to the cache of a single cache machine, L1 on a multilevel cache machine is usually smaller, has a smaller block size, and has a higher miss rate.
Compared to the cache of a single cache machine, L2 on a multilevel cache machine is often larger with a larger block size.
The access time of L2 is less critical than that of the cache of a single cache machine.