View
216
Download
0
Tags:
Embed Size (px)
Citation preview
1
Lecture 16: Large Cache Innovations
• Today: Large cache design and other cache innovations
• Midterm scores 91-80: 17 students 79-75: 14 students 74-68: 8 students 63-54: 5 students
• Q1 (HM), Q2 (bpred), Q3 (stalls), Q7 (loops) mostly correct• Q4 (ooo) – 50% correct; many didn’t stall renaming• Q5 (multi-core) – less than a handful got 8+ points• Q6 (memdep) – less than a handful got part (b) right and correctly articulated the effect on power/energy
2
Shared Vs. Private Caches in Multi-Core
• Advantages of a shared cache: Space is dynamically allocated among cores No wastage of space because of replication Potentially faster cache coherence (and easier to locate data on a miss)
• Advantages of a private cache: small L2 faster access time private bus to L2 less contention
3
UCA and NUCA
• The small-sized caches so far have all been uniform cache access: the latency for any access is a constant, no matter where data is found
• For a large multi-megabyte cache, it is expensive to limit access time by the worst case delay: hence, non-uniform cache architecture
4
Large NUCA
CPU
Issues to be addressed forNon-Uniform Cache Access:
• Mapping
• Migration
• Search
• Replication
5
Static and Dynamic NUCA
• Static NUCA (S-NUCA) The address index bits determine where the block is placed Page coloring can help here as well to improve locality
• Dynamic NUCA (D-NUCA) Blocks are allowed to move between banks The block can be anywhere: need some search mechanism Each core can maintain a partial tag structure so they have an idea of where the data might be (complex!) Every possible bank is looked up and the search propagates (either in series or in parallel) (complex!)
6
Example Organization
Latency 13-17cyc
Latency 65 cyc
Data must beplaced close to the
center-of-gravityof requests
7
Examples: Frequency of Accesses
Dark more accesses
OLTP (on-line transaction processing)
Ocean (scientific code)
Core 0
L1D$
L1I$
Core 1
L1D$
L1I$
Core 2
L1D$
L1I$
Core 3
L1D$
L1I$
Core 4
L1D$
L1I$
Core 5
L1D$
L1I$
Core 6
L1D$
L1I$
Core 7
L1D$
L1I$
Shared L2 Cache and Directory State
L2 Cache Controller
Scalable Non-broadcast Interconnect
Core 0
L1D$
L1I$
Core 1
L1D$
L1I$
Core 2
L1D$
L1I$
Core 3
L1D$
L1I$
Core 4
L1D$
L1I$
Core 5
L1D$
L1I$
Core 6
L1D$
L1I$
Core 7
L1D$
L1I$
Replicated Tags of all L2 and L1 Caches
Controller that handles L2 misses
Scalable Non-broadcast Interconnect
L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$
Off-chip access
Core 0
L1D$
L1I$
L2 $
Core 1
L1D$
L1I$
L2 $
Core 2
L1D$
L1I$
L2 $
Core 3
L1D$
L1I$
L2 $
Core 4
L1D$
L1I$
L2 $
Core 5
L1D$
L1I$
L2 $
Core 6
L1D$
L1I$
L2 $
Core 7
L1D$
L1I$
L2 $
Memory Controller for off-chip access
A single tile composedof a core, L1 caches, and
a bank (slice) of theshared L2 cache
The cache controller forwards address requests to the appropriate L2 bank
and handles coherenceoperations
Bottom diewith cores
and L1 caches
Core 0
L1 D$
L1 I$
L2 $
Core 1
L1 D$
L1 I$
L2 $
Core 2
L1 D$
L1 I$
L2 $
Core 3
L1 D$
L1 I$
L2 $
Memory controller for off-chip accessTop die with
L2 cache banks
Each core haslow-latency access
to one L2 bank
12
The Best of S-NUCA and D-NUCA
• Employ S-NUCA (no search required) and use page coloring to influence the block’s cache index bits and hence the bank that the block gets placed in
• Page migration enables block movement just as in D-NUCA
13
Prefetching
• Hardware prefetching can be employed for any of the cache levels
• It can introduce cache pollution – prefetched data is often placed in a separate prefetch buffer to avoid pollution – this buffer must be looked up in parallel with the cache access
• Aggressive prefetching increases “coverage”, but leads to a reduction in “accuracy” wasted memory bandwidth
• Prefetches must be timely: they must be issued sufficiently in advance to hide the latency, but not too early (to avoid pollution and eviction before use)
14
Stream Buffers
• Simplest form of prefetch: on every miss, bring in multiple cache lines
• When you read the top of the queue, bring in the next line
L1Stream buffer
Sequential lines
15
Stride-Based Prefetching
• For each load, keep track of the last address accessed by the load and a possibly consistent stride
• FSM detects consistent stride and issues prefetches
init
trans
steady
no-pred
incorrect
correct
incorrect(update stride)
correct
correct
correct
incorrect(update stride)
incorrect(update stride)
tag prev_addr stride statePC
16
Compiler Optimizations
• Loop interchange: loops can be re-ordered to exploit spatial locality
for (j=0; j<100; j++) for (i=0; i<5000; i++) x[i][j] = 2 * x[i][j];
is converted to…
for (i=0; i<5000; i++) for (j=0; j<100; j++) x[i][j] = 2 * x[i][j];
17
Exercise
for (i=0;i<N;i++) for (j=0;j<N;j++) { r=0; for (k=0;k<N;k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; }
for (jj=0; jj<N; jj+= B)for (kk=0; kk<N; kk+= B)for (i=0;i<N;i++) for (j=jj; j< min(jj+B,N); j++) { r=0; for (k=kk; k< min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; }
y zy zx
x
• Re-organize data accesses so that a piece of data is used a number of times before moving on… in other words, artificially create temporal locality
18
Exercise
for (jj=0; jj<N; jj+= B)for (kk=0; kk<N; kk+= B)for (i=0;i<N;i++) for (j=jj; j< min(jj+B,N); j++) { r=0; for (k=kk; k< min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; }
y zx
y zx
y zx
y zx y zx
19
Exercise
• Original code could have 2N3 + N2 memory accesses, while the new version has 2N3/B + N2
for (i=0;i<N;i++) for (j=0;j<N;j++) { r=0; for (k=0;k<N;k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; }
for (jj=0; jj<N; jj+= B)for (kk=0; kk<N; kk+= B)for (i=0;i<N;i++) for (j=jj; j< min(jj+B,N); j++) { r=0; for (k=kk; k< min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; }
y zy zx
x