20
1 ecture 16: Large Cache Innovations : Large cache design and other cache innovations rm scores 80: 17 students 75: 14 students 68: 8 students 54: 5 students M), Q2 (bpred), Q3 (stalls), Q7 (loops) mostly corr oo) – 50% correct; many didn’t stall renaming ulti-core) – less than a handful got 8+ points emdep) – less than a handful got part (b) right and correctly articulated the effect on po

1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores 91-80: 17 students 79-75: 14 students

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

1

Lecture 16: Large Cache Innovations

• Today: Large cache design and other cache innovations

• Midterm scores 91-80: 17 students 79-75: 14 students 74-68: 8 students 63-54: 5 students

• Q1 (HM), Q2 (bpred), Q3 (stalls), Q7 (loops) mostly correct• Q4 (ooo) – 50% correct; many didn’t stall renaming• Q5 (multi-core) – less than a handful got 8+ points• Q6 (memdep) – less than a handful got part (b) right and correctly articulated the effect on power/energy

2

Shared Vs. Private Caches in Multi-Core

• Advantages of a shared cache: Space is dynamically allocated among cores No wastage of space because of replication Potentially faster cache coherence (and easier to locate data on a miss)

• Advantages of a private cache: small L2 faster access time private bus to L2 less contention

3

UCA and NUCA

• The small-sized caches so far have all been uniform cache access: the latency for any access is a constant, no matter where data is found

• For a large multi-megabyte cache, it is expensive to limit access time by the worst case delay: hence, non-uniform cache architecture

4

Large NUCA

CPU

Issues to be addressed forNon-Uniform Cache Access:

• Mapping

• Migration

• Search

• Replication

5

Static and Dynamic NUCA

• Static NUCA (S-NUCA) The address index bits determine where the block is placed Page coloring can help here as well to improve locality

• Dynamic NUCA (D-NUCA) Blocks are allowed to move between banks The block can be anywhere: need some search mechanism Each core can maintain a partial tag structure so they have an idea of where the data might be (complex!) Every possible bank is looked up and the search propagates (either in series or in parallel) (complex!)

6

Example Organization

Latency 13-17cyc

Latency 65 cyc

Data must beplaced close to the

center-of-gravityof requests

7

Examples: Frequency of Accesses

Dark more accesses

OLTP (on-line transaction processing)

Ocean (scientific code)

Core 0

L1D$

L1I$

Core 1

L1D$

L1I$

Core 2

L1D$

L1I$

Core 3

L1D$

L1I$

Core 4

L1D$

L1I$

Core 5

L1D$

L1I$

Core 6

L1D$

L1I$

Core 7

L1D$

L1I$

Shared L2 Cache and Directory State

L2 Cache Controller

Scalable Non-broadcast Interconnect

Core 0

L1D$

L1I$

Core 1

L1D$

L1I$

Core 2

L1D$

L1I$

Core 3

L1D$

L1I$

Core 4

L1D$

L1I$

Core 5

L1D$

L1I$

Core 6

L1D$

L1I$

Core 7

L1D$

L1I$

Replicated Tags of all L2 and L1 Caches

Controller that handles L2 misses

Scalable Non-broadcast Interconnect

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

Off-chip access

Core 0

L1D$

L1I$

L2 $

Core 1

L1D$

L1I$

L2 $

Core 2

L1D$

L1I$

L2 $

Core 3

L1D$

L1I$

L2 $

Core 4

L1D$

L1I$

L2 $

Core 5

L1D$

L1I$

L2 $

Core 6

L1D$

L1I$

L2 $

Core 7

L1D$

L1I$

L2 $

Memory Controller for off-chip access

A single tile composedof a core, L1 caches, and

a bank (slice) of theshared L2 cache

The cache controller forwards address requests to the appropriate L2 bank

and handles coherenceoperations

Bottom diewith cores

and L1 caches

Core 0

L1 D$

L1 I$

L2 $

Core 1

L1 D$

L1 I$

L2 $

Core 2

L1 D$

L1 I$

L2 $

Core 3

L1 D$

L1 I$

L2 $

Memory controller for off-chip accessTop die with

L2 cache banks

Each core haslow-latency access

to one L2 bank

12

The Best of S-NUCA and D-NUCA

• Employ S-NUCA (no search required) and use page coloring to influence the block’s cache index bits and hence the bank that the block gets placed in

• Page migration enables block movement just as in D-NUCA

13

Prefetching

• Hardware prefetching can be employed for any of the cache levels

• It can introduce cache pollution – prefetched data is often placed in a separate prefetch buffer to avoid pollution – this buffer must be looked up in parallel with the cache access

• Aggressive prefetching increases “coverage”, but leads to a reduction in “accuracy” wasted memory bandwidth

• Prefetches must be timely: they must be issued sufficiently in advance to hide the latency, but not too early (to avoid pollution and eviction before use)

14

Stream Buffers

• Simplest form of prefetch: on every miss, bring in multiple cache lines

• When you read the top of the queue, bring in the next line

L1Stream buffer

Sequential lines

15

Stride-Based Prefetching

• For each load, keep track of the last address accessed by the load and a possibly consistent stride

• FSM detects consistent stride and issues prefetches

init

trans

steady

no-pred

incorrect

correct

incorrect(update stride)

correct

correct

correct

incorrect(update stride)

incorrect(update stride)

tag prev_addr stride statePC

16

Compiler Optimizations

• Loop interchange: loops can be re-ordered to exploit spatial locality

for (j=0; j<100; j++) for (i=0; i<5000; i++) x[i][j] = 2 * x[i][j];

is converted to…

for (i=0; i<5000; i++) for (j=0; j<100; j++) x[i][j] = 2 * x[i][j];

17

Exercise

for (i=0;i<N;i++) for (j=0;j<N;j++) { r=0; for (k=0;k<N;k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; }

for (jj=0; jj<N; jj+= B)for (kk=0; kk<N; kk+= B)for (i=0;i<N;i++) for (j=jj; j< min(jj+B,N); j++) { r=0; for (k=kk; k< min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; }

y zy zx

x

• Re-organize data accesses so that a piece of data is used a number of times before moving on… in other words, artificially create temporal locality

18

Exercise

for (jj=0; jj<N; jj+= B)for (kk=0; kk<N; kk+= B)for (i=0;i<N;i++) for (j=jj; j< min(jj+B,N); j++) { r=0; for (k=kk; k< min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; }

y zx

y zx

y zx

y zx y zx

19

Exercise

• Original code could have 2N3 + N2 memory accesses, while the new version has 2N3/B + N2

for (i=0;i<N;i++) for (j=0;j<N;j++) { r=0; for (k=0;k<N;k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; }

for (jj=0; jj<N; jj+= B)for (kk=0; kk<N; kk+= B)for (i=0;i<N;i++) for (j=jj; j< min(jj+B,N); j++) { r=0; for (k=kk; k< min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; }

y zy zx

x

20

Title

• Bullet