Cache Memories

Seoul National University

Cache Memories

Cache Memories Cache memory organization and operation Performance impact of caches

The memory mountain Rearranging loops to improve spatial locality

Cache Memories Cache memories are small, fast SRAM-based memories

managed automatically in hardware. Hold frequently accessed blocks of main memory

CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory.

Typical system structure:

Mainmemory

I/Obridge

Bus interface

Register file

CPU chip

System bus Memory bus

Cache memories

General Cache Organization (S, E, B)E = 2e lines per set

S = 2s sets

0 1 2 B-1tagv

B = 2b bytes per cache block (the data)

Cache size:C = S x E x B data bytes

valid bit

Cache ReadE = 2e lines per set

S = 2s sets

0 1 2 B-1tagv

valid bitB = 2b bytes per cache block (the data)

t bits s bits b bitsAddress of word:

tag setindex

blockoffset

data begins at this offset

• Locate set• Check if any line in set

has matching tag• Yes + line valid: hit• Locate data starting

at offset

Example: Direct Mapped Cache (E = 1)

S = 2s sets

Direct mapped: One line per setAssume: cache block size 8 bytes

t bits 0…01 100Address of int:

0 1 2 7tagv 3 654

find set

Example: Direct Mapped Cache (E = 1)Direct mapped: One line per setAssume: cache block size 8 bytes

0 1 2 7tagv 3 654

match: assume yes = hitvalid? +

block offset

Example: Direct Mapped Cache (E = 1)Direct mapped: One line per setAssume: cache block size 8 bytes

0 1 2 7tagv 3 654

match: assume yes = hitvalid? +

int (4 Bytes) is here

block offset

No match: old line is evicted and replaced

Direct-Mapped Cache SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 Blocks/set

Address trace (reads, one byte per read):0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002]

xt=1 s=2 b=1

0 ? ?v Tag Block

1 0 M[0-1]

hitmiss

1 0 M[6-7]

1 1 M[8-9]

1 0 M[0-1]Set 0Set 1Set 2Set 3

A Higher Level Exampleint sum_array_rows(double a[16][16]){ int i, j; double sum = 0;

for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum;}

32 B = 4 doubles

assume: cold (empty) cache,a[0][0] goes here

int sum_array_cols(double a[16][16]){ int i, j; double sum = 0;

for (j = 0; j < 16; j++) for (i = 0; i < 16; i++) sum += a[i][j]; return sum;}

Ignore the variables sum, i, j

E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytes

t bits 0…01 100Address of short int:

0 1 2 7tagv 3 654 0 1 2 7tagv 3 654

find set

0 1 2 7tagv 3 654 0 1 2 7tagv 3 654

compare both

valid? + match: yes = hit

block offset

0 1 2 7tagv 3 654 0 1 2 7tagv 3 654

compare both

valid? + match: yes = hit

block offset

short int (2 Bytes) is here

No match: • One line in set is selected for eviction and replacement• Replacement policies: random, least recently used (LRU), …

2-Way Set Associative Cache Simulation

M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

Address trace (reads, one byte per read):0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002]

xxt=2 s=1 b=1

0 ? ?v Tag Block

1 00 M[0-1]

hitmiss

1 01 M[6-7]

1 10 M[8-9]

A Higher Level Exampleint sum_array_rows(double a[16][16]){ int i, j; double sum = 0;

for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum;}

32 B = 4 doubles

assume: cold (empty) cache,a[0][0] goes here

int sum_array_rows(double a[16][16]){ int i, j; double sum = 0;

for (j = 0; j < 16; j++) for (i = 0; i < 16; i++) sum += a[i][j]; return sum;}

Ignore the variables sum, i, j

What about writes? Multiple copies of data exist:

L1, L2, Main Memory, Disk What to do on a write-hit?

Write-through (write immediately to memory) Write-back (defer write to memory until replacement of line)

Need a dirty bit (line different from memory or not) What to do on a write-miss?

Write-allocate (load into cache, update line in cache) Good if more writes to the location follow

No-write-allocate (writes immediately to memory) Typical

Write-through + No-write-allocate Write-back + Write-allocate

A Common Framework for Memory Hierarchies

Question 1: Where can a Block be Placed? One place (direct-

mapped), a few places (set associative), or any place (fully

associative)

Question 2: How is a Block Found? Indexing (direct-mapped),

limited search (set associative), full search (fully associative)

Question 3: Which Block is Replaced on a Miss? Typically LRU or

random

Question 4: How are Writes Handled? Write-through or write-

Intel Core i7 Cache Hierarchy

L1 d-cache

L1 i-cache

L2 unified cache

Core 0

L1 d-cache

L1 i-cache

L2 unified cache

Core 3

L3 unified cache(shared by all cores)

Main memory

Processor package

L1 i-cache and d-cache:32 KB, 8-way, Access: 4 cycles

L2 unified cache: 256 KB, 8-way, Access: 11 cycles

L3 unified cache:8 MB, 16-way,Access: 30-40 cycles

Block size: 64 bytes for all caches.

Cache Performance Metrics Miss Rate

Fraction of memory references not found in cache (misses / accesses)= 1 – hit rate

Typical numbers (in percentages): 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc.

Hit Time Time to deliver a line in the cache to the processor

includes time to determine whether the line is in the cache Typical numbers:

1-2 clock cycle for L1 5-20 clock cycles for L2

Miss Penalty Additional time required because of a miss

typically 50-200 cycles for main memory (Trend: increasing!)

Lets think about those numbers Huge difference between a hit and a miss

Could be 100x, if just L1 and main memory

Would you believe 99% hits is twice as good as 97%? Consider:

cache hit time of 1 cyclemiss penalty of 100 cycles

Average access time: 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles

This is why “miss rate” is used instead of “hit rate”

Writing Cache Friendly Code Make the common case go fast

Focus on the inner loops of the core functions

Minimize the misses in the inner loops Repeated references to variables are good (temporal locality) Stride-1 reference patterns are good (spatial locality)

Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.

Cache Memories Cache organization and operation Performance impact of caches

The Memory Mountain Read throughput (read bandwidth)

Number of bytes read from memory per second (MB/s)

Memory mountain: Measured read throughput as a function of spatial and temporal locality. Compact way to characterize memory system performance.

Memory Mountain Test Function/* The test function */void test(int elems, int stride) { int i, result = 0; volatile int sink;

for (i = 0; i < elems; i += stride)result += data[i];

sink = result; /* So compiler doesn't optimize away the loop */}

/* Run test(elems, stride) and return read throughput (MB/s) */double run(int size, int stride, double Mhz){ double cycles; int elems = size / sizeof(int);

test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */}

The Memory Mountains1 s3 s5 s7 s9

64M 16M 4M 1M 256K

64K 16K 4K

Stride (x8 bytes)

Working set size (bytes)

Intel Core i732 KB L1 i-cache32 KB L1 d-cache256 KB unified L2 cache8M unified L3 cache

All caches on-chip

64M 16M 4M 1M 256K

64K 16K 4K

Stride (x8 bytes)

All caches on-chip

Slopes ofspatial locality

64M 16M 4M 1M 256K

64K 16K 4K

Stride (x8 bytes)

All caches on-chip

Slopes ofspatial locality

Ridges of Temporal locality

Cache Memories Cache organization and operation Performance impact of caches

Miss Rate Analysis for Matrix Multiply Assume:

Line size = 32B (big enough for four 64-bit words) Matrix dimension (N) is very large

Approximate 1/N as 0.0 Cache is not even big enough to hold multiple rows

Analysis Method: Look at access pattern of inner loop

Matrix Multiplication Example Description:

Multiply N x N matrices O(N3) total operations N reads per source

element N values summed per

destination but may be able to

hold in register

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

Variable sumheld in register

Layout of C Arrays in Memory (review) C arrays allocated in row-major order

each row in contiguous memory locations Stepping through columns in one row:

for (i = 0; i < N; i++)sum += a[0][i];

accesses successive elements if block size (B) > 4 bytes, exploit spatial locality

compulsory miss rate = 4 bytes / B Stepping through rows in one column:

for (i = 0; i < n; i++)sum += a[i][0];

accesses distant elements no spatial locality!

compulsory miss rate = 1 (i.e. 100%)

Matrix Multiplication (ijk)

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

A B C(i,*)

(*,j)(i,j)

Inner loop:

Column-wise

Row-wise Fixed

Misses per inner loop iteration:A B C0.25 1.0 0.0

Matrix Multiplication (jik)

/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}

A B C(i,*)

(*,j)(i,j)

Inner loop:

Row-wise Column-wise

Matrix Multiplication (kij)

/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

A B C(i,*)

(i,k) (k,*)

Inner loop:

Row-wise Row-wiseFixed

Matrix Multiplication (ikj)

/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

A B C(i,*)

(i,k) (k,*)

Inner loop:

Row-wise Row-wiseFixed

Matrix Multiplication (jki)

/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

(*,j)(k,j)

Inner loop:

Column-wise

Matrix Multiplication (kji)

/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

(*,j)(k,j)

Inner loop:

FixedColumn-wise

Column-wise

Summary of Matrix Multiplication

ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25

kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5

jki (& kji): • 2 loads, 1 store• misses/iter = 2.0

for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

Core i7 Matrix Multiply Performance

50 100 150 200 250 300 350 400 450 500 550 600 650 700 7500

jkikjiijkjikkij

Array size (n)

jki / kji

ijk / jik

kij / ikj

Summary Memory hierarchies are an optimization resulting from a perfect

match between memory technology and two types of program locality Temporal locality Spatial locality

The goal is to provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory

Cache memory is an instance of a memory hierarchy exploits both temporal and spatial localities direct-mapped caches are simple and fast but have higher miss rates set-associative caches have lower miss rates but are complex and slow multilevel caches are becoming increasingly popular

Programmer can optimize for cache performance How data structure are organized How data are accessed (nested loop structure)

Cache Memories

Documents

Cache Memories October 8, 2007

Memory Hierarchy and Cache Memoriesuser.ceng.metu.edu.tr/~manguoglu/ceng331/15-16-cache... · 2019-12-08 · Memory Hierarchy and Cache Memories CENG331 - Computer Organization Instructor:

Cache Memories February 28, 2002 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Reading:

Cache&Memories:&& WhyProgrammersNeedtoKnow!& · 2011. 4. 4. · 2 Duke University 3 Compsci 104 IntelCore&i7Cache&Hierarchy Regs L1 d-cache L1 i-cache L2 unified cache Core 0 Regs

On the Effectiveness of XOR-Mapping Schemes for Cache Memories

Scratchpad Memories: A Design Alternative for Cache On-chip Memory in Embedded Systems

Cache Memories - University of California, Berkeley instructions and data in cache memories can usually be referenced in 10 to 25 percent of the time required to access main memory,

AN EFFICIENT HIERARCHICAL DUAL CACHE SYSTEM FOR NAND FLASH MEMORIES

Cache Memory and Performance Cache Memory 1courses.cs.vt.edu/cs2506/Spring2020/notes/L15_Cache...Cache memories are small, fast SRAM-based memories managed automatically in hardware

Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories

Chapter 6 Cache Memories - Computer Science

Lecture 16: Cache Memories • Last Time • Today

Cache Memories 15-213: Introduction to Computer Systems 11 th Lecture, Sep 30, 2013

1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches

Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213

Outline Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications

Cache Memories

Software-Based Self-Test of Set- Associative Cache Memories · 1 Software-Based Self-Test of Set-Associative Cache Memories Stefano Di Carlo, IEEE Member , Paolo Prinetto, IEEE Member,

Cache Memories - University of Western Ontario · 2012. 3. 13. · corresponding cached datum). (Moreno Maza) ... Cache Memories CS2101 March 2012 13 / 54. Hierarchical memories and

CPSC 312 Cache Memories Slides Source: Bryant Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on