University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems...

University of Amsterdam

Computer Systems – cache characteristics Arnoud Visser 1

Computer Systems

Cache characteristics

How do you know?(ow133)cpuid

This system has a Genuine Intel(R) Pentium(R) 4 processorProcessor Family: F, Extended Family: 0, Model: 2, Stepping: 7

Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz - 3.06 GHz (bus-speed 400/533 MHz)

Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entriesData TLB: 4K or 4M pages, fully associative, 64 entries1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line sizeNo 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cacheTrace cache: 12K-uops, 8-way set associative2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size

CPUID-instruction

asm("push %ebx;movl $2,%eax;CPUID;movl %eax,cache_eax;movl %ebx,cache_ebx;movl %edx,cache_edx;movl %ecx,cache_ecx;pop %ebx");

A limited number of answers

http://www.sandpile.org/ia32/cpuid.htm

Direct-Mapped Cache

• Simplest kind of cache

• Characterized by exactly one line per set.

E=1 lines per setvalid

• • •

set 0:

set 1:

set S-1:

cache block

Fully Associative Cache

• Difficult and expensive to build one that is both large and fast

TagSet 0: E = C/B lines in

the one and only set

Valid Tag

• • •

Cache block

E – way Set Associative Caches

• Characterized by more than one line per set

valid tagset 0: E=2 lines per set

set 1:

set S-1:

• • •

cache block

valid tag cache block

Accessing Set Associative Caches

• Set selection– Use the set index bits to determine the set

of interest.

t bits s bits0 0 0 0 1

b bits

tag set index block offset

selected set

s = log2 (S)

tagset 0:

tagset 1:

tagset S-1:

• • •

cache block

S = C / (B x E)

Accessing Set Associative Caches• Line matching and word selection

– must compare the tag in each valid line in the selected set.

1 0110 w3w0 w1 w2

1 1001

t bits s bits100i0110

b bits

tag set index block offset

selected set (i):

=1? (1) The valid bit must be set.

= ?(2) The tag bits in one of the cache lines must

match the tag bits inthe address

(3) If (1) and (2), then cache hit, and

block offset selects starting byte.

30 1 2 74 5 6

b = log2 (B)

Observations

• Contiguously region of addresses form a block

• Multiple address blocks share the same set-index

• Blocks that map to the same set can be uniquely identified by the tag

Caching in a Memory Hierarchy

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Larger, slower, cheaper storagedevice at level k+1 is partitionedinto blocks.

Data is copied betweenlevels in block-sized transfer units

8 9 14 3Smaller, faster, more expensivedevice at level k caches a subset of the blocks from level k+1

Level k:

Level k+1:

Same set-index

Request14

Request12

General Caching Concepts• Program needs object d, which

is stored in some block b.• Cache hit

– Program finds b in the cache at level k. E.g., block 14.

• Cache miss– b is not at level k, so level k

cache must fetch it from level k+1. E.g., block 12.

– If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”?

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Level k:

Level k+1:

0 1 2 3

Request12

4*4*12

Conflict miss

Intel Processors CacheSRAM

Instruction Data

Pentium II 1997 4-way, 32B, 128 sets

4-way,32B, 128 sets

4-way,32B, 4096 sets

Celeron A 1998 ,, ,, 4-way,32B, 1028 sets

Pentium III Coppermine

2000 ,, ,, 4-way,32B, 2048 sets

Pentium 4Willamette

2000 8-way 4-way,64B, 64 sets

8-way,64B, 512 sets

Pentium 4Northwood

2002 ,, ,, 8-way,64B, 1028 sets

http://www11.brinkster.com/bayup/dodownload.asp?f=37

Impact

• Associativitymore lines decrease the vulnerability on thrashing. It requires more tag-bits and control logic, which increases the hit-time (1-2 cycles for L1)

• Block sizelarger blocks exploit spatial locality (not temporal locality) to increase the hit-rate. Larger blocks increase the miss-penalty (5- 100 cycles for L1).

Design of DRAM cache– Line size?

• Large, since disk better at transferring large blocks– Associativity?

• High, to mimimize miss rate– Write through or write back?

• Write back, since can’t afford to perform small writes to disk

What would the impact of these choices be on:– miss rate

• Extremely low. << 1%– hit time

• Must match cache/DRAM performance– miss latency

• Very high. ~20ms– tag storage overhead

• Low, relative to block size

Locating an Object in a “Cache”• SRAM Cache

– Tag stored with cache line– Maps from cache block to memory blocks

• From cached to uncached form• Save a few bits by only storing tag

– No tag for block not in cache– Hardware retrieves information

• can quickly match against multiple tags

Object NameTag Data

•••

“Cache”

Locating an Object in “Cache”

•••

Object Name

Location

•••

On Disk

“Cache”Page Table

• DRAM Cache– Each allocated page of virtual memory has entry in

page table– Mapping from virtual pages to physical pages

• From uncached form to cached form– Page table entry even if page not in memory

• Specifies disk address

A System with Virtual Memory

Address Translation: Hardware converts virtual addresses to physical addresses via lookup table (page table)

Memory

Page Table

VirtualAddresses

PhysicalAddresses

Page Faults (like “Cache Misses”)What if an object is on disk rather than in memory?

– Page table entry indicates virtual address not in memory– OS exception handler invoked to move data from disk into

memory• current process suspends, others can resume• OS has full control over placement, etc.

Memory

Page Table

VirtualAddresses

PhysicalAddresses

Memory

Page Table

VirtualAddresses

PhysicalAddresses

Before fault After fault

VM Address Translation:Hardware vs Software

Processor

HardwareAddr TransMechanism

faulthandler

MainMemory

Secondary memorya

page fault

physical addressOS performsthis transfer(only if miss)

virtual address part of the on-chipmemory mgmt unit (MMU)

virtual page number page offset virtual address

physical page number page offset physical address0p–1

address translation

pm–1

n–1 0p–1p

Page offset bits don’t change as a result of translation

VM Address Translation• Parameters

– P = 2p = page size (bytes). – N = 2n = Virtual address limit– M = 2m = Physical address limit

Address Translation via Page TableAddress Translation via Page Table

virtual page number (VPN) page offset

virtual address

physical page number (PPN) page offset

physical address

0p–1pm–1

n–1 0p–1ppage table base register

if valid=0then pagenot in memory

valid physical page number (PPN)access

VPN acts astable index

PTEA= PTE

CPUTrans-lation

Cache MainMemory

VA PA miss

hitdata

Integrating VM and Cache

• Most Caches “Physically Addressed”– Allows multiple processes to have blocks in cache at

same time– Cache doesn’t need to be concerned with protection

issues (Access rights checked as part of address translation)

• Perform Address Translation Before Cache – But this involves a memory access itself (of the PTE)– Of course, page table entries can also become cached

CPUTLB

LookupCache

MainMemory

VA PA miss

Trans-lation

Speeding up Translation with a dedicated cache

• “Translation Lookaside Buffer” (TLB)– Small hardware cache in MMU– Maps virtual page to physical page numbers– Contains complete PTEs for small number of pages

What do you know?(ow133)cpuid

This system has a Genuine Intel(R) Pentium(R) 4 processorProcessor Family: F, Extended Family: 0, Model: 2, Stepping: 7

Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz - 3.06 GHz (bus-speed 400/533 MHz)

Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entriesData TLB: 4K or 4M pages, fully associative, 64 entries1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line sizeNo 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cacheTrace cache: 12K-uops, 8-way set associative2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size

Matrix Multiplication Example• Major Cache Effects to Consider

– Total cache size• Exploit temporal locality and keep the working set small (e.g., by

using blocking)– Block size

• Exploit spatial locality

• Description:– Multiply N x N matrices– O(N3) total operations– Accesses

• N reads per source element• N values summed per destination

• Assumption – No so large that cache not big enough to hold multiple rows

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

Variable sumheld in register

Matrix Multiplication (ijk)/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

(*,j)(i,j)

Inner loop:

Column-wise

Row-wise Fixed

• Misses per Inner Loop Iteration:A B C

0.25 1.0 0.0 = 1.25

Matrix Multiplication (jik)/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}

/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}

(*,j)(i,j)

Inner loop:

Row-wise Column-wise

0.25 1.0 0.0 = 1.25

Matrix Multiplication (kij)/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

(i,*)(i,k) (k,*)

Inner loop:

Row-wise Row-wiseFixed

0.0 0.25 0.25 = 0.5

Matrix Multiplication (ikj)/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

(i,*)(i,k) (k,*)

Inner loop:

Row-wise Row-wiseFixed

0.0 0.25 0.25 = 0.5

Matrix Multiplication (jki)/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

(*,j)(k,j)

Inner loop:

Column -wise

Column-wise

1.0 0.0 1.0 = 2.0

Matrix Multiplication (kji)/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

(*,j)(k,j)

Inner loop:

FixedColumn-wise

Column-wise

1.0 0.0 1.0 = 2.0

Summary of Matrix Multiplication

for (i=0; i<n; i++) {

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25

for (k=0; k<n; k++) {

for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

for (j=0; j<n; j++) {

for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5

jki (& kji): • 2 loads, 1 store• misses/iter = 2.0

Matrix Multiply Performance• Miss rates are helpful but not perfect predictors.

• Code scheduling matters, too.

25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400

Array size (n)

kjijkikijikjjikijk

Improving Temporal Locality by Blocking• Example: Blocked matrix multiplication

– do not mean “cache block”.– Instead, it mean a sub-block within the matrix.– Example: N = 8; sub-block size = 4

C11 = A11B11 + A12B21 C12 = A11B12 + A12B22

C21 = A21B11 + A22B21 C22 = A21B12 + A22B22

A11 A12

A21 A22

B11 B12

B21 B22

X = C11 C12

C21 C22

Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.

Blocked Matrix Multiply (bijk)for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } }}

Blocked Matrix Multiply Analysis• Innermost loop pair multiplies

– a 1 X bsize sliver of A by – a bsize X bsize block of B and

accumulates into 1 X bsize sliver of C– Loop over i steps through n row slivers of A & C,

using same B

block reused n times in succession

row sliver accessedbsize times

Update successiveelements of sliver

kk jjjj

Matrix Multiply Performance• Blocking (bijk and bikj) improves performance

by a factor of two over unblocked versions– relatively insensitive to array size.

Array size (n)

bijk (bsize = 25)

bikj (bsize = 25)

Concluding Observations• Programmer can optimize for cache performance

– How data structures are organized– How data are accessed

• Nested loop structure• Blocking is a general technique

• All systems favor “cache friendly code”– Getting absolute optimum performance is very

platform specific• Cache sizes, line sizes, associativities, etc.

– Can get most of the advantage with generic code• Keep working set reasonably small (temporal locality)• Use small strides (spatial locality)

Assignment

• Optimize your ‘image-processing’ code further with blocking

• Make your code a function of the L1-size

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems...

Documents

Kees Verruijt, Arnoud Roebers, Anjo de Heus · Kees Verruijt, Arnoud Roebers, Anjo de Heus Job Scheduling for SAP®

INF3380: Parallel Programming for Natural Sciences · Cache Cache Core Core Core Core Cache Cache Bus Compute Node Memory Core Core Core Core Cache Cache Core Core Core Core ... Compute

04 Cache Memoryinfosec.pusan.ac.kr/.../2019/03/CH04-cache-memory.pdfTable 4.1 Key Characteristics of Computer Memory Systems Location Internal (e.g. processor registers, cache, main

Prime-Cache / Pro-Cache / Power-Cache Archive Appliance User

System Cache v1.00 - Xilinx · The Cache memory provides the actual cache functionality in the System Cache. The cache is configurable in terms of size and associativity. The cache

Delta AC Motor Drives Arnoud de Bok / April 2012

cache. ... cache

LogiCORE IP System Cache v1The Cache memory provides the actual cache functionality in the System Cache. The cache is configurable in terms of size and associativity. The cache size

Arnoud Raskin @ the next 10 years in employer marketing

Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation

03-04 Cache Memory Computer Organization. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics

Computer Architecture Lecture 3 Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics

Proefschrift Strategie Implementatie | Arnoud van der Maas

Cache in Chromium: Disk Cache

Cache me outside - Umbraco Spark · Cache me outside ANTHONY DANG HEAD OF TECHNOLOGY, THE COGWORKS ... Partial Cache Output Cache / Donut Cache Custom Inline (method-level) Cache

Weibo / Cache-Cache

Arnoud Mouwen Ruud van der Ploeg 1. © 2012 Strategy Development Partners Arnoud Mouwen, Ruud van der Ploeg, Amsterdam Metropolitan Region, EMTA General

D CACHE: A Deep Learning Based Framework For …...A Framework that leverages state-of-the-art ML algorithms to improve cache efficiency Predict object characteristics to develop new

Kees Verruijt, Arnoud Roebers, Anjo de Heus

Cache memory and cache