View
228
Download
1
Category
Preview:
Citation preview
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 1
Computer Systems
Cache characteristics
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 2
How do you know?(ow133)cpuid
This system has a Genuine Intel(R) Pentium(R) 4 processorProcessor Family: F, Extended Family: 0, Model: 2, Stepping: 7
Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz - 3.06 GHz (bus-speed 400/533 MHz)
Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entriesData TLB: 4K or 4M pages, fully associative, 64 entries1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line sizeNo 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cacheTrace cache: 12K-uops, 8-way set associative2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 3
CPUID-instruction
asm("push %ebx;movl $2,%eax;CPUID;movl %eax,cache_eax;movl %ebx,cache_ebx;movl %edx,cache_edx;movl %ecx,cache_ecx;pop %ebx");
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 4
A limited number of answers
http://www.sandpile.org/ia32/cpuid.htm
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 5
Direct-Mapped Cache
• Simplest kind of cache
• Characterized by exactly one line per set.
E=1 lines per setvalid
valid
valid
tag
tag
tag
• • •
set 0:
set 1:
set S-1:
cache block
cache block
cache block
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 6
Fully Associative Cache
• Difficult and expensive to build one that is both large and fast
Valid
Valid
Tag
TagSet 0: E = C/B lines in
the one and only set
Valid Tag
• • •
Cache block
Cache block
Cache block
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 7
E – way Set Associative Caches
• Characterized by more than one line per set
valid tagset 0: E=2 lines per set
set 1:
set S-1:
• • •
cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 8
Accessing Set Associative Caches
• Set selection– Use the set index bits to determine the set
of interest.
m-1
t bits s bits0 0 0 0 1
0
b bits
tag set index block offset
selected set
s = log2 (S)
valid
valid
tag
tagset 0:
valid
valid
tag
tagset 1:
valid
valid
tag
tagset S-1:
• • •
cache block
cache block
cache block
cache block
cache block
cache block
S = C / (B x E)
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 9
Accessing Set Associative Caches• Line matching and word selection
– must compare the tag in each valid line in the selected set.
1 0110 w3w0 w1 w2
1 1001
t bits s bits100i0110
0m-1
b bits
tag set index block offset
selected set (i):
=1? (1) The valid bit must be set.
= ?(2) The tag bits in one of the cache lines must
match the tag bits inthe address
(3) If (1) and (2), then cache hit, and
block offset selects starting byte.
30 1 2 74 5 6
b = log2 (B)
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 10
Observations
• Contiguously region of addresses form a block
• Multiple address blocks share the same set-index
• Blocks that map to the same set can be uniquely identified by the tag
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 11
Caching in a Memory Hierarchy
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Larger, slower, cheaper storagedevice at level k+1 is partitionedinto blocks.
Data is copied betweenlevels in block-sized transfer units
8 9 14 3Smaller, faster, more expensivedevice at level k caches a subset of the blocks from level k+1
Level k:
Level k+1:
4
4
4 10
10
10
Same set-index
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 12
Request14
Request12
General Caching Concepts• Program needs object d, which
is stored in some block b.• Cache hit
– Program finds b in the cache at level k. E.g., block 14.
• Cache miss– b is not at level k, so level k
cache must fetch it from level k+1. E.g., block 12.
– If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”?
9 3
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Level k:
Level k+1:
1414
12
14
4*
4*12
12
0 1 2 3
Request12
4*4*12
14
12
Conflict miss
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 13
Intel Processors CacheSRAM
L1 L2
Instruction Data
Pentium II 1997 4-way, 32B, 128 sets
4-way,32B, 128 sets
4-way,32B, 4096 sets
Celeron A 1998 ,, ,, 4-way,32B, 1028 sets
Pentium III Coppermine
2000 ,, ,, 4-way,32B, 2048 sets
Pentium 4Willamette
2000 8-way 4-way,64B, 64 sets
8-way,64B, 512 sets
Pentium 4Northwood
2002 ,, ,, 8-way,64B, 1028 sets
http://www11.brinkster.com/bayup/dodownload.asp?f=37
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 14
Impact
• Associativitymore lines decrease the vulnerability on thrashing. It requires more tag-bits and control logic, which increases the hit-time (1-2 cycles for L1)
• Block sizelarger blocks exploit spatial locality (not temporal locality) to increase the hit-rate. Larger blocks increase the miss-penalty (5- 100 cycles for L1).
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 15
Design of DRAM cache– Line size?
• Large, since disk better at transferring large blocks– Associativity?
• High, to mimimize miss rate– Write through or write back?
• Write back, since can’t afford to perform small writes to disk
What would the impact of these choices be on:– miss rate
• Extremely low. << 1%– hit time
• Must match cache/DRAM performance– miss latency
• Very high. ~20ms– tag storage overhead
• Low, relative to block size
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 16
Locating an Object in a “Cache”• SRAM Cache
– Tag stored with cache line– Maps from cache block to memory blocks
• From cached to uncached form• Save a few bits by only storing tag
– No tag for block not in cache– Hardware retrieves information
• can quickly match against multiple tags
X
Object NameTag Data
D 243
X 17
J 105
•••
•••
0:
1:
N-1:
= X?
“Cache”
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 17
Locating an Object in “Cache”
Data
243
17
105
•••
0:
1:
N-1:
X
Object Name
Location
•••
D:
J:
X: 1
0
On Disk
“Cache”Page Table
• DRAM Cache– Each allocated page of virtual memory has entry in
page table– Mapping from virtual pages to physical pages
• From uncached form to cached form– Page table entry even if page not in memory
• Specifies disk address
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 18
A System with Virtual Memory
Address Translation: Hardware converts virtual addresses to physical addresses via lookup table (page table)
CPU
0:1:
N-1:
Memory
0:1:
P-1:
Page Table
Disk
VirtualAddresses
PhysicalAddresses
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 19
Page Faults (like “Cache Misses”)What if an object is on disk rather than in memory?
– Page table entry indicates virtual address not in memory– OS exception handler invoked to move data from disk into
memory• current process suspends, others can resume• OS has full control over placement, etc.
Memory
Page Table
Disk
VirtualAddresses
PhysicalAddresses
CPU
Memory
Page Table
Disk
VirtualAddresses
PhysicalAddresses
Before fault After fault
CPU
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 20
VM Address Translation:Hardware vs Software
VM Address Translation:Hardware vs Software
Processor
HardwareAddr TransMechanism
faulthandler
MainMemory
Secondary memorya
a'
page fault
physical addressOS performsthis transfer(only if miss)
virtual address part of the on-chipmemory mgmt unit (MMU)
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 21
virtual page number page offset virtual address
physical page number page offset physical address0p–1
address translation
pm–1
n–1 0p–1p
Page offset bits don’t change as a result of translation
VM Address Translation• Parameters
– P = 2p = page size (bytes). – N = 2n = Virtual address limit– M = 2m = Physical address limit
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 22
Address Translation via Page TableAddress Translation via Page Table
virtual page number (VPN) page offset
virtual address
physical page number (PPN) page offset
physical address
0p–1pm–1
n–1 0p–1ppage table base register
if valid=0then pagenot in memory
valid physical page number (PPN)access
VPN acts astable index
PTEA= PTE
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 23
CPUTrans-lation
Cache MainMemory
VA PA miss
hitdata
Integrating VM and Cache
• Most Caches “Physically Addressed”– Allows multiple processes to have blocks in cache at
same time– Cache doesn’t need to be concerned with protection
issues (Access rights checked as part of address translation)
• Perform Address Translation Before Cache – But this involves a memory access itself (of the PTE)– Of course, page table entries can also become cached
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 24
CPUTLB
LookupCache
MainMemory
VA PA miss
hit
data
Trans-lation
hit
miss
Speeding up Translation with a dedicated cache
• “Translation Lookaside Buffer” (TLB)– Small hardware cache in MMU– Maps virtual page to physical page numbers– Contains complete PTEs for small number of pages
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 25
What do you know?(ow133)cpuid
This system has a Genuine Intel(R) Pentium(R) 4 processorProcessor Family: F, Extended Family: 0, Model: 2, Stepping: 7
Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz - 3.06 GHz (bus-speed 400/533 MHz)
Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entriesData TLB: 4K or 4M pages, fully associative, 64 entries1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line sizeNo 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cacheTrace cache: 12K-uops, 8-way set associative2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 26
Matrix Multiplication Example• Major Cache Effects to Consider
– Total cache size• Exploit temporal locality and keep the working set small (e.g., by
using blocking)– Block size
• Exploit spatial locality
• Description:– Multiply N x N matrices– O(N3) total operations– Accesses
• N reads per source element• N values summed per destination
• Assumption – No so large that cache not big enough to hold multiple rows
/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
Variable sumheld in register
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 27
Matrix Multiplication (ijk)/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
A B C
(i,*)
(*,j)(i,j)
Inner loop:
Column-wise
Row-wise Fixed
• Misses per Inner Loop Iteration:A B C
0.25 1.0 0.0 = 1.25
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 28
Matrix Multiplication (jik)/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}
/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}
A B C
(i,*)
(*,j)(i,j)
Inner loop:
Row-wise Column-wise
Fixed
• Misses per Inner Loop Iteration:A B C
0.25 1.0 0.0 = 1.25
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 29
Matrix Multiplication (kij)/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}
/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}
A B C
(i,*)(i,k) (k,*)
Inner loop:
Row-wise Row-wiseFixed
• Misses per Inner Loop Iteration:A B C
0.0 0.25 0.25 = 0.5
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 30
Matrix Multiplication (ikj)/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}
/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}
A B C
(i,*)(i,k) (k,*)
Inner loop:
Row-wise Row-wiseFixed
• Misses per Inner Loop Iteration:A B C
0.0 0.25 0.25 = 0.5
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 31
Matrix Multiplication (jki)/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}
/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}
A B C
(*,j)(k,j)
Inner loop:
(*,k)
Column -wise
Column-wise
Fixed
• Misses per Inner Loop Iteration:A B C
1.0 0.0 1.0 = 2.0
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 32
Matrix Multiplication (kji)/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}
/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}
A B C
(*,j)(k,j)
Inner loop:
(*,k)
FixedColumn-wise
Column-wise
• Misses per Inner Loop Iteration:A B C
1.0 0.0 1.0 = 2.0
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 33
Summary of Matrix Multiplication
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5
jki (& kji): • 2 loads, 1 store• misses/iter = 2.0
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 34
Matrix Multiply Performance• Miss rates are helpful but not perfect predictors.
• Code scheduling matters, too.
0
10
20
30
40
50
60
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
Cyc
les
/ite
rati
on
kjijkikijikjjikijk
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 35
Improving Temporal Locality by Blocking• Example: Blocked matrix multiplication
– do not mean “cache block”.– Instead, it mean a sub-block within the matrix.– Example: N = 8; sub-block size = 4
C11 = A11B11 + A12B21 C12 = A11B12 + A12B22
C21 = A21B11 + A22B21 C22 = A21B12 + A22B22
A11 A12
A21 A22
B11 B12
B21 B22
X = C11 C12
C21 C22
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 36
Blocked Matrix Multiply (bijk)for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } }}
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 37
Blocked Matrix Multiply Analysis• Innermost loop pair multiplies
– a 1 X bsize sliver of A by – a bsize X bsize block of B and
accumulates into 1 X bsize sliver of C– Loop over i steps through n row slivers of A & C,
using same B
A B C
block reused n times in succession
row sliver accessedbsize times
Update successiveelements of sliver
i ikk
kk jjjj
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 38
Matrix Multiply Performance• Blocking (bijk and bikj) improves performance
by a factor of two over unblocked versions– relatively insensitive to array size.
0
10
20
30
40
50
60
Array size (n)
Cy
cle
s/it
era
tio
n
kji
jki
kij
ikj
jik
ijk
bijk (bsize = 25)
bikj (bsize = 25)
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 39
Concluding Observations• Programmer can optimize for cache performance
– How data structures are organized– How data are accessed
• Nested loop structure• Blocking is a general technique
• All systems favor “cache friendly code”– Getting absolute optimum performance is very
platform specific• Cache sizes, line sizes, associativities, etc.
– Can get most of the advantage with generic code• Keep working set reasonably small (temporal locality)• Use small strides (spatial locality)
University of Amsterdam
Computer Systems – cache characteristics Arnoud Visser 40
Assignment
• Optimize your ‘image-processing’ code further with blocking
• Make your code a function of the L1-size
Recommended