Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
CSE351, Autumn 2019L19: Caches IV
Caches IVCSE 351 Autumn 2019
Instructor: Teaching Assistants:Justin Hsia Andrew Hu Antonio Castelli Cosmo Wang
Diya Joy Ivy Yu Kaelin LaundryMaurice Montag Melissa Birchfield Millicent LiSuraj Jagadeesh
http://xkcd.com/908/
CSE351, Autumn 2019L19: Caches IV
Administrivia
Lab 3 due tonight Late days count as normal: Sunday is 1 day, Monday is 2
No lecture Monday ā Veteranās Day! Lab 4 released over the long weekend Cache parameter puzzles and code optimizations
hw16 due Wednesday (11/13)
2
CSE351, Autumn 2019L19: Caches IV
What about writes?
Multiple copies of data may exist: multiple levels of cache and main memory
What to do on a writeāhit? Writeāthrough: write immediately to next level Writeāback: defer write to next level until line is evicted (replaced)
ā¢ Must track which cache lines have been modified (ādirty bitā)
What to do on a writeāmiss? Write allocate: (āfetch on writeā) load into cache, then execute the
writeāhit policyā¢ Good if more writes or reads to the location follow
Noāwrite allocate: (āwrite aroundā) just write immediately to next level
Typical caches: Writeāback + Write allocate, usually Writeāthrough + Noāwrite allocate, occasionally
3
CSE351, Autumn 2019L19: Caches IV
Writeāback, Write Allocate Example
4
Note: While unrealistic, this example assumes that all requests have offset 0 and are for a blockās worth of data.
0xBEEFCache: G01
Valid Dirty Tag Block Contents
Memory:0xCAFE
0xBEEF
F
G
āāā
āāā
āāā
BlockNum
There is only one set in this tiny cache, so the tag is the entire block number!
CSE351, Autumn 2019L19: Caches IV
Writeāback, Write Allocate Example
5
0xBEEFCache: G01
Valid Dirty Tag Block Contents
Memory:0xCAFE
0xBEEF
F
G
āāā
āāā
āāā
BlockNum
1) mov 0xFACE, F
Step 1: Bring F into cache
Write Miss!
CSE351, Autumn 2019L19: Caches IV
Writeāback, Write Allocate Example
6
0xCAFECache: F01
Valid Dirty Tag Block Contents
Memory:0xCAFE
0xBEEF
F
G
āāā
āāā
āāā
BlockNum
1) mov 0xFACE, F
Step 1: Bring F into cache
Step 2: Write 0xFACE to cache only and set the dirty bit
Write Miss
CSE351, Autumn 2019L19: Caches IV
Writeāback, Write Allocate Example
7
0xFACECache: F11
Valid Dirty Tag Block Contents
Memory:0xCAFE
0xBEEF
F
G
āāā
āāā
āāā
BlockNum
1) mov 0xFACE, F
Step 1: Bring F into cache
Step 2: Write 0xFACE to cache only and set the dirty bit
Write Miss
CSE351, Autumn 2019L19: Caches IV
Writeāback, Write Allocate Example
8
0xFACECache: F11
Valid Dirty Tag Block Contents
Memory:0xCAFE
0xBEEF
F
G
āāā
āāā
āāā
BlockNum
1) mov 0xFACE, F
Step: Write 0xFEED to cache only (and set the dirty bit)
2) mov 0xFEED, FWrite Miss Write Hit!
CSE351, Autumn 2019L19: Caches IV
Writeāback, Write Allocate Example
9
0xFEEDCache: F11
Valid Dirty Tag Block Contents
Memory:0xCAFE
0xBEEF
F
G
āāā
āāā
āāā
BlockNum
1) mov 0xFACE, F 2) mov 0xFEED, FWrite Miss Write Hit
CSE351, Autumn 2019L19: Caches IV
Writeāback, Write Allocate Example
10
0xFEEDCache: F11
Valid Dirty Tag Block Contents
Memory:0xCAFE
0xBEEF
F
G
āāā
āāā
āāā
BlockNum
1) mov 0xFACE, F 2) mov 0xFEED, F 3) mov G, %axWrite Miss Write Hit Read Miss!
Step 1: Write F back to memory since it is dirty
CSE351, Autumn 2019L19: Caches IV
Writeāback, Write Allocate Example
11
0xFEEDCache: F11
Valid Dirty Tag Block Contents
Memory:0xFEED
0xBEEF
F
G
āāā
āāā
āāā
BlockNum
1) mov 0xFACE, F 2) mov 0xFEED, F 3) mov G, %axWrite Miss Write Hit Read Miss
Step 1: Write F back to memory since it is dirty
Step 2: Bring G into the cache so that we can copy it into %ax
0 G 0xBEEF
CSE351, Autumn 2019L19: Caches IV
Cache Simulator
Want to play around with cache parameters and policies? Check out our cache simulator! https://courses.cs.washington.edu/courses/cse351/cachesim/
Way to use: Take advantage of āexplain modeā and navigable history to test your own hypotheses and answer your own questions Selfāguided Cache Sim Demo posted along with Section 7 Will be used in hw17 ā Lab 4 Preparation
12
CSE351, Autumn 2019L19: Caches IV
Polling Question
Which of the following cache statements is FALSE? Vote at http://PollEv.com/justinh
A. We can reduce compulsory misses by decreasing our block size
B. We can reduce conflict misses by increasing associativity
C. A writeāback cache will save time for code with good temporal locality on writes
D. A writeāthrough cache will always match data with the memory hierarchy level below it
E. Weāre lostā¦
CSE351, Autumn 2019L19: Caches IV
Optimizations for the Memory Hierarchy
Write code that has locality! Spatial: access data contiguously Temporal: make sure access to the same data is not too far apart in time
How can you achieve locality? Adjust memory accesses in code (software) to improve miss rate (MR)ā¢ Requires knowledge of both how caches work as well as your systemās parameters
Proper choice of algorithm Loop transformations
14
CSE351, Autumn 2019L19: Caches IV
Example: Matrix Multiplication
15
C
= Ć
A B
ai* b*j
cij
CSE351, Autumn 2019L19: Caches IV
Matrices in Memory
How do cache blocks fit into this scheme? Row major matrix in memory:
16
Cache blocks
COLUMN of matrix (blue) is spread among cache blocks shown in red
CSE351, Autumn 2019L19: Caches IV
NaĆÆve Matrix Multiply
# move along rows of Afor (i = 0; i < n; i++)
# move along columns of Bfor (j = 0; j < n; j++)# EACH k loop reads row of A, col of B# Also read & write c(i,j) n timesfor (k = 0; k < n; k++)c[i*n+j] += a[i*n+k] * b[k*n+j];
17
C(i,j) A(i,:)B(:,j)
C(i,j)
CSE351, Autumn 2019L19: Caches IV
Cache Miss Analysis (NaĆÆve)
Scenario Parameters: Square matrix ( ), elements are doubles Cache block size = 64 B = 8 doubles Cache size (much smaller than )
Each iteration:
misses
18
Ignoring matrix c
CSE351, Autumn 2019L19: Caches IV
Cache Miss Analysis (NaĆÆve)
Scenario Parameters: Square matrix ( ), elements are doubles Cache block size = 64 B = 8 doubles Cache size (much smaller than )
Each iteration:
misses
Afterwards in cache:(schematic)
198 doubles wide
Ignoring matrix c
CSE351, Autumn 2019L19: Caches IV
Cache Miss Analysis (NaĆÆve)
Scenario Parameters: Square matrix ( ), elements are doubles Cache block size = 64 B = 8 doubles Cache size (much smaller than )
Each iteration:
misses
Total misses: 2 3
20
Ignoring matrix c
once per product matrix element
CSE351, Autumn 2019L19: Caches IV
Linear Algebra to the Rescue (1)
Can get the same result of a matrix multiplication by splitting the matrices into smaller submatrices (matrix āblocksā)
For example, multiply two 4Ć4 matrices:
21
This is extra (nonātestable)
material
CSE351, Autumn 2019L19: Caches IV
Linear Algebra to the Rescue (2)
22
Matrices of size , split into 4 blocks of size ( = )
C22 = A21B12 + A22B22 + A23B32 + A24B42 = k A2k*Bk2
Multiplication operates on small āblockā matrices Choose size so that they fit in the cache! This technique called ācache blockingā
C11 C12 C13 C14
C21 C22 C23 C24
C31 C32 C43 C34
C41 C42 C43 C44
A11 A12 A13 A14
A21 A22 A23 A24
A31 A32 A33 A34
A41 A42 A43 A144
B11 B12 B13 B14
B21 B22 B23 B24
B32 B32 B33 B34
B41 B42 B43 B44
This is extra (nonātestable)
material
CSE351, Autumn 2019L19: Caches IV
Blocked Matrix Multiply
Blocked version of the naĆÆve algorithm:
= block matrix size (assume divides evenly)
23
# move by rxr BLOCKS nowfor (i = 0; i < n; i += r)for (j = 0; j < n; j += r)for (k = 0; k < n; k += r)# block matrix multiplicationfor (ib = i; ib < i+r; ib++)for (jb = j; jb < j+r; jb++)for (kb = k; kb < k+r; kb++)c[ib*n+jb] += a[ib*n+kb]*b[kb*n+jb];
CSE351, Autumn 2019L19: Caches IV
Cache Miss Analysis (Blocked)
Scenario Parameters: Cache block size = 64 B = 8 doubles Cache size (much smaller than ) Three blocks ( ) fit into cache: 2
Each block iteration: misses per block 2
24
š/š blocksš2elements per block, 8 per cache block
š/š blocks in row and column
Ignoring matrix c
CSE351, Autumn 2019L19: Caches IV
Cache Miss Analysis (Blocked)
Scenario Parameters: Cache block size = 64 B = 8 doubles Cache size (much smaller than ) Three blocks ( ) fit into cache: 2
Each block iteration: misses per block 2
Afterwards in cache(schematic)
25
š/š blocksš2elements per block, 8 per cache block
š/š blocks in row and column
Ignoring matrix c
CSE351, Autumn 2019L19: Caches IV
Cache Miss Analysis (Blocked)
Scenario Parameters: Cache block size = 64 B = 8 doubles Cache size (much smaller than ) Three blocks ( ) fit into cache: 2
Each block iteration: misses per block 2
Total misses: 3
26
š/š blocksš2elements per block, 8 per cache block
š/š blocks in row and column
Ignoring matrix c
CSE351, Autumn 2019L19: Caches IV
Matrix Multiply Visualization
Here = 100, = 32 KiB, = 30
27
NaĆÆve:
Blocked:
ā 1,020,000cache misses
ā 90,000cache misses
CSE351, Autumn 2019L19: Caches IV
CacheāFriendly Code
Programmer can optimize for cache performance How data structures are organized How data are accessed
ā¢ Nested loop structureā¢ Blocking is a general technique
All systems favor ācacheāfriendly codeā Getting absolute optimum performance is very platform specificā¢ Cache size, cache block size, associativity, etc.
Can get most of the advantage with generic codeā¢ Keep working set reasonably small (temporal locality)ā¢ Use small strides (spatial locality)ā¢ Focus on inner loop code
28
CSE351, Autumn 2019L19: Caches IV
The Memory Mountain
29
128m32m
8m2m
512k128k
32k0
2000
4000
6000
8000
10000
12000
14000
16000
s1s3
s5s7
s9s11
Size (bytes)
Read
throughp
ut (M
B/s)
Stride (x8 bytes)
Core i7 Haswell2.1 GHz32 KB L1 dācache256 KB L2 cache8 MB L3 cache64 B block size
Slopes of spatial locality
Ridges of temporal locality
L1
Mem
L2
L3
Aggressive prefetching
CSE351, Autumn 2019L19: Caches IV
Learning About Your Machine
Linux: lscpu ls /sys/devices/system/cpu/cpu0/cache/index0/
ā¢ Example: cat /sys/devices/system/cpu/cpu0/cache/index*/size
Windows: wmic memcache get <query> (all values in KB) Example: wmic memcache get MaxCacheSize
Modern processor specs: http://www.7ācpu.com/
30