M Hi hM Hi hMemory HierarchyMemory Hierarchy
Jin-Soo Kim ([email protected])Jin Soo Kim ([email protected])Computer Systems Laboratory
Sungkyunkwan Universityhtt // l kk dhttp://csl.skku.edu
The CPU-Memory GapThe CPU-Memory GapThe CPU Memory GapThe CPU Memory Gap The increasing gap between DRAM, disk, and g g p
CPU speeds
100 000 000
1,000,00010,000,000
100,000,000
1 00010,000
100,000
ns
Disk seek timeDRAM access timeSRAM access time
10100
1,000CPU cycle time
11980 1985 1990 1995 2000
year
2SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
year
Principle of Locality (1)Principle of Locality (1)Principle of Locality (1)Principle of Locality (1) Temporal localityp y
• Recently referenced items are likely to be referenced in the near future.
Spatial localityp y• Items with nearby addresses tend to be referenced
close together in time.g
3SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
Principle of Locality (2)Principle of Locality (2)Principle of Locality (2)Principle of Locality (2) Locality example:y p
sum = 0;for (i = 0; i < n; i++)for (i = 0; i < n; i++)
sum += a[i];return sum;
• Data– Reference array elements in succession Spatial localityy– Reference sum each iteration
• Instructions
p yTemporal locality
– Reference instructions in sequence– Cycle through loop repeatedly
Spatial localityTemporal locality
4SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
Principle of Locality (3)Principle of Locality (3)Principle of Locality (3)Principle of Locality (3) How to exploit temporal locality?p p y
• Speed up data accesses by caching data in faster storage.
• Caching in multiple levels: form a memory hierarchy: – The lower levels of the memory hierarchy tend to be slower,
b t l d hbut larger and cheaper.
How to exploit spatial locality? How to exploit spatial locality?• Larger cache line size
C h b d t t th– Cache nearby data together.
5SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
Memory HierarchyMemory HierarchyMemory HierarchyMemory HierarchyL0:
Smaller,faster registers
on‐chip L1cache (SRAM) L1 cache holds cache lines retrieved
CPU registers hold words retrieved from L1 cache.
L1:
faster,and
costlier(per byte) ( )
off‐chip L2cache (SRAM)
L1 cache holds cache lines retrieved from the L2 cache memory.
L2 cache holds cache lines retrieved from main memory.
L2:
(per byte)storage devices
main memory(DRAM)Larger,
slower, Main memory holds disk blocks retrieved from local
retrieved from main memory.
L3:
local secondary storage(local disks)
,and
cheaper (per byte)
L l di k h ld fil
blocks retrieved from local disks.
L4:(p y )storagedevices
remote secondary storage
Local disks hold files retrieved from disks on remote network servers.
L5:
6SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
(distributed file systems, Web servers)L5:
Caching (1)Caching (1)Caching (1)Caching (1) Cache
• A smaller, faster storage• Improves the average access timep g• Exploits both temporal and spatial locality
7SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
Caching (2)Caching (2)Caching (2)Caching (2) Cache performance metricsp
• Average memory access time = Thit + Rmiss * Tmiss
• Hit time (Thit)( hit)– Time to deliver a line in the cache to the processor– Includes time to determine whether the line is in the cache– 1 clock cycle for L1, 3 ~ 8 clock cycles for L2
• Miss rate (Rmiss)F i f f f d i h– Fraction of memory references not found in cache (misses/references)
– 3 ~ 10% for L1, < 1% for L2,
• Miss penalty (Tmiss)– Additional time required because of a miss
8SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
– Typically 25 ~ 100 cycles for main memory
Caching (3)Caching (3)Caching (3)Caching (3) Cache design issuesg
• Cache size– 8KB ~ 64KB for L1
• Cache line size– Typically, 32B or 64B for L1
• Lookup– Fully associative
S i i 2 4 8 16– Set associative: 2-way, 4-way, 8-way, 16-way, etc.– Direct mapped
• Replacement• Replacement– LRU (Least Recently Used)– FIFO (First-In First-Out), etc.
9SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
O ( s s Ou ), e c
Matrix Multiplication (1)Matrix Multiplication (1)Matrix Multiplication (1)Matrix Multiplication (1) Descriptionp
• Multiply N x N matrices• O(N3) total operations /* ijk */ Variable sum( ) p
for (i=0; i<n; i++) {for (j=0; j<n; j++) {
sum = 0.0;f (k 0 k k )
held in register
(i *)
(*,j)(i,j)X = for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];c[i][j] = sum;
}
A B C
(i,*) X =
Assumptions• Line size = 32B (big enough for 4 64 bit words)
}}
• Line size = 32B (big enough for 4 64-bit words)• Matrix dimension (N) is very large
10SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
Matrix Multiplication (2)Matrix Multiplication (2)Matrix Multiplication (2)Matrix Multiplication (2) Matrix multiplication (ijk)p ( j )
/* ijk */for (i=0; i<n; i++) {
Inner loop:
for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++)
(i,*)
(*,j)(i,j)
for (k 0; k<n; k++) sum += a[i][k] * b[k][j];
c[i][j] = sum;}
A B C
}} Column-
wiseRow-wise Fixed
Misses per Inner Loop Iteration:A B C
0 25 1 0 0 0
11SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
0.25 1.0 0.0
Matrix Multiplication (3)Matrix Multiplication (3)Matrix Multiplication (3)Matrix Multiplication (3) Matrix multiplication (jik)p (j )
Inner loop:/* jik */for (j=0; j<n; j++) {
(i,*)
(*,j)(i,j)
(j ; j ; j ) {for (i=0; i<n; i++) {sum = 0.0;for (k=0; k<n; k++)
A B Cfor (k=0; k<n; k++)sum += a[i][k] * b[k][j];
c[i][j] = sum
Column-wise
Row-wise Fixed}
}
Misses per Inner Loop Iteration:A B C
0 25 1 0 0 0
12SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
0.25 1.0 0.0
Matrix Multiplication (4)Matrix Multiplication (4)Matrix Multiplication (4)Matrix Multiplication (4) Matrix multiplication (kij)p ( j)
/* kij */for (k=0; k<n; k++) {
Inner loop:for (k=0; k<n; k++) {for (i=0; i<n; i++) {r = a[i][k];f (j 0 j j ) A B C
(i,*)(i,k) (k,*)
for (j=0; j<n; j++)c[i][j] += r * b[k][j];
}
A B C
}Row-wise Row-wiseFixed
Misses per Inner Loop Iteration:A B C
0 0 0 25 0 25
13SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
0.0 0.25 0.25
Matrix Multiplication (5)Matrix Multiplication (5)Matrix Multiplication (5)Matrix Multiplication (5) Matrix multiplication (ikj)p ( j)
Inner loop:/* ikj */
A B C
(i,*)(i,k) (k,*)
for (i=0; i<n; i++) {for (k=0; k<n; k++) {r = a[i][k];
A B C[ ][ ]
for (j=0; j<n; j++)c[i][j] += r * b[k][j];
}
Row-wise Row-wiseFixed
}}
Misses per Inner Loop Iteration:A B C
0 0 0 25 0 25
14SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
0.0 0.25 0.25
Matrix Multiplication (6)Matrix Multiplication (6)Matrix Multiplication (6)Matrix Multiplication (6) Matrix multiplication (jki)p (j )
/* jki */ Inner loop:for (j=0; j<n; j++) {for (k=0; k<n; k++) {r = b[k][j];
(*,j)(k,j)
(*,k)
[ ][j];for (i=0; i<n; i++)c[i][j] += a[i][k] * r;
}
A B C
}}
Column -wise
Column-wise
Fixed
Misses per Inner Loop Iteration:A B C
1 0 0 0 1 0
wise wise
15SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
1.0 0.0 1.0
Matrix Multiplication (7)Matrix Multiplication (7)Matrix Multiplication (7)Matrix Multiplication (7) Matrix multiplication (kji)p ( j )
Inner loop:/* kji */
(*,j)(k,j)
(*,k)for (k=0; k<n; k++) {for (j=0; j<n; j++) {r = b[k][j];
A B C
jfor (i=0; i<n; i++)c[i][j] += a[i][k] * r;
}
Column -wise
Column-wise
Fixed
}}
Misses per Inner Loop Iteration:A B C
1 0 0 0 1 0
wise wise
16SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
1.0 0.0 1.0
Matrix Multiplication (8)Matrix Multiplication (8)Matrix Multiplication (8)Matrix Multiplication (8) Summaryy
ijk (& jik): • 2 loads, 0 stores
kij (& ikj): • 2 loads, 1 store
jki (& kji): • 2 loads, 1 store
for (i=0; i<n; i++) {
• misses/iter = 1.25
for (k=0; k<n; k++) { for (j=0; j<n; j++) {
• misses/iter = 0.5 • misses/iter = 2.0
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)for (k 0; k<n; k )
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
for (j 0; j<n; j )
c[i][j] += r * b[k][j];
}
}
for (i 0; i<n; i )
c[i][j] += a[i][k] * r;
}
}}
}
} }
A B C A B C A B C
17SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
0.25 1.0 0.0 0.0 0.25 0.25 1.0 0.0 1.0
Matrix Multiplication (9)Matrix Multiplication (9)Matrix Multiplication (9)Matrix Multiplication (9) Performance in Pentium
50
60
40
50
n kji
30
ycle
s/ite
ratio
kjijkikijikj
20
Cy
jikijk
0
10
18SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400Array size (n)
Blocked Matrix Multiplication (1)Blocked Matrix Multiplication (1)p ( )p ( )
Improving temporal locality by blockingp g p y y g• “Block” means a sub-block within the matrix.• Example: N = 8, sub-block size = 4p ,
A11 A12
A21 A22
B11 B12
B21 B22X =
C11 C12
C21 C22
• Key idea: Sub-blocks (i.e., Axy) can be treated just like
A21 A22 B21 B22 C21 C22
y xy jscalars.
C11 = A11B11 + A12B21 C12 = A11B12 + A12B22
C21 = A21B11 + A22B21 C22 = A21B12 + A22B22
19SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
Blocked Matrix Multiplication (2)Blocked Matrix Multiplication (2)p ( )p ( )
Performance in Pentium• Blocking improves performance by a factor of two
over unblocked version (ijk and jik)
50
60
40
erat
ion
kjijkikijikj
20
30
Cyc
les/
ite ikjjikijkbijk (bsize = 25)
0
10
5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0
bikj (bsize = 25)
20SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
25 50 75 100
125
150
175
200
225
250
275
300
325
350
375
400
Array size (n)
Calculating Pi (1)Calculating Pi (1)Calculating Pi (1)Calculating Pi (1)
tan = 1 atan(1) =
4
4
4 0
1 11+x2 dx
= 4 (atan(1)–atan(0))
0 1+x y = atan(x)y = 1+x21
= 4 ( – 0) =
4
21SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
Calculating Pi (2)Calculating Pi (2)Calculating Pi (2)Calculating Pi (2) Pi: Sequential versionq
#define N 20000000#define STEP (1 0 / (double) N)#define STEP (1.0 / (double) N)
double compute (){
1
1int i;double x;double sum = 0.0;
N
for (i = 0; i < N; i++){
x = (i + 0.5) * STEP;
i+0.5N
1+ ( )2
1
sum += 4.0 / (1.0 + x * x);}return sum * STEP;
}
0 1
(i+0.5)N
22SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
} N
Calculating Pi (3)Calculating Pi (3)Calculating Pi (3)Calculating Pi (3) False sharingg
6
7Pi (OpenMP) N = 2e7
Pad = 0
Pad 28
struct thread_stat {int count;char pad[PADS];
} t t[8]
4
5
edup
Pad = 28
Pad = 60
Pad =124
} stat[8];
double compute () {int i, id;
1
2
3
Spe, ;
double x, sum = 0.0;
#pragma omp parallel for i t ( ) d ti ( )
0
1
1 2 3 4 5 6 7 8The Number of Cores
private(x) reduction(+:sum)for (i = 0; i < N; i ++) {int id=omp_get_thread_num();stat[id].count++; The Number of Cores[ ] ;x = (i + 0.5) * STEP;sum += 4.0 / (1.0 + x * x);
}t * STEP
stat[0] stat[1] ... stat[7]
padC C C
23SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
return sum * STEP;}
pad ...C C C
ObservationsObservationsObservationsObservations Programmer can optimize for cache
performance.• How data structures are organized.• How data are accessed.
– Nested loop structure– Blocking is a general technique– Blocking is a general technique.
All systems favor “cache friendly code”• Getting absolute optimum performance is very• Getting absolute optimum performance is very
platform specific– Cache sizes, line sizes, associativities, etc.
• Can get most of the advantage with generic code– Keep working set reasonably small (temporal locality)
Use small strides (spatial locality)
24SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])
– Use small strides (spatial locality)