24
M Hi h M Hi h Memory Hierarchy Memory Hierarchy Jin-Soo Kim ( [email protected]) Jin Soo Kim ( [email protected]) Computer Systems Laboratory Sungkyunkwan University htt // l kk d http://csl.skku.edu

h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

M Hi hM Hi hMemory HierarchyMemory Hierarchy

Jin-Soo Kim ([email protected])Jin Soo Kim ([email protected])Computer Systems Laboratory

Sungkyunkwan Universityhtt // l kk dhttp://csl.skku.edu

Page 2: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

The CPU-Memory GapThe CPU-Memory GapThe CPU Memory GapThe CPU Memory Gap The increasing gap between DRAM, disk, and g g p

CPU speeds

100 000 000

1,000,00010,000,000

100,000,000

1 00010,000

100,000

ns

Disk seek timeDRAM access timeSRAM access time

10100

1,000CPU cycle time

11980 1985 1990 1995 2000

year

2SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

year

Page 3: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Principle of Locality (1)Principle of Locality (1)Principle of Locality (1)Principle of Locality (1) Temporal localityp y

• Recently referenced items are likely to be referenced in the near future.

Spatial localityp y• Items with nearby addresses tend to be referenced

close together in time.g

3SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

Page 4: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Principle of Locality (2)Principle of Locality (2)Principle of Locality (2)Principle of Locality (2) Locality example:y p

sum = 0;for (i = 0; i < n; i++)for (i = 0; i < n; i++)

sum += a[i];return sum;

• Data– Reference array elements in succession Spatial localityy– Reference sum each iteration

• Instructions

p yTemporal locality

– Reference instructions in sequence– Cycle through loop repeatedly

Spatial localityTemporal locality

4SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

Page 5: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Principle of Locality (3)Principle of Locality (3)Principle of Locality (3)Principle of Locality (3) How to exploit temporal locality?p p y

• Speed up data accesses by caching data in faster storage.

• Caching in multiple levels: form a memory hierarchy: – The lower levels of the memory hierarchy tend to be slower,

b t l d hbut larger and cheaper.

How to exploit spatial locality? How to exploit spatial locality?• Larger cache line size

C h b d t t th– Cache nearby data together.

5SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

Page 6: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Memory HierarchyMemory HierarchyMemory HierarchyMemory HierarchyL0:

Smaller,faster registers

on‐chip L1cache (SRAM) L1 cache holds cache lines retrieved

CPU registers hold words retrieved from L1 cache.

L1:

faster,and 

costlier(per byte) ( )

off‐chip L2cache (SRAM)

L1 cache holds cache lines retrieved from the L2 cache memory.

L2 cache holds cache lines retrieved from main memory.

L2:

(per byte)storage devices

main memory(DRAM)Larger,  

slower,  Main memory holds disk blocks retrieved from local

retrieved from main memory.

L3:

local secondary storage(local disks)

,and 

cheaper (per byte)

L l di k h ld fil

blocks retrieved from local disks.

L4:(p y )storagedevices

remote secondary storage

Local disks hold files retrieved from disks on remote network servers.

L5:

6SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

(distributed file systems, Web servers)L5:

Page 7: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Caching (1)Caching (1)Caching (1)Caching (1) Cache

• A smaller, faster storage• Improves the average access timep g• Exploits both temporal and spatial locality

7SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

Page 8: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Caching (2)Caching (2)Caching (2)Caching (2) Cache performance metricsp

• Average memory access time = Thit + Rmiss * Tmiss

• Hit time (Thit)( hit)– Time to deliver a line in the cache to the processor– Includes time to determine whether the line is in the cache– 1 clock cycle for L1, 3 ~ 8 clock cycles for L2

• Miss rate (Rmiss)F i f f f d i h– Fraction of memory references not found in cache (misses/references)

– 3 ~ 10% for L1, < 1% for L2,

• Miss penalty (Tmiss)– Additional time required because of a miss

8SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

– Typically 25 ~ 100 cycles for main memory

Page 9: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Caching (3)Caching (3)Caching (3)Caching (3) Cache design issuesg

• Cache size– 8KB ~ 64KB for L1

• Cache line size– Typically, 32B or 64B for L1

• Lookup– Fully associative

S i i 2 4 8 16– Set associative: 2-way, 4-way, 8-way, 16-way, etc.– Direct mapped

• Replacement• Replacement– LRU (Least Recently Used)– FIFO (First-In First-Out), etc.

9SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

O ( s s Ou ), e c

Page 10: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Matrix Multiplication (1)Matrix Multiplication (1)Matrix Multiplication (1)Matrix Multiplication (1) Descriptionp

• Multiply N x N matrices• O(N3) total operations /* ijk */ Variable sum( ) p

for (i=0; i<n; i++) {for (j=0; j<n; j++) {

sum = 0.0;f (k 0 k k )

held in register

(i *)

(*,j)(i,j)X = for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];c[i][j] = sum;

}

A B C

(i,*) X =

Assumptions• Line size = 32B (big enough for 4 64 bit words)

}}

• Line size = 32B (big enough for 4 64-bit words)• Matrix dimension (N) is very large

10SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

Page 11: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Matrix Multiplication (2)Matrix Multiplication (2)Matrix Multiplication (2)Matrix Multiplication (2) Matrix multiplication (ijk)p ( j )

/* ijk */for (i=0; i<n; i++) {

Inner loop:

for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++)

(i,*)

(*,j)(i,j)

for (k 0; k<n; k++) sum += a[i][k] * b[k][j];

c[i][j] = sum;}

A B C

}} Column-

wiseRow-wise Fixed

Misses per Inner Loop Iteration:A B C

0 25 1 0 0 0

11SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

0.25 1.0 0.0

Page 12: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Matrix Multiplication (3)Matrix Multiplication (3)Matrix Multiplication (3)Matrix Multiplication (3) Matrix multiplication (jik)p (j )

Inner loop:/* jik */for (j=0; j<n; j++) {

(i,*)

(*,j)(i,j)

(j ; j ; j ) {for (i=0; i<n; i++) {sum = 0.0;for (k=0; k<n; k++)

A B Cfor (k=0; k<n; k++)sum += a[i][k] * b[k][j];

c[i][j] = sum

Column-wise

Row-wise Fixed}

}

Misses per Inner Loop Iteration:A B C

0 25 1 0 0 0

12SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

0.25 1.0 0.0

Page 13: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Matrix Multiplication (4)Matrix Multiplication (4)Matrix Multiplication (4)Matrix Multiplication (4) Matrix multiplication (kij)p ( j)

/* kij */for (k=0; k<n; k++) {

Inner loop:for (k=0; k<n; k++) {for (i=0; i<n; i++) {r = a[i][k];f (j 0 j j ) A B C

(i,*)(i,k) (k,*)

for (j=0; j<n; j++)c[i][j] += r * b[k][j];

}

A B C

}Row-wise Row-wiseFixed

Misses per Inner Loop Iteration:A B C

0 0 0 25 0 25

13SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

0.0 0.25 0.25

Page 14: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Matrix Multiplication (5)Matrix Multiplication (5)Matrix Multiplication (5)Matrix Multiplication (5) Matrix multiplication (ikj)p ( j)

Inner loop:/* ikj */

A B C

(i,*)(i,k) (k,*)

for (i=0; i<n; i++) {for (k=0; k<n; k++) {r = a[i][k];

A B C[ ][ ]

for (j=0; j<n; j++)c[i][j] += r * b[k][j];

}

Row-wise Row-wiseFixed

}}

Misses per Inner Loop Iteration:A B C

0 0 0 25 0 25

14SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

0.0 0.25 0.25

Page 15: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Matrix Multiplication (6)Matrix Multiplication (6)Matrix Multiplication (6)Matrix Multiplication (6) Matrix multiplication (jki)p (j )

/* jki */ Inner loop:for (j=0; j<n; j++) {for (k=0; k<n; k++) {r = b[k][j];

(*,j)(k,j)

(*,k)

[ ][j];for (i=0; i<n; i++)c[i][j] += a[i][k] * r;

}

A B C

}}

Column -wise

Column-wise

Fixed

Misses per Inner Loop Iteration:A B C

1 0 0 0 1 0

wise wise

15SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

1.0 0.0 1.0

Page 16: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Matrix Multiplication (7)Matrix Multiplication (7)Matrix Multiplication (7)Matrix Multiplication (7) Matrix multiplication (kji)p ( j )

Inner loop:/* kji */

(*,j)(k,j)

(*,k)for (k=0; k<n; k++) {for (j=0; j<n; j++) {r = b[k][j];

A B C

jfor (i=0; i<n; i++)c[i][j] += a[i][k] * r;

}

Column -wise

Column-wise

Fixed

}}

Misses per Inner Loop Iteration:A B C

1 0 0 0 1 0

wise wise

16SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

1.0 0.0 1.0

Page 17: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Matrix Multiplication (8)Matrix Multiplication (8)Matrix Multiplication (8)Matrix Multiplication (8) Summaryy

ijk (& jik): • 2 loads, 0 stores

kij (& ikj): • 2 loads, 1 store

jki (& kji): • 2 loads, 1 store

for (i=0; i<n; i++) {

• misses/iter = 1.25

for (k=0; k<n; k++) { for (j=0; j<n; j++) {

• misses/iter = 0.5 • misses/iter = 2.0

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)for (k 0; k<n; k )

sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

for (j 0; j<n; j )

c[i][j] += r * b[k][j];

}

}

for (i 0; i<n; i )

c[i][j] += a[i][k] * r;

}

}}

}

} }

A B C A B C A B C

17SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

0.25 1.0 0.0 0.0 0.25 0.25 1.0 0.0 1.0

Page 18: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Matrix Multiplication (9)Matrix Multiplication (9)Matrix Multiplication (9)Matrix Multiplication (9) Performance in Pentium

50

60

40

50

n kji

30

ycle

s/ite

ratio

kjijkikijikj

20

Cy

jikijk

0

10

18SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400Array size (n)

Page 19: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Blocked Matrix Multiplication (1)Blocked Matrix Multiplication (1)p ( )p ( )

Improving temporal locality by blockingp g p y y g• “Block” means a sub-block within the matrix.• Example: N = 8, sub-block size = 4p ,

A11 A12

A21 A22

B11 B12

B21 B22X =

C11 C12

C21 C22

• Key idea: Sub-blocks (i.e., Axy) can be treated just like

A21 A22 B21 B22 C21 C22

y xy jscalars.

C11 = A11B11 + A12B21 C12 = A11B12 + A12B22

C21 = A21B11 + A22B21 C22 = A21B12 + A22B22

19SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

Page 20: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Blocked Matrix Multiplication (2)Blocked Matrix Multiplication (2)p ( )p ( )

Performance in Pentium• Blocking improves performance by a factor of two

over unblocked version (ijk and jik)

50

60

40

erat

ion

kjijkikijikj

20

30

Cyc

les/

ite ikjjikijkbijk (bsize = 25)

0

10

5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0

bikj (bsize = 25)

20SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

25 50 75 100

125

150

175

200

225

250

275

300

325

350

375

400

Array size (n)

Page 21: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Calculating Pi (1)Calculating Pi (1)Calculating Pi (1)Calculating Pi (1)

tan = 1 atan(1) =

4

4

4 0

1 11+x2 dx

= 4 (atan(1)–atan(0))

0 1+x y = atan(x)y = 1+x21

= 4 ( – 0) =

4

21SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

Page 22: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Calculating Pi (2)Calculating Pi (2)Calculating Pi (2)Calculating Pi (2) Pi: Sequential versionq

#define N 20000000#define STEP (1 0 / (double) N)#define STEP    (1.0 / (double) N)

double compute (){

1

1int i;double x;double sum = 0.0;

N

for (i = 0; i < N; i++){

x = (i + 0.5) * STEP;

i+0.5N

1+ ( )2

1

sum += 4.0 / (1.0 + x * x);}return sum * STEP;

}

0 1

(i+0.5)N

22SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

} N

Page 23: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

Calculating Pi (3)Calculating Pi (3)Calculating Pi (3)Calculating Pi (3) False sharingg

6

7Pi (OpenMP) N = 2e7

Pad = 0

Pad 28

struct thread_stat {int count;char pad[PADS];

} t t[8]

4

5

edup

Pad = 28

Pad = 60

Pad =124

} stat[8];

double compute ()  {int i, id;

1

2

3

Spe, ;

double x, sum = 0.0;

#pragma omp parallel for i t ( ) d ti ( )

0

1

1 2 3 4 5 6 7 8The Number of Cores

private(x) reduction(+:sum)for (i = 0; i < N; i ++)  {int id=omp_get_thread_num();stat[id].count++; The Number of Cores[ ] ;x = (i + 0.5) * STEP;sum += 4.0 / (1.0 + x * x);

}t * STEP

stat[0] stat[1] ... stat[7]

padC C C

23SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

return sum * STEP;}

pad ...C C C

Page 24: h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

ObservationsObservationsObservationsObservations Programmer can optimize for cache

performance.• How data structures are organized.• How data are accessed.

– Nested loop structure– Blocking is a general technique– Blocking is a general technique.

All systems favor “cache friendly code”• Getting absolute optimum performance is very• Getting absolute optimum performance is very

platform specific– Cache sizes, line sizes, associativities, etc.

• Can get most of the advantage with generic code– Keep working set reasonably small (temporal locality)

Use small strides (spatial locality)

24SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

– Use small strides (spatial locality)