h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030:

M Hi hM Hi hMemory HierarchyMemory Hierarchy

Jin-Soo Kim ([email protected])Jin Soo Kim ([email protected])Computer Systems Laboratory

Sungkyunkwan Universityhtt // l kk dhttp://csl.skku.edu

The CPU-Memory GapThe CPU-Memory GapThe CPU Memory GapThe CPU Memory Gap The increasing gap between DRAM, disk, and g g p

CPU speeds

100 000 000

1,000,00010,000,000

100,000,000

1 00010,000

100,000

ns

Disk seek timeDRAM access timeSRAM access time

10100

1,000CPU cycle time

11980 1985 1990 1995 2000

year

2SSE2030: Introduction to Computer Systems | Fall 2010| Jin-Soo Kim ([email protected])

year

Principle of Locality (1)Principle of Locality (1)Principle of Locality (1)Principle of Locality (1) Temporal localityp y

• Recently referenced items are likely to be referenced in the near future.

Spatial localityp y• Items with nearby addresses tend to be referenced

close together in time.g


Principle of Locality (2)Principle of Locality (2)Principle of Locality (2)Principle of Locality (2) Locality example:y p

sum = 0;for (i = 0; i < n; i++)for (i = 0; i < n; i++)

sum += a[i];return sum;

• Data– Reference array elements in succession Spatial localityy– Reference sum each iteration

• Instructions

p yTemporal locality

– Reference instructions in sequence– Cycle through loop repeatedly

Spatial localityTemporal locality


Principle of Locality (3)Principle of Locality (3)Principle of Locality (3)Principle of Locality (3) How to exploit temporal locality?p p y

• Speed up data accesses by caching data in faster storage.

• Caching in multiple levels: form a memory hierarchy: – The lower levels of the memory hierarchy tend to be slower,

b t l d hbut larger and cheaper.

How to exploit spatial locality? How to exploit spatial locality?• Larger cache line size

C h b d t t th– Cache nearby data together.


Memory HierarchyMemory HierarchyMemory HierarchyMemory HierarchyL0:

Smaller,faster registers

on‐chip L1cache (SRAM) L1 cache holds cache lines retrieved

CPU registers hold words retrieved from L1 cache.

L1:

faster,and

costlier(per byte) ( )

off‐chip L2cache (SRAM)

L1 cache holds cache lines retrieved from the L2 cache memory.

L2 cache holds cache lines retrieved from main memory.

L2:

(per byte)storage devices

main memory(DRAM)Larger,

slower, Main memory holds disk blocks retrieved from local

retrieved from main memory.

L3:

local secondary storage(local disks)

,and

cheaper (per byte)

L l di k h ld fil

blocks retrieved from local disks.

L4:(p y )storagedevices

remote secondary storage

Local disks hold files retrieved from disks on remote network servers.

L5:


(distributed file systems, Web servers)L5:

Caching (1)Caching (1)Caching (1)Caching (1) Cache

• A smaller, faster storage• Improves the average access timep g• Exploits both temporal and spatial locality


Caching (2)Caching (2)Caching (2)Caching (2) Cache performance metricsp

• Average memory access time = Thit + Rmiss * Tmiss

• Hit time (Thit)( hit)– Time to deliver a line in the cache to the processor– Includes time to determine whether the line is in the cache– 1 clock cycle for L1, 3 ~ 8 clock cycles for L2

• Miss rate (Rmiss)F i f f f d i h– Fraction of memory references not found in cache (misses/references)

– 3 ~ 10% for L1, < 1% for L2,

• Miss penalty (Tmiss)– Additional time required because of a miss


– Typically 25 ~ 100 cycles for main memory

Caching (3)Caching (3)Caching (3)Caching (3) Cache design issuesg

• Cache size– 8KB ~ 64KB for L1

• Cache line size– Typically, 32B or 64B for L1

• Lookup– Fully associative

S i i 2 4 8 16– Set associative: 2-way, 4-way, 8-way, 16-way, etc.– Direct mapped

• Replacement• Replacement– LRU (Least Recently Used)– FIFO (First-In First-Out), etc.


O ( s s Ou ), e c

Matrix Multiplication (1)Matrix Multiplication (1)Matrix Multiplication (1)Matrix Multiplication (1) Descriptionp

• Multiply N x N matrices• O(N3) total operations /* ijk */ Variable sum( ) p

for (i=0; i<n; i++) {for (j=0; j<n; j++) {

sum = 0.0;f (k 0 k k )

held in register

(i *)

(*,j)(i,j)X = for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];c[i][j] = sum;

}

A B C

(i,*) X =

Assumptions• Line size = 32B (big enough for 4 64 bit words)

}}

• Line size = 32B (big enough for 4 64-bit words)• Matrix dimension (N) is very large


Matrix Multiplication (2)Matrix Multiplication (2)Matrix Multiplication (2)Matrix Multiplication (2) Matrix multiplication (ijk)p ( j )

/* ijk */for (i=0; i<n; i++) {

Inner loop:

for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++)

(i,*)

(*,j)(i,j)

for (k 0; k<n; k++) sum += a[i][k] * b[k][j];

c[i][j] = sum;}

A B C

}} Column-

wiseRow-wise Fixed

Misses per Inner Loop Iteration:A B C

0 25 1 0 0 0


0.25 1.0 0.0

Matrix Multiplication (3)Matrix Multiplication (3)Matrix Multiplication (3)Matrix Multiplication (3) Matrix multiplication (jik)p (j )

Inner loop:/* jik */for (j=0; j<n; j++) {

(i,*)

(*,j)(i,j)

(j ; j ; j ) {for (i=0; i<n; i++) {sum = 0.0;for (k=0; k<n; k++)

A B Cfor (k=0; k<n; k++)sum += a[i][k] * b[k][j];

c[i][j] = sum

Column-wise

Row-wise Fixed}

}


0 25 1 0 0 0


0.25 1.0 0.0

Matrix Multiplication (4)Matrix Multiplication (4)Matrix Multiplication (4)Matrix Multiplication (4) Matrix multiplication (kij)p ( j)

/* kij */for (k=0; k<n; k++) {

Inner loop:for (k=0; k<n; k++) {for (i=0; i<n; i++) {r = a[i][k];f (j 0 j j ) A B C

(i,*)(i,k) (k,*)

for (j=0; j<n; j++)c[i][j] += r * b[k][j];

}

A B C

}Row-wise Row-wiseFixed


0 0 0 25 0 25


0.0 0.25 0.25

Matrix Multiplication (5)Matrix Multiplication (5)Matrix Multiplication (5)Matrix Multiplication (5) Matrix multiplication (ikj)p ( j)

Inner loop:/* ikj */

A B C

(i,*)(i,k) (k,*)

for (i=0; i<n; i++) {for (k=0; k<n; k++) {r = a[i][k];

A B C[ ][ ]

for (j=0; j<n; j++)c[i][j] += r * b[k][j];

}

Row-wise Row-wiseFixed

}}


0 0 0 25 0 25


0.0 0.25 0.25

Matrix Multiplication (6)Matrix Multiplication (6)Matrix Multiplication (6)Matrix Multiplication (6) Matrix multiplication (jki)p (j )

/* jki */ Inner loop:for (j=0; j<n; j++) {for (k=0; k<n; k++) {r = b[k][j];

(*,j)(k,j)

(*,k)

[ ][j];for (i=0; i<n; i++)c[i][j] += a[i][k] * r;

}

A B C

}}

Column -wise

Column-wise

Fixed


1 0 0 0 1 0

wise wise


1.0 0.0 1.0

Matrix Multiplication (7)Matrix Multiplication (7)Matrix Multiplication (7)Matrix Multiplication (7) Matrix multiplication (kji)p ( j )

Inner loop:/* kji */

(*,j)(k,j)

(*,k)for (k=0; k<n; k++) {for (j=0; j<n; j++) {r = b[k][j];

A B C

jfor (i=0; i<n; i++)c[i][j] += a[i][k] * r;

}

Column -wise

Column-wise

Fixed

}}


1 0 0 0 1 0

wise wise


1.0 0.0 1.0

Matrix Multiplication (8)Matrix Multiplication (8)Matrix Multiplication (8)Matrix Multiplication (8) Summaryy

ijk (& jik): • 2 loads, 0 stores

kij (& ikj): • 2 loads, 1 store

jki (& kji): • 2 loads, 1 store

for (i=0; i<n; i++) {

• misses/iter = 1.25

for (k=0; k<n; k++) { for (j=0; j<n; j++) {

• misses/iter = 0.5 • misses/iter = 2.0

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)for (k 0; k<n; k )

sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

for (j 0; j<n; j )

c[i][j] += r * b[k][j];

}

}

for (i 0; i<n; i )

c[i][j] += a[i][k] * r;

}

}}

}

} }

A B C A B C A B C


0.25 1.0 0.0 0.0 0.25 0.25 1.0 0.0 1.0

Matrix Multiplication (9)Matrix Multiplication (9)Matrix Multiplication (9)Matrix Multiplication (9) Performance in Pentium

50

60

40

50

n kji

30

ycle

s/ite

ratio

kjijkikijikj

20

Cy

jikijk

0

10


25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400Array size (n)

Blocked Matrix Multiplication (1)Blocked Matrix Multiplication (1)p ( )p ( )

Improving temporal locality by blockingp g p y y g• “Block” means a sub-block within the matrix.• Example: N = 8, sub-block size = 4p ,

A11 A12

A21 A22

B11 B12

B21 B22X =

C11 C12

C21 C22

• Key idea: Sub-blocks (i.e., Axy) can be treated just like

A21 A22 B21 B22 C21 C22

y xy jscalars.

C11 = A11B11 + A12B21 C12 = A11B12 + A12B22

C21 = A21B11 + A22B21 C22 = A21B12 + A22B22


Blocked Matrix Multiplication (2)Blocked Matrix Multiplication (2)p ( )p ( )

Performance in Pentium• Blocking improves performance by a factor of two

over unblocked version (ijk and jik)

50

60

40

erat

ion

kjijkikijikj

20

30

Cyc

les/

ite ikjjikijkbijk (bsize = 25)

0

10

5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0

bikj (bsize = 25)


25 50 75 100

125

150

175

200

225

250

275

300

325

350

375

400

Array size (n)

Calculating Pi (1)Calculating Pi (1)Calculating Pi (1)Calculating Pi (1)

tan = 1 atan(1) =

4

4

4 0

1 11+x2 dx

= 4 (atan(1)–atan(0))

0 1+x y = atan(x)y = 1+x21

= 4 ( – 0) =

4


Calculating Pi (2)Calculating Pi (2)Calculating Pi (2)Calculating Pi (2) Pi: Sequential versionq

#define N 20000000#define STEP (1 0 / (double) N)#define STEP (1.0 / (double) N)

double compute (){

1

1int i;double x;double sum = 0.0;

N

for (i = 0; i < N; i++){

x = (i + 0.5) * STEP;

i+0.5N

1+ ( )2

1

sum += 4.0 / (1.0 + x * x);}return sum * STEP;

}

0 1

(i+0.5)N


} N

Calculating Pi (3)Calculating Pi (3)Calculating Pi (3)Calculating Pi (3) False sharingg

6

7Pi (OpenMP) N = 2e7

Pad = 0

Pad 28

struct thread_stat {int count;char pad[PADS];

} t t[8]

4

5

edup

Pad = 28

Pad = 60

Pad =124

} stat[8];

double compute () {int i, id;

1

2

3

Spe, ;

double x, sum = 0.0;

#pragma omp parallel for i t ( ) d ti ( )

0

1

1 2 3 4 5 6 7 8The Number of Cores

private(x) reduction(+:sum)for (i = 0; i < N; i ++) {int id=omp_get_thread_num();stat[id].count++; The Number of Cores[ ] ;x = (i + 0.5) * STEP;sum += 4.0 / (1.0 + x * x);

}t * STEP

stat[0] stat[1] ... stat[7]

padC C C


return sum * STEP;}

pad ...C C C

ObservationsObservationsObservationsObservations Programmer can optimize for cache

performance.• How data structures are organized.• How data are accessed.

– Nested loop structure– Blocking is a general technique– Blocking is a general technique.

All systems favor “cache friendly code”• Getting absolute optimum performance is very• Getting absolute optimum performance is very

platform specific– Cache sizes, line sizes, associativities, etc.

• Can get most of the advantage with generic code– Keep working set reasonably small (temporal locality)

Use small strides (spatial locality)


– Use small strides (spatial locality)

Documents

h i M HM emory Hierarchycsl.skku.edu/uploads/SSE2030F10/16-memory.pdf · 2011. 2. 14. · jki kij ikj 20 Cycles/it 30 jik ijk bijk (bsize = 25) 0 10 5 05 5 bikj (bsize = 25) SSE2030: