68
Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

  • View
    220

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

Processor Architectures and Program Mapping

5KK70 TU/e

Henk Corporaal

Bart Mesman

Data Management Part c:

SCBD, MAA, and Data Layout

Page 2: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 2

Part 3 overview

• Recap on design flow

• Platform dependent steps– SCBD: Storage Cycle Budget Distribution– MAA: Memory Allocation and Assignment– Data layout techniques for RAM– Data layout techniques for Caches

• Results

• Conclusions

Thanks to the IMEC DTSE people

Page 3: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 3

Dynamic memory mgmtDynamic memory mgmt

Task concurrency mgmtTask concurrency mgmt

Physical memory mgmtPhysical memory mgmt

Address optimizationAddress optimization

SWSWdesigndesignflowflow

HWHWdesigndesignflowflow

SW/HW co-designSW/HW co-design

Concurrent OO specConcurrent OO spec

Remove OO overheadRemove OO overhead

Page 4: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 4

DM stepsC-in

Preprocessing

Dataflow transformations

Loop transformations

Data reuse Memory hierarchy layer assignment

Cycle budget distribution

Memory allocation and assignment

Data layout

C-out

Address optimization

Today

Page 5: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 5

Result of Memory hierarchy assignment for cavity detection

L3

L2

L1

N*M

3*1

image_in

M*3

gauss_x gauss_xy comp_edgeimage_out

3*3 1*1 3*3 1*1

N*M

N*M*3 N*M*3 N*M

N*M

0

N*M*3 N*M

N*M*3 N*M*8 N*M*8 N*M*8 N*M*8

M*3 M*3

1MB

SDRAM

16KB

Cache

128 B

RegFile

Page 6: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 6

Data-reuse - cavity detection code

for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixel initialized */ if (x==0 && y>=1 && y<=M-2)

in_pixels[x%3] = image_in[x][y];

/* copy rest of in_pixel's in row */ if (x>=0 && x<=N-2 && y>=1 && y<=M-2) in_pixels[(x+1)%3]= image_in[x+1][y];

if (x>=1 && x<=N-1-1 && y>=1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0;

Code after reuse transformation (partly)

Page 7: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

Storage Cycle Budget Distribution &

Memory Allocation and Assignment

Page 8: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 8

Define the memory organization which can provide enough bandwidth with minimal cost

for (i = 1 to 100) tm p += A [i] + B [i] + C [i]; A [i] = D [i];

for (j = 1 to 100) B [i] += D [j] + f(A [j]);

for (k = 1 to 100) tm p3 = m ax(tm p3, g(C [k] + B [j]));

A B

D

C

A' B'

Data-path

500

cycl

es

Page 9: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 9

Balancing memory bandwidth

Reduce max. number of loads/store per cycle:

Memory Bandwidth Required

time

High

Memory Bandwidth Required

time

Low

Page 10: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 10

Data management approach

R(A)

R(A)

R (C)

R(D)

W (D)

R(B)

W (B)

W (A)

W (C)

R(C)

W (B)

Flow GraphBalancing

C [i] = f(C [i-1])B [i] = C [i];a = A [i];B [i+1] = B [i+1] + a;d = D [i]+1;A [a] = A [a] + d;D [i] = d

Data-flowAnalysis

1

Cyc

le b

ud

get

= 6

3

2

6

4

5

R(A)

R(A)

R (C)

R(D)

W (D)

R(B)

W (B)

W (A)

W (C)

R(C)

W (B)

One of the manypossible schedules

Page 11: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 11

Data management approach

R(A)

R(A)

R(C)

R(D)

W (D)

R(B)

W (B)

W (A)

W (C)

R(C)

W (B)

A B CD

Memory Allocation &Assignment

Flow GraphBalancing

A B

C D

C [i] = f(C [i-1 ])B [i] = C [i];a = A [i];B [i+1] = B [i+1] + a;d = D [i]+1;A [a] = A [a] + d;D [i] = d

1

3

2

6

4

5

R(A)

R(A)

R(C)

R(D)

W (D)

R(B)

W (B)

W (A)

W (C)

R(C)

W (B)

Page 12: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 12

A BB CC DD AA CB

A B

C D

A B CD

FinalSchedule

C onflict G raph A validM em ory C onfiguration

M ax C lique = 3

Conflict cost calculation

Key issues:• Number of conflicts• Self conflicts• Chromatic number = size of maximum clique

Page 13: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 13

Self conflict dual port memory

A CB BA B DC D AC

A

C D

A B CD

A BB CC DD AA CB

A BD

A B

C D

FinalSchedule

C onflict G raph A validM em ory C onfiguration

Self conflict

B C

D ual port m em ory

Page 14: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 14

Chromatic number minimum # single port memories

A BA B

C D

A B CD

AC

BD

A B

DCC

FinalSchedule

C onflict G raph A validM em ory C onfiguration

C hrom . N r. = 3

C hrom . N r. = 2

B CC DD AA CB

A BB CC DD AAB C

Page 15: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 15

Low number of conflicts large assignment freedom

A BC DA BC DA BC

A B

C D

A BB CC DD AAB C

AC

BD

B

C D

FinalSchedule

Conflict Graph A validMemory Configuration

One solution

Multiple solutions

AC

BD

ACB

D

A

Page 16: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 16

time slots

?

R(C)W(B)W(B)R(B)W(A)R(A)R(A)

R(C)W(C)R(D)W(D)

1 2 3 4 5 6

R(A)

R(A)

R(C)

R(D)

W(D)

R(B)

W(B)

W(A)

W(C)

R(C)

W(B)

Conflict Directed Ordering is used for flat graph scheduling

• Reduce intervals until all conflicts known• Driven by cost of conflicts• Constructive algorithm

Page 17: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 17

Local optimization is not good for global optimization

A D

A B

C

A B C

B CA

D BC A

A BC D

B

D

A B

C D

A B

C D

A B

C D

D

A C

A B

C

AB D

B DA

A CB D

A CB D

B

D

A B

C D

A B

C D

A B

C D

C

+ +

Local optim ization G lobal optim izationfor (i = 1 to 5) A [i] = A [i] + B [i] +

C [i]+D [i];

for (j = 1 to 5) B [j] = f(A [j]);

tm p += g(C [j]+D [j]);

for (k = 1 to 10)

B [k] = g(D [k], A [k]) tm p2 += B [k],C [k];

Page 18: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 18

Budget distribution has large impact on memory cost

for (i = 1 to 100)

A [i] = A [i] + B [i] + C [i];

for (j = 1 to 100)

B [j] = f(A [j]);

for (k = 1 to 100)

C [k] = g(B [j]);

R (A) R (B)

R (C ) W (A)

R (A)

W (B)

R (B) W (C)

R (A) W (B)

R (A) R (B) R (C ) W (A)

200-200-100 200-100-200 100-200-200

A B

C

A B

C

A B

C

R (A) R (B)

R (C ) W (A)

R (B)

W (C )

R (A)

W (B)

R (B)

W (C )

2

2

1

2

2

12

1

2

A B C AB

CA B C

cycle budget = 500 cycles

Page 19: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 19

Decreasing basic block length until target cycle budget is met

for (i = 1 to 5) A [i] = A [i] + B [i] +

C [i]+D [i];

for (j = 1 to 5)

B [j] = f(A [j]); D [j] = g(C [j]);

for (k = 1 to 10)

B [k] = g(B [k], A [k]); D [k] = C [k];

5 4

25 20

Target Cycle Budget (75 Cycles)

5 450 40

25 C ycles

Targ

et c

ycle

bud

get (

75 c

ycle

s)

20 C ycles

50 C ycles

4 3

40 30

95 C ycles 90 C ycles

80 C ycles

70 C yles

Page 20: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 20

for (i = 1 to 100)

b_tm p = B [i];

A [2*i] = b_tm p + C [i];

A [2*i+1] = b_tm p;

for (j = 1 to 100)

D [j] = f(C [j+100]);

for (i = 1 to 100)

b_tm p = B [i];

A [2*i] = b_tm p + C [i];

A [2*i+1] = b_tm p;

D [i] = f(C [i+100]);

A [i]

C [j]

D [j]

1

2

1

2

Cycle budget: 400 cycles

B[i]

A [i]

C [i] 1 B [i]

A B

C

2 C[i]

3 D [i]

C [i]

A [i]

A [i]

D

A B

C D

BD

CAAB

CD

Obtain more freedom by merging loops

• More scheduling freedom• Extension to different threads

Page 21: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 21

Memory allocation and assignment

Page 22: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 22

Memory Allocation and Assignment Substeps

Array-to-memory Assignment

DC

A

B

Port AssignmentBus Sharing

DC

A

B

Memory Allocation 1 2 3

Page 23: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 23

Influence of MAA

• Bitwidth• Address range• Nr. memories• Nr. ports

• Assign arrays to memory • Memory interconnect• Minimize power & Area

Bitwidth

(maximum)Size

Nr. ports (R/W/RW)

MEMORY-1

A

B

Bitwidth

(maximum)Size

Nr. ports (R/W/RW)

MEMORY-N

K

L

1001001110101001

100100111010XXXX

1001XXXXXX

0101110010

Page 24: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 24

Example of bus sharing possibilities

R(A) R(B)

R(B) W(A)

W(C) R(A)

R(A) W(B)

W(A) W(B)

W(A) W(C)

m1 m2 m3

A B

X X

C

m1 m2 m3

A BC

m1 m2 m3

A B

X

C

Page 25: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 25

Decreasing cycle budget limits freedom and raises cost

Cost

Cycle Budget

A B

C D

C hrom N r=3

Minimum budget Fully sequentialLarge

M any obligatoryC ould be m any

B

C D

A

Lim ited freedom

LowFreeN oneFull freedom

R(A) R(B)

W (A)

W (A)

R(C)

1

2

3

R(A)

R(B)

W (A)

W (A)

R(C)1

2

3

4

5

N eeded bandw idth N r. m em ories

Self conflicts (m ult. port) Assignm ent

A[i] = A[i] + B[i];A[i+1] = C[i];

U nintrestingalternatives

Page 26: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 26

MinimumBudget

SequentialBudget

Conflict graph changed,but no impact on assignment

Conflict graph changed,change in assignment

Self conflict,forcing dual port mem.

Resulting Pareto curve for DAB synchro application

En

ergy

cos

t

Page 27: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 27

Example conflict graph for cavity detection

Page 28: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 28

MAA result

Power:On-chip area:

Page 29: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 29

Data layout

how to put data into memory

Page 30: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 30

A

C

?

?

B

MEM1

F

G?

?

H

MEM2

PE

PE

A'B'?

?

CACHE

Memory data layout forcustom and cache architectures

PE

PE

A'

B'

CACHE

A

C

MEM1

B

F

MEM2

G

H

C

A

B

C

B

Page 31: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 31

for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); i-1

i

j

Window

Intra-array in-place mappingreduces size of one array

aa

time

max nr. of life elements

Page 32: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 32

variabledomains

abstractaddresses

realaddresses

aA a

C

A

BaC

aB

Two-phase mapping of array elements onto addresses

Sto

rage

ord

er

Allo

cation

Page 33: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 33

aa=???

memory addressvariable domain

Exploration of storage ordersfor 2-dimensional array

a2

a1

? ?? ? ? ?

a=3a1+a2

a=3(1-a1)+a2

a=3a1+(2-a2)

a=2a2+a1

a=2a2+(1-a1)

a=2(2-a2)+a1

a=3(1-a1)+(2-a2)

a=2(2-a2)+(1-a1)

Page 34: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 34

Chosen storage order determines window sizefor (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]);

row-major ordering: a=5i+j

for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*i+j] = f(a[5*i+j-5]);

Highest live address:

Lowest live address:

5*i+j

5*i+j-5

Difference + 1= Window: 6

column-major: a=5j+i

for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*j+i] = f(a[5*j+i-1]);

5*4+i-1

5*0+i-1

21

j

i

Page 35: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 35xx caa abstrreal

x

A

B

C

D

E

Memory Size

Static allocation:no in-place mapping

E

aE

C

aC

A

aA

D

aD

BaB

time

Page 36: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 36

C

Memory Size

A

D

B

E

xxx cWmodaa abstrrealx

Static, windowed

C

Memory Size

A

DB

E

Dynamic, windowed

Windowed Allocation:intra-array in-place mapping

Page 37: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 37xx caa abstrreal

x

Dynamic allocation:inter-array in-place mapping

E

aE

C

aC

A

aA

D

aD

BaB

A B

C

D

EMemory

Size

Page 38: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 38

A

BC

ED

Wmodcaa abstrrealxxx

A C

ED

BMemory

Size

Dynamic, common window

Dynamic allocation strategy with common window

Page 39: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 39

Before:

bit8 B[10][20];bit6 A[30];for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[x][y] = …;

After:

bit8 memory[334];bit8* B =(bit8*)&memory[134];bit6* A =(bit6*)&memory[120];for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[(x*20+y*2)%78] = …;

Expressing memory data layoutin source code

Example: array of 10x20 elements

A: offset 120, no windowB: storage order [20, 2], offset 134, window 78

Page 40: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 40

int x[W], y[W];for (i1=0; i1 < W; i1++) x[i1] = getInput();for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * x[wrap(i2+di2,W)]; } y[i2] = sum;}for (i3=0; i3 < W; i3++) putOutput(y[i3]);

Example of memory data layoutfor storage size reduction

Page 41: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 41

i1=0 i1=W-1 i2=0 i2=W-1 i3=0 i3=W-1i2=N

x[0]

x[N]

x[W-1]

y[W-1]

x[]

y[]

y[0]

y[N]

W+W

Occupied address-time domainof x[] and y[]

Page 42: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 42

int mem1[N+W];for (i1=0; i1 < W; i1++) mem1[N+i1] = getInput();for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * mem1[N+wrap(i2+di2,W)]; } mem1[i2] = sum;}for (i3=0; i3 < W; i3++) putOutput(mem1[i3]);

Optimized source codeafter memory data layout

Page 43: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 43

i1=0 i1=W-1 i2=0 i2=W-1 i3=0 i3=W-1i2=N

mem1[0]

mem1[N]

mem1[W-1]x[]

y[]

mem1[W+N-1]

N+W

Optimized OAT domainafter memory data layout

Page 44: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 44

In-place mapping for cavity detection example• Input image is partly consumed by the time

first results for output image are ready

index

time

Image_in

time

address

Image

time

index

Image_out

Page 45: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 45

In-place - cavity detection code

for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */

… = image_in[x+1][y]; }}

for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */

… = image [x+1][y]; }}

Page 46: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 46

Cavity detection summary

0

100

200

300

400

500

600

accesses size cycles

Overall result:

• Local accesses reduced by factor 3

• Memory size reduced by factor 5

• Power reduced by factor 5

• System bus load reduced by factor 12

• Performance worsened by factor 6

Page 47: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 47

Data layout for caches• Caches are hardware controled• Therefore: no explicit copy coded needed !

• What can we do ?

Page 48: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 48

p-k-m mk

tag index address byte address

tagdata

Hit?

mainmemory

CPU

2k lines

p-k-m2m bytes

Cache line / BlockCache principles

Page 49: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 49

Cache Architecture Fundamentals

• Block placement – Where in the cache will a new block be placed?

• Block identification– How is a block found in the cache?

• Block replacement policy– Which block is evicted from the cache?

• Updating policy– How is a block written from cache to memory?

Page 50: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 50

CacheCache0011

77

2233445566

22334455

0011

6677......

0011223344556677

Fully associative Fully associative (one-to-many)(one-to-many)

Anywhere in cacheAnywhere in cache

Anywhere in cacheAnywhere in cache

Here only!Here only!

0011223344556677

Direct mapped Direct mapped (one-to-one)(one-to-one)

Here only!Here only!

MemoryMemory00112233445566778899

101011111212131314141515

Mapping?Mapping?

......

Block placement policies

Page 51: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 51

Direct mapped cache

20 10

Byteoffset

Valid Tag DataIndex

0

1

2

1021

1022

1023

Tag

Index

Hit Data

20 32

31 30 13 12 1 1 2 1 0Address (bit positions)

Page 52: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 52

• Taking advantage of spatial locality:

Direct mapped cache: larger blocks

Address (showing bit positions)

16 12 Byteoffset

V Tag Data

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

31 16 15 4 32 1 0

Address (bit positions)

Page 53: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 53

• Increasing the block size tends to decrease miss rate:

Performance

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164

Block size (bytes)

Page 54: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 54

4-way associative cacheAddress

22 8

V TagIndex

0

1

2

253

254255

Data V Tag Data V Tag Data V Tag Data

3222

4-to-1 multiplexor

Hit Data

123891011123031 0

Page 55: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 55

Performance

0%

3%

6%

9%

12%

15%

Eight-wayFour-wayTwo-wayOne-way

1 KB

2 KB

4 KB

8 KB

Mis

s ra

te

Associativity 16 KB

32 KB

64 KB

128 KB

1 KB

2 KB

8 KB

Page 56: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 56

Cache FundamentalsThe “Three C's”• Compulsory Misses

– 1st access to a block: never in the cache

• Capacity Misses– Cache cannot contain all the blocks

– Blocks are discarded and retrieved later

– Avoided by increasing cache size

• Conflict Misses– Too many blocks mapped to same set

– Avoided by increasing associativity

Page 57: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 57

for(i=0; i<10; i++) A[i] = f(B[i]);

Cache(@ i=2)

A[0]B[1]

B[2]

B[0]

A[1]

A[2]------

• B[3], A[3] required• B[3] never loaded before

loaded into cache• A[3] never loaded before

allocates new line

Cache(@ i=3)

Compulsory miss example

Page 58: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 58

Capacity miss example

B[3]B[0]A[0]

i=0B[3]B[0]A[0]B[4]B[1]A[1]

i=1A[2]B[0]A[0]B[4]B[1]A[1]B[5]B[2]

i=2A[2]B[6]B[3]A[3]B[1]A[1]B[5]B[2]

i=3A[2]B[6]B[3]A[3]B[7]B[4]A[4]B[2]

i=4B[5]A[5]B[3]A[3]B[7]B[4]A[4]B[8]

i=5B[5]A[5]B[9]B[6]A[6]B[4]A[4]B[8]

i=6

for(i=0; i<N; i++) A[i] = B[i+3]+B[i];

B[5]A[5]B[9]B[6]A[6]B[10]B[7]A[7]

i=7

• 11 compulsory misses (+8 write misses)

• 5 capacity misses

Cache size: 8 blocks of 1 wordFully associative

Page 59: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 59

Cache (@ i=0)

1234567

B[0][j]

A[0]/B[0][j]0

for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j];

A[0]0A[1]1A[2]

B[3][9]

7

10

31

B[3][0]B[0][1]

A[3]234 B[0][0]

B[1][0]

B[1][1]

B[2][0]56

11

B[2][1]B[3][1]

12

B[0][2]B[1][2]1

3 B[2][2]B[3][2]

89

1415

01

7

2

7

23456

345

01

67

B[0][3] 0...

Memoryaddress

Cacheaddress

j=even

A[0] multiply loaded

A[i] read 10 times

-> A[0] flushed in favor B[0][j] -> Miss j=odd

Conflict miss example

Page 60: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 60

“Three C's” vs Cache size [Gee93]

1 2 4 8 16 32 64

Cache Size in KB

0.00

0.05

0.10

0.15

Total Misses Compulsory Misses Capacity MissesConflict Misses

Rel

ativ

e A

bsol

ute

Mis

sess

Page 61: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

Data layout may reduce cache misses

Page 62: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 62

Example 1: Capacity & Compulsory miss reduction

B[3]B[0]A[0]

i=0B[3]B[0]A[0]B[4]B[1]A[1]

i=1A[2]B[0]A[0]B[4]B[1]A[1]B[5]B[2]

i=2A[2]B[6]B[3]A[3]B[1]A[1]B[5]B[2]

i=3A[2]B[6]B[3]A[3]B[7]B[4]A[4]B[2]

i=4B[5]A[5]B[3]A[3]B[7]B[4]A[4]B[8]

i=5B[5]A[5]B[9]B[6]A[6]B[4]A[4]B[8]

i=6

for(i=0; i<N; i++) A[i] = B[i+3]+B[i];

B[5]A[5]B[9]B[6]A[6]B[10]B[7]A[7]

i=7

• 11 compulsory misses (+8 write misses)

• 5 capacity misses

Page 63: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 63

#Words

B[]

i60

CacheMemory

Main Memory

(16 words) (16 words)

AB[new]

Fit data in cache within-place mapping

A[]

15Detailed Analysis:

max=15 words

12

for(i=0; i<12; i++) A[i] = B[i+3]+B[i];Traditional

Analysis:max=27 words

Page 64: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 64

Remove capacity / compulsory misses with in-place mapping

AB[3]AB[0]

i=0AB[3]AB[0]AB[4]AB[1]

i=1AB[3]AB[0]AB[4]AB[1]AB[5]AB[2]

i=2AB[3]AB[0]AB[4]AB[1]AB[5]AB[2]AB[6]

i=3AB[3]AB[0]AB[4]AB[1]AB[5]AB[2]AB[6]AB[7]

i=4AB[3]AB[8]AB[4]AB[1]AB[5]AB[2]AB[6]AB[7]

i=5AB[3]AB[8]AB[4]AB[9]AB[5]AB[2]AB[6]AB[7]

i=6

for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i];

AB[7]AB[8]AB[4]AB[9]AB[5]AB[10]AB[6]AB[7]

i=7

• 11 compulsory misses

• 5 cache hits (+8 write hits)

Page 65: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 65

Cache (@ i=0)

1234567

B[0][j]

A[0]/B[0][j]0

for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j];

A[0]0A[1]1A[2]

B[3][9]

7

10

31

B[3][0]B[0][1]

A[3]234 B[0][0]

B[1][0]

B[1][1]

B[2][0]56

11

B[2][1]B[3][1]

12

B[0][2]B[1][2]1

3 B[2][2]B[3][2]

89

1415

01

7

2

7

23456

345

01

67

B[0][3] 0...

Memoryaddress

Cacheaddress

j=even

A[0] multiply loaded

A[i] read 10 times

-> A[0] flushed in favor B[0][j] -> Miss j=odd

Example 2: Conflict miss reduction

Page 66: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 66

for(j=0; j<10; j++)for(j=0; j<10; j++) for(i=0; i<4; i++)for(i=0; i<4; i++) A[i] = A[i]+B[i][j];A[i] = A[i]+B[i][j];

A[0]A[0]00A[1]A[1]11A[2]A[2]

B[3][9]B[3][9]

77

1122

3311

B[3][0]B[3][0]

B[0][1]B[0][1]

Main MemoryMain Memory

A[3]A[3]223344 B[0][0]B[0][0]

B[1][0]B[1][0]

B[1][1]B[1][1]

B[2][0]B[2][0]5566

1133

Leave gapLeave gap

B[2][1]B[2][1]B[3][1]B[3][1]

Leave gapLeave gapB[0][2]B[0][2]

0011

77

44

77

2233445566

556677

11441155

1188

44......

......

......

11223344556677

B[0][j]B[0][j]

A[0]A[0]00

A[0] A[0] multiply multiply loadedloaded

A[i] multiple A[i] multiple xx read read

No No conflictconflict

Cache Cache (@ i=0)(@ i=0)

j=anyj=any

© imec 2001

Avoid conflict miss withmain memory data layout

Page 67: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 67

0

2

4

6

8

10

12

14

16

512Bytes 1KB 2KB

Cache Size

Mis

s R

ate

(%

) Initial - DirectMapped

Data Layout Org -Direct Mapped

Initial - Fully Assoc

Data Layout Organization forDirect Mapped Caches

Page 68: Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK70 68

Conclusion on Data Management

• In multi-media applications exploring data transfer and storage issues should be done at source code level

• DMM method

– Reducing number of external memory accesses

– Reducing external memory size

– Trade-offs between internal memory complexity and speed

– Platform independent high-level transformations

– Platform dependent transformations exploit platform

characteristics (efficient use of memory, cache, …)

– Substantial energy reduction