U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González

UPC

MICRO36San Diego

December 2003

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Enric Gibert1

Jesús Sánchez2

Antonio González1,2

1Dept. d’Arquitectura de Computadors

Universitat Politècnica de Catalunya (UPC)

Barcelona

2Intel Barcelona Research CenterIntel Labs - UPC

Barcelona

UPC

MICRO36San Diego

December 2003

CLUSTER 1

Reg. FileReg. File

FUsFUs

CLUSTER 2

Reg. FileReg. File

FUsFUs

CLUSTER 3

Reg. FileReg. File

FUsFUs

CLUSTER 4

Reg. FileReg. File

FUsFUs

Register-to-register communication buses

L1 cacheL1 cache

L2 cacheL2 cache

Memory buses

Motivation

CLUSTER 1

Reg. FileReg. File

FUsFUs

CLUSTER n

Reg. FileReg. File

FUsFUs

L1 cachemodule

L1 cachemodule

L2 cacheL2 cache

L1 cachemodule

L1 cachemodule

Memory buses

...

OPTION 1: Distribute L1

CLUSTER 1

Reg. FileReg. File

FUsFUs

CLUSTER n

Reg. FileReg. File

FUsFUs

memorybuffer

memorybuffer

L1 cacheL1 cache

memorybuffer

memorybuffer

Memory buses

...

L2 cacheL2 cache

OPTION 2: Memory Buffers

UPC

MICRO36San Diego

December 2003

Contributions

Small L0 Buffer in each cluster– Flexible mechanisms to map data to the buffers– Compiler-controlled memory inst. hints

Instruction scheduling techniques (VLIW)– Mark “critical” instructions to use the buffers– Use appropriate memory instruction hints

Data coherence among buffers [CGO’03]

– 3 mechanisms: same cluster, partial store replication and not use buffers

UPC

MICRO36San Diego

December 2003

Talk Outline

Flexible Compiler-Managed L0 Buffers

Instruction Scheduling Techniques

Evaluation

Conclusions

UPC

MICRO36San Diego

December 2003

L0 Buffers

CLUSTER 1

Reg. FileReg. File

Register-to-register communication buses

L1 cacheL1 cache

INTINT FPFP MEMMEM

CLUSTER 2

Reg. FileReg. File

INTINT FPFP MEMMEM

CL

US

TE

R 3

CL

US

TE

R 4

L0 bufferL0 buffer L0 bufferL0 buffer

unpack logic

UPC

MICRO36San Diego

December 2003

Mapping FlexibilityMapping Flexibility

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

a[0]

a[1]

a[2]

a[3]

a[4]

a[5]

a[6]

a[7]

CLUSTER 1

L0 Buffer

L1 block (16 bytes)

L1 cache

CLUSTER 2

L0 Buffer

CLUSTER 3

L0 Buffer

CLUSTER 4

L0 Buffer

1 2 3 4

load a[0] with stride 1 element

a[0] a[1]

linearmapping

4 bytes 4 bytes 4 bytes 4 bytes

unpack logic

1 2 3 4

a[0] a[1]a[0] a[1] a[0] a[1]a[0] a[1]

interleavedmapping

(1 cycle penalty)a[0] a[4] a[1] a[5] a[2] a[6]a[3] a[7]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

load a[0] load a[1] load a[2]load a[3]All loads with a4-element stride

UPC

MICRO36San Diego

December 2003

L0 bufferL0 buffer L0 bufferL0 buffer

L1 cacheL1 cache

Memory Hints

Access Directives

CLUSTER 1

INTINT FPFP MEMMEM

CLUSTER 2

INTINT FPFP MEMMEM

CL

US

TE

R 3

CL

US

TE

R 4

load (sequential access)load (parallel access)

: no access, sequential, parallel: linear, interleaved

: none, positive, negative

Mapping Hints Prefetching Hints

cycle i+1load no access *p

cycle iload sequential a[0]

load a[0]

load *p

no access

a[0] a[1]a[2] a[3]

load *a (positive pref.) cycle i

a++ cycle i+1

UPC

MICRO36San Diego

December 2003

L0 - L1 Interaction

L0 Buffers are write-through

CLUSTER 1

Reg. FileReg. File

L1 cacheL1 cache

INTINT FPFP MEMMEM

CL

US

TE

R 2

L0 bufferL0 bufferC

LU

ST

ER

3

CL

US

TE

R 4

unpack logic1) Simplifies replacements

• no bus arbitration• flush instruction

store

2) No pack logic

3) Data consistency

pack logic

replacement

load

UPC

MICRO36San Diego

December 2003

Talk Outline



Evaluation

Conclusions

UPC

MICRO36San Diego

December 2003

CLUSTER 1

L0 buffer

CLUSTER 2

L0 buffer

L1L1

SCHEDULE

store Ecycle i+3

load Dcycle i+2

store Ccycle i+1

load Bload Acycle i

Not use buffers (NB)

Memory Coherence

CLUSTER 1

L0 buffer

CLUSTER 2

L0 buffer

L1L1

SCHEDULE

store Ecycle i+3

load Dcycle i+2

store Ccycle i+1

load Acycle i

1 cluster (1C)

storeC

storeC

loadA

loadA

loadD

loadD

loadB

loadB

storeE

storeE

storeC

storeC

loadA

loadA

loadD

loadD

loadB

loadB

storeE

storeE

load B

UPC

MICRO36San Diego

December 2003

Scheduling Algorithm (I)

Overview– Candidate instructions strided mem. insts.

S S

epicdec 99% mpeg2dec 96%

g721dec 100% pegwitdec 50%

g721enc 100% pegwitenc 56%

gsmdec 97% pgpdec 99%

gsmenc 99% pgpenc 86%

jpegdec 60% rasta 95%

jpegenc 49%

– Assign “critical” candidate instructions to buffers

Loop unrolling– Factors: 1 or N– Unroll N: may benefit from

interleaved mapping

– Global comms. + workload

– Do not overflow buffers

UPC

MICRO36San Diego

December 2003

NFreeEntries = {2, 2}

CLUSTER 1

L0 buffer

CLUSTER 2

L0 buffer

loadD

loadD

loadA

loadA

loadB

loadB

storeC

storeC

loadE

loadE


NFreeEntriesLatencies (slack)

loadA

loadA

loadB

loadB

loadE

loadE

loadF

loadF

loadG

loadG

loadC

loadC

loadD

loadD

loadH

loadH

addadd

RF

Scheduling Algorithm (II)

Sort NodesSort Nodes

InitializeData

InitializeData

Next Node

Next Node

Sort P andCompute

Latencies

Sort P andCompute

Latencies

Schedule ina Cluster of P

Schedule ina Cluster of P

Swing MS

Sort P• L0 availability• Min. global comms.• Max. workload

Compute latencies• NFreeEntries• Coherence

P = PossibleClusters

P = PossibleClusters

II=II+1II=II+1

loadD

loadD

loadA

loadA

loadB

loadB

storeC

storeC

loadE

loadE


mem deps


loadD

loadD

loada[i]

loada[i]

loadB

loadB

loadE

loadE

loada[i+1]

loada[i+1]

loadC

loadC

1

loadD

loadD

loadB

loadB

loadE

loadE

loada[i+1]

loada[i+1]

loadC

loadC

NFreeEntries +RecomputeCriticality +Reassign Latencies

NFreeEntries +RecomputeCriticality +Reassign Latencies

empty

! empty

impossiblepossible

UPC

MICRO36San Diego

December 2003

Talk Outline



Evaluation

Conclusions

UPC

MICRO36San Diego

December 2003

Evaluation Framework (I)

IMPACT C compiler• Compile + optimize + memory disambiguation• Extended with proposed instruction scheduler

Mediabench benchmark suite Input

epicdec titanic

g721dec S_16_44

g721enc S_16_44

gsmdec S_16_44

gsmenc S_16_44

jpegdec monalisa

jpegenc monalisa

Input

mpeg2dec tek6

pegwitdec techrep

pegwitenc techrep

pgpdec techrep

pgpenc techrep

rasta ex5_c1

UPC

MICRO36San Diego

December 2003

Evaluation Framework (II)

Architecture configuration

# Clusters 4

Functional Units 1 FP / cluster + 1 integer / cluster + 1 memory / cluster

L0 Buffers 8-byte subblocks, fully-associative

1-cycle latency

L1 Cache

8KB total size, 32 byte blocks

2-way set associative

6-cycle latency

1 extra cycle for interleaved mapping (unpack logic)

L2 Cache 10-cycle latency, always hits

Register

Communications

4 buses with a 2-cycle latency

UPC

MICRO36San Diego

December 2003

Number of L0 Entries

0

0,2

0,4

0,6

0,8

1

1,2

1,4

Exe

cuti

on

tim

e

4-en

try

8-en

try

16-e

ntry

unbo

unde

d

4-en

try

8-en

try

16-e

ntry

unbo

unde

d

4-en

try

8-en

try

16-e

ntry

unbo

unde

d

4-en

try

8-en

try

16-e

ntry

unbo

unde

d

4-en

try

8-en

try

16-e

ntry

unbo

unde

d

4-en

try

8-en

try

16-e

ntry

unbo

unde

d

s tall time

compute time

epicdec g721dec gsmenc jpegdec rasta MEAN

UPC

MICRO36San Diego

December 2003

L0 Hit Rate

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L0 hits L0 misses

UPC

MICRO36San Diego

December 2003

Improving L0 Hit Rate

Solution: prefetch two blocks in advance– Use more L0 buffer entries– Speedups: 1.12 in epicdec (+7% HR) and 1.04 in rasta (+12% HR)

CLUSTER 1

L0 buffera[0] a[1]

load a[0]

load a[1]

load a[2]

load a[3]

II=2 prefetch a[2]

a[2] is needed

tim

e

Iteration 1 Iteration 2 Iteration 3 Iteration 4

a[2] a[3]

a[2] reaches L0

UPC

MICRO36San Diego

December 2003

Distributed Cache

CLUSTER 1

Reg. FileReg. File

Func. UnitsFunc. Units

L1 module

L2 cacheW0 W1 W2 W3 W4 W5 W6 W7

W0 W2 W4 W6

CLUSTER 2

Reg. FileReg. File


L1 module

W1 W3 W5 W7

Word-interleaved

[MICRO35]

CLUSTER 1

Reg. FileReg. File


L1 module

L2 cache

CLUSTER 2

Reg. FileReg. File


L1 module

MultiVLIW

L1 cache block

[MICRO33]

Cache-coherent protocol

UPC

MICRO36San Diego

December 2003

Performance Results

0

0,2

0,4

0,6

0,8

1

1,2

1,4E

xecu

tio

n t

ime

8-en

try

buff

ers

mul

tiVLI

WIn

terle

aved

1In

terle

aved

2

8-en

try

buff

ers

mul

tiVLI

WIn

terle

aved

1In

terle

aved

2

8-en

try

buff

ers

mul

tiVLI

WIn

terle

aved

1In

terle

aved

2

8-en

try

buff

ers

mul

tiVLI

WIn

terle

aved

1In

terle

aved

2

8-en

try

buff

ers

mul

tiVLI

WIn

terle

aved

1In

terle

aved

2

8-en

try

buff

ers

mul

tiVLI

WIn

terle

aved

1In

terle

aved

2

s tall time

compute time

epicdec g721dec gsmenc jpegdec rasta MEAN

UPC

MICRO36San Diego

December 2003

Talk Outline



Evaluation

Conclusions

UPC

MICRO36San Diego

December 2003

Conclusions

Flexible Compiler-Managed L0 Buffers– Mapping flexibility– Memory instruction hints

Instruction Scheduling Techniques– Mark “critical” insts. + do not overflow buffers– Memory coherence solutions [CGO’03]

Performance Results– 16% better than unified L1 cache without buffers– Outperforms word-interleaved cache [MICRO35]

– Competitive compared to MultiVLIW [MICRO33]

UPC

MICRO36San Diego

December 2003

Questions?

Documents

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González