Upload
esteban-solly
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
UPC
MICRO36San Diego
December 2003
Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors
Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors
Enric Gibert1
Jesús Sánchez2
Antonio González1,2
1Dept. d’Arquitectura de Computadors
Universitat Politècnica de Catalunya (UPC)
Barcelona
2Intel Barcelona Research CenterIntel Labs - UPC
Barcelona
UPC
MICRO36San Diego
December 2003
CLUSTER 1
Reg. FileReg. File
FUsFUs
CLUSTER 2
Reg. FileReg. File
FUsFUs
CLUSTER 3
Reg. FileReg. File
FUsFUs
CLUSTER 4
Reg. FileReg. File
FUsFUs
Register-to-register communication buses
L1 cacheL1 cache
L2 cacheL2 cache
Memory buses
Motivation
CLUSTER 1
Reg. FileReg. File
FUsFUs
CLUSTER n
Reg. FileReg. File
FUsFUs
L1 cachemodule
L1 cachemodule
L2 cacheL2 cache
L1 cachemodule
L1 cachemodule
Memory buses
...
OPTION 1: Distribute L1
CLUSTER 1
Reg. FileReg. File
FUsFUs
CLUSTER n
Reg. FileReg. File
FUsFUs
memorybuffer
memorybuffer
L1 cacheL1 cache
memorybuffer
memorybuffer
Memory buses
...
L2 cacheL2 cache
OPTION 2: Memory Buffers
UPC
MICRO36San Diego
December 2003
Contributions
Small L0 Buffer in each cluster– Flexible mechanisms to map data to the buffers– Compiler-controlled memory inst. hints
Instruction scheduling techniques (VLIW)– Mark “critical” instructions to use the buffers– Use appropriate memory instruction hints
Data coherence among buffers [CGO’03]
– 3 mechanisms: same cluster, partial store replication and not use buffers
UPC
MICRO36San Diego
December 2003
Talk Outline
Flexible Compiler-Managed L0 Buffers
Instruction Scheduling Techniques
Evaluation
Conclusions
UPC
MICRO36San Diego
December 2003
L0 Buffers
CLUSTER 1
Reg. FileReg. File
Register-to-register communication buses
L1 cacheL1 cache
INTINT FPFP MEMMEM
CLUSTER 2
Reg. FileReg. File
INTINT FPFP MEMMEM
CL
US
TE
R 3
CL
US
TE
R 4
L0 bufferL0 buffer L0 bufferL0 buffer
unpack logic
UPC
MICRO36San Diego
December 2003
Mapping FlexibilityMapping Flexibility
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
a[0]
a[1]
a[2]
a[3]
a[4]
a[5]
a[6]
a[7]
CLUSTER 1
L0 Buffer
L1 block (16 bytes)
L1 cache
CLUSTER 2
L0 Buffer
CLUSTER 3
L0 Buffer
CLUSTER 4
L0 Buffer
1 2 3 4
load a[0] with stride 1 element
a[0] a[1]
linearmapping
4 bytes 4 bytes 4 bytes 4 bytes
unpack logic
1 2 3 4
a[0] a[1]a[0] a[1] a[0] a[1]a[0] a[1]
interleavedmapping
(1 cycle penalty)a[0] a[4] a[1] a[5] a[2] a[6]a[3] a[7]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
load a[0] load a[1] load a[2]load a[3]All loads with a4-element stride
UPC
MICRO36San Diego
December 2003
L0 bufferL0 buffer L0 bufferL0 buffer
L1 cacheL1 cache
Memory Hints
Access Directives
CLUSTER 1
INTINT FPFP MEMMEM
CLUSTER 2
INTINT FPFP MEMMEM
CL
US
TE
R 3
CL
US
TE
R 4
load (sequential access)load (parallel access)
: no access, sequential, parallel: linear, interleaved
: none, positive, negative
Mapping Hints Prefetching Hints
cycle i+1load no access *p
cycle iload sequential a[0]
load a[0]
load *p
no access
a[0] a[1]a[2] a[3]
load *a (positive pref.) cycle i
a++ cycle i+1
UPC
MICRO36San Diego
December 2003
L0 - L1 Interaction
L0 Buffers are write-through
CLUSTER 1
Reg. FileReg. File
L1 cacheL1 cache
INTINT FPFP MEMMEM
CL
US
TE
R 2
L0 bufferL0 bufferC
LU
ST
ER
3
CL
US
TE
R 4
unpack logic1) Simplifies replacements
• no bus arbitration• flush instruction
store
2) No pack logic
3) Data consistency
pack logic
replacement
load
UPC
MICRO36San Diego
December 2003
Talk Outline
Flexible Compiler-Managed L0 Buffers
Instruction Scheduling Techniques
Evaluation
Conclusions
UPC
MICRO36San Diego
December 2003
CLUSTER 1
L0 buffer
CLUSTER 2
L0 buffer
L1L1
SCHEDULE
store Ecycle i+3
load Dcycle i+2
store Ccycle i+1
load Bload Acycle i
Not use buffers (NB)
Memory Coherence
CLUSTER 1
L0 buffer
CLUSTER 2
L0 buffer
L1L1
SCHEDULE
store Ecycle i+3
load Dcycle i+2
store Ccycle i+1
load Acycle i
1 cluster (1C)
storeC
storeC
loadA
loadA
loadD
loadD
loadB
loadB
storeE
storeE
storeC
storeC
loadA
loadA
loadD
loadD
loadB
loadB
storeE
storeE
load B
UPC
MICRO36San Diego
December 2003
Scheduling Algorithm (I)
Overview– Candidate instructions strided mem. insts.
S S
epicdec 99% mpeg2dec 96%
g721dec 100% pegwitdec 50%
g721enc 100% pegwitenc 56%
gsmdec 97% pgpdec 99%
gsmenc 99% pgpenc 86%
jpegdec 60% rasta 95%
jpegenc 49%
– Assign “critical” candidate instructions to buffers
Loop unrolling– Factors: 1 or N– Unroll N: may benefit from
interleaved mapping
– Global comms. + workload
– Do not overflow buffers
UPC
MICRO36San Diego
December 2003
NFreeEntries = {2, 2}
CLUSTER 1
L0 buffer
CLUSTER 2
L0 buffer
loadD
loadD
loadA
loadA
loadB
loadB
storeC
storeC
loadE
loadE
NFreeEntries = {1, 0}
NFreeEntriesLatencies (slack)
loadA
loadA
loadB
loadB
loadE
loadE
loadF
loadF
loadG
loadG
loadC
loadC
loadD
loadD
loadH
loadH
addadd
RF
Scheduling Algorithm (II)
Sort NodesSort Nodes
InitializeData
InitializeData
Next Node
Next Node
Sort P andCompute
Latencies
Sort P andCompute
Latencies
Schedule ina Cluster of P
Schedule ina Cluster of P
Swing MS
Sort P• L0 availability• Min. global comms.• Max. workload
Compute latencies• NFreeEntries• Coherence
P = PossibleClusters
P = PossibleClusters
II=II+1II=II+1
loadD
loadD
loadA
loadA
loadB
loadB
storeC
storeC
loadE
loadE
NFreeEntries = {1, 1}
mem deps
NFreeEntries = {2, 2}
loadD
loadD
loada[i]
loada[i]
loadB
loadB
loadE
loadE
loada[i+1]
loada[i+1]
loadC
loadC
1
loadD
loadD
loadB
loadB
loadE
loadE
loada[i+1]
loada[i+1]
loadC
loadC
NFreeEntries +RecomputeCriticality +Reassign Latencies
NFreeEntries +RecomputeCriticality +Reassign Latencies
empty
! empty
impossiblepossible
UPC
MICRO36San Diego
December 2003
Talk Outline
Flexible Compiler-Managed L0 Buffers
Instruction Scheduling Techniques
Evaluation
Conclusions
UPC
MICRO36San Diego
December 2003
Evaluation Framework (I)
IMPACT C compiler• Compile + optimize + memory disambiguation• Extended with proposed instruction scheduler
Mediabench benchmark suite Input
epicdec titanic
g721dec S_16_44
g721enc S_16_44
gsmdec S_16_44
gsmenc S_16_44
jpegdec monalisa
jpegenc monalisa
Input
mpeg2dec tek6
pegwitdec techrep
pegwitenc techrep
pgpdec techrep
pgpenc techrep
rasta ex5_c1
UPC
MICRO36San Diego
December 2003
Evaluation Framework (II)
Architecture configuration
# Clusters 4
Functional Units 1 FP / cluster + 1 integer / cluster + 1 memory / cluster
L0 Buffers 8-byte subblocks, fully-associative
1-cycle latency
L1 Cache
8KB total size, 32 byte blocks
2-way set associative
6-cycle latency
1 extra cycle for interleaved mapping (unpack logic)
L2 Cache 10-cycle latency, always hits
Register
Communications
4 buses with a 2-cycle latency
UPC
MICRO36San Diego
December 2003
Number of L0 Entries
0
0,2
0,4
0,6
0,8
1
1,2
1,4
Exe
cuti
on
tim
e
4-en
try
8-en
try
16-e
ntry
unbo
unde
d
4-en
try
8-en
try
16-e
ntry
unbo
unde
d
4-en
try
8-en
try
16-e
ntry
unbo
unde
d
4-en
try
8-en
try
16-e
ntry
unbo
unde
d
4-en
try
8-en
try
16-e
ntry
unbo
unde
d
4-en
try
8-en
try
16-e
ntry
unbo
unde
d
s tall time
compute time
epicdec g721dec gsmenc jpegdec rasta MEAN
UPC
MICRO36San Diego
December 2003
L0 Hit Rate
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L0 hits L0 misses
UPC
MICRO36San Diego
December 2003
Improving L0 Hit Rate
Solution: prefetch two blocks in advance– Use more L0 buffer entries– Speedups: 1.12 in epicdec (+7% HR) and 1.04 in rasta (+12% HR)
CLUSTER 1
L0 buffera[0] a[1]
load a[0]
load a[1]
load a[2]
load a[3]
II=2 prefetch a[2]
a[2] is needed
tim
e
Iteration 1 Iteration 2 Iteration 3 Iteration 4
a[2] a[3]
a[2] reaches L0
UPC
MICRO36San Diego
December 2003
Distributed Cache
CLUSTER 1
Reg. FileReg. File
Func. UnitsFunc. Units
L1 module
L2 cacheW0 W1 W2 W3 W4 W5 W6 W7
W0 W2 W4 W6
CLUSTER 2
Reg. FileReg. File
Func. UnitsFunc. Units
L1 module
W1 W3 W5 W7
Word-interleaved
[MICRO35]
CLUSTER 1
Reg. FileReg. File
Func. UnitsFunc. Units
L1 module
L2 cache
CLUSTER 2
Reg. FileReg. File
Func. UnitsFunc. Units
L1 module
MultiVLIW
L1 cache block
[MICRO33]
Cache-coherent protocol
UPC
MICRO36San Diego
December 2003
Performance Results
0
0,2
0,4
0,6
0,8
1
1,2
1,4E
xecu
tio
n t
ime
8-en
try
buff
ers
mul
tiVLI
WIn
terle
aved
1In
terle
aved
2
8-en
try
buff
ers
mul
tiVLI
WIn
terle
aved
1In
terle
aved
2
8-en
try
buff
ers
mul
tiVLI
WIn
terle
aved
1In
terle
aved
2
8-en
try
buff
ers
mul
tiVLI
WIn
terle
aved
1In
terle
aved
2
8-en
try
buff
ers
mul
tiVLI
WIn
terle
aved
1In
terle
aved
2
8-en
try
buff
ers
mul
tiVLI
WIn
terle
aved
1In
terle
aved
2
s tall time
compute time
epicdec g721dec gsmenc jpegdec rasta MEAN
UPC
MICRO36San Diego
December 2003
Talk Outline
Flexible Compiler-Managed L0 Buffers
Instruction Scheduling Techniques
Evaluation
Conclusions
UPC
MICRO36San Diego
December 2003
Conclusions
Flexible Compiler-Managed L0 Buffers– Mapping flexibility– Memory instruction hints
Instruction Scheduling Techniques– Mark “critical” insts. + do not overflow buffers– Memory coherence solutions [CGO’03]
Performance Results– 16% better than unified L1 cache without buffers– Outperforms word-interleaved cache [MICRO35]
– Competitive compared to MultiVLIW [MICRO33]