Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
•Juergen Ributzka•David Stephenson•Timothy Kong•Dee Lee•Fred Chow•Guang R. Gao
Overview
• SiCortex Multiprocessor
• Software Pipelining Framework
• Results
• Future Work
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 2
SiCortex Multiprocessor
• RISC
• 6 cores (MIPS 5KF)
• 500 – 700 MHz
• In-Order Execution
• Limited-Dual Issue
• Low Power (12 – 16 Watt)
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 3
Software Pipelining Framework
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 4
DDG
MII
MS
MVE
RA
CG
Front-End
Middle-End
Back-End
FORTRAN C/C++ Java
Loop NestOptimizer
GlobalOptimizer
CodeGenerator
Data Dependence Graph (DDG)
• Cyclic Data Dependence Graph
• No register anti- and output-dependencies
• Only single BBs
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 5
DDG
MII
MS
MVE
RA
CG
S1
S2
S1
S2
Minimum Initiation Interval
• Limited by:
– Resource Requirements
– Recurrence Requirements
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 6
DDG
MII
MS
MVE
RA
CG
Modulo Scheduling (MS)
• Huff Modulo Scheduling
– Lifetime sensitive
– Uses backtracking
• Hyper Node Reduction Scheduling
– Lifetime sensitive
– No backtracking
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 7
DDG
MII
MS
MVE
RA
CG
Modulo Variable Expansion (MVE)
• Separate TN set for every iteration
• # of unrollings depends on longest lifetime
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 8
DDG
MII
MS
MVE
RA
CG
TN10 op1 TNXX…TN20 op2 TN10 TNYY
TN11 op1 TNXX…TN21 op2 TN11 TNYY
TN10 op1 TNXX…TN20 op2 TN10 TNYY
Iteration
C
y
c
l
e
Register Allocation (RA)
• Only kernels are fully register allocated
• No regions
– SWP kernels are not a black box for GRA and LRA anymore
• Currently no register spilling support
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 9
DDG
MII
MS
MVE
RA
CG
Code Generation (CG)
• Prologues and epilogues are partially register allocated
• Need several epilogues due to different register sets
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 10
DDG
MII
MS
MVE
RA
CG
P
P0
K0
K1 E0
E1
E2
E
Benchmark Results
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 11
0.9
0.95
1
1.05
1.1
1.15
1.2
BT CG EP FT IS LU MG SP UA
spe
ed
up
NAS Parallel Benchmark
ABI n32
ABI 64
Benchmark Results
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 12
0.95
0.96
0.97
0.98
0.99
1
1.01
1.02
1.03
1.04
1.05
41
0.b
wav
es
43
3.m
ilc
43
4.z
eusm
p
43
5.g
rom
acs
43
6.c
actu
sAD
M
43
7.le
slie
3d
44
4.n
amd
44
7.d
ealII
45
0.s
op
lex
45
3.p
ovr
ay
45
4.c
alcu
lix
45
9.G
emsF
DTD
46
5.t
on
to
47
0.lb
m
48
1.w
rf
48
2.s
ph
inx3
40
0.p
erlb
ench
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.li
bq
uan
tum
46
4.h
26
ref
47
1.o
mn
etp
p
47
3.a
star
spe
ed
up
SPEC2006
ABI n32
ABI 64
Conclusion and Future Work
• Spilling support needed to enable more loops for SWP
• Screen out loops with small trip count during runtime to reduce SWP overhead
• Hit-under-miss support to overcome current hardware limitations
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 13
Questions ?
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 14