View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Platform-based Design
TU/e 5kk70Henk Corporaal
Bart Mesman
ILP compilation (part b)
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
2
Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples
– C6
– TM
– TTA
• Clustering• Code generation / scheduling• Design Space Exploration: TTA framework
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
3
Scheduling: OverviewTransforming a sequential program into a parallel program:
read sequential program read machine description file for each procedure do
perform function inlining
for each procedure dotransform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do
perform instruction scheduling write parallel program
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
4
Extended basic block scheduling: Code Motion
A a) add r3, r4, 4 b) beq . . .
D e) mul r1, r1, r3
C d) sub r3, r3, r2
B c) add r1, r1, r2
• Downward code motions?
— a B, a C, a D, c D, d D
• Upward code motions?
— c A, d A, e B, e C, e A
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
5
Extended Scheduling scope
A
C
F
B
D
E
G
A;If cond Then B Else C;D;If cond Then E Else F;G;
Code: CFG:ControlFlowGraph
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
6
Scheduling scopes
Trace Superblock Decision tree Hyperblock/region
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
7
Create and Enlarge Scheduling Scope
B C
E F
D
G
A
Trace Superblock
B C
F E’
D’
G’
A
E
D
G
tail duplication
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
8
Create and Enlarge Scheduling Scope
B C
E F
D
G
A
Hyperblock/ region
B C
E’ F’
D’
G’’
A
E
D
G
Decision Tree
tail duplication
F
G’
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
9
Trace Sup.block
Hyp.block
Dec.Tree
Region
Multiple exc. paths No No Yes Yes YesSide-entries allowed Yes No No No NoJoin points allowed Yes No Yes No YesCode motion down joins Yes No No No NoMust be if-convertible No No Yes No NoTail dup. before sched. No Yes No Yes No
Comparing scheduling scopes
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
10
Code movement (upwards) within regions
I
I I
add
I
source block
destination block
I
Copy needed
Intermediateblock
Check foroff-liveness
Legend:
Code movement
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
11
Extended basic block scheduling:Code Motion
• A dominates B A is always executed before B– Consequently:
• A does not dominate B code motion from B to A requires
code duplication
• B post-dominates A B is always executed after A– Consequently:
• B does not post-dominate A code motion from B to A is speculative
A
CB
ED
F
Q1: does C dominate E?
Q2: does C dominate D?
Q3: does F post-dominate D?
Q4: does D post-dominate B?
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
12
Scheduling: Loops
B C
D
ALoop Optimizations:
B
C’’
D
A
C’
C
Loop peeling
B
C’’
D
A
C’
C
Loop unrolling
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
13
Scheduling: LoopsProblems with unrolling:
• Exploits only parallelism within sets of n iterations
• Iteration start-up latency
• Code expansion
Basic block scheduling
Basic block scheduling and unrolling
Software pipelining
reso
urc
e u
tiliz
atio
n
time
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
14
Software pipelining• Software pipelining a loop is:
– Scheduling the loop such that iterations start before preceding iterations have finished
Or:– Moving operations across the backedge
LD
ML
ST
LD
LD ML
LD ML ST
ML ST
ST
LD
LD ML
LD ML ST
ML ST
ST
Example: y = a.x
3 cycles/iteration Unroling
5/3 cycles/iteration
Software pipelining
1 cycle/iteration
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
15
Software pipelining (cont’d)Basic techniques:
• Modulo scheduling (Rau, Lam)– list scheduling with modulo resource constraints
• Kernel recognition techniques– unroll the loop
– schedule the iterations
– identify a repeating pattern
– Examples:• Perfect pipelining (Aiken and Nicolau)
• URPR (Su, Ding and Xia)
• Petri net pipelining (Allan)
• Enhanced pipeline scheduling (Ebcioğlu)– fill first cycle of iteration
– copy this instruction over the backedge
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
16
Software pipelining: Modulo scheduling
Example: Modulo scheduling a loop
for (i = 0; i < n; i++)
a[i+6] = 3* a[i] - 1;
(a) Example loop
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
(b) Code without loop control
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
Prologue
Kernel
Epilogue
(c) Software pipeline
• Prologue fills the SW pipeline with iterations
• Epilogue drains the SW pipeline
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
17
Software pipelining: determine II, the Initation Interval
ld r1, (r2)
mul r3, r1, 3
(0,1) (1,0)
sub r4, r3, 1
st r4, (r5)
(0,1) (1,0)
(0,1) (1,0) (1,6)
(delay, distance)
Cyclic data dependences
cycle(v) cycle(u) + delay(u,v) - II.distance(u,v)
For (i=0;.....)
A[i+6]= 3*A[i]-1
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
18
Modulo scheduling constraints
MII minimum initiation interval bounded by cyclic dependences and resources:
MII = max{ ResMII, RecMII }
Resources:
)(
)(max
ravailable
rusedResMII
resourcesr
Cycles:
ce
edistanceIIedelayvcyclevcycle )(.)()()(
Therefore:
ce
cyclesc edistanceIIedelayNIIRecMII )(.)(0,|min
Or:
ce
ce
cyclesc edistance
edelayRecMII
)(
)(max
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
19
The Role of the Compiler
9 steps required to translate an HLL program(see online bookchapter)
• Front-end compilation
• Determine dependencies
• Graph partitioning: make multiple threads (or tasks)
• Bind partitions to compute nodes
• Bind operands to locations
• Bind operations to time slots: Scheduling
• Bind operations to functional units
• Bind transports to buses
• Execute operations and perform transports
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
20
Division of responsibilities between hardware and compiler
Frontend
Binding of Operands
Determine Dependencies
Scheduling
Binding of Transports
Binding of Operations
Execute
Binding of Operands
Determine Dependencies
Scheduling
Binding of Transports
Binding of Operations
Responsibility of compiler Responsibility of Hardware
Application
Superscalar
Dataflow
Multi-threaded
Indep. Arch
VLIW
TTA
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
21
Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples
– C6
– TM
– TTA
• Clustering• Code generation• Design Space Exploration: TTA framework
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
22
Mapping applications to processorsMOVE framework
Architectureparameters
OptimizerOptimizer
Parametric compilerParametric compiler Hardware generatorHardware generator
feedbackfeedback
Userintercation
Parallel object code chip
Pareto curve(solution space)
cost
exec
. tim
e
x
x
x
x
xx
x
xx
x
x
x
x
x
x
xx x
x
x
Move framework
TTA based system
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
23
TTA (MOVE) organization
Socket
integer RF
floatRF
booleanRF
instruct.unit
immediateunit
load/store unit
integer ALU
float ALU
integer ALU
load/store unit
Data Memory
Instruction Memory
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
24
Code generation trajectory for TTAs
Application (C)
Compiler frontend
Sequential code
Compiler backend
Parallel code
Sequential simulation
Parallel simulation
Arc
hite
ctur
e de
scri
ptio
n
Profiling data
Input/Output
Input/Output
• Frontend: GCC or SUIF (adapted)
• Frontend: GCC or SUIF (adapted)
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
25
Exploration: TTA resource reduction
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
26
Exporation: TTA connectivity reduction
Number of connections removed
Exe
cuti
on t
ime
Reducing bus delay
FU stage constrains cycle time
Cri
tical
con
nect
ions
dis
appe
ar
0
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
27
Can we do better
How ?
• Transformations
• SFUs: Special Function Units
• Multiple Processors
Cost
Exe
cutio
n tim
e
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
28
Transforming the specification
+
+
+
+
+
+
Based on associativity of + operationa + (b + c) = (a + b) + c
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
29
Transforming the specification
d = a * b;
e = a + d;
f = 2 * b + d;
r = f – e;
x = z + y;
r = 2*b – a;x = z + y;
<<
-
a
1 b
+
x
zy
r
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
30
Changing the architectureadding SFUs: special function units
+
+
+
+
+
+
4-input adderwhy is this faster?
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
31
Changing the architectureadding SFUs: special function units
In the extreme case put everything into one unit!
Spatial mapping- no control flow
However: no flexibility / programmability !!
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
32
SFUs: fine grain patterns• Why using fine grain SFUs:
– Code size reduction– Register file #ports reduction– Could be cheaper and/or faster– Transport reduction– Power reduction (avoid charging non-local wires)– Supports whole application domain !
Which patterns do need support?• Detection of recurring operation patterns needed
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
33
SFUs: covering results
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
34
Exploration: resulting architecture
9 buses4 RFs
4 Addercmp FUs 2 Multiplier FUs
2 Diffadd FUs
streamoutput
streaminput
Architecture for image processing• Note the reduced connectivity
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
35
Conclusions• Billions of embedded processing systems
– how to design these systems quickly, cheap, correct, low power,.... ?
– what will their processing platform look like?
• VLIWs are very powerful and flexible– can be easily tuned to application domain
• TTAs even more flexible, scalable, and lower power
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
36
Conclusions
• Compilation for ILP architectures is mature, and
• Enters the commercial area.
• However– Great discrepancy between available and exploitable
parallelism
• Advanced code scheduling techniques needed to exploit ILP
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
37
Bottom line:
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
38
Hands-on (not this year)
• Map JPEG to a TTA processor– see web page:
http://www.ics.ele.tue.nl/~heco/courses/pam
• Install TTA tools (compiler and simulator)
• Go through all listed steps
• Perform DSE: design space exploration
• Add SFU
• 1 or 2 page report in 2 weeks
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
39
Hands-on
• Let’s look at DSE: Design Space Exploration
• We will use the Imagine processor
• http://cva.stanford.edu/projects/imagine/