Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

Platform-based Design

TU/e 5kk70Henk Corporaal

Bart Mesman

ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

2

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering• Code generation / scheduling• Design Space Exploration: TTA framework


3

Scheduling: OverviewTransforming a sequential program into a parallel program:

read sequential program read machine description file for each procedure do

perform function inlining

for each procedure dotransform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do

perform instruction scheduling write parallel program


4

Extended basic block scheduling: Code Motion

A a) add r3, r4, 4 b) beq . . .

D e) mul r1, r1, r3

C d) sub r3, r3, r2

B c) add r1, r1, r2

• Downward code motions?

— a B, a C, a D, c D, d D

• Upward code motions?

— c A, d A, e B, e C, e A


5

Extended Scheduling scope

A

C

F

B

D

E

G

A;If cond Then B Else C;D;If cond Then E Else F;G;

Code: CFG:ControlFlowGraph


6

Scheduling scopes

Trace Superblock Decision tree Hyperblock/region


7

Create and Enlarge Scheduling Scope

B C

E F

D

G

A

Trace Superblock

B C

F E’

D’

G’

A

E

D

G

tail duplication


8

Create and Enlarge Scheduling Scope

B C

E F

D

G

A

Hyperblock/ region

B C

E’ F’

D’

G’’

A

E

D

G

Decision Tree

tail duplication

F

G’


9

Trace Sup.block

Hyp.block

Dec.Tree

Region

Multiple exc. paths No No Yes Yes YesSide-entries allowed Yes No No No NoJoin points allowed Yes No Yes No YesCode motion down joins Yes No No No NoMust be if-convertible No No Yes No NoTail dup. before sched. No Yes No Yes No

Comparing scheduling scopes


10

Code movement (upwards) within regions

I

I I

add

I

source block

destination block

I

Copy needed

Intermediateblock

Check foroff-liveness

Legend:

Code movement


11

Extended basic block scheduling:Code Motion

• A dominates B A is always executed before B– Consequently:

• A does not dominate B code motion from B to A requires

code duplication

• B post-dominates A B is always executed after A– Consequently:

• B does not post-dominate A code motion from B to A is speculative

A

CB

ED

F

Q1: does C dominate E?

Q2: does C dominate D?

Q3: does F post-dominate D?

Q4: does D post-dominate B?


12

Scheduling: Loops

B C

D

ALoop Optimizations:

B

C’’

D

A

C’

C

Loop peeling

B

C’’

D

A

C’

C

Loop unrolling


13

Scheduling: LoopsProblems with unrolling:

• Exploits only parallelism within sets of n iterations

• Iteration start-up latency

• Code expansion

Basic block scheduling

Basic block scheduling and unrolling

Software pipelining

reso

urc

e u

tiliz

atio

n

time


14

Software pipelining• Software pipelining a loop is:

– Scheduling the loop such that iterations start before preceding iterations have finished

Or:– Moving operations across the backedge

LD

ML

ST

LD

LD ML

LD ML ST

ML ST

ST

LD

LD ML

LD ML ST

ML ST

ST

Example: y = a.x

3 cycles/iteration Unroling

5/3 cycles/iteration

Software pipelining

1 cycle/iteration


15

Software pipelining (cont’d)Basic techniques:

• Modulo scheduling (Rau, Lam)– list scheduling with modulo resource constraints

• Kernel recognition techniques– unroll the loop

– schedule the iterations

– identify a repeating pattern

– Examples:• Perfect pipelining (Aiken and Nicolau)

• URPR (Su, Ding and Xia)

• Petri net pipelining (Allan)

• Enhanced pipeline scheduling (Ebcioğlu)– fill first cycle of iteration

– copy this instruction over the backedge


16

Software pipelining: Modulo scheduling

Example: Modulo scheduling a loop

for (i = 0; i < n; i++)

a[i+6] = 3* a[i] - 1;

(a) Example loop

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

(b) Code without loop control

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

Prologue

Kernel

Epilogue

(c) Software pipeline

• Prologue fills the SW pipeline with iterations

• Epilogue drains the SW pipeline


17

Software pipelining: determine II, the Initation Interval

ld r1, (r2)

mul r3, r1, 3

(0,1) (1,0)

sub r4, r3, 1

st r4, (r5)

(0,1) (1,0)

(0,1) (1,0) (1,6)

(delay, distance)

Cyclic data dependences

cycle(v) cycle(u) + delay(u,v) - II.distance(u,v)

For (i=0;.....)

A[i+6]= 3*A[i]-1


18

Modulo scheduling constraints

MII minimum initiation interval bounded by cyclic dependences and resources:

MII = max{ ResMII, RecMII }

Resources:

)(

)(max

ravailable

rusedResMII

resourcesr

Cycles:

ce

edistanceIIedelayvcyclevcycle )(.)()()(

Therefore:

ce

cyclesc edistanceIIedelayNIIRecMII )(.)(0,|min

Or:

ce

ce

cyclesc edistance

edelayRecMII

)(

)(max


19

The Role of the Compiler

9 steps required to translate an HLL program(see online bookchapter)

• Front-end compilation

• Determine dependencies

• Graph partitioning: make multiple threads (or tasks)

• Bind partitions to compute nodes

• Bind operands to locations

• Bind operations to time slots: Scheduling

• Bind operations to functional units

• Bind transports to buses

• Execute operations and perform transports


20

Division of responsibilities between hardware and compiler

Frontend

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Execute

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Responsibility of compiler Responsibility of Hardware

Application

Superscalar

Dataflow

Multi-threaded

Indep. Arch

VLIW

TTA


21

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering• Code generation• Design Space Exploration: TTA framework


22

Mapping applications to processorsMOVE framework

Architectureparameters

OptimizerOptimizer

Parametric compilerParametric compiler Hardware generatorHardware generator

feedbackfeedback

Userintercation

Parallel object code chip

Pareto curve(solution space)

cost

exec

. tim

e

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xx x

x

x

Move framework

TTA based system


23

TTA (MOVE) organization

Socket

integer RF

floatRF

booleanRF

instruct.unit

immediateunit

load/store unit

integer ALU

float ALU

integer ALU

load/store unit

Data Memory

Instruction Memory


24

Code generation trajectory for TTAs

Application (C)

Compiler frontend

Sequential code

Compiler backend

Parallel code

Sequential simulation

Parallel simulation

Arc

hite

ctur

e de

scri

ptio

n

Profiling data

Input/Output

Input/Output

• Frontend: GCC or SUIF (adapted)

• Frontend: GCC or SUIF (adapted)


25

Exploration: TTA resource reduction


26

Exporation: TTA connectivity reduction

Number of connections removed

Exe

cuti

on t

ime

Reducing bus delay

FU stage constrains cycle time

Cri

tical

con

nect

ions

dis

appe

ar

0


27

Can we do better

How ?

• Transformations

• SFUs: Special Function Units

• Multiple Processors

Cost

Exe

cutio

n tim

e


28

Transforming the specification

+

+

+

+

+

+

Based on associativity of + operationa + (b + c) = (a + b) + c


29

Transforming the specification

d = a * b;

e = a + d;

f = 2 * b + d;

r = f – e;

x = z + y;

r = 2*b – a;x = z + y;

<<

-

a

1 b

+

x

zy

r


30

Changing the architectureadding SFUs: special function units

+

+

+

+

+

+

4-input adderwhy is this faster?


31

Changing the architectureadding SFUs: special function units

In the extreme case put everything into one unit!

Spatial mapping- no control flow

However: no flexibility / programmability !!


32

SFUs: fine grain patterns• Why using fine grain SFUs:

– Code size reduction– Register file #ports reduction– Could be cheaper and/or faster– Transport reduction– Power reduction (avoid charging non-local wires)– Supports whole application domain !

Which patterns do need support?• Detection of recurring operation patterns needed


33

SFUs: covering results


34

Exploration: resulting architecture

9 buses4 RFs

4 Addercmp FUs 2 Multiplier FUs

2 Diffadd FUs

streamoutput

streaminput

Architecture for image processing• Note the reduced connectivity


35

Conclusions• Billions of embedded processing systems

– how to design these systems quickly, cheap, correct, low power,.... ?

– what will their processing platform look like?

• VLIWs are very powerful and flexible– can be easily tuned to application domain

• TTAs even more flexible, scalable, and lower power


36

Conclusions

• Compilation for ILP architectures is mature, and

• Enters the commercial area.

• However– Great discrepancy between available and exploitable

parallelism

• Advanced code scheduling techniques needed to exploit ILP


37

Bottom line:


38

Hands-on (not this year)

• Map JPEG to a TTA processor– see web page:

http://www.ics.ele.tue.nl/~heco/courses/pam

• Install TTA tools (compiler and simulator)

• Go through all listed steps

• Perform DSE: design space exploration

• Add SFU

• 1 or 2 page report in 2 weeks


39

Hands-on

• Let’s look at DSE: Design Space Exploration

• We will use the Imagine processor

• http://cva.stanford.edu/projects/imagine/

Documents

Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)