39
Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

Platform-based Design

TU/e 5kk70Henk Corporaal

Bart Mesman

ILP compilation (part b)

Page 2: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

2

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering• Code generation / scheduling• Design Space Exploration: TTA framework

Page 3: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

3

Scheduling: OverviewTransforming a sequential program into a parallel program:

read sequential program read machine description file for each procedure do

perform function inlining

for each procedure dotransform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do

perform instruction scheduling write parallel program

Page 4: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

4

Extended basic block scheduling: Code Motion

A a) add r3, r4, 4 b) beq . . .

D e) mul r1, r1, r3

C d) sub r3, r3, r2

B c) add r1, r1, r2

• Downward code motions?

— a B, a C, a D, c D, d D

• Upward code motions?

— c A, d A, e B, e C, e A

Page 5: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

5

Extended Scheduling scope

A

C

F

B

D

E

G

A;If cond Then B Else C;D;If cond Then E Else F;G;

Code: CFG:ControlFlowGraph

Page 6: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

6

Scheduling scopes

Trace Superblock Decision tree Hyperblock/region

Page 7: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

7

Create and Enlarge Scheduling Scope

B C

E F

D

G

A

Trace Superblock

B C

F E’

D’

G’

A

E

D

G

tail duplication

Page 8: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

8

Create and Enlarge Scheduling Scope

B C

E F

D

G

A

Hyperblock/ region

B C

E’ F’

D’

G’’

A

E

D

G

Decision Tree

tail duplication

F

G’

Page 9: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

9

Trace Sup.block

Hyp.block

Dec.Tree

Region

Multiple exc. paths No No Yes Yes YesSide-entries allowed Yes No No No NoJoin points allowed Yes No Yes No YesCode motion down joins Yes No No No NoMust be if-convertible No No Yes No NoTail dup. before sched. No Yes No Yes No

Comparing scheduling scopes

Page 10: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

10

Code movement (upwards) within regions

I

I I

add

I

source block

destination block

I

Copy needed

Intermediateblock

Check foroff-liveness

Legend:

Code movement

Page 11: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

11

Extended basic block scheduling:Code Motion

• A dominates B A is always executed before B– Consequently:

• A does not dominate B code motion from B to A requires

code duplication

• B post-dominates A B is always executed after A– Consequently:

• B does not post-dominate A code motion from B to A is speculative

A

CB

ED

F

Q1: does C dominate E?

Q2: does C dominate D?

Q3: does F post-dominate D?

Q4: does D post-dominate B?

Page 12: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

12

Scheduling: Loops

B C

D

ALoop Optimizations:

B

C’’

D

A

C’

C

Loop peeling

B

C’’

D

A

C’

C

Loop unrolling

Page 13: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

13

Scheduling: LoopsProblems with unrolling:

• Exploits only parallelism within sets of n iterations

• Iteration start-up latency

• Code expansion

Basic block scheduling

Basic block scheduling and unrolling

Software pipelining

reso

urc

e u

tiliz

atio

n

time

Page 14: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

14

Software pipelining• Software pipelining a loop is:

– Scheduling the loop such that iterations start before preceding iterations have finished

Or:– Moving operations across the backedge

LD

ML

ST

LD

LD ML

LD ML ST

ML ST

ST

LD

LD ML

LD ML ST

ML ST

ST

Example: y = a.x

3 cycles/iteration Unroling

5/3 cycles/iteration

Software pipelining

1 cycle/iteration

Page 15: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

15

Software pipelining (cont’d)Basic techniques:

• Modulo scheduling (Rau, Lam)– list scheduling with modulo resource constraints

• Kernel recognition techniques– unroll the loop

– schedule the iterations

– identify a repeating pattern

– Examples:• Perfect pipelining (Aiken and Nicolau)

• URPR (Su, Ding and Xia)

• Petri net pipelining (Allan)

• Enhanced pipeline scheduling (Ebcioğlu)– fill first cycle of iteration

– copy this instruction over the backedge

Page 16: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

16

Software pipelining: Modulo scheduling

Example: Modulo scheduling a loop

for (i = 0; i < n; i++)

a[i+6] = 3* a[i] - 1;

(a) Example loop

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

(b) Code without loop control

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

Prologue

Kernel

Epilogue

(c) Software pipeline

• Prologue fills the SW pipeline with iterations

• Epilogue drains the SW pipeline

Page 17: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

17

Software pipelining: determine II, the Initation Interval

ld r1, (r2)

mul r3, r1, 3

(0,1) (1,0)

sub r4, r3, 1

st r4, (r5)

(0,1) (1,0)

(0,1) (1,0) (1,6)

(delay, distance)

Cyclic data dependences

cycle(v) cycle(u) + delay(u,v) - II.distance(u,v)

For (i=0;.....)

A[i+6]= 3*A[i]-1

Page 18: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

18

Modulo scheduling constraints

MII minimum initiation interval bounded by cyclic dependences and resources:

MII = max{ ResMII, RecMII }

Resources:

)(

)(max

ravailable

rusedResMII

resourcesr

Cycles:

ce

edistanceIIedelayvcyclevcycle )(.)()()(

Therefore:

ce

cyclesc edistanceIIedelayNIIRecMII )(.)(0,|min

Or:

ce

ce

cyclesc edistance

edelayRecMII

)(

)(max

Page 19: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

19

The Role of the Compiler

9 steps required to translate an HLL program(see online bookchapter)

• Front-end compilation

• Determine dependencies

• Graph partitioning: make multiple threads (or tasks)

• Bind partitions to compute nodes

• Bind operands to locations

• Bind operations to time slots: Scheduling

• Bind operations to functional units

• Bind transports to buses

• Execute operations and perform transports

Page 20: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

20

Division of responsibilities between hardware and compiler

Frontend

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Execute

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Responsibility of compiler Responsibility of Hardware

Application

Superscalar

Dataflow

Multi-threaded

Indep. Arch

VLIW

TTA

Page 21: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

21

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering• Code generation• Design Space Exploration: TTA framework

Page 22: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

22

Mapping applications to processorsMOVE framework

Architectureparameters

OptimizerOptimizer

Parametric compilerParametric compiler Hardware generatorHardware generator

feedbackfeedback

Userintercation

Parallel object code chip

Pareto curve(solution space)

cost

exec

. tim

e

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xx x

x

x

Move framework

TTA based system

Page 23: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

23

TTA (MOVE) organization

Socket

integer RF

floatRF

booleanRF

instruct.unit

immediateunit

load/store unit

integer ALU

float ALU

integer ALU

load/store unit

Data Memory

Instruction Memory

Page 24: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

24

Code generation trajectory for TTAs

Application (C)

Compiler frontend

Sequential code

Compiler backend

Parallel code

Sequential simulation

Parallel simulation

Arc

hite

ctur

e de

scri

ptio

n

Profiling data

Input/Output

Input/Output

• Frontend: GCC or SUIF (adapted)

• Frontend: GCC or SUIF (adapted)

Page 25: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

25

Exploration: TTA resource reduction

Page 26: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

26

Exporation: TTA connectivity reduction

Number of connections removed

Exe

cuti

on t

ime

Reducing bus delay

FU stage constrains cycle time

Cri

tical

con

nect

ions

dis

appe

ar

0

Page 27: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

27

Can we do better

How ?

• Transformations

• SFUs: Special Function Units

• Multiple Processors

Cost

Exe

cutio

n tim

e

Page 28: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

28

Transforming the specification

+

+

+

+

+

+

Based on associativity of + operationa + (b + c) = (a + b) + c

Page 29: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

29

Transforming the specification

d = a * b;

e = a + d;

f = 2 * b + d;

r = f – e;

x = z + y;

r = 2*b – a;x = z + y;

<<

-

a

1 b

+

x

zy

r

Page 30: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

30

Changing the architectureadding SFUs: special function units

+

+

+

+

+

+

4-input adderwhy is this faster?

Page 31: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

31

Changing the architectureadding SFUs: special function units

In the extreme case put everything into one unit!

Spatial mapping- no control flow

However: no flexibility / programmability !!

Page 32: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

32

SFUs: fine grain patterns• Why using fine grain SFUs:

– Code size reduction– Register file #ports reduction– Could be cheaper and/or faster– Transport reduction– Power reduction (avoid charging non-local wires)– Supports whole application domain !

Which patterns do need support?• Detection of recurring operation patterns needed

Page 33: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

33

SFUs: covering results

Page 34: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

34

Exploration: resulting architecture

9 buses4 RFs

4 Addercmp FUs 2 Multiplier FUs

2 Diffadd FUs

streamoutput

streaminput

Architecture for image processing• Note the reduced connectivity

Page 35: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

35

Conclusions• Billions of embedded processing systems

– how to design these systems quickly, cheap, correct, low power,.... ?

– what will their processing platform look like?

• VLIWs are very powerful and flexible– can be easily tuned to application domain

• TTAs even more flexible, scalable, and lower power

Page 36: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

36

Conclusions

• Compilation for ILP architectures is mature, and

• Enters the commercial area.

• However– Great discrepancy between available and exploitable

parallelism

• Advanced code scheduling techniques needed to exploit ILP

Page 37: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

37

Bottom line:

Page 38: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

38

Hands-on (not this year)

• Map JPEG to a TTA processor– see web page:

http://www.ics.ele.tue.nl/~heco/courses/pam

• Install TTA tools (compiler and simulator)

• Go through all listed steps

• Perform DSE: design space exploration

• Add SFU

• 1 or 2 page report in 2 weeks

Page 39: Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

39

Hands-on

• Let’s look at DSE: Design Space Exploration

• We will use the Imagine processor

• http://cva.stanford.edu/projects/imagine/