25
Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL) csda csda Fast, Quasi-Optimal, and Fast, Quasi-Optimal, and Pipelined Instruction-Set Pipelined Instruction-Set Extensions Extensions

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

Embed Size (px)

Citation preview

Page 1: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

Ajay K. Verma, Philip Brisk and Paolo Ienne

Processor Architecture Laboratory (LAP)& Centre for Advanced Digital Systems (CSDA)

Ecole Polytechnique Fédérale de Lausanne (EPFL)

csda

csda

Fast, Quasi-Optimal, and Pipelined Fast, Quasi-Optimal, and Pipelined Instruction-Set ExtensionsInstruction-Set Extensions

Page 2: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

2

Custom ISE IdentificationCustom ISE Identification

Register File

ALU MUL LD/ST

Data Memory

AFUout1 = F (in1, in2, in3, in4)out2 = G (in1, in2, in3, in4)

Limited number ofI/O ports

Page 3: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

3

OutlineOutline

Problem formulation ISE selection I/O serialisation

Related work

Non-optimality of earlier work

Integer Linear Programming (ILP) formulation

Results

Conclusions

Page 4: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

4

Problem FormulationProblem Formulation Given

a dataflow graph

a set of forbidden nodes

Find a subgraph S, which isconvex free of

forbidden nodes

And, has largest gainM (S) =

Nexec * (SW (S) – HW (S))

f

a

x2

x1 d

x3

h

b c e g

Page 5: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

5

Convex SubgraphConvex Subgraph

d

cb

a

In order to execute the AFU we need the output of node b

Computation of node b requires the output of AFU

A non-convex AFU cannot be scheduled without creating a deadlock

Page 6: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

6

I/O SerialisationI/O Serialisation

f

d

b c e

2 inputs, 4 outputsAvailable I/O ports: (1, 2)

cb

e

d

f

Page 7: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

7

ISE Merit EstimationISE Merit Estimation

M (S) = Nexec * (SW (S) – HW (S))

f

a

x2

x1 d

x3

h

b c e g

cb

e

d

f

Page 8: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

8

Related WorkRelated Work ISE identification under I/O constraints

Search space pruning using I/O and convexity constraints [Atasu03, Clark03, Yu04, Pozzi06, Yu07, Chen07]

ILP based approach [Atasu05] Pseudo-polynomial time algorithm [Bonzini07]

ISE identification under relaxed I/O constraints Restricted search space exploration [Pozzi05] Generation of a semi compact set of connected ISEs

[Pothineni07]

I/O serialisation Exponential time algorithms [Pozzi05, Pothineni07]

Algorithms for specific processor models Single-issue RISC processor model [Verma07]

Page 9: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

9

Earlier WorkEarlier Work

ISE Selection I/O Serialisation

Atasu03

Yu07

Chen07

Bonzini07

Pozzi05

Pothineni07

Optimal ISEs selection undervarious I/O constraints

Exponential time I/O serialisation algorithm

Page 10: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

10

Non-Optimality of Earlier WorkNon-Optimality of Earlier Work

.5

.6

.5

.6

.5

.6

.3

.2

.5

.6

.5

.6

.5

.6

.3

.2

cycle saved:

23.36

cycle saved:

15.02

cycle saved: 066

cycle saved: 112

Page 11: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

11

Our ContributionsOur Contributions

Optimal ILP formulation for a large class of processor modelsEarlier work consider RISC processor model only

Single run In the earlier work ISE selection was done for

various I/O constraints

ISE selection and I/O scheduling togetherAnother source of non-optimality of earlier work

Page 12: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

12

Integer Linear ProgrammingInteger Linear Programming

Objective function

Linear constraints

Page 13: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

13

ILP FormulationILP Formulation

Linear constraintsNo forbidden nodesConvexity constraints I/O serialisation based constraints I/O access per cycle based constraints

Objective functionSaving in cycles should be maximum

Page 14: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

14

ISE Selection Constraints (1 of 2)ISE Selection Constraints (1 of 2) Variable: For each node ni a Boolean variable xi

xi is true iff node ni is in the selected ISE

Constraint: No forbidden node should be in the ISE If ni is a forbidden node, then xi = 0

Variable: For each node ni two Boolean variables pi and si

pi (si) is true iff at least a predecessor (successor) of ni is in the selected ISE

Constraint: Subgraph corresponding to the selected ISE must be convex If (pi and si are true), then xi must be true (i.e., pi + si – xi ≤

1)

Page 15: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

15

ISE Selection Constraints (2 of 2)ISE Selection Constraints (2 of 2)

Relationship between pi, si and xi

pi = 0 if ni has no children

U (xj U pj) where nj’s are children of ni

si = 0 if ni has no parents

U (xj U pj) where nj’s are parents of ni

Page 16: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

16

I/O Serialisation Based Constraints (1 I/O Serialisation Based Constraints (1 of 3)of 3)

n1 n2

n3

n4

n5

Variable: An integer variable intDelayi

Denotes the cycle in which node ni is executed, e.g.,

intDelay1 = 0 intDelay4 = 1 intDelay5 = 2

Variable: A real variable fractionalDelayi Denotes the smallest time after

intDelayi cycle when output of ni are available, e.g.,

fractionalDelay3 = HW (n3) fractionalDelay4 = HW (n3) + HW (n4)

Variable: An integer variable ρij Denotes the number of stages across

the edges between the nodes ni and nj , e.g.,

ρ13 = 1 ρ34 = 0 ρ25 = 2

Page 17: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

17

I/O Serialisation Based Constraints (2 I/O Serialisation Based Constraints (2 of 3)of 3)

Constraint: The difference between the cycles of predecessor and successor node is the same as number of latches on the edge connecting them, e.g., intDelay4 = intDelay3 +

ρ34

intDelay5 = intDelay2 + ρ25

Constraint: The total number of stages is the same as the last cycle in which an output node is computed, e.g., R = intDelay5 + ρ57 R = intDelay2 + ρ26

n1 n2

n3

n4

n5

n6n7

Extra latches on output edges are createdin order to realize an imaginary sink node

Page 18: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

18

I/O Serialisation Based Constraints (3 I/O Serialisation Based Constraints (3 of 3)of 3)

Constraint: fractionalDelay of a node depends on the fractionalDelay of its predecessor nodes, e.g., Case 1: if node is the first node

in the cycle fractionalDelay3 = HW (n3)

Case 2: if node is not the first node in the cycle

fractionalDelay4 = fractionalDelay3 + HW (n4)

Constraint: fractionalDelay of a node should never exceed the cycle time, e.g., fractionalDelay3 ≤ λ fractionalDelay4 ≤ λ

n1 n2

n3

n4

n5

n6n7

Page 19: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

19

I/O Access Per Cycle Based I/O Access Per Cycle Based Constraints Constraints

Variable: Boolean variables cikIN and cik

OUT

cikIN is true, iff ni is an input of ISE and is accessed in the

kth stage of execution (similarly for cikOUT)

Constraint: In each stage no more than m inputs should be accessed, and no more than n outputs should be written back, i.e., for each k ∑ cik

IN ≤ m

∑ cikOUT ≤ n

cikIN and cik

OUT can be computed using the intDelay, fractionalDelay of nodes and ρ values of incoming and outgoing edges of the AFU

Page 20: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

20

Objective FunctionObjective Function

Saving in cycles should be maximized SW (S) – HW (S) should be maximum

SW (S) = ∑ xi SW (ni)

HW (S) = R

Any processor model where SW (S) and HW (S) can becomputed using linear inequalities, can be handled using ILP

Page 21: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

21

Experimental SetupExperimental Setup

Input dataflowgraph

ISE selectionAtasu03

ISE selectionAtasu03

ILP method

I/O serialisationPozzi05

No serialisation

exp / subopt

exp / opt

Page 22: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

22

Results (1 of 3)Results (1 of 3)

viterbi

adpcmdecoder adpcmcoder

No pipelining

Pozzi’s algorithm

ILP method

Page 23: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

23

Results (2 of 3)Results (2 of 3)

Pozzi’s algorithm takes several hours on this benchmark, and produces inferior results

Benchmark: aes

Biggest dataflow graph: 703

After 3 minutes After an hour

Page 24: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

24

Results (3 of 3)Results (3 of 3)

The best AFU with 22 inputs and 22 outputs

Page 25: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale

25

ConclusionsConclusions

ISE Selection I/O Serialisation

Atasu03

Yu07

Chen07

Bonzini07

Pozzi05

Pothineni07

The methodology can be generalized for a large class of processor models

Optimal, single run algorithm