System Partitioning - uni-paderborn.de

System Partitioning

Dr. Tobias KenterProf. Dr. Christian Plessl

SS 2017

High-Performance IT Systems groupPaderborn University

Version 1.6.0 – 2017-06-25

Overview

• graph models for system synthesis

• the partitioning problem

• partitioning methods

2

Overview




3

Models for System Synthesis

• allocation + binding = partitioning

• problem graph– nodes: functional and communication objects– edges: dependencies

• architecture graph– nodes: functional and communication resources– edges: directed communication paths

• specification graph– problem graph + architecture graph + possible mappings

4

Problem Graph

1

3

2

4

DFG problem graph

1

3

2

4

5

7

6

communicationnodes

5

Architecture Graph

architecture grapharchitecture

RISC

HWM2HWM1

bus

point-to-point link

vRISC

vHWM1

vHWM2

vbus

vptp

6

Specification Graph

vRISC

vHWM1

vHWM2

vbus

vptp

1

3

2

4

5

7

6

7

Allocation, Binding

vRISC

vHWM1

vHWM2

vbus

vptp

1

3

2

4

5

7

6

8

Example: Homogeneous Multiprocessor

• optimization goals– minimize latency– meet deadlines– …

bus

PE1

M

PE2

M

PE3

M

vPE1

vbus

vPE2 vPE3

9

Example: HW/SW Bi-Partitioning

• bi-partitioning means that there are only two blocks: SW and HW (accelerator)

processor

FPGA

bus

vprocessor

vFPGA

vbus

10

Synthesis and Constraints

• system synthesis modeling– create problem graph– define architectural template– identify possible bindings– refinements: communication, memory, ..

• system synthesis with constraints Cmax, Lmax– allocation - determines cost C (first approximation)– binding– scheduling - determines latency L

– feasible schedule: L ≤ Lmax

– feasible binding: leads to at least one feasible schedule– feasible allocation: C ≤ Cmax and leads to at least one feasible

binding

11

Overview




12

Partitioning Problem

• definition: the partitioning problem is to assignn objects O ={o1, ..., on} to k partitions P={p1, ..., pk},

such that– p1⋃ p2 ⋃ ... ⋃ pm = O– pi ⋂ pj = { } ∀ i,j: i ≠ j and– the cost c(P) is minimized.

the general partitioning problem is NP-complete

• in system synthesis:– objects: problem graph nodes – partitions: architecture graph nodes

13

Partitioning – Abstraction Levels

• structural partitioning– at the RTL- or netlist level– e.g. map a digital circuit onto two chips (FPGAs, ASICs)– system parameters are relatively well known (area and delay of functional

units, registers, gates)– no comparison of design alternatives– cost: e.g. number of wires between partitions, maximum delay between all

partitions, average size of each partition, ...

• functional partitioning– at the system level– system parameters are not known à estimation required– comparison of design alternatives à design space exploration– cost: e.g. required communication bandwidth or connections between

partitions, hardware implementation cost for all partitions, ...

14

Cost Functions

• measure quality of a design point– may include C system cost in [$]

L latency in [sec] P power consumption in [W]

– requires estimation to find C, L, P

• example: linear cost function with penalty

– hC , hL , hP … denote how strong C, L, P violate the design constraints Cmax, Lmax, Pmax

– k1 , k2 , k3 … weighting and normalization

f(C, L, P) = k1·hC(C,Cmax) + k2·hL(L,Lmax) + k3·hP(P,Pmax)

15

✎ Example: Cost functions

Overview




16

Partitioning Methods – Overview

• exact methods– enumeration of solutions– integer linear programs (ILP)

• heuristic methods– constructive methods

§ random mapping§ hierarchical clustering

– iterative methods (refinement methods)§ greedy partitioners§ Kernighan-Lin

§ simulated annealing§ evolutionary algorithms (design space exploration)

17

Enumeration

• Enumeration is not feasible even for small partitioning problems

18

✎ Example: Enumeration

ILP – Example

19

processor

ASIC

bus

1

32

4

problem architecture

5

6

task SW cost HW cost

1 80 320

2 240 170

3 710 120

4 130 20

5 100 400

6 80 260

cost table

→ find optimal allocation & binding

(all bindings possible)

ILP – Basic Problem Formulation

• binary variables xi,k = 1: object oi in partition pk

• cost ci,k (constants), if object oi is in partition pk

• integer linear program:

20

!𝑥#,% = 11 ≤ 𝑖 ≤ 𝑛,

%-.

𝑥#,% ∈ 0,1 1 ≤ 𝑖 ≤ 𝑛, 1 ≤ 𝑘 ≤ 𝑚

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒:!!𝑥#,% ⋅ 𝑐#,%

8

#-.

1 ≤ 𝑘 ≤ 𝑚, 1 ≤ 𝑖 ≤ 𝑛,

%-.

ILP – Illustration Binding

1

32

4 5

6


1 80 320

2 240 170

3 710 120

4 130 20

5 100 400

6 80 260

cost table

vprocessor

vASIC

x11

1

2

x12

x21

x22

n=6 m=2

n * m=12binary variables

specification graph (incomplete)

21

ILP – Basic ILP in LP Format

/************************/ /* objective function *//************************/

/* c1s = 80 c1h = 320 *//* c2s = 240 c2h = 170 *//* c3s = 710 c3h = 120 *//* c4s = 130 c4h = 20 *//* c5s = 100 c5h = 400 *//* c6s = 80 c6h = 260 */

min: 80 x11 + 320 x12 + 240 x21 + 170 x22 + 710 x31 + 120 x32 + 130 x41 + 20 x42 + 100 x51 + 400 x52 + 80 x61 + 260 x62;

/******************************/ /* unique mapping constraints *//******************************/ x11 + x12 = 1;x21 + x22 = 1;x31 + x32 = 1;x41 + x42 = 1;x51 + x52 = 1;x61 + x62 = 1;

/**********************//* variables in {0,1} *//**********************/x11 <= 1;x12 <= 1;x21 <= 1;x22 <= 1;x31 <= 1;x32 <= 1;x41 <= 1;x42 <= 1;x51 <= 1;x52 <= 1;x61 <= 1;x62 <= 1;

/*********************/ /* integer variables *//*********************/int x11;int x12;int x21;int x22;int x31;int x32;int x41;int x42;int x51;int x52;int x61;int x62;

22

ILP – Solution of Basic ILP w/ Cost Constraints

1

32

4 5

6


1 80 320

2 240 170

3 710 120

4 130 20

5 100 400

6 80 260

cost table

vprocessor

vASIC

1

2

n=6 m=2


total cost = 570

allocation & binding

23

ILP – Additional Constraints (1)

• at maximum hk objects in partition pk

• maximal cost Hk of objects in partition pk

24

!𝑥#,% ≤ ℎ%1 ≤ 𝑘 ≤ 𝑚8

#-.

!𝑐#,% ⋅ 𝑥#,% ≤ 𝐻%1 ≤ 𝑘 ≤ 𝑚8

#-.

ILP – Additional Constraints (2)

• constraint on the hardware cost– cost of all tasks mapped to hardware must not exceed 300

/****************************/ /* hardware cost constraint *//****************************/ 320 x12 + 170 x22 + 120 x32 + 20 x42 + 400 x52 + 260 x62 <= 300;

25

!𝑐#,; ⋅ 𝑥#,; ≤ 300=

#-.

ILP – Solution w/ Cost and Capacity Constraints

1

32

4 5

6


1 80 320

2 240 170

3 710 120

4 130 20

5 100 400

6 80 260

cost table

vprocessor

vASIC

1

2

n=6 m=2


total cost = 640

constraint: HW cost <= 300


26

ILP – Communication Cost Modeling (1)

• including communication cost– assume local communication is for free (SW-SW or HW-HW) but bus

communication incurs costs (SW-HW or HW-SW)– for simplicity, we assume additive costs

edge i →j local communication bus communication

1→2 0 20

1→3 0 250

2→4 0 200

2→5 0 10

3→5 0 100

4→6 0 5

5→6 0 50

communication cost: comi,j

27

ILP – Communication Cost Modeling (2)

• introduce additional binary variables– gi,j =1, iff the communication from node i to node j uses the bus

• define gi,j in terms of the primary variables xi,k

• new cost/objectivefunction

non-linear !!

linearization

28

✎ Excursus: Linearization of constraints

𝑥#,. ⋅ 𝑥>,;+𝑥#,; ⋅ 𝑥>,. = 𝑔#,> ∀𝑒𝑑𝑔𝑒𝑠(𝑣#, 𝑣>)

𝑥#,.+𝑥>,; − 1 ≤ 𝑔#,>𝑥#,;+𝑥>,. − 1 ≤ 𝑔#,>

G∀𝑒𝑑𝑔𝑒𝑠(𝑣#, 𝑣>)

𝑚𝑖𝑛 !(𝑐#,. ⋅ 𝑥#,.

8

#-.

+ 𝑐#,; ⋅ 𝑥#,;)+ ! 𝑐𝑜𝑚#,> ⋅ 𝑔#,>JKLJM(NO,NP)

ILP – Extended ILP in LP Format

/***********************/ /* communication costs *//***********************/ x12 + x21 - 1 <= g12;x11 + x22 - 1 <= g12;

x12 + x31 - 1 <= g13;x11 + x32 - 1 <= g13;

x22 + x41 - 1 <= g24;x21 + x42 - 1 <= g24;

x22 + x51 - 1 <= g25;x21 + x52 - 1 <= g25;

x32 + x51 - 1 <= g35;x31 + x52 - 1 <= g35;

x42 + x61 - 1 <= g46;x41 + x62 - 1 <= g46;

x52 + x61 - 1 <= g56;x51 + x62 - 1 <= g56;

/**********************//* variables in {0,1} *//**********************/g12 <= 1; g13 <= 1; g24 <= 1; g25 <= 1; g35 <= 1; g46 <= 1; g56 <= 1;

/*********************/ /* integer variables *//*********************/int g12; int g13; int g24; int g25; int g35; int g46; int g56;

/***********************/ /* objective function *//***********************/

/* c1s = 80 c1h = 320 *//* c2s = 240 c2h = 170 *//* c3s = 710 c3h = 120 *//* c4s = 130 c4h = 20 *//* c5s = 100 c5h = 400 *//* c6s = 80 c6h = 260 */

/* g12 = 20 *//* g13 = 250 *//* g24 = 200 *//* g25 = 10 *//* g35 = 100 *//* g46 = 5 */ /* g56 = 50 */

min: 80 x11 + 320 x12 + 240 x21 + 170 x22 + 710 x31 + 120 x32 + 130 x41 + 20 x42 + 100 x51 + 400 x52 + 80 x61 + 260 x62 + 20 g12 + 250 g13 + 200 g24 + 10 g25 + 100 g35 + 5 g46 + 50 g56;

29

ILP – Solution w/ HW and Communication Costs


1 80 320

2 240 170

3 710 120

4 130 20

5 100 400

6 80 260

cost table

vprocessor

vASIC

1

2

n=6 m=2


total cost = 1100

constraint: HW cost <= 300

vbus

1

32

4 5

6

20 250

10

5 50

200 100

30

Partitioning Methods – Overview






31

Constructive Methods

• examples– random mapping

§ each object is randomly assigned to some block§ disadvantage: probability to generate a near-optimal partitioning is extremely

low

– hierarchical clustering§ define a closeness function that determines how desirable the grouping of two

objects is§ perform a stepwise grouping of close objects or object groups§ disadvantage: no easy way to balance the size of the partitions

• constructive methods …– lead to valid, but suboptimal results– are often used to generate a start partition for iterative (refinement)

methods– clustering often shows the difficulty of finding suitable closeness function

32

Hierarchical Clustering (1)

2010

108

4 6

v1

v3v2

v4

10

7

4

v5 = v1 ⋃ v3

v4

v5

v2

20

closeness between hierarchical objects: arithmetic mean

33


5.5

v6 = v2 ⋃ v5

v4

v610

7

4 v4

v5

v2

10

34


v7 = v6 ⋃ v4

v75.5

v4

v6

5.5

35


v7 = v6 ⋃ v4

v4

v6 = v2 ⋃ v5

v5 = v1 ⋃ v3

v1 v2 v3

step 1:

step 2:

step 3:

cut lines(partitions)

36

✎ Exercise 6.1: Hierarchical Clustering ✎ Exercise 6.2: Closeness Function

Greedy Hw/Sw Partitioning (1)

• bi-partitioning (simplest case): P = {pSW, pHW}

• software oriented approach: P = {O, {}}– in SW all functions can be realized – performance might be too low → migrate objects to HW

• hardware oriented approach: P = {{}, O}– in HW the performance goals are satisfied (assumes HW is always faster

than SW) – cost might be too high → migrate objects to SW

37

Greedy Hw/Sw Partitioning (2)

• migration of objects into the other partition (HW/SW), until there is no more improvement

REPEAT {Pold = P;FOR i = 1 TO n {

IF (f(Move(P, oi)) < f(P)) {P = Move(P, oi);

}}

UNTIL (P == Pold)

cost function f()

38

✎ Example: Greedy structural partitioning ✎ Exercise 7.1: Greedy partitioning

Kernighan-Lin (1)

• iterative heuristic method for generating balanced bi-partitions • problem definition

– given: graph G(V,E) with weighted edges (V: set of vertices, E: set of edges)

– find partitioning into two balanced disjoint partitions A,B, such that the weight of edges between A and B is minimized

v9

v2

v4v5

v7

v1

v3v6

v8

39

5 7

2

8

123

41

9

12

2 7

5

B. W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. Bell Systems Technical Journal, 49:291–307, 1970.

Kernighan-Lin (2)

• definitions– Ia internal cost of vertex a (sum of all edge weights between a in A and all

other nodes in A)– Ea external cost (sum of all edge weights between a any node b in B)– Da =Ea – Ia difference between external and internal cost– ca,b cost of edge between a and b

• lemma– if two nodes a and b are exchanged, the cost is reduced by

Told – Tnew = Da + Db – 2ca,b

40

Kernighan-Lin (3)

• proof– let z be the cost due to all connections between A and B that do not

involve a or b, then:– cost before swapping a and b:

T = z + Ea + Eb – ca,b (subtract ca,b otherwise counted twice)– cost after swapping a and b:

T' = z + Ia + Ib + ca,b (add ca,b because not considered in internal cost)

– gain in cost: T-T' = (Ea-Ia) + (Eb-Ib) – 2ca,b = Da + Db – 2ca,b

41

a bca,b

IaIb

Ea

Eb

Kernighan-Lin (4)

42

function Kernighan-Lin(G(V,E)):determine a balanced initial partition of the nodes into sets A and Bdo

compute D values for all a in A and b in Blet gv, av, and bv be empty listsfor (n := 1 to |V|/2)

find a from A and b from B, such that g = D[a] + D[b] - 2*c[a,b] is maximalmove a to B and b to Aremove a and b from further consideration in this passadd g to gv, a to av and b to bvupdate D values for the elements of A = A \ a and B = B \ b

end forfind k which maximizes g_max, the sum of g[1],...,g[k]if (g_max > 0) then

exchange av[1],av[2],...,av[k] with bv[1],bv[2],...,bv[k]until (g_max <= 0)

return G(V,E)

source: Wikipedia (en, 2.7.2015)

Kernighan-Lin Example (1)

• balanced initial partition A and B

• compute D values for all a in A and b in B

43

v1

v5

1

v2

v6

v3

v7

v4

v8

1 4 2

2

3 1 1

B

A

E I D = E - Iv1 1 3 -2v2

v3

v7

v4

v5

v6

v8


• balanced initial partition A and B

• compute D values for all a in A and b in B

44

v1

v5

1

v2

v6

v3

v7

v4

v8

1 4 2

2

3 1 1

B

A

E I D = E - Iv1 1 3 -2v2 1 4 -3v3 1 5 -4v7 0 4 -4v4 1 2 -1v5 1 2 -1v6 1 2 -1v8 0 2 -2


• find a from A and b from B, such that g = D[a] + D[b] - 2*c[a,b] is maximal

45

E I D = E - Iv1 1 3 -2v2 1 4 -3v3 1 5 -4v7 0 4 -4v4 1 2 -1v5 1 2 -1v6 1 2 -1v8 0 2 -2

v1

v5

1

v2

v6

v3

v7

v4

v8

1 4 2

2

3 1 1

B

A

D[a] + D[b] v4 v5 v6 v8

v1 -2 -1 -2 -1 -2 -1 -2 -2v2 -3 -1 -3 -1 -3 -1 -3 -2v3 -4 -1 -4 -1 -4 -1 -4 -2v7 -4 -1 -4 -1 -4 -1 -4 -2


• find a from A and b from B, such that g = D[a] + D[b] - 2*c[a,b] is maximal

46

E I D = E - Iv1 1 3 -2v2 1 4 -3v3 1 5 -4v7 0 4 -4v4 1 2 -1v5 1 2 -1v6 1 2 -1v8 0 2 -2

v1

v5

1

v2

v6

v3

v7

v4

v8

1 4 2

2

3 1 1

B

A

g v4 v5 v6 v8

v1 -2 -1 = -3 -2 -1 -2 = -5 -2 -1 = -3 -2 -2 = -4v2 -3 -1 = -4 -3 -1 = -4 -3 -1 -2 = -6 -3 -2 = -5v3 -4 -1 -2 = -7 -4 -1 = -5 -4 -1 = -5 -4 -2 = -6v7 -4 -1 = -5 -4 -1 = -5 -4 -1 = -5 -4 -2 = -6


• move a to B and b to A

• remove a and b from further consideration in this pass• add g to gv, a to av and b to bv

– gv = [-3]– av = [v1]– bv = [v4]

47

v1

v5

1

v2

v6

v3

v7

v4

v8

14 2

2

3 1 1

BA


• update D values for the elements of A = A \ a and B = B \ b

48

v1

v5

1

v2

v6

v3

v7

v4

v8

14 2

2

3 1 1

BA

E I D = E - Iv2 4 1 3v3 0 6 -6v7 0 4 -4v5 0 3 -3v6 1 2 -1v8 2 0 0

E I D = E - Iv2 1 4 -3v3 1 5 -4v7 0 4 -4v5 1 2 -1v6 1 2 -1v8 0 2 -2


• iteration 2:• find a from A and b from B, such that

g = D[a] + D[b] - 2*c[a,b] is maximal

49

g v5 v6 v8

v2 3 -3 = 0 3 -1 -2 = 0 3 + 2 = 5v3 -6 -3 = -9 -6 -1 = -7 -6 + 2 = -4v7 -4 -3 = -7 -4 -1 = -5 -4 + 2 = -2

v1

v5

1

v2

v6

v3

v7

v4

v8

14 2

2

3 1 1

BA

E I D = E - Iv2 4 1 3v3 0 6 -6v7 0 4 -4v5 0 3 -3v6 1 2 -1v8 2 0 2

• gv = [-3, 5]• av = [v1, v2]• bv = [v4, v8]




50

g v5 v6

v3 -4 -3 = -7 -4 -5 = -9v7 -4 -3 = -7 -4 -5 = -9

v1

v5

1

v2

v6

v3

v7

v4

v8

14 2

2

3 1 1

B A

E I D = E - Iv3 1 5 -4v7 0 4 -4v5 0 3 -3v6 0 5 -5

• gv = [-3, 5, -7]• av = [v1, v2 , v3]• bv = [v4, v8 , v5]




• find k which maximizes g_max, the sum of g[1],...,g[k]

51

g v6

v7 4 + 1 = 5

v1

v5

1

v2

v6

v3

v7

v4

v8

14 2

2

3 1 1

B

A

E I D = E - Iv7 4 0 4v6 2 1 1

• gv = [-3, 5, -7, 5]• av = [v1, v2 , v3, v7]• bv = [v4, v8 , v5, v6]

• k=1: g_sum = -3• k=2: g_sum = 2• k=3: g_sum = -5• k=4: g_sum = 0

select k=2

swapped A and B

Kernighan-Lin (5)

• algorithmic complexity of one update loop: O(n2 log(n))• widely used in partitioning of netlists for digital circuits• there exist several extensions

– unequal partition sizes– n-way partitioning– finding near-optimal partitions

52

Simulated Annealing (1)

• simulated annealing – nature-inspired optimization method (meta-heuristic)– metal and glass take on minimal energy states when they are cooled

down under certain conditions:§ for each temperature, thermodynamic equilibrium is reached§ the temperature is decreased arbitrarily slow

– probability for a particle jumping into a higher energy state:

• application to combinatorial optimization– energy = cost of a solution– reduction of cost with simulated temperature, sometimes increases in cost

are accepted

53

𝑃(𝑒#,𝑒>, 𝑇) = 𝑒SOTSPUVW


temp = temp_start;cost = c(P);WHILE (Frozen() == FALSE) {

WHILE (Equilibrium() == FALSE) {P’ = RandomMove(P);cost’ = c(P’);deltacost = cost’ - cost;IF (Accept(deltacost, temp) > random[0,1)) {

P = P’cost = cost’

}}temp = DecreaseTemp(temp);

}

54

deltacost < 0 → improvement

A𝑐𝑐𝑒𝑝𝑡 = min 1, 𝑒^KJ_`abcM`%⋅`J,d


• accept function

55

0

0.2

0.4

0.6

0.8

1

1.2

-2 -1 0 1 2 3

acce

pt(d

elta

cost

,t)

deltacost (negative value means improvement)

accept function

t=0.1 t=0.5

t=1.0

t=1.5

f1(x)f2(x)f3(x)f4(x)

accept(deltacost, t) =min 1,e−deltacost

k⋅t#

$%

&

'(

k=1


• annealing schedule: DecreaseTemp(), Frozen()– temp_start = 1.0– temp = a • temp (typical: 0.8 ≤ a ≤ 0.99)– stop at temp < temp_min

or if there is no more improvement

• equilibrium: Equilibrium()– after certain number of iterations or if there is no more improvement

• complexity– from exponential to constant, depending on the implementation of the

functions Equilibrium(), DecreaseTemp(), Frozen()– the longer the runtime, the better the results– usually functions are constructed to get polynomial runtime

56

Partitioning Methods – Recap






57

Changes

• 2017-07-10 (v1.6.1)– add extensive example on Kerninghan-Lin

• 2017-06-25 (v1.6.0)– updated for SS2017

• 2016-07-05 (v1.5.2)– fixed type on equation on slide 43

• 2016-06-21 (v1.5.1)– fixed many equations that were broken due to PowerPoint update

• 2016-06-05 (v1.5.0)– updated for SS2016

• 2015-07-02 (v1.4.1)– corrected Kernighan-Lin pseudocode (slide 41)

• 2015-06-18 (v1.4.0)– update for SS2015

58

Changes

• 2014-06-26 (v1.3.1)– minor cosmetic changes

• 2014-06-11 (v1.3.0)– updated for SS2014 (added detailed description of Kernighan-Lin

algorithm)

• 2013-06-12 (v1.2.0)– updated for SS2013

• 2012-05-12 (v1.1.0)– updated for SS2012

• 2011-06-07 (v1.0.0)– initial version

• 2011-06-09 (v1.0.1)– cosmetic changes

59

Documents

System Partitioning - uni-paderborn.de