Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
System Partitioning
Dr. Tobias KenterProf. Dr. Christian Plessl
SS 2017
High-Performance IT Systems groupPaderborn University
Version 1.6.0 – 2017-06-25
Overview
• graph models for system synthesis
• the partitioning problem
• partitioning methods
2
Overview
• graph models for system synthesis
• the partitioning problem
• partitioning methods
3
Models for System Synthesis
• allocation + binding = partitioning
• problem graph– nodes: functional and communication objects– edges: dependencies
• architecture graph– nodes: functional and communication resources– edges: directed communication paths
• specification graph– problem graph + architecture graph + possible mappings
4
Problem Graph
1
3
2
4
DFG problem graph
1
3
2
4
5
7
6
communicationnodes
5
Architecture Graph
architecture grapharchitecture
RISC
HWM2HWM1
bus
point-to-point link
vRISC
vHWM1
vHWM2
vbus
vptp
6
Specification Graph
vRISC
vHWM1
vHWM2
vbus
vptp
1
3
2
4
5
7
6
7
Allocation, Binding
vRISC
vHWM1
vHWM2
vbus
vptp
1
3
2
4
5
7
6
8
Example: Homogeneous Multiprocessor
• optimization goals– minimize latency– meet deadlines– …
bus
PE1
M
PE2
M
PE3
M
vPE1
vbus
vPE2 vPE3
9
Example: HW/SW Bi-Partitioning
• bi-partitioning means that there are only two blocks: SW and HW (accelerator)
processor
FPGA
bus
vprocessor
vFPGA
vbus
10
Synthesis and Constraints
• system synthesis modeling– create problem graph– define architectural template– identify possible bindings– refinements: communication, memory, ..
• system synthesis with constraints Cmax, Lmax– allocation - determines cost C (first approximation)– binding– scheduling - determines latency L
– feasible schedule: L ≤ Lmax
– feasible binding: leads to at least one feasible schedule– feasible allocation: C ≤ Cmax and leads to at least one feasible
binding
11
Overview
• graph models for system synthesis
• the partitioning problem
• partitioning methods
12
Partitioning Problem
• definition: the partitioning problem is to assignn objects O ={o1, ..., on} to k partitions P={p1, ..., pk},
such that– p1⋃ p2 ⋃ ... ⋃ pm = O– pi ⋂ pj = { } ∀ i,j: i ≠ j and– the cost c(P) is minimized.
the general partitioning problem is NP-complete
• in system synthesis:– objects: problem graph nodes – partitions: architecture graph nodes
13
Partitioning – Abstraction Levels
• structural partitioning– at the RTL- or netlist level– e.g. map a digital circuit onto two chips (FPGAs, ASICs)– system parameters are relatively well known (area and delay of functional
units, registers, gates)– no comparison of design alternatives– cost: e.g. number of wires between partitions, maximum delay between all
partitions, average size of each partition, ...
• functional partitioning– at the system level– system parameters are not known à estimation required– comparison of design alternatives à design space exploration– cost: e.g. required communication bandwidth or connections between
partitions, hardware implementation cost for all partitions, ...
14
Cost Functions
• measure quality of a design point– may include C system cost in [$]
L latency in [sec] P power consumption in [W]
– requires estimation to find C, L, P
• example: linear cost function with penalty
– hC , hL , hP … denote how strong C, L, P violate the design constraints Cmax, Lmax, Pmax
– k1 , k2 , k3 … weighting and normalization
f(C, L, P) = k1·hC(C,Cmax) + k2·hL(L,Lmax) + k3·hP(P,Pmax)
15
✎ Example: Cost functions
Overview
• graph models for system synthesis
• the partitioning problem
• partitioning methods
16
Partitioning Methods – Overview
• exact methods– enumeration of solutions– integer linear programs (ILP)
• heuristic methods– constructive methods
§ random mapping§ hierarchical clustering
– iterative methods (refinement methods)§ greedy partitioners§ Kernighan-Lin
§ simulated annealing§ evolutionary algorithms (design space exploration)
17
Enumeration
• Enumeration is not feasible even for small partitioning problems
18
✎ Example: Enumeration
ILP – Example
19
processor
ASIC
bus
1
32
4
problem architecture
5
6
task SW cost HW cost
1 80 320
2 240 170
3 710 120
4 130 20
5 100 400
6 80 260
cost table
→ find optimal allocation & binding
(all bindings possible)
ILP – Basic Problem Formulation
• binary variables xi,k = 1: object oi in partition pk
• cost ci,k (constants), if object oi is in partition pk
• integer linear program:
20
!𝑥#,% = 11 ≤ 𝑖 ≤ 𝑛,
%-.
𝑥#,% ∈ 0,1 1 ≤ 𝑖 ≤ 𝑛, 1 ≤ 𝑘 ≤ 𝑚
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒:!!𝑥#,% ⋅ 𝑐#,%
8
#-.
1 ≤ 𝑘 ≤ 𝑚, 1 ≤ 𝑖 ≤ 𝑛,
%-.
ILP – Illustration Binding
1
32
4 5
6
task SW cost HW cost
1 80 320
2 240 170
3 710 120
4 130 20
5 100 400
6 80 260
cost table
vprocessor
vASIC
x11
1
2
x12
x21
x22
n=6 m=2
n * m=12binary variables
specification graph (incomplete)
21
ILP – Basic ILP in LP Format
/************************/ /* objective function *//************************/
/* c1s = 80 c1h = 320 *//* c2s = 240 c2h = 170 *//* c3s = 710 c3h = 120 *//* c4s = 130 c4h = 20 *//* c5s = 100 c5h = 400 *//* c6s = 80 c6h = 260 */
min: 80 x11 + 320 x12 + 240 x21 + 170 x22 + 710 x31 + 120 x32 + 130 x41 + 20 x42 + 100 x51 + 400 x52 + 80 x61 + 260 x62;
/******************************/ /* unique mapping constraints *//******************************/ x11 + x12 = 1;x21 + x22 = 1;x31 + x32 = 1;x41 + x42 = 1;x51 + x52 = 1;x61 + x62 = 1;
/**********************//* variables in {0,1} *//**********************/x11 <= 1;x12 <= 1;x21 <= 1;x22 <= 1;x31 <= 1;x32 <= 1;x41 <= 1;x42 <= 1;x51 <= 1;x52 <= 1;x61 <= 1;x62 <= 1;
/*********************/ /* integer variables *//*********************/int x11;int x12;int x21;int x22;int x31;int x32;int x41;int x42;int x51;int x52;int x61;int x62;
22
ILP – Solution of Basic ILP w/ Cost Constraints
1
32
4 5
6
task SW cost HW cost
1 80 320
2 240 170
3 710 120
4 130 20
5 100 400
6 80 260
cost table
vprocessor
vASIC
1
2
n=6 m=2
n * m=12binary variables
total cost = 570
allocation & binding
23
ILP – Additional Constraints (1)
• at maximum hk objects in partition pk
• maximal cost Hk of objects in partition pk
24
!𝑥#,% ≤ ℎ%1 ≤ 𝑘 ≤ 𝑚8
#-.
!𝑐#,% ⋅ 𝑥#,% ≤ 𝐻%1 ≤ 𝑘 ≤ 𝑚8
#-.
ILP – Additional Constraints (2)
• constraint on the hardware cost– cost of all tasks mapped to hardware must not exceed 300
/****************************/ /* hardware cost constraint *//****************************/ 320 x12 + 170 x22 + 120 x32 + 20 x42 + 400 x52 + 260 x62 <= 300;
25
!𝑐#,; ⋅ 𝑥#,; ≤ 300=
#-.
ILP – Solution w/ Cost and Capacity Constraints
1
32
4 5
6
task SW cost HW cost
1 80 320
2 240 170
3 710 120
4 130 20
5 100 400
6 80 260
cost table
vprocessor
vASIC
1
2
n=6 m=2
n * m=12binary variables
total cost = 640
constraint: HW cost <= 300
allocation & binding
26
ILP – Communication Cost Modeling (1)
• including communication cost– assume local communication is for free (SW-SW or HW-HW) but bus
communication incurs costs (SW-HW or HW-SW)– for simplicity, we assume additive costs
edge i →j local communication bus communication
1→2 0 20
1→3 0 250
2→4 0 200
2→5 0 10
3→5 0 100
4→6 0 5
5→6 0 50
communication cost: comi,j
27
ILP – Communication Cost Modeling (2)
• introduce additional binary variables– gi,j =1, iff the communication from node i to node j uses the bus
• define gi,j in terms of the primary variables xi,k
• new cost/objectivefunction
non-linear !!
linearization
28
✎ Excursus: Linearization of constraints
𝑥#,. ⋅ 𝑥>,;+𝑥#,; ⋅ 𝑥>,. = 𝑔#,> ∀𝑒𝑑𝑔𝑒𝑠(𝑣#, 𝑣>)
𝑥#,.+𝑥>,; − 1 ≤ 𝑔#,>𝑥#,;+𝑥>,. − 1 ≤ 𝑔#,>
G∀𝑒𝑑𝑔𝑒𝑠(𝑣#, 𝑣>)
𝑚𝑖𝑛 !(𝑐#,. ⋅ 𝑥#,.
8
#-.
+ 𝑐#,; ⋅ 𝑥#,;)+ ! 𝑐𝑜𝑚#,> ⋅ 𝑔#,>JKLJM(NO,NP)
ILP – Extended ILP in LP Format
/***********************/ /* communication costs *//***********************/ x12 + x21 - 1 <= g12;x11 + x22 - 1 <= g12;
x12 + x31 - 1 <= g13;x11 + x32 - 1 <= g13;
x22 + x41 - 1 <= g24;x21 + x42 - 1 <= g24;
x22 + x51 - 1 <= g25;x21 + x52 - 1 <= g25;
x32 + x51 - 1 <= g35;x31 + x52 - 1 <= g35;
x42 + x61 - 1 <= g46;x41 + x62 - 1 <= g46;
x52 + x61 - 1 <= g56;x51 + x62 - 1 <= g56;
/**********************//* variables in {0,1} *//**********************/g12 <= 1; g13 <= 1; g24 <= 1; g25 <= 1; g35 <= 1; g46 <= 1; g56 <= 1;
/*********************/ /* integer variables *//*********************/int g12; int g13; int g24; int g25; int g35; int g46; int g56;
/***********************/ /* objective function *//***********************/
/* c1s = 80 c1h = 320 *//* c2s = 240 c2h = 170 *//* c3s = 710 c3h = 120 *//* c4s = 130 c4h = 20 *//* c5s = 100 c5h = 400 *//* c6s = 80 c6h = 260 */
/* g12 = 20 *//* g13 = 250 *//* g24 = 200 *//* g25 = 10 *//* g35 = 100 *//* g46 = 5 */ /* g56 = 50 */
min: 80 x11 + 320 x12 + 240 x21 + 170 x22 + 710 x31 + 120 x32 + 130 x41 + 20 x42 + 100 x51 + 400 x52 + 80 x61 + 260 x62 + 20 g12 + 250 g13 + 200 g24 + 10 g25 + 100 g35 + 5 g46 + 50 g56;
29
ILP – Solution w/ HW and Communication Costs
task SW cost HW cost
1 80 320
2 240 170
3 710 120
4 130 20
5 100 400
6 80 260
cost table
vprocessor
vASIC
1
2
n=6 m=2
allocation & binding
total cost = 1100
constraint: HW cost <= 300
vbus
1
32
4 5
6
20 250
10
5 50
200 100
30
Partitioning Methods – Overview
• exact methods– enumeration of solutions– integer linear programs (ILP)
• heuristic methods– constructive methods
§ random mapping§ hierarchical clustering
– iterative methods (refinement methods)§ greedy partitioners§ Kernighan-Lin
§ simulated annealing§ evolutionary algorithms (design space exploration)
31
Constructive Methods
• examples– random mapping
§ each object is randomly assigned to some block§ disadvantage: probability to generate a near-optimal partitioning is extremely
low
– hierarchical clustering§ define a closeness function that determines how desirable the grouping of two
objects is§ perform a stepwise grouping of close objects or object groups§ disadvantage: no easy way to balance the size of the partitions
• constructive methods …– lead to valid, but suboptimal results– are often used to generate a start partition for iterative (refinement)
methods– clustering often shows the difficulty of finding suitable closeness function
32
Hierarchical Clustering (1)
2010
108
4 6
v1
v3v2
v4
10
7
4
v5 = v1 ⋃ v3
v4
v5
v2
20
closeness between hierarchical objects: arithmetic mean
33
Hierarchical Clustering (2)
5.5
v6 = v2 ⋃ v5
v4
v610
7
4 v4
v5
v2
10
34
Hierarchical Clustering (3)
v7 = v6 ⋃ v4
v75.5
v4
v6
5.5
35
Hierarchical Clustering (4)
v7 = v6 ⋃ v4
v4
v6 = v2 ⋃ v5
v5 = v1 ⋃ v3
v1 v2 v3
step 1:
step 2:
step 3:
cut lines(partitions)
36
✎ Exercise 6.1: Hierarchical Clustering ✎ Exercise 6.2: Closeness Function
Greedy Hw/Sw Partitioning (1)
• bi-partitioning (simplest case): P = {pSW, pHW}
• software oriented approach: P = {O, {}}– in SW all functions can be realized – performance might be too low → migrate objects to HW
• hardware oriented approach: P = {{}, O}– in HW the performance goals are satisfied (assumes HW is always faster
than SW) – cost might be too high → migrate objects to SW
37
Greedy Hw/Sw Partitioning (2)
• migration of objects into the other partition (HW/SW), until there is no more improvement
REPEAT {Pold = P;FOR i = 1 TO n {
IF (f(Move(P, oi)) < f(P)) {P = Move(P, oi);
}}
UNTIL (P == Pold)
cost function f()
38
✎ Example: Greedy structural partitioning ✎ Exercise 7.1: Greedy partitioning
Kernighan-Lin (1)
• iterative heuristic method for generating balanced bi-partitions • problem definition
– given: graph G(V,E) with weighted edges (V: set of vertices, E: set of edges)
– find partitioning into two balanced disjoint partitions A,B, such that the weight of edges between A and B is minimized
v9
v2
v4v5
v7
v1
v3v6
v8
39
5 7
2
8
123
41
9
12
2 7
5
B. W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. Bell Systems Technical Journal, 49:291–307, 1970.
Kernighan-Lin (2)
• definitions– Ia internal cost of vertex a (sum of all edge weights between a in A and all
other nodes in A)– Ea external cost (sum of all edge weights between a any node b in B)– Da =Ea – Ia difference between external and internal cost– ca,b cost of edge between a and b
• lemma– if two nodes a and b are exchanged, the cost is reduced by
Told – Tnew = Da + Db – 2ca,b
40
Kernighan-Lin (3)
• proof– let z be the cost due to all connections between A and B that do not
involve a or b, then:– cost before swapping a and b:
T = z + Ea + Eb – ca,b (subtract ca,b otherwise counted twice)– cost after swapping a and b:
T' = z + Ia + Ib + ca,b (add ca,b because not considered in internal cost)
– gain in cost: T-T' = (Ea-Ia) + (Eb-Ib) – 2ca,b = Da + Db – 2ca,b
41
a bca,b
IaIb
Ea
Eb
Kernighan-Lin (4)
42
function Kernighan-Lin(G(V,E)):determine a balanced initial partition of the nodes into sets A and Bdo
compute D values for all a in A and b in Blet gv, av, and bv be empty listsfor (n := 1 to |V|/2)
find a from A and b from B, such that g = D[a] + D[b] - 2*c[a,b] is maximalmove a to B and b to Aremove a and b from further consideration in this passadd g to gv, a to av and b to bvupdate D values for the elements of A = A \ a and B = B \ b
end forfind k which maximizes g_max, the sum of g[1],...,g[k]if (g_max > 0) then
exchange av[1],av[2],...,av[k] with bv[1],bv[2],...,bv[k]until (g_max <= 0)
return G(V,E)
source: Wikipedia (en, 2.7.2015)
Kernighan-Lin Example (1)
• balanced initial partition A and B
• compute D values for all a in A and b in B
43
v1
v5
1
v2
v6
v3
v7
v4
v8
1 4 2
2
3 1 1
B
A
E I D = E - Iv1 1 3 -2v2
v3
v7
v4
v5
v6
v8
Kernighan-Lin Example (1)
• balanced initial partition A and B
• compute D values for all a in A and b in B
44
v1
v5
1
v2
v6
v3
v7
v4
v8
1 4 2
2
3 1 1
B
A
E I D = E - Iv1 1 3 -2v2 1 4 -3v3 1 5 -4v7 0 4 -4v4 1 2 -1v5 1 2 -1v6 1 2 -1v8 0 2 -2
Kernighan-Lin Example (2)
• find a from A and b from B, such that g = D[a] + D[b] - 2*c[a,b] is maximal
45
E I D = E - Iv1 1 3 -2v2 1 4 -3v3 1 5 -4v7 0 4 -4v4 1 2 -1v5 1 2 -1v6 1 2 -1v8 0 2 -2
v1
v5
1
v2
v6
v3
v7
v4
v8
1 4 2
2
3 1 1
B
A
D[a] + D[b] v4 v5 v6 v8
v1 -2 -1 -2 -1 -2 -1 -2 -2v2 -3 -1 -3 -1 -3 -1 -3 -2v3 -4 -1 -4 -1 -4 -1 -4 -2v7 -4 -1 -4 -1 -4 -1 -4 -2
Kernighan-Lin Example (2)
• find a from A and b from B, such that g = D[a] + D[b] - 2*c[a,b] is maximal
46
E I D = E - Iv1 1 3 -2v2 1 4 -3v3 1 5 -4v7 0 4 -4v4 1 2 -1v5 1 2 -1v6 1 2 -1v8 0 2 -2
v1
v5
1
v2
v6
v3
v7
v4
v8
1 4 2
2
3 1 1
B
A
g v4 v5 v6 v8
v1 -2 -1 = -3 -2 -1 -2 = -5 -2 -1 = -3 -2 -2 = -4v2 -3 -1 = -4 -3 -1 = -4 -3 -1 -2 = -6 -3 -2 = -5v3 -4 -1 -2 = -7 -4 -1 = -5 -4 -1 = -5 -4 -2 = -6v7 -4 -1 = -5 -4 -1 = -5 -4 -1 = -5 -4 -2 = -6
Kernighan-Lin Example (3)
• move a to B and b to A
• remove a and b from further consideration in this pass• add g to gv, a to av and b to bv
– gv = [-3]– av = [v1]– bv = [v4]
47
v1
v5
1
v2
v6
v3
v7
v4
v8
14 2
2
3 1 1
BA
Kernighan-Lin Example (4)
• update D values for the elements of A = A \ a and B = B \ b
48
v1
v5
1
v2
v6
v3
v7
v4
v8
14 2
2
3 1 1
BA
E I D = E - Iv2 4 1 3v3 0 6 -6v7 0 4 -4v5 0 3 -3v6 1 2 -1v8 2 0 0
E I D = E - Iv2 1 4 -3v3 1 5 -4v7 0 4 -4v5 1 2 -1v6 1 2 -1v8 0 2 -2
Kernighan-Lin Example (5)
• iteration 2:• find a from A and b from B, such that
g = D[a] + D[b] - 2*c[a,b] is maximal
49
g v5 v6 v8
v2 3 -3 = 0 3 -1 -2 = 0 3 + 2 = 5v3 -6 -3 = -9 -6 -1 = -7 -6 + 2 = -4v7 -4 -3 = -7 -4 -1 = -5 -4 + 2 = -2
v1
v5
1
v2
v6
v3
v7
v4
v8
14 2
2
3 1 1
BA
E I D = E - Iv2 4 1 3v3 0 6 -6v7 0 4 -4v5 0 3 -3v6 1 2 -1v8 2 0 2
• gv = [-3, 5]• av = [v1, v2]• bv = [v4, v8]
Kernighan-Lin Example (6)
• iteration 3:• find a from A and b from B, such that
g = D[a] + D[b] - 2*c[a,b] is maximal
50
g v5 v6
v3 -4 -3 = -7 -4 -5 = -9v7 -4 -3 = -7 -4 -5 = -9
v1
v5
1
v2
v6
v3
v7
v4
v8
14 2
2
3 1 1
B A
E I D = E - Iv3 1 5 -4v7 0 4 -4v5 0 3 -3v6 0 5 -5
• gv = [-3, 5, -7]• av = [v1, v2 , v3]• bv = [v4, v8 , v5]
Kernighan-Lin Example (7)
• iteration 4:• find a from A and b from B, such that
g = D[a] + D[b] - 2*c[a,b] is maximal
• find k which maximizes g_max, the sum of g[1],...,g[k]
51
g v6
v7 4 + 1 = 5
v1
v5
1
v2
v6
v3
v7
v4
v8
14 2
2
3 1 1
B
A
E I D = E - Iv7 4 0 4v6 2 1 1
• gv = [-3, 5, -7, 5]• av = [v1, v2 , v3, v7]• bv = [v4, v8 , v5, v6]
• k=1: g_sum = -3• k=2: g_sum = 2• k=3: g_sum = -5• k=4: g_sum = 0
select k=2
swapped A and B
Kernighan-Lin (5)
• algorithmic complexity of one update loop: O(n2 log(n))• widely used in partitioning of netlists for digital circuits• there exist several extensions
– unequal partition sizes– n-way partitioning– finding near-optimal partitions
52
Simulated Annealing (1)
• simulated annealing – nature-inspired optimization method (meta-heuristic)– metal and glass take on minimal energy states when they are cooled
down under certain conditions:§ for each temperature, thermodynamic equilibrium is reached§ the temperature is decreased arbitrarily slow
– probability for a particle jumping into a higher energy state:
• application to combinatorial optimization– energy = cost of a solution– reduction of cost with simulated temperature, sometimes increases in cost
are accepted
53
𝑃(𝑒#,𝑒>, 𝑇) = 𝑒SOTSPUVW
Simulated Annealing (2)
temp = temp_start;cost = c(P);WHILE (Frozen() == FALSE) {
WHILE (Equilibrium() == FALSE) {P’ = RandomMove(P);cost’ = c(P’);deltacost = cost’ - cost;IF (Accept(deltacost, temp) > random[0,1)) {
P = P’cost = cost’
}}temp = DecreaseTemp(temp);
}
54
deltacost < 0 → improvement
A𝑐𝑐𝑒𝑝𝑡 = min 1, 𝑒^KJ_`abcM`%⋅`J,d
Simulated Annealing (3)
• accept function
55
0
0.2
0.4
0.6
0.8
1
1.2
-2 -1 0 1 2 3
acce
pt(d
elta
cost
,t)
deltacost (negative value means improvement)
accept function
t=0.1 t=0.5
t=1.0
t=1.5
f1(x)f2(x)f3(x)f4(x)
accept(deltacost, t) =min 1,e−deltacost
k⋅t#
$%
&
'(
k=1
Simulated Annealing (4)
• annealing schedule: DecreaseTemp(), Frozen()– temp_start = 1.0– temp = a • temp (typical: 0.8 ≤ a ≤ 0.99)– stop at temp < temp_min
or if there is no more improvement
• equilibrium: Equilibrium()– after certain number of iterations or if there is no more improvement
• complexity– from exponential to constant, depending on the implementation of the
functions Equilibrium(), DecreaseTemp(), Frozen()– the longer the runtime, the better the results– usually functions are constructed to get polynomial runtime
56
Partitioning Methods – Recap
• exact methods– enumeration of solutions– integer linear programs (ILP)
• heuristic methods– constructive methods
§ random mapping§ hierarchical clustering
– iterative methods (refinement methods)§ greedy partitioners§ Kernighan-Lin
§ simulated annealing§ evolutionary algorithms (design space exploration)
57
Changes
• 2017-07-10 (v1.6.1)– add extensive example on Kerninghan-Lin
• 2017-06-25 (v1.6.0)– updated for SS2017
• 2016-07-05 (v1.5.2)– fixed type on equation on slide 43
• 2016-06-21 (v1.5.1)– fixed many equations that were broken due to PowerPoint update
• 2016-06-05 (v1.5.0)– updated for SS2016
• 2015-07-02 (v1.4.1)– corrected Kernighan-Lin pseudocode (slide 41)
• 2015-06-18 (v1.4.0)– update for SS2015
58
Changes
• 2014-06-26 (v1.3.1)– minor cosmetic changes
• 2014-06-11 (v1.3.0)– updated for SS2014 (added detailed description of Kernighan-Lin
algorithm)
• 2013-06-12 (v1.2.0)– updated for SS2013
• 2012-05-12 (v1.1.0)– updated for SS2012
• 2011-06-07 (v1.0.0)– initial version
• 2011-06-09 (v1.0.1)– cosmetic changes
59