Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
1
Shankar BalachandranAssistant Professor, Dept. of CSE
High Level SynthesisHigh Level Synthesis
2
References and Copyright• Textbooks referred (none required)
[Mic94] G. De Micheli“Synthesis and Optimization of Digital Circuits”McGraw-Hill, 1994.
• Slides used:Giovanni de Micheli’s Slides on SynthesisKia Bazargan’s Material on High Level Synthesis[©Gupta] © Rajesh Gupta
UC-Irvinehttp://www.ics.uci.edu/~rgupta/ics280.html
3
High Level Synthesis (HLS)• The process of converting a high-level description
of a design to a netlistInput:
o High-level languages (e.g., C)o Behavioral hardware description languages (e.g., VHDL)o Structural HDLs (e.g., VHDL)o State diagrams / logic networks
Tools:o Parsero Library of modules
Constraints:o Area constraints (e.g., # modules of a certain type)o Delay constraints (e.g., set of operations should finish in λ
clock cycles)Output:
o Operation scheduling (time) and binding (resource)o Control generation and detailed interconnections
4
Architectural-level synthesis motivation• Raise input abstraction level
Reduce specification of detailsExtend designer baseSelf-documenting design specificationsEase modifications and extensions
• Reduce design time• Explore and optimize macroscopic structure:
Series/parallel execution of operations
5
Synthesis
• Transform behavioral into structural view• Architectural-level synthesis:
Architectural abstraction levelDetermine macroscopic structureExample: major building blocks
• Logic-level synthesis:Logic abstraction levelDetermine microscopic structureExample: logic gate interconnection
6
High-Level Synthesis Compilation Flow
Lex
Parse
Compilationfront-end
BehavioralOptimization
Intermediateform
Arch synthLogic synth
Lib Binding HLS backend
x = a + ( b × c )+ d
++
×
a b c d
+
+ ×
a d b c
7
Example
diffeq {read (x; y; u; dx; a);repeat
xl = x+dx;ul = u –(3 . x . u . dx) – (3 . y . dx)yl = y + u . dx ;c = xl < a;X = xl; u = ul; y = yl;
until ( c )write (y);
}
8
Compilation and behavioral optimization
• Software compilation:Compile program into intermediate formOptimize intermediate formGenerate target code for an architecture
• Hardware compilation:Compile programs/HDL into sequencing graphOptimize sequencing graphGenerate gate-level interconnection for a cell library
9
Behavioral-level optimization
• Semantic-preserving transformations aiming at simplifying the model
• Applied to parse-trees or during their generation
• Taxonomy:Data-flow based transformationsControl-flow based transformations
10
Architectural synthesis and optimization• Synthesize macroscopic structure in terms of
building-blocks• Explore area/performance trade-off:
maximize performance of implementations subject to area constraintsminimize area implementations subject to performance constraints
• Determine an optimal implementation• Create logic model for data-path and control
11
Design space and objectives• Design space:
Set of all feasible implementations
• Implementation parameters:AreaPerformance:
o Cycle-timeo Latencyo Throughput (for pipelined implementations)
Power consumption
12
Design evaluation space
Area
Area
Area
Latency
Latency
Latency
Latency Max
Area Max
Cycle-time
13
Dependency Graph (Sequencing Graph)
diffeq {read (x; y; u; dx; a);repeat
xl = x+dx;ul = u –(3 . x . u . dx) – (3 . y . dx)yl = y + u . dx ;c = xl < a;X = xl; u = ul; y = yl;
until ( c )write (y);
}
+××××
× × + <-
-
x dxayu 3
ul vl c xl
14
Hardware modeling
• Circuit behavior:Sequencing graphs
• Building blocks:Resources
• Constraints:Timing and resource usage
15
Resources• Functional resources:
Perform operations on dataExample: arithmetic and logic blocks
• Storage resources:Store dataExample: memory and registers
• Interface resources:Example: busses and ports
16
Resources and circuit families
• Resource-dominated circuits.Area and performance depend on few, well-characterized blocksThe most common in DSP Circuits
• Non resource-dominated circuitsArea and performance are strongly influenced by sparse logic, control and wiringExample: some ASIC circuits
17
Implementation constraints• Timing constraints:
Cycle-timeLatency of a set of operationsTime spacing between operation pairs
• Resource constraints:Resource usage (or allocation)Partial binding
18
Sequence Graph• Remove all nodes
corresponding to constants (with respect to the loop)
• Remove edges from those constants as well
• Remove nodes that are being written in the loop and the corresponding edges
• Add NOP nodes at both ends to represent reads and writes of the loop
• Such a graph is much simpler to operate on
+××××
× × + <-
-
x dxayu 3
ul vl c xlNOP
NOP
19
Synthesis in the temporal domain
• Scheduling:Associate a start-time with each operationDetermine latency and parallelism of the implementationThe schedule is called φ
• Scheduled sequencing graph:Sequencing graph with start-time annotation
20
Synthesis in Temporal Domain
• Schedule:Mapping of operations to time slots (cycles)A scheduled sequencing graph is a labeled graph
+
NOP
××××
× × + <-
-NOP
1
23
4
+
NOP
×
×
××
×
×
+
<-
-
NOP
1
23
4Schedule 1 Schedule 2
21
Operation Types• For each operation, define its type.• For each resource, define a resource type,
and a delay (in terms of # cycles)• T is a relation that maps an operation to a
resource type that can implement itT : V {1, 2, ..., nres}.
• More general case:A resource type may implement more than one operation type (e.g., ALU)
• Resource binding:Map each operation to a resource with the same typeMight have multiple options
22
Synthesis in the spatial domain• Binding:
Associate a resource with each operation with the same typeDetermine the area of the implementation
• Sharing:Bind a resource to more than one operationOperations must not execute concurrently
• Bound sequencing graph:Sequencing graph with resource annotation
23
Schedule in Spatial Domain• Resource sharing
More than one operation bound to same resourceOperations have to be serializedCan be represented using hyperedges (define vertex partition)
+
NOP
××××
× × + <
-
-
NOP
1
2
3
4
24
Binding Should Change with Schedules
+
NOP
×
×
××
×
×
+
<-
-
NOP
1
23
4
+
NOP
××××
× × + <-
-NOP
1
23
4
4 Multipliers2 Adders
2 Multipliers2 Adders
25
Scheduling and Binding
Resource dominated
Control dominated
• Resource constraints:Number of resource instances of each type{ak : k=1, 2, ..., nres}.
• Scheduling:Labeled vertices φ (v3)=1.
• Binding:Hyperedges (or vertex partitions) β (v2)=adder1.
• Cost:Number of resources ≈ areaRegisters, steering logic (Muxes, busses), wiring, control unit
• Delay:Start time of the “sink” nodeMight be affected by steering logic and schedule (control logic) – resource-dominated vs. ctrl-dominated
26
Architectural Optimization• Optimization in view of design space flexibility• A multi-criteria optimization problem:
Determine schedule φ and binding β.Under area Α, latency λ and cycle time τ objectives
• Find non-dominated points in solution space• Solution space tradeoff curves:
Non-linear, discontinuousArea / latency / cycle time (more?)
27
Area/latency trade-off
1 2 3 4 5 6 7 8
5
10
15
78
1213
(3,2)
(2,1)
(3,1)
Area
Latency
20
1817
30
40
(2,2)(2,1)
(1,2)
(1,1)
Cycle-time
X
28
Scheduling and Binding• Cost λ and Α determined by both φ and β.
Also affected by floorplan and detailed routing
• β affected by φ:Resources cannot be shared among concurrent ops
• φ affected by β:Resources cannot be shared among concurrent opsWhen register and steering logic delays added to execution delays, might violate cycle time.
• Order?Apply either one (scheduling, binding) first
29
How Is the Datapath Implemented?• In the following schedule and binding, every
operation has two inputs. If an input is not shown explicitly, it comes from a unique register
+
×
×
××
×
×
+
<
-
-
1
2
3
4
30
Operation Scheduling• Input:
Sequencing graph G(V, E), with n verticesCycle time τ.Operation delays D = {di: i=0..n}.
• Output:Schedule φ determines start time ti of operation vi.Latency λ = tn – t0.
• Goal: determine area / latency tradeoff• Classes:
Non-hierarchical and unconstrainedLatency constrainedResource constrainedHierarchical
31
Min Latency Unconstrained Scheduling• Simplest case: no constraints, find min latency• Given set of vertices V, delays D and a partial
order > on operations E, find an integer labeling of operations φ: V Z+ Such that:
ti = φ(vi).ti ≥ tj + dj ∀ (vj, vi) ∈ E.λ = tn – t0 is minimum.
• Solvable in polynomial time• Bounds on latency for resource constrained
problems• ASAP algorithm used: topological order
32
ASAP SchedulesSchedule v0 at t0=0.While (vn not scheduled)
o Select vi with all scheduled predecessorso Schedule vi at ti = max {tj+dj}, vj being a predecessor of vi.
Return tn.
+
NOP
××××
× × + <-
-NOP
1
23
4
33
ALAP SchedulesSchedule vn at tn=λ.While (v0 not scheduled)
o Select vi with all scheduled successorso Schedule vi at ti = min {tj-dj}, vj being a succecessor of vi.
+
NOP
×
×××
×
×
+ <-
-NOP
1
23
4
34
Remarks• ALAP solves a latency-constrained problem• Latency bound can be set to latency computed by
ASAP algorithm• Mobility:
Defined for each operationDifference between ALAP and ASAP schedule
• Slack on the start time
35
Example
• Operations with zero mobility:{ v1, v2, v3, v4, v5 }Critical path
• Operations with mobility one: { v6, v7 }• Operations with mobility two: { v8, v9, v10, v11 }
* * + <
-
-
* * * * +
NOP
NOP
0
1 2
3
4
5
6
7
8
9
10
11
n
TIME 1
TIME 2
TIME 3
TIME 4
*
*
+ <
-
-
* *
*
* +
NOP
NOP
0
1 2
3
4
5
6
7 8
9
10
11
n
36
Scheduling under Relative Timing constraints• Motivation:
Deadlines and Release times are absoluteAlso makes sense to have relative constraintso Eg: Memory fetch must be done within 6 cycles and takes a
minimum of 2 cycles
• Constraints:Upper/lower bounds on start-time difference of any operation pairA minimum timing constraint lij ≥ 0 for specified i,jpairsA maximum timing constraint uij ≥ 0 for specified i,jpairs
• Feasibility of a solution
37
Constraint graph model• Start from sequencing graph
Model delays as weights on edges
• Add forward edges for minimum constraints:Edge ( vi , vj ) with weight lij → tj ≥ ti + lij
• Add backward edges for maximum constraints:That is, for constraint from vi to vjadd backward edge ( vj , vi ) with weight: -uij
o because tj ≤ ti + uij→ ti ≥ tj - uij
38
Example
NOP
NOP
* *
+ +
0
1 3
2 4
n
NOP
NOP
* *
+ +
0
1 3
2 4
n
MAX TIME
3
MIN TIME
4
-3
4
0 0
222
1 1
6vn
5v4
1v3
3v2
1v1
1v0
Start timeVertex1. Ensure that there are no positive cycles in the graph
2. Find longest paths in the graph between 2 nodes i and j and use as delay separations
39
Resource Constraint Scheduling• Constrained scheduling
General case NP-completeMinimize latency given constraints on area orthe resources (ML-RCS)Minimize resources subject to bound on latency (MR-LCS)
• Exact solution methodsILP: Integer Linear ProgrammingHu’s heuristic algorithm for identical processors
• HeuristicsList schedulingForce-directed scheduling
40
• Use binary decision variablesi = 0, 1, ..., nl = 1, 2, ..., λ’+1 λ’ given upper-bound on latency xil = 1 if operation i starts at step l, 0 otherwise.
• Set of linear inequalities (constraints),and an objective function (min latency)
• Observations
ti = start time of op i.
is op vi (still) executing at step l?
ILP Formulation of ML-RCS
[Mic94] p.198
))(),((
0
iLii
Si
Li
Siil
vALAPtvASAPt
tlandtlforx
==
><=
ill
i xlt ∑= .
⇒=∑+−=
11
l
dlmim
i
x ?
41
Constraints• Operations start only once
Σ xil = 1 i = 1, 2,…, n
• Sequencing relations must be satisfiedti ≥ tj + dj ti - tj - dj ≥ 0 for all (vj, vi) є E
• Resource bounds must be satisfiedSimple case (unit delay)Σ l xil ≤ ak k = 1,2,…nres ; for all l
• Equation 2 can be rewritten as Σ l • xil – Σ l • xjl – dj ≥ 0 for all (vj, vi) є E
i:T(vi)=k
42
Start Time vs. Execution Time• For each operation vi , only one start time• If di=1, then the following questions are the
same:Does operation vi start at step l?Is operation vi running at step l?
• But if di>1, then the two questions should be formulated as:
Does operation vi start at step l?o Does xil = 1 hold?
Is operation vi running at step l?o Does the following hold?
11
=∑+−=
l
dlmim
i
x ?
43
Operation vi Still Running at Step l ?• Is v9 running at step 6?
Is x9,6 + x9,5 + x9,4 = 1 ?
• Note:Only one (if any) of the above three cases can happenTo meet resource constraints, we have to ask the same question for ALL steps, and ALL operations of that type
v9
456
x9,4=1
v9
456
x9,5=1
v9
456
x9,6=1
44
Operation vi Still Running at Step l ?• Is vi running at step l ?
Is xi,l + xi,l-1 + ... + xi,l-di+1 = 1 ?
vi
l
l-1
l-di+1
...
xi,l-di+1=1
vil
l-1
l-di+1
...
xi,l-1=1
vil
l-1
l-di+1
...
xi,l=1
. . .
45
• Constraints:Unique start times:
Sequencing (dependency) relations must be satisfied
Resource constraints
• Objective: min cTt.t =start times vector, c =cost weight (e.g., [0 0 ... 1])When c =[0 0 ... 1], cTt =
ILP Formulation of ML-RCS (cont.)
∑ ==l
il nix ,...,1,0,1
jl
jll
ilijjji dxlxlEvvdtt ∑∑ +≥⇒∈∀+≥ ..),(
1,...,1,,...,1,)(: 1
+==≤∑ ∑= +−=
λlnkax reskkvTi
l
dlmim
i i
nll
xl∑ .
46
ILP Example• Assume λ = 4• First, perform ASAP and ALAP
(we can write the ILP without ASAP and ALAP, but using ASAP and ALAP will simplify the inequalities)
+
NOP
××××
× × + <-
-NOP
1
23
4
+
NOP
×
×××
×
×
+ <-
-NOP
1
23
4
v2v1
v3
v4
v5
vn
v6
v7
v8
v9
v10
v11
v2v1
v3
v4
v5
vn
v6
v7 v8
v9
v10
v11
47
ILP Example: Unique Start Times Constraint• Without using ASAP and
ALAP values:• Using ASAP and ALAP:
1.........
11
4,113,112,111,11
4,23,22,21,2
4,13,12,11,1
=+++
=+++
=+++
xxxx
xxxxxxxx
....11
11
11111
4,93,92,9
3,82,81,8
3,72,7
2,61,6
4,5
3,4
2,3
1,2
1,1
=++
=++
=+
=+
=
=
=
=
=
xxxxxx
xxxx
xxxxx
* * + <
-
-
* * * * +
NOP
NOP
0
1 2
3
4
5
6
7
8
9
10
11
n
48
ILP Example: Dependency Constraints• Using ASAP and ALAP, the non-trivial inequalities
are: (assuming unit delay for + and *)
01.4.3.2.501.4.3.2.501.3.2.401.3.2.4.3.201.3.2.4.3.201.2.3.2
4,113,112,115,
4,93,92,95,
3,72,74,5
3,102,101,104,113,112,11
3,82,81,84,93,92,9
2,61,63,72,7
≥−−−−
≥−−−−
≥−−−
≥−−−−++
≥−−−−++
≥−−−+
xxxxxxxxxxx
xxxxxxxxxxxxxxxx
n
n* * + <
-
-
* * * * +
NOP
NOP
0
1 2
3
4
5
6
7
8
9
10
11
n
49
ILP Example: Resource Constraints• Resource constraints (assuming 2 adders and 2
multipliers)
• Objective:Since λ=4 and sink has no mobility, any feasible solution is optimum, but we can use the following anyway:
2222222
4,114,94,5
3,113,103,93,4
2,112,102,9
1,10
3,83,7
2,82,72,62,3
1,81,61,21,1
≤++
≤+++
≤++
≤
≤+
≤+++
≤+++
xxxxxxxxxxxxxxxxxxxxx
4,3,2,1, .4.3.2 nnnn xxxxMin +++
50
Example
*
*
+
<
-
-
* *
*
*
+
NOP
NOP
0
1 2
3
4
5
6
78
9
10
11
n
TIME 1
TIME 2
TIME 3
TIME 4
Note that the schedule is different from both ALAP and ASAP schedules.
51
ILP Formulation of MR-LCS• Dual problem to ML-RCS• Objective:
Goal is to optimize total resource usage, a.Objective function is cTa , where entries in c are respective area costs of resources
• Constraints:Same as ML-RCS constraints, plus:Latency constraint added:
Note: unknown ak appears in constraints.• Resource usage is unknown in the constraints• Resource usage is the objective to minimize
1. +≤∑ λnll
xl
52
ILP Solution• Use standard ILP packages• Transform into LP problem • Advantages:
Exact methodOthers constraints can be incorporated
• Disadvantages:Works well only up to few thousand variables
53
Hu’s Algorithm• Simple case of the scheduling problem
Operations of unit delayOperations (and resources) of the same type
• Hu’s algorithmGreedyPolynomial AND optimalComputes lower bound on number of resources for a given latencyOR: computes lower bound on latency subject to resource constraints
• Basic idea:Label operations based on their distances from the sinkTry to schedule nodes with higher labels first(i.e., most “critical” operations have priority)
54
Hu’s algorithm• Assumptions:
Graph is a forestAll operations have unit delayAll operations have the same type
• Algorithm:Greedy strategyExact solution
55
Example
• Assumptions:One resource type onlyAll operations have unit delay
• Labels:Distance to sink
3 2 1 1
2
1
4 4 3 2 2
0
1 2
3
4
5
6
7
8
9
10
11
n
56
Hu’s AlgorithmHU (G(V,E), a) {
Label the vertices // label = length of longest pathpassing through the vertex
l = 1repeat {
U = unscheduled vertices in V whosepredecessors have been scheduled(or have no predecessors)
Select S ⊆ U such that |S| ≤ a and labels in Sare maximal
Schedule the S operations at step l by settingti=l, i: vi ∈ S.
l = l + 1} until vn is scheduled.
}
57
3 11
Example
Step 1: Op 1,2,6
Step 2: Op 3,7,8
Step 3: Op 4,9,10
Step 4: Op 5,11
2 1
2
4 4 3 2 2
0
1 2
3
4
5
6
7
8
9
10
11
n
_a = 3
4 4 3 2
23
2
1
2
11
1
58
Hu’s Algorithm: Example (a=3)
59
List Scheduling• Greedy algorithm for ML-RCS and MR-LCS
Does NOT guarantee optimum solution
• Similar to Hu’s algorithmOperation selection decided by criticalityO(n) time complexity
• More general inputResource constraints on different resource types
60
List Scheduling Algorithm: ML-RCS
LIST_L (G(V,E), a) {l = 1repeat {
for each resource type k {Ul,k = available vertices in V.Tl,k = operations in progress.
Select Sk ⊆ Ul,k such that |Sk| + |Tl,k| ≤ akSchedule the Sk operations at step l
}l = l + 1
} until vn is scheduled.}
61
Example
* *
+
<
-
-
* * *
*
+
NOP
NOP
0
1 2
3
4
5
6
7 8
9
10
11
n
TIME 1
TIME 2
TIME 3
TIME 4
TIME 5
TIME 6
TIME 7
Resource bounds:3 multipliers with delay 2
1 ALU with delay 1
* * + <
-
-
* * * * +
NOP
NOP
0
1 2
3
4
5
6
7
8
9
10
11
n
62
List Scheduling Algorithm: MR-LCSLIST_R (G(V,E), λ’) {
a = 1, l = 1Compute the ALAP times tL.if t0
L < 0return (not feasible)
repeat {for each resource type k {
Ul,k = available vertices in V.Compute the slacks { si = ti
L - l, ∀ vi∈ Ul,k }.Schedule operations with zero slack, update aSchedule additional Sk ⊆ Ul,k under a constraints
}l = l + 1
} until vn is scheduled.}
63
Example
TIME 1
TIME 2
TIME 3
TIME 4
*
*
+
<
-
-
* *
*
*
+
NOP
NOP
0
1 2
3
4
5
6
7 8
9
10
11
n
* * + <
-
-
* * * * +
NOP
NOP
0
1 2
3
4
5
6
7
8
9
10
11
n
AssumptionsUnit-delay resourcesMaximum latency = 4
Start with :a1 = 1 multipliera2 = 1 ALUs
Step 1Two multiplications on CPSet a1 = 2 Schedule Mult 1,2 Schedule ALU 10
Step 2Schedule Mult 3, 6Schedule ALU 11
Step 3Schedule Mult 7,8Schedule ALU 4
Step 4Set a2=2Schedule ALU 5, 9
64
Summary• Scheduling algorithms are used by tools
Compilers use them when you write code for DSP processorsTools like Xilinx ISE, Synopsys DC etc. use them when you compiler HDL models
• Good understanding of “under the hood”operations of tools is useful
• The constraint solving techniques can be used directly for your custom designs
Eg: In DSP software, if you know the resources, you write assembly code to minimize latency
65
Force-Directed Scheduling• Similar to list scheduling
Can handle ML-RCS and MR-LCSFor ML-RCS, schedules step-by-stepBUT, selection of the operations tries to find the globally best set of operations
• Idea:Find the mobility μi = ti
L – tiS of operations
Look at the operation type probability distributionsTry to flatten the operation type distributions
• Definition: operation probability densitypi ( l ) = Pr { vi starts at step l }.Assume uniform distribution:
],[1
1)( Li
Si
ii ttlforlp ∈
+=
μ
66
Force-Directed Scheduling: Definitions• Operation-type distribution (NOT normalized to 1)
• Operation probabilities over control steps:
• Distribution graph of type k over all steps:
qk ( l ) can be thought of as expected operator costfor implementing operations of type k at step l.
∑=
=kvTi
iki
lplq)(:
)()(
)}(,),1(),0({ npppp iiii Κ=
)}(,),1(),0({ nqqq kkk Κ
67
Example
+
NOP
××××
× × + <-
-NOP
1
23
4
0)4(
83.031
21)3(
33.231
21
211)2(
83.231
2111)1(
=
=+=
=+++=
=+++=
mult
mult
mult
mult
q
q
q
q
2.83
2.33
.83
66.131
311)4(
231
31
311)3(
131
31
31)2(
33.031)1(
=++=
=+++=
=++=
==
add
add
add
add
q
q
q
q
0
1
2
1.66
0.33
68
Force-Directed Scheduling Algorithm: Idea• Very similar to LIST_L(G(V,E), a)
Compute mobility of operations using ASAP and ALAPComputer operation probabilities and type distributionsSelect and schedule operationsUpdate operation probabilities and type distributionsGo to next control step
• Difference with list sched in selecting operationsSelect operations with least forceConsider the effect on the type distributionConsider the effect on successor nodes and their type distributions