Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
2014-02-13
1
TDTS 01 Lecture 7
High-Level Synthesis IITDTS 01 Lecture 7
High-Level Synthesis II
Zebo PengEmbedded Systems Laboratory
IDA, Linköping University
Zebo PengEmbedded Systems Laboratory
IDA, Linköping University
22Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Lecture 7
Advanced HLS issues
Control unit synthesis
Allocation and binding
2014-02-13
2
33Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Allocation and Binding
Allocation (unit selection) —— To determine the type and number of hardware resources required, including
Functional units
Storage elements
Buses
Binding —— Assignment to resource instances:
Operations to functional unit instances
Values to be stored to instances of storage elements
Data transfers to bus instances
Allocation and binding generate the datapath of the design.
44Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Allocation and Binding Principle
Resource sharing: Allow multiple non-concurrent operations to share the same hardware as much as possible.
Optimization goal:
Minimize total cost of functional units, registers, bus drivers, and multiplexers.
Minimize total interconnection length (placement info needed).
Constraint on critical path delay.
s1 +
+
a
a b,e,g
+1, +3
+
+
b c d
s2
o1
o3
o2
o4
e f
g h
c,f,h d
+2, +4
2014-02-13
3
55Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Allocation/Binding — Approach 1 Constructive — start with an empty datapath and add
functional, storage and interconnection components as needed.Greedy algorithms — perform allocation/binding for one control
step at a time.
+ +
*
Reg
m1, m2
a1, a3 a2, a4
Rule-based –– used to select type and numbers of function units, especially prior to scheduling.
+
*
*
+
+
1
2
3
a1
+
m1
a2 a3
m2
a4
f
f
66Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Allocation/Binding — Approach 2
Graph-theoretical formulations — Sub-tasks are
mapped into well-defined problems in graph theory.
Clique partitioning.
Left-edge algorithm.
Graph coloring.
2014-02-13
4
77Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Clique Partitioning
G = (V, E), an undirected graph with a set V of vertices and a set E of edges. a1
a2
a4
a3
a clique
another clique
A clique partitioningexample
A clique is a set of vertices that form a complete subgraph of G.
The Clique Partitioning Problem:To partition G into a minimal number of cliques such that each vertex belongs to exactly one clique.
88Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Allocation as Clique Partitioning
Functional unit allocation:
Each vertex represents an operation.
An edge connects two vertices iff:
The two operations are scheduled into different control steps, and
There exists a functional unit that is capable of carrying out both operations.
a3
a1
a2
a4
m1
m2
+
*
*
+
+
1
2
3
a1
+
m1
a2 a3
m2
a4
f
2014-02-13
5
99Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
S. Allocation as Clique Partitioning
Storage allocation as a clique partitioning problem:
Each value needed to be stored is mapped to a vertex.
Two vertices are connected, iff the life-times of the two values
do not intersect.
The clique partitioning problem is NP-complete.
Efficient heuristics must be developed.
Ex. Tseng developed a polynomial time algorithm, based on
step-wise grouping, which generates very good results.
1010Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Tseng’s Algorithm
A super-graph is derived from the original graph.
V3
V1 V2
V4 V5
Edge
(V1,V3) 1(V1,V4) 1(V2,V3) 0(V2,V5) 0(V3,V4) 1(V4,V5) 0
Commonneighbors V2
V4 V5V3
V1
Merge the two nodes and repeated from the first step, until no more merger can be carried out.
Find two connected super-nodes such that they have the maximum number of common neighbors.
2014-02-13
6
1111Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
S1-3
V2
V4 V5V3
V1
Tseng’s Algorithm (Cont’d)
Edge
(S1-3,V4) 0(V2,V5) 0(V4,V5) 0
Commonneighbors
V2
V5V4V3
V1
V2
V5V4V3
V1
S1-3-4
Edge
(V2,V5) 0
Commonneighbors V2
V5V4V3
V1
1212Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Left-Edge (LE) Algorithm
Used in channel routing to minimize the number of tracks used to connect points (layout design).
To minimize the number of needed tracks.
To reduce wire lengths.
To avoid wire crossings.
2014-02-13
7
1313Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
LE Algorithm for Reg. Allocation
Map birth time of a value to the left (top) edge, and its death time to the right (down) edge of a wire.
i1
a
b
*5
+ 4
+ 8
*9
- 6
+ 1
* 7+
10
*3
+ 2
i1 i2 i3 i4 i5
o1 o2 o3
a
bgf
ed
c
‘7’
‘3’
‘8’
‘8’
‘9’
‘4’ ‘2’
o1
i2 i3 i4 i5
d e
f g
c
o2 o3
1414Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
The Left-Edge Algorithm1. The values are sorted in increasing order of their birth
times.
2. The first value is assigned to the first register.
3. The list is then scanned for the next value whose birth time is equal to or larger than the death time of the previous value.
4. This value is assigned to the current register.
5. The list is scanned until no more value can share the same register.
6. A new register is then introduced to hold the next value in the sorted list, and the algorithm iterates from step 3.
2014-02-13
8
1515Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
LE Algorithm Example
i1
a
b
R2
i2
R1
o1
f
o2
i5i3
d
g
o3
R3 R4 R5
i4
e
c
i1
a
b
o1
i2 i3 i4 i5
d e
f g
c
o2 o3
Original life-times
a
b
o1
d e
f g
c
o2 o3
i1 i2 i3 i4 i5
Sorted list based on birth times Allocated registers
1616Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
LE Algorithm Discussions
The algorithm guarantees to allocate the minimum number of registers.
However, it has two disadvantages:Not all life-time table can be interpreted as intersecting
intervals on a line.
• Loop
• Conditional branches
The assignment is neither unique, nor necessarily optimal, in terms of minimal number of multiplexers, for example.
2014-02-13
9
1717Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Allocation/Binding — Approach 3
Transformational allocation –– starting from an initial allocation and binding, a final design is obtained by successive transformations. Usually it starts with a maximal allocation (each operation has
its dedicated physical unit).
The design is then improved by merging, step-by-step, physical units so that hardware resources are shared as much as possible.
Si +Sj + Si,j +
Si
Sj
1818Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Lecture 7
Advanced HLS issues
Control unit synthesis
Allocation and binding
2014-02-13
10
1919Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Control-Unit Synthesis
Two basic approaches are widely used:
Microcode.
Hard-wired.
The basic assumptions:
A synchronous controller is used.
A schedule is given with the set of activation signals
• for enabling, multiplexer input selection, bus control, etc.
The controller is modeled as a finite-state machine.
2020Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Microcoded Control Synthesis
To store the control information in an organized fashion.
A microcode ROM of size λ is used, where λ is the number of
schedule steps.
The ROM must have log2λ address bits (note: x denotes
the ceiling function).
A synchronous counter with a reset signal is used to address
the ROM.
The counter is controlled by the system clock.
The ROM contents can be implemented as horizontal or
vertical microcode.
2014-02-13
11
2121Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Horizontal Microcode Each activation signal is associated to one bit of the word in
the microcode. Address Microwords
00011011
CounterResetClock
1 1 0 0 0 1 0 1 0 1 00 0 1 0 0 0 1 0 1 0 10 0 0 1 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0
Activation signals
The word length is usually much larger than λ, and the ROM has therefore a width larger than its height.
Each bit is connected directly to an activation signal ── high performance.
There are many zeros ── wasted storage resource.
λ
2222Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Vertical Microcode A fully vertical microcode encodes the n activation signals with
log2n bits to reduce the width of the ROM. Several words may be needed for a schedule step.
1 1 0 0 0 1 0 1 0 1 00 0 1 0 0 0 1 0 1 0 10 0 0 1 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0
Activation signals
0 0 0 10 0 1 00 1 1 01 0 0 01 0 1 00 0 1 10 1 1 11 0 0 11 0 1 10 1 0 00 1 0 1
1 2 3 4 5 6 7 8 9 10 11 (n = 11)
Activation signals
Decoder
2014-02-13
12
2323Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Vertical Microcode Issues
A decoder is needed, which can be implemented by another ROM to form a two-stage control store.
Operation concurrency may not be fully supported.
Reserve code-words for concurrent operations.• e.g., using “1100” to denote activation of the first group
of activation signals.
Vertical control schemes can be implemented by:
Lengthening the schedule, or
Reading multiple ROM words in each step.
Both have, however, some disadvantages.
0 0 0 10 0 1 00 1 1 01 0 0 01 0 1 0
0 0 1 10 1 1 11 0 0 11 0 1 1
0 1 0 0
0 1 0 1
Activation S.
Decoder
2424Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Microcode Optimization To find the shortest encoding of the words such that full
concurrency is preserved — the microcode compaction problem (an intractable problem).
MC can be approached by partitioning the operations into groups such that only one operation is active in each group and therefore vertical encoding can be used in it.
1 1 0 0 0 1 0 1 0 1 00 0 1 0 0 0 1 0 1 0 10 0 0 1 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0
Activation signals
1 2 3 4 5 6 7 8 9 0 1’
1 0 0 1 1 0 0 1 0 1 00 1 0 0 0 1 0 0 1 0 10 0 1 0 0 0 0 0 0 0 00 0 0 0 0 0 1 0 0 0 0
1 3 4 2 6 7 5 8 9 0 1’
0 1 1 0 1 0 1 0 11 0 0 1 0 1 0 1 01 1 0 0 0 0 0 0 00 0 0 1 1 0 0 0 0
A B C D E
D1 D2 D3 D4
2014-02-13
13
2525Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Microcode Compaction To minimize the number of groups.
Construct a conflict graph, where the vertices correspond to the operations and the edges represent concurrency.
A minimum coloring of this graph gives the minimum number of groups needed.
Note: this does not necessarily lead to the minimum number of word bits (e.g., 10 can be divided as 5+5, or 7+3).
4
3
21
5
6
4
3
21
5
6
Coloring
2626Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Hard-Wired Control Synthesis
Generate a Moore-type finite-state machine from a schedule.
Synthesize the FSM model.
1 1 0 0 0 1 0 1 0 1 00 0 1 0 0 0 1 0 1 0 10 0 0 1 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0
1 2 3 4 5 6 7 8 9 0 1’ S1 S2
S3S4
3,7,9,11
4
1,2,6,8,10
5
Reset
2014-02-13
14
2727Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Lecture 7
Advanced HLS issues
Control unit synthesis
Allocation and binding
2828Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Advanced Issues of HLS
Many-to-many mapping between operations and physical components.
Re-use of previous designs (partial structure).
Synthesis with commercially available sub-systems, IP-based synthesis.
HLS with testability consideration.
+
Adder
x
Mult
x
ALU
+ -
Subs Adder
Bit-width compatibility
2014-02-13
15
2929Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7
Summary
High-level synthesis is one of the most important design steps in the design process of electronic systems.
The use of efficient HLS tools has led to the great improvement of design productivity.
The two most important tasks are scheduling and allocation/binding, which are interdependent.
Controller design is also an important task, and its interaction with datapath design should be considered.
The HLS tasks are usually formulated as optimization problems and heuristic algorithms are used.