High-Level Synthesis II - ida.liu.seTDTS01/lectures/14/lec7.pdf · Graph coloring. 2014-02-13 4 Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7 7 TDTS01

2014-02-13

1

TDTS 01 Lecture 7

High-Level Synthesis IITDTS 01 Lecture 7

High-Level Synthesis II

Zebo PengEmbedded Systems Laboratory

IDA, Linköping University

Zebo PengEmbedded Systems Laboratory

IDA, Linköping University

22Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7TDTS01 Lecture Notes – Lecture 7

Lecture 7

Advanced HLS issues

Control unit synthesis

Allocation and binding

2014-02-13

2


Allocation and Binding

Allocation (unit selection) —— To determine the type and number of hardware resources required, including

Functional units

Storage elements

Buses

Binding —— Assignment to resource instances:

Operations to functional unit instances

Values to be stored to instances of storage elements

Data transfers to bus instances

Allocation and binding generate the datapath of the design.


Allocation and Binding Principle

Resource sharing: Allow multiple non-concurrent operations to share the same hardware as much as possible.

Optimization goal:

Minimize total cost of functional units, registers, bus drivers, and multiplexers.

Minimize total interconnection length (placement info needed).

Constraint on critical path delay.

s1 +

+

a

a b,e,g

+1, +3

+

+

b c d

s2

o1

o3

o2

o4

e f

g h

c,f,h d

+2, +4

2014-02-13

3


Allocation/Binding — Approach 1 Constructive — start with an empty datapath and add

functional, storage and interconnection components as needed.Greedy algorithms — perform allocation/binding for one control

step at a time.

+ +

*

Reg

m1, m2

a1, a3 a2, a4

Rule-based –– used to select type and numbers of function units, especially prior to scheduling.

+

*

*

+

+

1

2

3

a1

+

m1

a2 a3

m2

a4

f

f


Allocation/Binding — Approach 2

Graph-theoretical formulations — Sub-tasks are

mapped into well-defined problems in graph theory.

Clique partitioning.

Left-edge algorithm.

Graph coloring.

2014-02-13

4


Clique Partitioning

G = (V, E), an undirected graph with a set V of vertices and a set E of edges. a1

a2

a4

a3

a clique

another clique

A clique partitioningexample

A clique is a set of vertices that form a complete subgraph of G.

The Clique Partitioning Problem:To partition G into a minimal number of cliques such that each vertex belongs to exactly one clique.


Allocation as Clique Partitioning

Functional unit allocation:

Each vertex represents an operation.

An edge connects two vertices iff:

The two operations are scheduled into different control steps, and

There exists a functional unit that is capable of carrying out both operations.

a3

a1

a2

a4

m1

m2

+

*

*

+

+

1

2

3

a1

+

m1

a2 a3

m2

a4

f

2014-02-13

5


S. Allocation as Clique Partitioning

Storage allocation as a clique partitioning problem:

Each value needed to be stored is mapped to a vertex.

Two vertices are connected, iff the life-times of the two values

do not intersect.

The clique partitioning problem is NP-complete.

Efficient heuristics must be developed.

Ex. Tseng developed a polynomial time algorithm, based on

step-wise grouping, which generates very good results.


Tseng’s Algorithm

A super-graph is derived from the original graph.

V3

V1 V2

V4 V5

Edge

(V1,V3) 1(V1,V4) 1(V2,V3) 0(V2,V5) 0(V3,V4) 1(V4,V5) 0

Commonneighbors V2

V4 V5V3

V1

Merge the two nodes and repeated from the first step, until no more merger can be carried out.

Find two connected super-nodes such that they have the maximum number of common neighbors.

2014-02-13

6


S1-3

V2

V4 V5V3

V1

Tseng’s Algorithm (Cont’d)

Edge

(S1-3,V4) 0(V2,V5) 0(V4,V5) 0

Commonneighbors

V2

V5V4V3

V1

V2

V5V4V3

V1

S1-3-4

Edge

(V2,V5) 0

Commonneighbors V2

V5V4V3

V1


Left-Edge (LE) Algorithm

Used in channel routing to minimize the number of tracks used to connect points (layout design).

To minimize the number of needed tracks.

To reduce wire lengths.

To avoid wire crossings.

2014-02-13

7


LE Algorithm for Reg. Allocation

Map birth time of a value to the left (top) edge, and its death time to the right (down) edge of a wire.

i1

a

b

*5

+ 4

+ 8

*9

- 6

+ 1

* 7+

10

*3

+ 2

i1 i2 i3 i4 i5

o1 o2 o3

a

bgf

ed

c

‘7’

‘3’

‘8’

‘8’

‘9’

‘4’ ‘2’

o1

i2 i3 i4 i5

d e

f g

c

o2 o3


The Left-Edge Algorithm1. The values are sorted in increasing order of their birth

times.

2. The first value is assigned to the first register.

3. The list is then scanned for the next value whose birth time is equal to or larger than the death time of the previous value.

4. This value is assigned to the current register.

5. The list is scanned until no more value can share the same register.

6. A new register is then introduced to hold the next value in the sorted list, and the algorithm iterates from step 3.

2014-02-13

8


LE Algorithm Example

i1

a

b

R2

i2

R1

o1

f

o2

i5i3

d

g

o3

R3 R4 R5

i4

e

c

i1

a

b

o1

i2 i3 i4 i5

d e

f g

c

o2 o3

Original life-times

a

b

o1

d e

f g

c

o2 o3

i1 i2 i3 i4 i5

Sorted list based on birth times Allocated registers


LE Algorithm Discussions

The algorithm guarantees to allocate the minimum number of registers.

However, it has two disadvantages:Not all life-time table can be interpreted as intersecting

intervals on a line.

• Loop

• Conditional branches

The assignment is neither unique, nor necessarily optimal, in terms of minimal number of multiplexers, for example.

2014-02-13

9


Allocation/Binding — Approach 3

Transformational allocation –– starting from an initial allocation and binding, a final design is obtained by successive transformations. Usually it starts with a maximal allocation (each operation has

its dedicated physical unit).

The design is then improved by merging, step-by-step, physical units so that hardware resources are shared as much as possible.

Si +Sj + Si,j +

Si

Sj


Lecture 7

Advanced HLS issues



2014-02-13

10


Control-Unit Synthesis

Two basic approaches are widely used:

Microcode.

Hard-wired.

The basic assumptions:

A synchronous controller is used.

A schedule is given with the set of activation signals

• for enabling, multiplexer input selection, bus control, etc.

The controller is modeled as a finite-state machine.


Microcoded Control Synthesis

To store the control information in an organized fashion.

A microcode ROM of size λ is used, where λ is the number of

schedule steps.

The ROM must have log2λ address bits (note: x denotes

the ceiling function).

A synchronous counter with a reset signal is used to address

the ROM.

The counter is controlled by the system clock.

The ROM contents can be implemented as horizontal or

vertical microcode.

2014-02-13

11


Horizontal Microcode Each activation signal is associated to one bit of the word in

the microcode. Address Microwords

00011011

CounterResetClock

1 1 0 0 0 1 0 1 0 1 00 0 1 0 0 0 1 0 1 0 10 0 0 1 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0

Activation signals

The word length is usually much larger than λ, and the ROM has therefore a width larger than its height.

Each bit is connected directly to an activation signal ── high performance.

There are many zeros ── wasted storage resource.

λ


Vertical Microcode A fully vertical microcode encodes the n activation signals with

log2n bits to reduce the width of the ROM. Several words may be needed for a schedule step.

1 1 0 0 0 1 0 1 0 1 00 0 1 0 0 0 1 0 1 0 10 0 0 1 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0

Activation signals

0 0 0 10 0 1 00 1 1 01 0 0 01 0 1 00 0 1 10 1 1 11 0 0 11 0 1 10 1 0 00 1 0 1

1 2 3 4 5 6 7 8 9 10 11 (n = 11)

Activation signals

Decoder

2014-02-13

12


Vertical Microcode Issues

A decoder is needed, which can be implemented by another ROM to form a two-stage control store.

Operation concurrency may not be fully supported.

Reserve code-words for concurrent operations.• e.g., using “1100” to denote activation of the first group

of activation signals.

Vertical control schemes can be implemented by:

Lengthening the schedule, or

Reading multiple ROM words in each step.

Both have, however, some disadvantages.

0 0 0 10 0 1 00 1 1 01 0 0 01 0 1 0

0 0 1 10 1 1 11 0 0 11 0 1 1

0 1 0 0

0 1 0 1

Activation S.

Decoder


Microcode Optimization To find the shortest encoding of the words such that full

concurrency is preserved — the microcode compaction problem (an intractable problem).

MC can be approached by partitioning the operations into groups such that only one operation is active in each group and therefore vertical encoding can be used in it.

1 1 0 0 0 1 0 1 0 1 00 0 1 0 0 0 1 0 1 0 10 0 0 1 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0

Activation signals

1 2 3 4 5 6 7 8 9 0 1’

1 0 0 1 1 0 0 1 0 1 00 1 0 0 0 1 0 0 1 0 10 0 1 0 0 0 0 0 0 0 00 0 0 0 0 0 1 0 0 0 0

1 3 4 2 6 7 5 8 9 0 1’

0 1 1 0 1 0 1 0 11 0 0 1 0 1 0 1 01 1 0 0 0 0 0 0 00 0 0 1 1 0 0 0 0

A B C D E

D1 D2 D3 D4

2014-02-13

13


Microcode Compaction To minimize the number of groups.

Construct a conflict graph, where the vertices correspond to the operations and the edges represent concurrency.

A minimum coloring of this graph gives the minimum number of groups needed.

Note: this does not necessarily lead to the minimum number of word bits (e.g., 10 can be divided as 5+5, or 7+3).

4

3

21

5

6

4

3

21

5

6

Coloring


Hard-Wired Control Synthesis

Generate a Moore-type finite-state machine from a schedule.

Synthesize the FSM model.

1 1 0 0 0 1 0 1 0 1 00 0 1 0 0 0 1 0 1 0 10 0 0 1 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0

1 2 3 4 5 6 7 8 9 0 1’ S1 S2

S3S4

3,7,9,11

4

1,2,6,8,10

5

Reset

2014-02-13

14


Lecture 7

Advanced HLS issues




Advanced Issues of HLS

Many-to-many mapping between operations and physical components.

Re-use of previous designs (partial structure).

Synthesis with commercially available sub-systems, IP-based synthesis.

HLS with testability consideration.

+

Adder

x

Mult

x

ALU

+ -

Subs Adder

Bit-width compatibility

2014-02-13

15


Summary

High-level synthesis is one of the most important design steps in the design process of electronic systems.

The use of efficient HLS tools has led to the great improvement of design productivity.

The two most important tasks are scheduling and allocation/binding, which are interdependent.

Controller design is also an important task, and its interaction with datapath design should be considered.

The HLS tasks are usually formulated as optimization problems and heuristic algorithms are used.

Documents

High-Level Synthesis II - ida.liu.seTDTS01/lectures/14/lec7.pdf · Graph coloring. 2014-02-13 4 Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS01 Lecture Notes – Lecture 7 7 TDTS01