Heuristic based throughput analysis and optimization of asynchronous pipelines

Alexander SmirnovAlexander Taubin

Determine ◦ max throughput◦ causes of throughput limit◦ max achievable throughput◦ cost of achieving a given throughput level

Data independent token flow◦ No early evaluation◦ DEMUXes send data all ways

Cells across library/design implement the same handshaking protocol

Previous work Cell characterization Protocol characterization Throughput of asynchronous pipelines

(reminder) Throughput analysis Throughput optimization

Early works on the throughput of async. pipelines: M. Greenstreet, K. Steighlitz; T. Willams; A. Lines

Time separation of events (TSE) based approaches to throughput analysis: T. Amon, H. Hulgaard, S. Burns, G. Boriello; S. Chakraborty, D. Dill; P. McGee, S. Nowick;

Simulation based approaches: C. Brej; K. Fazel

Slack matching (throughput optimization) approaches: P. Prakash, A. Martin; P. Beerel, M. Davies, A. Lines, N. Kim;

Cell characterization example (in Liberty)

5

Cell (in ASIC) is a physical implementation of a gate Characterization is a way of abstracting away the details and

specifying the parameters needed on the higher level of hierarchy

Cell characterization◦ abstracts away cell implementation details ◦ specifies functionality, timing, area, power consumption, etc ◦ necessary and sufficient for efficient synthesis, optimization and

simulation De-facto standard – Synopsys “Liberty”

Conventional gate:◦ Implements function

of input wires◦ Special signals

clock set clear etc

AND2

A

B

C stageC=A & B

B

A

reset

reqack

C

reqack

reqack

Q

QSET

CLR

D

reset

clock

D Q

6

Asynchronous stage◦ Implements function

of input channels Special signals request acknowledge data0 data1 reset

Reuse Synopsys Liberty whenever possible

Use attributes to specify roles of pins in handshaking, channel, etc

Specify functionality in terms of channels (abstract out control functionality)

Use Data → Data timing arcs to specify channel → channel attributes: slack, number of tokens at initialization

* PCHB stage example

Abstract channel: forward/backward control and forward data propagation

Assumption: handshake protocol is the same across the library/design

8

L - Left/RightF - Forward/BackwardC - Control/DataE - Evaluation/Reset

Abstract channel: forward/backward control and forward data propagation

Assumption: handshake protocol is the same across the library/design

Use cell characterization to infer handshake protocol

Abstraction and characterization allow identifying protocol loops in every stage for every pair of channels

9

L - Left/RightF - Forward/BackwardC - Control/DataE - Evaluation/Reset

Goal: enumerate all handshake cycles◦ handshake cycles are same across the design

(assumption) ◦ for practical protocols a handshake cycle covers 3

stages◦ enumerate all possible cycles in a full timing graph of a 4-

stage FIFO, normalize cycles and remove identical

10

* PCHB stage exampleComplexity negligible

Asynchronous pipeline throughput is determined by loops◦ Handshaking◦ Algorithmic (rings)

and congestion

Pipeline throughput is known for basic pipeline compositions

Bottleneck based – pipeline compositions are bottleneck candidates

T. Willams (1990), A. Lines (1995):

Throughput T

◦ x – token count◦ s – slack◦ d – dynamic slack ◦ c – cycle time

x is invariant for a ring in a pipeline with deterministic (data independent) token flow

12

Number of tokens x

Th

rou

gh

pu

t

t, H

z

T

d

s

lifci

for serial composition of pipelines with throughputs T1, T2 the resulting throughput Tresulting = min{T1,T2}

Tresulting is observed at

dmin x dmax TjTi

T2T1

13

for parallel composition of pipelines with throughputs T1, T2 the resulting throughput

Tresulting is observed at

Number of tokens x

Thr

ough

put

T, H

z T1bal

T1

d1

s1 s2

T2

T2bald2

Tresulting

T2

T1

14

Peak throughput of a is limited by the slowest component to determine the throughput of a pipeline it is sufficient to discover that slowest combination of stages - throughput bottleneck

Bottleneck candidates (BCs):◦ Handshake (h/s) cycle◦ Re-converging paths◦ Algorithmic cycle (ring)

BC characterized by cycle time rang

Length of each h/s cycle in the protocol computed for each window of length 2 m 3 (HB stages).

Handshake cycles are known from protocol analysis Lengths of each cycle (i

min and imax) are computed for each cycle “in

place” and then Heuristic: cycles involving multiple branches not considered

complexity or where vi are primary outputs of a stages

environment reaction times * PCHB stages example

Theorem: if a BC is a bottleneck, reaction times on its borders never exceed those used to compute

It follows from the theorem that BC can be analyzed in isolation to determine

BCs are sorted with respect to BC with the highest is a bottleneck

– it defines the throughput of the design

Requires results of handshake cycle analysis Identify pairs of re-converging paths, compute Reduce the number of pairs of re-converging paths:

◦ one pair of re-converging paths identified per fork-join◦ pipelines is assumed to have deterministic (data

independent token flow) number of initial tokens in any two re-converging paths is the same

Number of BCs can be reduced if optimization not needed

Heuristics for identifying rings, re-converging paths include: ◦ consider two of any set of rings with common

arc(s) (longest and shortest)

Throughput of rings, re-converging path pairs is computed using the equations from T. Willams, A. Lines BUT◦ If a handshake cycle covers re-converging paths (if the length of

the shorter branch is 0-2 half-buffer stages) the equations from T. Willams, A. Lines do not apply

Throughput such bottleneck candidate is determined by the handshake cycles

...

Identify handshake bottlenecks (slide window)

Optimize handshake bottlenecks (if necessary)

Identify BCs due to algorithmic loops and dynamic slack imbalance◦ CPM, modified to handle loops◦ Trade memory for time – store arrival times,

significant predecessors ◦ Eliminate unnecessary graph exploration

Predicted throughput variation range (% of the actual simulated throughput)

Predicted throughput variation depend on:◦ Due to asymmetry in library cells throughput varies

depending on the data (actual throughput variation)◦ Uncertainty introduced by heuristics (currently

incomplete synchronization trees introduce height uncertainty)

Throughput estimation is heuristic based i.e. error is possible Shown is the % difference of the actual throughput and the

predicted variation range bound weighted by actual throughput

In 92.5% of test cases measured throughput is within the predicted variation range, the maximum error observed is 27%

Alleviate bottlenecks with throughput less than the goal by◦ Handshake pipelining ◦ Ring padding, slack matching

Iteratively ◦ insert stages ◦ update all BCs

Alleviate bottlenecks with throughput less than the goal by◦ Handshake pipelining ◦ Ring padding, slack matching

Iteratively ◦ insert stages ◦ update all BCs

The approach allows automatically optimize the throughput up to the level limited by:◦ library cells◦ data deficient (long non-pipelined) rings

Fully optimized throughput is higher (cycle time smaller) for ◦ FIFOs◦ circuits without synchronization trees (fan-out 1)

Based on Synopsys Liberty developed asynchronous cell/stage characterization used for synthesis, throughput analysis/optimization

Protocol characterization automatically inferred from cell characterization

Support for hierarchical designs (with possible loss of precision)

All bottlenecks are identified All bottlenecks except for data deficient rings are

automatically alleviated Optimization tested with stage insertion but other

optimizations can be used Analysis results easily adjusted to reflect non-

structural changes

Currently not considering handshake cycles involving branches

Unless merges/forks are properly characterized analysis in hierarchical designs is imprecise

Currently synchronization trees are assumed balanced, for incomplete trees one sync cell delay I added to the variation range

Documents

Heuristic based throughput analysis and optimization of asynchronous pipelines