Upload
ilya
View
81
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Heuristic based throughput analysis and optimization of asynchronous pipelines. Alexander Smirnov Alexander Taubin. Goals and assumptions. Determine max throughput causes of throughput limit max achievable throughput cost of achieving a given throughput level - PowerPoint PPT Presentation
Citation preview
Alexander SmirnovAlexander Taubin
Determine ◦ max throughput◦ causes of throughput limit◦ max achievable throughput◦ cost of achieving a given throughput level
Data independent token flow◦ No early evaluation◦ DEMUXes send data all ways
Cells across library/design implement the same handshaking protocol
Previous work Cell characterization Protocol characterization Throughput of asynchronous pipelines
(reminder) Throughput analysis Throughput optimization
Early works on the throughput of async. pipelines: M. Greenstreet, K. Steighlitz; T. Willams; A. Lines
Time separation of events (TSE) based approaches to throughput analysis: T. Amon, H. Hulgaard, S. Burns, G. Boriello; S. Chakraborty, D. Dill; P. McGee, S. Nowick;
Simulation based approaches: C. Brej; K. Fazel
Slack matching (throughput optimization) approaches: P. Prakash, A. Martin; P. Beerel, M. Davies, A. Lines, N. Kim;
Cell characterization example (in Liberty)
5
Cell (in ASIC) is a physical implementation of a gate Characterization is a way of abstracting away the details and
specifying the parameters needed on the higher level of hierarchy
Cell characterization◦ abstracts away cell implementation details ◦ specifies functionality, timing, area, power consumption, etc ◦ necessary and sufficient for efficient synthesis, optimization and
simulation De-facto standard – Synopsys “Liberty”
Conventional gate:◦ Implements function
of input wires◦ Special signals
clock set clear etc
AND2
A
B
C stageC=A & B
B
A
reset
reqack
C
reqack
reqack
Q
QSET
CLR
D
reset
clock
D Q
6
Asynchronous stage◦ Implements function
of input channels Special signals request acknowledge data0 data1 reset
Reuse Synopsys Liberty whenever possible
Use attributes to specify roles of pins in handshaking, channel, etc
Specify functionality in terms of channels (abstract out control functionality)
Use Data → Data timing arcs to specify channel → channel attributes: slack, number of tokens at initialization
* PCHB stage example
Abstract channel: forward/backward control and forward data propagation
Assumption: handshake protocol is the same across the library/design
8
L - Left/RightF - Forward/BackwardC - Control/DataE - Evaluation/Reset
Abstract channel: forward/backward control and forward data propagation
Assumption: handshake protocol is the same across the library/design
Use cell characterization to infer handshake protocol
Abstraction and characterization allow identifying protocol loops in every stage for every pair of channels
9
L - Left/RightF - Forward/BackwardC - Control/DataE - Evaluation/Reset
Goal: enumerate all handshake cycles◦ handshake cycles are same across the design
(assumption) ◦ for practical protocols a handshake cycle covers 3
stages◦ enumerate all possible cycles in a full timing graph of a 4-
stage FIFO, normalize cycles and remove identical
10
* PCHB stage exampleComplexity negligible
Asynchronous pipeline throughput is determined by loops◦ Handshaking◦ Algorithmic (rings)
and congestion
Pipeline throughput is known for basic pipeline compositions
Bottleneck based – pipeline compositions are bottleneck candidates
T. Willams (1990), A. Lines (1995):
Throughput T
◦ x – token count◦ s – slack◦ d – dynamic slack ◦ c – cycle time
x is invariant for a ring in a pipeline with deterministic (data independent) token flow
12
Number of tokens x
Th
rou
gh
pu
t
t, H
z
T
d
s
lifci
for serial composition of pipelines with throughputs T1, T2 the resulting throughput Tresulting = min{T1,T2}
Tresulting is observed at
dmin x dmax TjTi
T2T1
13
for parallel composition of pipelines with throughputs T1, T2 the resulting throughput
Tresulting is observed at
Number of tokens x
Thr
ough
put
T, H
z T1bal
T1
d1
s1 s2
T2
T2bald2
Tresulting
T2
T1
14
Peak throughput of a is limited by the slowest component to determine the throughput of a pipeline it is sufficient to discover that slowest combination of stages - throughput bottleneck
Bottleneck candidates (BCs):◦ Handshake (h/s) cycle◦ Re-converging paths◦ Algorithmic cycle (ring)
BC characterized by cycle time rang
Length of each h/s cycle in the protocol computed for each window of length 2 m 3 (HB stages).
Handshake cycles are known from protocol analysis Lengths of each cycle (i
min and imax) are computed for each cycle “in
place” and then Heuristic: cycles involving multiple branches not considered
complexity or where vi are primary outputs of a stages
environment reaction times * PCHB stages example
Theorem: if a BC is a bottleneck, reaction times on its borders never exceed those used to compute
It follows from the theorem that BC can be analyzed in isolation to determine
BCs are sorted with respect to BC with the highest is a bottleneck
– it defines the throughput of the design
Requires results of handshake cycle analysis Identify pairs of re-converging paths, compute Reduce the number of pairs of re-converging paths:
◦ one pair of re-converging paths identified per fork-join◦ pipelines is assumed to have deterministic (data
independent token flow) number of initial tokens in any two re-converging paths is the same
Number of BCs can be reduced if optimization not needed
Heuristics for identifying rings, re-converging paths include: ◦ consider two of any set of rings with common
arc(s) (longest and shortest)
Throughput of rings, re-converging path pairs is computed using the equations from T. Willams, A. Lines BUT◦ If a handshake cycle covers re-converging paths (if the length of
the shorter branch is 0-2 half-buffer stages) the equations from T. Willams, A. Lines do not apply
Throughput such bottleneck candidate is determined by the handshake cycles
...
Identify handshake bottlenecks (slide window)
Optimize handshake bottlenecks (if necessary)
Identify BCs due to algorithmic loops and dynamic slack imbalance◦ CPM, modified to handle loops◦ Trade memory for time – store arrival times,
significant predecessors ◦ Eliminate unnecessary graph exploration
Predicted throughput variation range (% of the actual simulated throughput)
Predicted throughput variation depend on:◦ Due to asymmetry in library cells throughput varies
depending on the data (actual throughput variation)◦ Uncertainty introduced by heuristics (currently
incomplete synchronization trees introduce height uncertainty)
Throughput estimation is heuristic based i.e. error is possible Shown is the % difference of the actual throughput and the
predicted variation range bound weighted by actual throughput
In 92.5% of test cases measured throughput is within the predicted variation range, the maximum error observed is 27%
Alleviate bottlenecks with throughput less than the goal by◦ Handshake pipelining ◦ Ring padding, slack matching
Iteratively ◦ insert stages ◦ update all BCs
Alleviate bottlenecks with throughput less than the goal by◦ Handshake pipelining ◦ Ring padding, slack matching
Iteratively ◦ insert stages ◦ update all BCs
The approach allows automatically optimize the throughput up to the level limited by:◦ library cells◦ data deficient (long non-pipelined) rings
Fully optimized throughput is higher (cycle time smaller) for ◦ FIFOs◦ circuits without synchronization trees (fan-out 1)
Based on Synopsys Liberty developed asynchronous cell/stage characterization used for synthesis, throughput analysis/optimization
Protocol characterization automatically inferred from cell characterization
Support for hierarchical designs (with possible loss of precision)
All bottlenecks are identified All bottlenecks except for data deficient rings are
automatically alleviated Optimization tested with stage insertion but other
optimizations can be used Analysis results easily adjusted to reflect non-
structural changes
Currently not considering handshake cycles involving branches
Unless merges/forks are properly characterized analysis in hierarchical designs is imprecise
Currently synchronization trees are assumed balanced, for incomplete trees one sync cell delay I added to the variation range