Slide 1
Variation-Aware Chip Design for Reliability and
Performance
Deming Chen, ECE, UIUC
Acknowledgement: work partially supported by Sun Microsystems
Students: Christine Chen, Greg Lucas, Lu Wan
Outline
BackgroundProcess variationMotivation of variation-aware chip
designSSTA with multiple clock domainsTelescopic logic for
processor performanceClock tree design with skew
reductionConclusion
Process Variation
Increases as device and interconnect feature sizes are scaled
downCan be within-die (intra-die) and between dies
(inter-die)
(Source: Intel)
Traditional Solutions
Speed/Power binning: measure chips and bin into performance
categories, sell lower performing or power-hungry chips at a lower
priceGuard-band the design to achieve the desired yieldUses
pessimistic worst-case process corners Inefficient as the
variability increases with scaling
Deterministic analysis:WCETX = X + 3X,WCETY = Y + 3YWCETX + Y
=WCETX + WCETY =X + Y + 3(X + Y)
Example
Statistical analysis:X + Y = WCETX + Y =X+Y + 3X+Y =X + Y +
3(sqrt(2X + 2Y))
Worst Case Execution Time of X + Y, where X = , Y =
Statistical Static Timing Analysis (SSTA) for Multiple Clock
Domains
Introduction
Increased process variation in DSM technologies demands SSTA Many
SSTA algorithms have been proposed, but they all focus on simple
timing graphs and the traversal algorithmIn industrial designs,
there are multi-cycle paths, multiple clock domains, and false
paths
To meet the demands of industry strength designs, SSTA must be
extended to handle complex timing graphs
WID Variation Modeling
2 components to process variationSystematic VariationLg, WgRandom
Variation Na, tox40%-60% of the total variation is systematic
[Nassif, ISQED00]Therefore, correlation must be consideredUtilize a
grid based correlation model [Chang & Sapatnekar,
ICCAD03]
Correlation is a function of distance, all cells within a grid have
correlation = 1
MCSSTA Algorithm Overview
MCSSTA extends SSTA to handleMulti-cycle pathsMultiple Clock
DomainsFalse Paths
Adder A 1 cycle Multiplier B 2 cycles
Multiplier C has both single and multi-cycle paths through
it
Extending the Max Equations to Multi-Clock Domains
Clock Domain Decomposition
Find: pdfFF5 = max(P1,P2) considering the timing constraints and
correlation
Step 1: Normalization
Where = mean, = standard deviation, and n = cycle constraint
Step 2: Correlation Correction
MCSSTA Timing Graph Setup
Each node/edge contains a list of the cycle constraints that go
through the node/edgeCan account for false paths and other
complicated timing constraints by removing the cycle constraint
from a node/edge
Circuit and timing graph for multiplexer C
1
Principal Components Timing Traversal
PCA transforms a set of correlated random variables into a set of
independent random variables: Significantly simplifies traversal of
the timing graph since correlation does not have to be tracked.Two
properties of PCA:
1.
2.
The normalization and correlation correction operations can be
performed simultaneously by dividing the principal components by
the cycle constraint
Modified PCA Traversal
Normalization and Correlation Correction are performed during the
timing graph traversal
Experimental Results
ISCAS benchmarks, slowest 70% of paths set to 2 cycles0.207% error
in mean2.0% error in standard deviation
Summary
SSTA is closer to maturity It has been extended to consider complex
timing constraints for normal distributions In the future, we plan
to extend the method to handle non-normal distributions
Telescopic Logic for Microprocessor Performance
Traditional View of Circuit Optimization
Quality metric for circuit optimizationCycle time: tcycle >
ATlongest path Power consumption: Poverall = Pdynamic +
PstaticCircuit optimization is staticStatic timing analysisLongest
path receives most optimization
effortDecompositionRe-synthesisSizing up/downDual threshold
voltagePower optimization creates critical path wallCritical path
wall makes timing optimization more difficult
Recent Innovation: RAZOR Logic
Tolerate timing errorData correct one cycleData error n cycle to
recovery (n>1)Perf = Fmax * ( p + (1-p)/n), where p is
probability that data is latched correctly.
RAZOR logic
A Promising Alternative: Telescopic Logic
One-cycle class: set of input vectors that make circuit stable
before tcycle.Two-cycle class: set of input vectors that make
circuit stable after tcycle.fh asserted when input vector belongs
to two-cycle class.Throughput:
Telescopic unit
Concept of Dynamic Circuit Optimization
Classification of Primary Output (PO)Critical(C) /
non-critical(NC)High-Activity (HA) / Low-Activity (LA)Four possible
combinations:C+HA, NC+HA, C+LA, NC+LAQuestion: should the
optimization be constrained by paths that are rarely
exercised?Dynamic optimization:Timing speculation: 1. allow few PO
slower than tcycle; 2. do data recovery when error is
latched.Instead of spend equal optimization effort on C+HA and
C+LA, dynamic opt. biases optimization effort towards
C+HA
Dynamic Optimization with Telescopic Logic
ROBDD: Reduced Ordered Binary Decision Diagram is used to encode
the functionality of circuit.TCF: a Timed Characteristic Function
(BDD+timing) that encodes time and function relationship, is built
for the circuit.
1. Represent function with ROBDD
2. Using TCF to derive sensitization probability
Cont
PROB: Given signal probability of each PI and TCF of a circuit, the
probability of sensitizing the POs can be derived.LowVT: accelerate
certain nodes in circuit by assigning lowVT. MINCUT: using
maxflow-mincut algorithm to find candidates to assign lowVT
3. Maxflow-mincut chooses candidates to assign lowVT
Use TCF+BDD to evaluate functional bias
Given different input probabilityCase 1: each PI has static prob. =
0.5;Case 2: each PI has static prob. =0.2;When overclocked for the
same amount, probability of getting correct outputs case 2 >
case 1.
Dynamic Optimization Effect
Blue is Synopsys dualVT optimized result Red is dynamic optimized
dualVT result with the same amount of lowVT cells. Though longest
path for red is longer than blue, red has higher probability of
getting correct output than blue.
Summary
Telescopic logic can be a promising approach for dynamic circuit
optimization to improve performance.Techniques such as BDD, TCF,
maxflow-mincut and lowVT assignment can be used to achieve dynamic
optimization.Compared to circuit optimized in traditional way,
dynamic optimization increases the overall throughput.
Clock Tree Design under Process Variation
Zero-Skew Clock Tree Synthesis
Clock skews are differences in clock arrival times and hurt circuit
frequencyThere are existing clock tree synthesis algorithms for
zero skew!Tsay, Exact zero skew, 1991Chao et al., Zero Skew Clock
Routing With Minimum Wirelength, 1992othersHowever, exact zero skew
cannot be achieved in the presence of process variation
Bounded Skew Clock Tree Synthesis
Delays from the clock source to all of the clock sinks are within a
certain boundThe skew bound is defined as the maximum difference in
clock arrival timesThere are existing clock tree synthesis
algorithmsCong et al., Bounded-Skew Clock and Steiner Routing,
1998othersSome tradeoffs can be obtained among wirelength, power,
and skew boundsHowever, these works still deal with deterministic
delays
Buffered Clock Tree Synthesis
In most of the buffered clock tree synthesis algorithms, buffers
are inserted after clock tree routing by selecting potential buffer
positions in the treeSimultaneous clock tree routing and buffer
insertion is done in [3] and [4][4] is designed to construct a
balanced buffered clock tree, which is more compatible for future
improvement, e.g. link insertion
[3] Chen and Wong, An Algorithm for Zero-Skew Clock Tree Routing
with Buffer Insertion, 1996[4] Rajaram and Pan, Variation Tolerant
Buffered Clock Network Synthesis with Cross Links, 2006
Motivation for This Work
Obviously, there are some tradeoffs among clock tree topology,
number of buffers, delay, and skew boundsCan we capture these
properties and make design space exploration during the synthesis
of the clock tree?Furthermore, can we utilize the information to
make future improvement?Also, the delays are actually probabilistic
distributions instead of deterministic valuesCan we utilize the
delay distributions to make the clock trees more robust?
Algorithm Overview
Construct buffered clock trees in a bottom-up fashion passing delay
and skew bound information along the wayAt the top (root) level,
several clock trees with different delay and skew bound properties
are available for the user to choose fromTradeoffs among delay,
skew bounds, buffer distribution, and tree topology can be seen in
this stageAfter the target tree is chosen, the final topology is
built in a top-down way
Merging Region [1/2]
Given two subtrees A and B, we want to connect them to a merging
point M1 and obtain the delay and skew bounds from M1 to any of the
clock sinks in A and BSeveral other points Mi can be chosen such
that the delay and skew bounds from Mi to the sink nodes are the
same as those of M1These points form a merging region
A
B
M1
Mi
Several different merging regions for subtrees A and B can be
constructed by defining different delay and skew boundsThe delay
and skew bounds associated with the merging region can be passed to
the upper level when merging the new subtree M and another
subtree
Merging Region [2/2]
Buffer Insertion
After the merging region is constructed, we make it a potential
buffer positionA buffer library is given, defining the
characteristics (e.g. intrinsic delay, capacitance, resistance) of
different types of buffersTry to insert different types of buffers
(or not to insert any buffer) Pass the solutions to an upper
level
M
A
B
M
A
B
M
A
B
Illustration: Merging
A
B
No delay at the sink nodes
M
Wire delays
Two distributions and describe the delay at node M
Illustration: Skew Bounds
Define the upper skew bound asDefine the lower skew bound asDefine
the overall skew bound asDefine the maximum delay as
For each selected merging region, we can obtain a pair of (b, d)
to characterize it
A
B
M
Buffer Insertion
Once a merging region is determined, we try to insert buffers at
the positionNew delays are calculated as
is the intrinsic delay of buffer type i is the resistance of
buffer type iNew (b, d) pairs are also calculated accordingly
A
B
M
Pruning
Pruning can be performed to eliminate redundant solutionsThe blue
points are redundant because they have larger (b, d) pairs compared
to at least another pointThe red points are kept in the solution
space
d
b
Next Iteration
A
B
M
C
D
N
R
Top-Level Decision
When we get to the topmost level, some pairs of (b, d) representing
different clock tree designs are availableChoose one design that is
most desirableTrace back in a top-down way to build the clock
tree
d
b
Summary
Our algorithm is able to make a design space exploration on
buffered clock trees, capturing their different delay, skew bounds,
buffer distribution, and tree topologyThe synthesized clock trees
can be further improved according to their propertiesFor example, a
clock tree with loose skew bounds can be improved using link
insertion or other techniquesMore accurate delay models such as
SSTA with spatial correlation, or non-normally distributed delays
can be adopted
Conclusions
Process variation is shifting chip design into the statistical
domainVarious analysis/optimization approaches can be takenProcess
variation modelingStatistical timing analysisStatistical gate level
optimizationStatistical physical designVariation-aware
architecturePerformance/power efficiency can be effectively
improvedSome remaining challengesFurther validation of variation
modelingHow systematic variation can be correlated well to the
actual measured data Novel joint CAD/architecture worksConsider T
and V variation together with process variation
Thank you!
*
*
*
*