Variation-Aware Chip Design for Reliability and Performance

Slide 1

Variation-Aware Chip Design for Reliability and Performance
Deming Chen, ECE, UIUC
Acknowledgement: work partially supported by Sun Microsystems
Students: Christine Chen, Greg Lucas, Lu Wan

Outline
BackgroundProcess variationMotivation of variation-aware chip designSSTA with multiple clock domainsTelescopic logic for processor performanceClock tree design with skew reductionConclusion

Process Variation
Increases as device and interconnect feature sizes are scaled downCan be within-die (intra-die) and between dies (inter-die)
(Source: Intel)

Traditional Solutions
Speed/Power binning: measure chips and bin into performance categories, sell lower performing or power-hungry chips at a lower priceGuard-band the design to achieve the desired yieldUses pessimistic worst-case process corners Inefficient as the variability increases with scaling

Deterministic analysis:WCETX = X + 3X,WCETY = Y + 3YWCETX + Y =WCETX + WCETY =X + Y + 3(X + Y)
Example
Statistical analysis:X + Y = WCETX + Y =X+Y + 3X+Y =X + Y + 3(sqrt(2X + 2Y))
Worst Case Execution Time of X + Y, where X = , Y =

Statistical Static Timing Analysis (SSTA) for Multiple Clock Domains

Introduction
Increased process variation in DSM technologies demands SSTA Many SSTA algorithms have been proposed, but they all focus on simple timing graphs and the traversal algorithmIn industrial designs, there are multi-cycle paths, multiple clock domains, and false paths
To meet the demands of industry strength designs, SSTA must be extended to handle complex timing graphs

WID Variation Modeling
2 components to process variationSystematic VariationLg, WgRandom Variation Na, tox40%-60% of the total variation is systematic [Nassif, ISQED00]Therefore, correlation must be consideredUtilize a grid based correlation model [Chang & Sapatnekar, ICCAD03]
Correlation is a function of distance, all cells within a grid have correlation = 1

MCSSTA Algorithm Overview
MCSSTA extends SSTA to handleMulti-cycle pathsMultiple Clock DomainsFalse Paths
Adder A 1 cycle Multiplier B 2 cycles
Multiplier C has both single and multi-cycle paths through it

Extending the Max Equations to Multi-Clock Domains
Clock Domain Decomposition
Find: pdfFF5 = max(P1,P2) considering the timing constraints and correlation
Step 1: Normalization
Where = mean, = standard deviation, and n = cycle constraint
Step 2: Correlation Correction

MCSSTA Timing Graph Setup
Each node/edge contains a list of the cycle constraints that go through the node/edgeCan account for false paths and other complicated timing constraints by removing the cycle constraint from a node/edge
Circuit and timing graph for multiplexer C

1

Principal Components Timing Traversal
PCA transforms a set of correlated random variables into a set of independent random variables: Significantly simplifies traversal of the timing graph since correlation does not have to be tracked.Two properties of PCA:
1.
2.
The normalization and correlation correction operations can be performed simultaneously by dividing the principal components by the cycle constraint

Modified PCA Traversal
Normalization and Correlation Correction are performed during the timing graph traversal

Experimental Results
ISCAS benchmarks, slowest 70% of paths set to 2 cycles0.207% error in mean2.0% error in standard deviation

Summary
SSTA is closer to maturity It has been extended to consider complex timing constraints for normal distributions In the future, we plan to extend the method to handle non-normal distributions

Telescopic Logic for Microprocessor Performance

Traditional View of Circuit Optimization
Quality metric for circuit optimizationCycle time: tcycle > ATlongest path Power consumption: Poverall = Pdynamic + PstaticCircuit optimization is staticStatic timing analysisLongest path receives most optimization effortDecompositionRe-synthesisSizing up/downDual threshold voltagePower optimization creates critical path wallCritical path wall makes timing optimization more difficult

Recent Innovation: RAZOR Logic
Tolerate timing errorData correct one cycleData error n cycle to recovery (n>1)Perf = Fmax * ( p + (1-p)/n), where p is probability that data is latched correctly.
RAZOR logic

A Promising Alternative: Telescopic Logic
One-cycle class: set of input vectors that make circuit stable before tcycle.Two-cycle class: set of input vectors that make circuit stable after tcycle.fh asserted when input vector belongs to two-cycle class.Throughput:
Telescopic unit

Concept of Dynamic Circuit Optimization
Classification of Primary Output (PO)Critical(C) / non-critical(NC)High-Activity (HA) / Low-Activity (LA)Four possible combinations:C+HA, NC+HA, C+LA, NC+LAQuestion: should the optimization be constrained by paths that are rarely exercised?Dynamic optimization:Timing speculation: 1. allow few PO slower than tcycle; 2. do data recovery when error is latched.Instead of spend equal optimization effort on C+HA and C+LA, dynamic opt. biases optimization effort towards C+HA

Dynamic Optimization with Telescopic Logic
ROBDD: Reduced Ordered Binary Decision Diagram is used to encode the functionality of circuit.TCF: a Timed Characteristic Function (BDD+timing) that encodes time and function relationship, is built for the circuit.
1. Represent function with ROBDD
2. Using TCF to derive sensitization probability

Cont
PROB: Given signal probability of each PI and TCF of a circuit, the probability of sensitizing the POs can be derived.LowVT: accelerate certain nodes in circuit by assigning lowVT. MINCUT: using maxflow-mincut algorithm to find candidates to assign lowVT
3. Maxflow-mincut chooses candidates to assign lowVT

Use TCF+BDD to evaluate functional bias
Given different input probabilityCase 1: each PI has static prob. = 0.5;Case 2: each PI has static prob. =0.2;When overclocked for the same amount, probability of getting correct outputs case 2 > case 1.

Dynamic Optimization Effect
Blue is Synopsys dualVT optimized result Red is dynamic optimized dualVT result with the same amount of lowVT cells. Though longest path for red is longer than blue, red has higher probability of getting correct output than blue.

Summary
Telescopic logic can be a promising approach for dynamic circuit optimization to improve performance.Techniques such as BDD, TCF, maxflow-mincut and lowVT assignment can be used to achieve dynamic optimization.Compared to circuit optimized in traditional way, dynamic optimization increases the overall throughput.

Clock Tree Design under Process Variation

Zero-Skew Clock Tree Synthesis
Clock skews are differences in clock arrival times and hurt circuit frequencyThere are existing clock tree synthesis algorithms for zero skew!Tsay, Exact zero skew, 1991Chao et al., Zero Skew Clock Routing With Minimum Wirelength, 1992othersHowever, exact zero skew cannot be achieved in the presence of process variation

Bounded Skew Clock Tree Synthesis
Delays from the clock source to all of the clock sinks are within a certain boundThe skew bound is defined as the maximum difference in clock arrival timesThere are existing clock tree synthesis algorithmsCong et al., Bounded-Skew Clock and Steiner Routing, 1998othersSome tradeoffs can be obtained among wirelength, power, and skew boundsHowever, these works still deal with deterministic delays

Buffered Clock Tree Synthesis
In most of the buffered clock tree synthesis algorithms, buffers are inserted after clock tree routing by selecting potential buffer positions in the treeSimultaneous clock tree routing and buffer insertion is done in [3] and [4][4] is designed to construct a balanced buffered clock tree, which is more compatible for future improvement, e.g. link insertion
[3] Chen and Wong, An Algorithm for Zero-Skew Clock Tree Routing with Buffer Insertion, 1996[4] Rajaram and Pan, Variation Tolerant Buffered Clock Network Synthesis with Cross Links, 2006

Motivation for This Work
Obviously, there are some tradeoffs among clock tree topology, number of buffers, delay, and skew boundsCan we capture these properties and make design space exploration during the synthesis of the clock tree?Furthermore, can we utilize the information to make future improvement?Also, the delays are actually probabilistic distributions instead of deterministic valuesCan we utilize the delay distributions to make the clock trees more robust?

Algorithm Overview
Construct buffered clock trees in a bottom-up fashion passing delay and skew bound information along the wayAt the top (root) level, several clock trees with different delay and skew bound properties are available for the user to choose fromTradeoffs among delay, skew bounds, buffer distribution, and tree topology can be seen in this stageAfter the target tree is chosen, the final topology is built in a top-down way

Merging Region [1/2]
Given two subtrees A and B, we want to connect them to a merging point M1 and obtain the delay and skew bounds from M1 to any of the clock sinks in A and BSeveral other points Mi can be chosen such that the delay and skew bounds from Mi to the sink nodes are the same as those of M1These points form a merging region
A
B
M1
Mi

Several different merging regions for subtrees A and B can be constructed by defining different delay and skew boundsThe delay and skew bounds associated with the merging region can be passed to the upper level when merging the new subtree M and another subtree
Merging Region [2/2]

Buffer Insertion
After the merging region is constructed, we make it a potential buffer positionA buffer library is given, defining the characteristics (e.g. intrinsic delay, capacitance, resistance) of different types of buffersTry to insert different types of buffers (or not to insert any buffer) Pass the solutions to an upper level
M
A
B
M
A
B
M
A
B

Illustration: Merging
A
B
No delay at the sink nodes
M
Wire delays
Two distributions and describe the delay at node M

Illustration: Skew Bounds
Define the upper skew bound asDefine the lower skew bound asDefine the overall skew bound asDefine the maximum delay as

For each selected merging region, we can obtain a pair of (b, d) to characterize it
A
B
M

Buffer Insertion
Once a merging region is determined, we try to insert buffers at the positionNew delays are calculated as

is the intrinsic delay of buffer type i is the resistance of buffer type iNew (b, d) pairs are also calculated accordingly
A
B
M

Pruning
Pruning can be performed to eliminate redundant solutionsThe blue points are redundant because they have larger (b, d) pairs compared to at least another pointThe red points are kept in the solution space
d
b

Next Iteration
A
B
M
C
D
N
R

Top-Level Decision
When we get to the topmost level, some pairs of (b, d) representing different clock tree designs are availableChoose one design that is most desirableTrace back in a top-down way to build the clock tree
d
b

Summary
Our algorithm is able to make a design space exploration on buffered clock trees, capturing their different delay, skew bounds, buffer distribution, and tree topologyThe synthesized clock trees can be further improved according to their propertiesFor example, a clock tree with loose skew bounds can be improved using link insertion or other techniquesMore accurate delay models such as SSTA with spatial correlation, or non-normally distributed delays can be adopted

Conclusions
Process variation is shifting chip design into the statistical domainVarious analysis/optimization approaches can be takenProcess variation modelingStatistical timing analysisStatistical gate level optimizationStatistical physical designVariation-aware architecturePerformance/power efficiency can be effectively improvedSome remaining challengesFurther validation of variation modelingHow systematic variation can be correlated well to the actual measured data Novel joint CAD/architecture worksConsider T and V variation together with process variation

Thank you!

*
*
*
*

Documents

Variation-Aware Chip Design for Reliability and Performance