Upload
aquarius
View
40
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Large Scale Circuit Placement: Gap and Promise. Jason Cong UCLA VLSI CAD LAB 1 Joint work with Chin-Chih Chang, Tim Kong, Michail Romesis, Joseph R. Shinnerl, Min Xie and Xin Yuan. Outline. Introduction Gap Analysis of Existing Placement Algorithms Scalable Paradigm – Multilevel Placement. - PowerPoint PPT Presentation
Citation preview
Large Scale Circuit Placement: Gap and Promise
Jason CongUCLA VLSI CAD LAB1
Joint work with Chin-Chih Chang, Tim Kong, Michail Romesis, Joseph R. Shinnerl, Min Xie and Xin Yuan
Outline
Introduction Gap Analysis of Existing Placement
Algorithms Scalable Paradigm – Multilevel
Placement
Why Still Placement Problem
True, it has been studied over 30 years, but … We need good solutions more then ever
One of most important steps in IC implementation flow Directly defines interconnects
Difficult Problem size grows 2X every 18-24 months
Moore’s Law Cannot place hierarchically without quality degradation
Example of Logic Hierarchy in Final Layout
By courtesy of IBM (Tony Drumm)
Why Still Placement
True, it has been studied over 30 years, but … We need good solutions more then ever
One of most important steps in IC implementation flow Directly defines interconnects
Difficult Problem size grows 2X every 18-24 months
Moore’s Law Cannot place hierarchically without quality degradation
We are not very good at it …
Outline
Introduction Gap Analysis of Existing Placement
Algorithms Scalable Paradigm – Multilevel
Placement
Motivation
Lack of significant progress in wirelength reduction Rate of reduction is about 5-10% every 2-3 years Latest developments in placement differ mainly
in runtime Most work compare only with known heuristics
Use real design based benchmarks Use synthetic benchmarks
Little understanding about the divergence from the optimal
Placement Examples with Known Optimal Wirelength [Chang et al, 2003]
All the modules are of equal size, and there is no space between rows and adjacent modules
For 22-pin nets , connect any two adjacent modules
/ 2n n n
For each nn-pin net , connect the nn modules in a rectangular region close to a square, i.e., the length of each side is close to sqrt(n)
The wirelength is of each nn-pin net is given by
Given a (real) netlist N Construct netlist N’ with known opt. WL and match the net distribution of N
Placement Examples with Known Upperbounds [Cong et al, 2003]
Limitations of PEKO All the nets are local
Wirelength contribution by global connections in real designs can be significant
Extend PEKO by introducing non-local nets to mimic global connections Method 1: Generate a subset of i-i-pin
nets by randomly connecting ii modules on the chip
Method 2: Generate a subset of ii-pin nets according to wirelength distribution vector (WDV)
Illustration:PEKU Example Construction
Input : t = 64, D = {d2=35,d3=21,d4=7,d5=4,d6=2, d7=1} =0.2
Total WL = 184
Generate 28 2-pin optimally
Generate 16 3-pin optimally
Generate 6 4-pin optimally
Generate 1 4-pin randomly
Generate 4 5-pin optimally
Generate 2 6-pin optimally
Generate 1 7-pin optimally
W = {w1…w3=0, w4=3, w5=3, w6= 0,w7 =2,w8 =2,w9=1, w10=0, w11=1, w12=1}
Generate 7 2-pin randomly
Generate 5 3-pin randomly
Studied Five State-of-the-Art Placers Capo [Caldwell et al, 2000]
Based on multilevel partitioner Aims to enhance the routability
Dragon [Wang et al, 2000] Uses hMetis for initial partition SA with bin-based swapping
mPL [Chan et al, 2000] Nonlinear programming on the coarsest level Discrete relaxation at finer levels
mPG [Chang et al, 2002] Uses FC clustering and hierarchical density control Incremental A-tree for routability
Qplace [Cadence Inc.] Leading edge industrial placer Component of Silicon Ensemble
Experimental Results on PEKO
0
10000
20000
30000
40000
50000
0 50000 100000 150000 200000 250000
#cells
Ru
nti
me(
s)
Capo 8.6 Dragon 2.20 mPG 1.0 mPL 3.0 Qplace 5.1
1.20
1.40
1.60
1.80
2.00
2.20
2.40
2.60
0 50000 100000 150000 200000 250000
#cells
Qu
alit
y R
atio
Capo 8.6 Dragon 2.20 mPG 1.0 mPL 3.0 Qplace 5.1
Existing Algorithms can be 59% to 140% away from the optimal on PEKO
On Examples with pads mPG and Qplace show improvement of 12% and 10% repectively Dragon, mPL, and Capo do not benefit much from the additional information
There is significant room for improvement in placement algorithms
Experimental Results on PEKO
Capo, QPlace and mPL scales well in runtime Average solution quality of each tool shows deterioration by an
additional 9% to 17% when the problem size increases by a factor of 10
0
20000
40000
60000
80000
100000
10000 100000 1000000 10000000#cells
Ru
nti
me(
s)
Capo 8.6 Dragon 2.20 mPG 1.0 mPL 3.0 Qplace 5.1
1.201.401.601.802.002.202.402.602.80
10000 100000 1000000 10000000#cells
Qu
alit
y R
atio
Capo 8.6 Dragon 2.20 mPG 1.0 mPL 3.0 QPLace 5.1
Experimental Results on PEKU
The effectiveness of existing placers can vary significantly for circuits of similar size but different characteristics
Comparing QRs helps to identify the technique that works best under each scenario
1.10
1.30
1.50
1.70
1.90
2.10
2.30
0.00 0.00 0.50 0.75 1.00 2.00 5.00 10.00% of non-local nets
Qua
lity
Rat
io
Dragon 2.20 Capo 8.6 mPG 1.0 mPL 3.0 Qplace 5.1
QR (Placed Wirelength vs Upperbound) may not be tight
High Interest in the Community
Timing-driven Placement Examples with Known Optimal (TPEKO)
Obtain a placement for the circuit from any available tool
Perform timing analysis on the circuit
Create an artificial combinational path with equal or larger delay than the longest path
Guarantee the cells in the path are adjacent to each other
Make necessary modifications
Original longest path
Artificial path
Evaluating Timing-Driven Placement Algorithms Using TPEKO
Evaluating two state-of-the-art FPGA placement algorithms VPR [Marquardt et al.
2000] PATH [Kong 2002]
Can be far away from the optimal for difficult examples 35% on average 54% in the worst case
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
1 2 3 4 5#longest path
Qua
lity
Rat
io
VPR PATH
Observations from Gap Analysis
Significant opportunity in placement Existing algorithms may produce solutions far away
from the optimal The quality result of the same placer varies for
circuits of similar size but different characteristic Scalability problem in runtime and solution quality
Significant ROI Benefit equal to one to two generations of process
scaling But without requiring multi-billion dollar investment
(hopefully!)
Outline
IntroductionIntroduction Gap Analysis of Existing Placement AlgorithmsGap Analysis of Existing Placement Algorithms Scalable ParadigmScalable Paradigm Timing OptimizationTiming Optimization Routability OptimizationRoutability Optimization ConcludingConcluding RemarksRemarks ApplicationApplication
Multi-Million Gate FPGA PlacementMulti-Million Gate FPGA Placement
Paradigm 2: Multilevel Placement
Coarsening: build the hierarchy by recursive aggregation (generalized clustering)
Relaxation: improve the placement at each level by localized optimization
Interpolation: transfer coarse-level solution to adjacent, finer level (generalized declustering)
Multilevel Flow: multiple traversals over multiple hierarchies (V-cycle variations)
Multi-Level Optimization Framework
Interpolation &Relaxation (optimization)
Coarsening(Clustering)
Pro
ble
m s
ize
dec
reas
es
• Multilevel coarsening generates smaller problem sizes at coarser levels faster optimization at coarser levels
• May explore different aspects of the solution space at different levels• Gradual refinement on good solutions from coarser levels is very efficient• Successful in many applications
•Originally developed for PDEs•Recent success in VLSI CAD: partitioning, placement, routing
Given problem
Multilevel Coarse Placement
Coarsening by clustering
Refinement by placement
Initial Placement
A bin grid structure at each level
Hierarchical area density control
Optimization by SA, QP, RDFL, etc.
Multilevel Methods: Coarsening by Recursive Aggregation
Recursive aggregation defines the hierarchy. Different aggregation algorithms can be used on different levels
and/or in different V-cycles. Clustering methods
First-Choice Clustering (hMetis [Karypis 1999]). AMG based aggregation
An aggregate need not be a cluster. A cell can be fractionally associated to more than one aggregate
Merge each vertex with its “best” neighborMerged Nets
Multilevel Methods: Relaxation(Intralevel Optimization)
Iterative improvement at each level by fast, localized computation Discrete permutation enumerations; swapping Unconstrained quadratic wirelength minimization on
subsets Network-flow based improvement on subsets (RDFL)
Local relaxation is sufficient. Global improvement comes from the multilevel hierarchy.
Relaxations at finer levels may be quite different, e.g., more discrete, than relaxations at coarser levels.
Relaxation on Local Subsets
Original Subnetlist with Subproblem
Move the red cells to their optimal positions, holding all other cells fixed and (perhaps) ignoring overlap
Unrelated Cell
Fixed Neighbor
Movable Cell
Example: Goto-based Discrete Relaxation
Each cell’s optimal location is readily calculated when all other cells are held fixed.
Compute a chain A, B, C, D, E, whereB is a randomly selected neighbor of A’s optimal location, etc.
Examine all permutations of the chain and take the best one.
Problem: the chain is not closed (A is not necessarily near any other cell’s optimal location).
Example: Quadratic Relaxation on Noncontiguous Subsets (QRS)
Select a subset M of cells to move Identify other cells and pads, F, connected to
M by nets in
Decouple the horizontal and vertical problems. M is obtained as segments of length k along a
DFS vertex traversal of the netlist
}.|{ MeEeEM
Solving the QRS subproblem
Problem formulation (horizontal case):
Iteratively solve the weighted quadratic minimization problem, using the current solution to determine the weight (as in Gordian-L)
May result in cell overlap!
number. small is , )(||
1 where
|)(|
))((min
)()(
2
eve
Ee evke
ke
vxe
x
xvx
xvx
M
Ripple-move legalization [Hur and Lillis, 2000]Because many forms of subset relaxation ignore overlap, post-relaxation cell swaps may be needed to remove overlap.
Define a DAG on neighboring bins. Edge cost reflects the best wirelength gain over all cell swaps between two bins.Calculate a max-gain monotone path on the bin-grid graph
Multilevel Methods: Interpolation(Generalized Declustering)
Goal: transfer a partial solution from a coarser level to its adjacent finer level
Simplest approach: place all components of a cluster at its center
Better approach: place each component of an aggregate at the weighted average of the aggregates to which it is strongly connected.
Optionally: impose constraints; e.g., the average location of the components can be held fixed.
AMG-style Linear Interpolation
Place the C-Pt representatives
The inherited position of a cluster component ( ) can be determined by several cluster positions, not just its own.
Place the F-pts by weighted interpolation
AMG-based Linear Interpolation [A. Brandt 1986]
interpolation
constantAMG
jijvF
jijvC
i vavavjj points points
clusterNext finer level cells
Within each cluster, select the one with
maximum degree as C-point; others are
considered as F-points
C-point
Iterated Multilevel Flow
Make use of placement solution from 1st V-cycle
First Choice (FC)clustering
Geometric basedFC clustering
Iterated Multilevel Flow
Iterated V-Cycles F-Cycle
Backtracking V-Cycle
Sample Impact of the Multilevel Components to mPL’s overall quality
First-Choice Clustering: 3—4% reduced WL QRS Relaxation: 5—6% reduced WL AMG Interpolation: 2—3% reduced WL Iterated V-cycles: 2—8% reduced WL
mPL 3.0 vs. mPL1.0 and Gordian-LUniform-Cell IBM/ISPD 98 Circuits
mPL1.0+Dom. Gordian-L+Dom.
Circuit Wirelength CPU time Wirelength CPU time
Ibm04 1.18 0.31 1.05 1.90
Ibm07 1.14 0.34 1.05 3.77
Ibm09 1.14 0.33 1.04 4.90
Ibm10 1.11 0.31 0.99 6.54
Ibm14 1.11 0.41 1.04 8.28
Ibm16 1.16 0.46 1.00 11.76
Ibm17 1.07 0.44 0.98 10.41
Ibm18 1.18 0.42 1.03 13.43
Average 1.14 0.38 1.02 7.62
(12% better than mPL1.0 with 2x longer runtime; 2% better than Gordian-L and 7x faster)
mPL 3.0 vs. Capo 8.5 and DragonUniform-Cell IBM/ISPD 98 Circuits
Capo 8.5 / mPL3.0 Dragon / mPL3.0
Circuit Wirelength CPU time Wirelength CPU time
Ibm04 1.12 0.53 0.97 3.03
Ibm07 1.12 0.60 0.95 3.33
Ibm09 1.12 0.67 1.01 5.40
Ibm10 1.10 0.55 0.99 4.70
Ibm14 1.08 0.57 0.95 3.02
Ibm16 1.06 0.54 0.90 6.83
Ibm17 1.10 0.43 0.98 6.82
Ibm18 1.10 0.43 0.96 6.10
Average 1.10 0.54 0.96 4.91
(10% better than Capo with 2x longer runtime;4% worse than Dragon but 4x faster)
mPL3.0 vs. mPL1.0, Capo8.5, Dragon and Gordian-L
0.00
1.00
2.00
3.00
4.00
5.00
6.00
0.95 1.00 1.05 1.10 1.15
scaled wirelength
scal
ed r
un
tim
e mPL3.0
mPL1.0
Capo8.5
Dragon
Gordian-L
QRS2nd V-cycle
AMG
Extension: Multilevel Mixed-size Placement Level 0
Level kCoarsest level
placement
Big objects legalization
big object
small object
fixed big object
cluster
Simultaneous place big and small objectsGradually fix the locations of big objects and generate overlap-free placement for big objects during multilevel placement
Example: Final Placement of ibm02 by mPG-ms
Concluding Remarks
There is significant opportunity to improve the placement technologies
Multilevel placement is a promising scalable solution