Large Scale Circuit Placement: Gap and Promise

Large Scale Circuit Placement: Gap and Promise

Jason CongUCLA VLSI CAD LAB1

Joint work with Chin-Chih Chang, Tim Kong, Michail Romesis, Joseph R. Shinnerl, Min Xie and Xin Yuan

Outline

Introduction Gap Analysis of Existing Placement

Algorithms Scalable Paradigm – Multilevel

Placement

Why Still Placement Problem

True, it has been studied over 30 years, but … We need good solutions more then ever

One of most important steps in IC implementation flow Directly defines interconnects

Difficult Problem size grows 2X every 18-24 months

Moore’s Law Cannot place hierarchically without quality degradation

Example of Logic Hierarchy in Final Layout

By courtesy of IBM (Tony Drumm)

Why Still Placement

True, it has been studied over 30 years, but … We need good solutions more then ever

One of most important steps in IC implementation flow Directly defines interconnects

Difficult Problem size grows 2X every 18-24 months

Moore’s Law Cannot place hierarchically without quality degradation

We are not very good at it …

Outline

Introduction Gap Analysis of Existing Placement

Algorithms Scalable Paradigm – Multilevel

Placement

Motivation

Lack of significant progress in wirelength reduction Rate of reduction is about 5-10% every 2-3 years Latest developments in placement differ mainly

in runtime Most work compare only with known heuristics

Use real design based benchmarks Use synthetic benchmarks

Little understanding about the divergence from the optimal

Placement Examples with Known Optimal Wirelength [Chang et al, 2003]

All the modules are of equal size, and there is no space between rows and adjacent modules

For 22-pin nets , connect any two adjacent modules

/ 2n n n

For each nn-pin net , connect the nn modules in a rectangular region close to a square, i.e., the length of each side is close to sqrt(n)

The wirelength is of each nn-pin net is given by

Given a (real) netlist N Construct netlist N’ with known opt. WL and match the net distribution of N

Placement Examples with Known Upperbounds [Cong et al, 2003]

Limitations of PEKO All the nets are local

Wirelength contribution by global connections in real designs can be significant

Extend PEKO by introducing non-local nets to mimic global connections Method 1: Generate a subset of i-i-pin

nets by randomly connecting ii modules on the chip

Method 2: Generate a subset of ii-pin nets according to wirelength distribution vector (WDV)

Illustration:PEKU Example Construction

Input : t = 64, D = {d2=35,d3=21,d4=7,d5=4,d6=2, d7=1} =0.2

Total WL = 184

Generate 28 2-pin optimally



Generate 1 4-pin randomly




W = {w1…w3=0, w4=3, w5=3, w6= 0,w7 =2,w8 =2,w9=1, w10=0, w11=1, w12=1}



Studied Five State-of-the-Art Placers Capo [Caldwell et al, 2000]

Based on multilevel partitioner Aims to enhance the routability

Dragon [Wang et al, 2000] Uses hMetis for initial partition SA with bin-based swapping

mPL [Chan et al, 2000] Nonlinear programming on the coarsest level Discrete relaxation at finer levels

mPG [Chang et al, 2002] Uses FC clustering and hierarchical density control Incremental A-tree for routability

Qplace [Cadence Inc.] Leading edge industrial placer Component of Silicon Ensemble

Experimental Results on PEKO

0

10000

20000

30000

40000

50000

0 50000 100000 150000 200000 250000

#cells

Ru

nti

me(

s)

Capo 8.6 Dragon 2.20 mPG 1.0 mPL 3.0 Qplace 5.1

1.20

1.40

1.60

1.80

2.00

2.20

2.40

2.60

0 50000 100000 150000 200000 250000

#cells

Qu

alit

y R

atio


Existing Algorithms can be 59% to 140% away from the optimal on PEKO

On Examples with pads mPG and Qplace show improvement of 12% and 10% repectively Dragon, mPL, and Capo do not benefit much from the additional information

There is significant room for improvement in placement algorithms

Experimental Results on PEKO

Capo, QPlace and mPL scales well in runtime Average solution quality of each tool shows deterioration by an

additional 9% to 17% when the problem size increases by a factor of 10

0

20000

40000

60000

80000

100000

10000 100000 1000000 10000000#cells

Ru

nti

me(

s)


1.201.401.601.802.002.202.402.602.80

10000 100000 1000000 10000000#cells

Qu

alit

y R

atio

Capo 8.6 Dragon 2.20 mPG 1.0 mPL 3.0 QPLace 5.1

Experimental Results on PEKU

The effectiveness of existing placers can vary significantly for circuits of similar size but different characteristics

Comparing QRs helps to identify the technique that works best under each scenario

1.10

1.30

1.50

1.70

1.90

2.10

2.30

0.00 0.00 0.50 0.75 1.00 2.00 5.00 10.00% of non-local nets

Qua

lity

Rat

io

Dragon 2.20 Capo 8.6 mPG 1.0 mPL 3.0 Qplace 5.1

QR (Placed Wirelength vs Upperbound) may not be tight

High Interest in the Community

Timing-driven Placement Examples with Known Optimal (TPEKO)

Obtain a placement for the circuit from any available tool

Perform timing analysis on the circuit

Create an artificial combinational path with equal or larger delay than the longest path

Guarantee the cells in the path are adjacent to each other

Make necessary modifications

Original longest path

Artificial path

Evaluating Timing-Driven Placement Algorithms Using TPEKO

Evaluating two state-of-the-art FPGA placement algorithms VPR [Marquardt et al.

2000] PATH [Kong 2002]

Can be far away from the optimal for difficult examples 35% on average 54% in the worst case

1.00

1.05

1.10

1.15

1.20

1.25

1.30

1.35

1.40

1 2 3 4 5#longest path

Qua

lity

Rat

io

VPR PATH

Observations from Gap Analysis

Significant opportunity in placement Existing algorithms may produce solutions far away

from the optimal The quality result of the same placer varies for

circuits of similar size but different characteristic Scalability problem in runtime and solution quality

Significant ROI Benefit equal to one to two generations of process

scaling But without requiring multi-billion dollar investment

(hopefully!)

Outline

IntroductionIntroduction Gap Analysis of Existing Placement AlgorithmsGap Analysis of Existing Placement Algorithms Scalable ParadigmScalable Paradigm Timing OptimizationTiming Optimization Routability OptimizationRoutability Optimization ConcludingConcluding RemarksRemarks ApplicationApplication

Multi-Million Gate FPGA PlacementMulti-Million Gate FPGA Placement

Paradigm 2: Multilevel Placement

Coarsening: build the hierarchy by recursive aggregation (generalized clustering)

Relaxation: improve the placement at each level by localized optimization

Interpolation: transfer coarse-level solution to adjacent, finer level (generalized declustering)

Multilevel Flow: multiple traversals over multiple hierarchies (V-cycle variations)

Multi-Level Optimization Framework

Interpolation &Relaxation (optimization)

Coarsening(Clustering)

Pro

ble

m s

ize

dec

reas

es

• Multilevel coarsening generates smaller problem sizes at coarser levels faster optimization at coarser levels

• May explore different aspects of the solution space at different levels• Gradual refinement on good solutions from coarser levels is very efficient• Successful in many applications

•Originally developed for PDEs•Recent success in VLSI CAD: partitioning, placement, routing

Given problem

Multilevel Coarse Placement

Coarsening by clustering

Refinement by placement

Initial Placement

A bin grid structure at each level

Hierarchical area density control

Optimization by SA, QP, RDFL, etc.

Multilevel Methods: Coarsening by Recursive Aggregation

Recursive aggregation defines the hierarchy. Different aggregation algorithms can be used on different levels

and/or in different V-cycles. Clustering methods

First-Choice Clustering (hMetis [Karypis 1999]). AMG based aggregation

An aggregate need not be a cluster. A cell can be fractionally associated to more than one aggregate

Merge each vertex with its “best” neighborMerged Nets

Multilevel Methods: Relaxation(Intralevel Optimization)

Iterative improvement at each level by fast, localized computation Discrete permutation enumerations; swapping Unconstrained quadratic wirelength minimization on

subsets Network-flow based improvement on subsets (RDFL)

Local relaxation is sufficient. Global improvement comes from the multilevel hierarchy.

Relaxations at finer levels may be quite different, e.g., more discrete, than relaxations at coarser levels.

Relaxation on Local Subsets

Original Subnetlist with Subproblem

Move the red cells to their optimal positions, holding all other cells fixed and (perhaps) ignoring overlap

Unrelated Cell

Fixed Neighbor

Movable Cell

Example: Goto-based Discrete Relaxation

Each cell’s optimal location is readily calculated when all other cells are held fixed.

Compute a chain A, B, C, D, E, whereB is a randomly selected neighbor of A’s optimal location, etc.

Examine all permutations of the chain and take the best one.

Problem: the chain is not closed (A is not necessarily near any other cell’s optimal location).

Example: Quadratic Relaxation on Noncontiguous Subsets (QRS)

Select a subset M of cells to move Identify other cells and pads, F, connected to

M by nets in

Decouple the horizontal and vertical problems. M is obtained as segments of length k along a

DFS vertex traversal of the netlist

}.|{ MeEeEM

Solving the QRS subproblem

Problem formulation (horizontal case):

Iteratively solve the weighted quadratic minimization problem, using the current solution to determine the weight (as in Gordian-L)

May result in cell overlap!

number. small is , )(||

1 where

|)(|

))((min

)()(

2

eve

Ee evke

ke

vxe

x

xvx

xvx

M

Ripple-move legalization [Hur and Lillis, 2000]Because many forms of subset relaxation ignore overlap, post-relaxation cell swaps may be needed to remove overlap.

Define a DAG on neighboring bins. Edge cost reflects the best wirelength gain over all cell swaps between two bins.Calculate a max-gain monotone path on the bin-grid graph

Multilevel Methods: Interpolation(Generalized Declustering)

Goal: transfer a partial solution from a coarser level to its adjacent finer level

Simplest approach: place all components of a cluster at its center

Better approach: place each component of an aggregate at the weighted average of the aggregates to which it is strongly connected.

Optionally: impose constraints; e.g., the average location of the components can be held fixed.

AMG-style Linear Interpolation

Place the C-Pt representatives

The inherited position of a cluster component ( ) can be determined by several cluster positions, not just its own.

Place the F-pts by weighted interpolation

AMG-based Linear Interpolation [A. Brandt 1986]

interpolation

constantAMG

jijvF

jijvC

i vavavjj points points

clusterNext finer level cells

Within each cluster, select the one with

maximum degree as C-point; others are

considered as F-points

C-point

Iterated Multilevel Flow

Make use of placement solution from 1st V-cycle

First Choice (FC)clustering

Geometric basedFC clustering

Iterated Multilevel Flow

Iterated V-Cycles F-Cycle

Backtracking V-Cycle

Sample Impact of the Multilevel Components to mPL’s overall quality

First-Choice Clustering: 3—4% reduced WL QRS Relaxation: 5—6% reduced WL AMG Interpolation: 2—3% reduced WL Iterated V-cycles: 2—8% reduced WL

mPL 3.0 vs. mPL1.0 and Gordian-LUniform-Cell IBM/ISPD 98 Circuits

mPL1.0+Dom. Gordian-L+Dom.

Circuit Wirelength CPU time Wirelength CPU time

Ibm04 1.18 0.31 1.05 1.90

Ibm07 1.14 0.34 1.05 3.77

Ibm09 1.14 0.33 1.04 4.90

Ibm10 1.11 0.31 0.99 6.54

Ibm14 1.11 0.41 1.04 8.28

Ibm16 1.16 0.46 1.00 11.76

Ibm17 1.07 0.44 0.98 10.41

Ibm18 1.18 0.42 1.03 13.43

Average 1.14 0.38 1.02 7.62

(12% better than mPL1.0 with 2x longer runtime; 2% better than Gordian-L and 7x faster)

mPL 3.0 vs. Capo 8.5 and DragonUniform-Cell IBM/ISPD 98 Circuits

Capo 8.5 / mPL3.0 Dragon / mPL3.0

Circuit Wirelength CPU time Wirelength CPU time

Ibm04 1.12 0.53 0.97 3.03

Ibm07 1.12 0.60 0.95 3.33

Ibm09 1.12 0.67 1.01 5.40

Ibm10 1.10 0.55 0.99 4.70

Ibm14 1.08 0.57 0.95 3.02

Ibm16 1.06 0.54 0.90 6.83

Ibm17 1.10 0.43 0.98 6.82

Ibm18 1.10 0.43 0.96 6.10

Average 1.10 0.54 0.96 4.91

(10% better than Capo with 2x longer runtime;4% worse than Dragon but 4x faster)

mPL3.0 vs. mPL1.0, Capo8.5, Dragon and Gordian-L

0.00

1.00

2.00

3.00

4.00

5.00

6.00

0.95 1.00 1.05 1.10 1.15

scaled wirelength

scal

ed r

un

tim

e mPL3.0

mPL1.0

Capo8.5

Dragon

Gordian-L

QRS2nd V-cycle

AMG

Extension: Multilevel Mixed-size Placement Level 0

Level kCoarsest level

placement

Big objects legalization

big object

small object

fixed big object

cluster

Simultaneous place big and small objectsGradually fix the locations of big objects and generate overlap-free placement for big objects during multilevel placement

Example: Final Placement of ibm02 by mPG-ms

Concluding Remarks

There is significant opportunity to improve the placement technologies

Multilevel placement is a promising scalable solution

Documents

Large Scale Circuit Placement: Gap and Promise