John A. Nestor, Kavon Nasabzadeh, and Oliver Bowen ECE Department Lafayette College Easton, Pennsylvania 18042 [email protected] Supporting Rip-Up

John A. Nestor, Kavon Nasabzadeh, and Oliver Bowen

ECE DepartmentLafayette College

Easton, Pennsylvania18042

[email protected]

Supporting Rip-Up and Reroute in an FPGA-Based Multilayer Maze Routing

Accelerator

2 MAPLD 2005/209Nestor

Outline

Background Overview of VLSI CAD Maze Routing with the Lee Algorithm Routing Accelerators The L3 Acclerator Maze Routing with Multiple Nets (Rip up and Reroute)

L4 - Hardware Acclerator with Rip up and Reroute Design Improvements Etching - a new feature to support Rip up and Reroute Control Software Implementation

Results Conclusion


The Routing Problem

Given a set desired connections (a netlist) A set of layers available to make connections

Create a set of connections that: Completely connects the terminals of each net Meets timing constraints on delay for critical nets Minimizes the area consumed by routing Minimizes the crossings between each layer Resolves crosstalk and noise issues


Maze Routing - The Lee Algorithm

Treat routing surface as a grid

Algorithm Operation Expansion - starting with “source” terminal, label neighboring

nodes with shortest path back to source; stop when target reached

Backtrace - follow shortest path from target back to source Cleanup - remove labels

Easily extended to multiple layers

S

T

N

N


Lee Algorithm Operation

S

TS

T

S

T

S

T

S

TS

T

S

T

S

T

S

T

S

T

S

T

S

T

Expansion - O(d2) steps


Lee Algorithm Operation (continued)

S

T

S

T

S

T

S

T T

Backtrace - O(d) steps

Cleanup - O(N2) steps

T T T T


Lee Algorithm Tradeoffs

Advantage: Guaranteed to find a shortest path connection if one exists

Disadvantages: Slow

• Expansion: O(d2) for connection of distance d

• Backtrace: O(d) for connection of distance d

• Cleanup: O(N2) for an N X N grid

Shortest-path guarantee only for a single connection


Multiple Net Routing

Lee Algorithm: routes nets one at a time Problem: Early connections may block later ones Solution: “Rip-up-and-reroute”

Rip up (remove) blocking nets Reroute nets in different order Key question: how to select net order?

S2

T1

S1

T2

Net 1 routed firstand blocks Net 2

S2

T1

S1

T2

Net 1 ripped up

S2

T1

S1

T2

Net 2 routed firstfollowed by Net 1


Software-Based Rip-Up and Reroute

Label unoccupied gridpoints with distance from source Allow expansion of occupied cells with penalty (e.g., 50) Path through obstacle found when no no-obstacle path exists Usage matrix sums connections in each gridpoint; identifies

violations where sum > 1 - identifies nets for rip-up

3

2

1

2

3

2

1

S2

1

2

T1

52

51

52

S1

53

52

53

T2

Expansion: Path foundthrough net N1

S2

T1

S1

T2

Backtrace: Routed withconflicting paths

1

1

1

2

1

1

1 1

Usage matrixIdentifies violations

Violation


Maze Routing Accelerators

Full grid accelerators(e.g. L-Machine, [Breuer & Shamsa 1981]) One processing element per gridpoint Custom implementation Reduces execution time from O(d2) to O(d) Impractical for multiple layers

Virtual grid acclerators (e.g. HAM [Venkataswaran & Mazumder, 1993]) Each processing element handles multiple gridpoints Custom implementatoin Complex PE design


The L3 Maze Routing Accelerator

Modified Direct Grid Approach Use two-dimensional grid to support multiple layers by

time multiplexing PE Operation on each clock cycle

Expansion: • if the cell represented the PE is unexpanded and a

neighboring PE is expanded, enter an “expansion state” that marks the direction of the neighbor (shortest path to source)

• Quit when target enters expansion state Backtrace

• Start with target and follow marked direction• Quite when source is reached

Cleanup: clear expansion states (but not obstacles) of all PEs in parallel - one layer per clock cycle


L3 Accelerator Organization

PE PE PE PE

PE PE PE PE

PE PE PE PE

PE PE PE PE

Column Decoder

ControlUnit

STATUS Signal (AND of all cell STATUS outputs)

CMD Signal (to all PEs)

Row

Dec.

S

T


L3 - PE Details

Efficient implementation - 32 LUTs in a Xilinx Spartan/Virtex FPGA for each PE Large Xilinx FPGAs can support grids of 32 X 32 PEs

Execution time comparsion: hw vs sw: Expansion: O(L X d) for L layers vs. O(L X d2) Backtrace: O(L X d) vs. O(L X d) Cleanup: O(L) vs. O(L X N2)

Single-net routing speedup of 93X over software for a 16 X 16 4 layer array (2.5GHz Pentum 4)

Drawbacks Limited clock speed due to long timing paths in design No support for multiple net routing


L4 - An Improved Routing Accelerator

Support for rip up and reroute using etching Hardware can’t support cost-based routing (unlike SW) Instead, allow expansion of “obstacle” cells Analogy - solvent “etching” through a barrier

Pipelined control unit / array interaction for increased clock rate (50 MHz vs. 24MHz)

Increased implementation cost (44 LUTs), but still feasible for 32 X 32 routing array

Control algorithm in Host Processor supports multiple net routing


L4 Processing Element Design

Register ST0stores state for layercurrently being processed

Shift Register SREGstores state for layersabove/below current layer

Layers processedbottom-to-top

D QXL

XH/TOP

CLK

SEQUENCERST0

SREG

NSCS

HI

LI NI SI EI WI

RSEL CSEL CMD

ST

to adjacentcells

fromdecoders CMD

STATUS

PFV

EN

1 0

PF

PFEWPFNS

EN

XO

D Q

ETCHETCH


L4 Processing Element - States

EMPTY Cell unoccupied and unexpanded

BLOCKED Cell occupied by routed net

XE Expanded - shortest backtrace path to east

XW Expanded - shortest backtrace path to west

XN Expanded - shortest backtrace path to north

XS Expanded - shortest backtrace path to south

XU Expanded - shortest backtrace path upward

XD Expanded - shortest backtrace path downard

Each PE stores the state of each layer at a specific (x,y) coordinate

States:


L4 Processing Element - Commands

READ Return state of selected cell(s)

WRITE Write state of selected cell(s)

EXPAND

IF EMPTY or (BLOCKED and ETCH enabled, AND a neighboring cell is expandedTHEN enter corresponding expand state (XN, …)

Commands are broadcast to all cells Decoders select rectangular regions of cells Cell outputs ORed together

CLEARX

Reset expanded cells to EMPTY state Reset etched cells to BLOCKED state


Etching Example

ETCH=0 - expansion fails

Etch=1 - expansion succeeds

S2

T1

S1

T2 S2

T1

S1

T2 S2

T1

S1

T2

S2

T1

S1

T2 S2

T1

S1

T2 S2

T1

S1

T2 S2

T1

S1

T2

After BacktraceExpansion


Overall System

Column Dec.

Row Dec.

ControlUnit

arraystatus

array cmd.HostProcessor(Control

Algorithm)

routing commands(src, tgt)

routing results(segments)

rs1

rs2

cs1 cs2

PE Array

FIFO

FIFO


Control Algorithm

Based on the SILK Simulated Evolution router [Lin et. al., 1989]

Key idea: score nets by fitness Score = * (number of violations)

+ * (number of vias – minimum number of vias)

+ * (actual wire length / lower bound of wire length)

Probabilistically select less fit nets for rip-up Use hardware router to find connections Usage matrix (maintained in software) identifies

routing violationsYoun-Long Lin, Yu-Chin Hsu and Fur-Shing Tsai, “SILK: A Simulated Evolution Router”,IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 8, No. 10, pp. 1108-1114, October 1989.


Control Algorithm

SE AlgorithmRip up Nets

Reroute Nets

HW/SWRouter

Initial Routing(Greedy Algorithm)

HW - L4Accelerator

UsageMatrix

PostOptimization

InputNetlist

RoutingResult

SW - ModifiedLee Algorithm

Success Fail

Feasible Solution

FinalRouting


Evaluating the System

Goal: evaluate the best case speedup over software (interface overhead excluded)

Method: Cycle counters in accelerator, host processors

Experiements on two grids 4-layer 8 X 8 4-layer 16 X 16


Results

Ź Software-only sys. Hardware-software sys. Ź Ź

Number time Number ofRoutes

incurringtime taken by

softwaretime taken by

hardware total time number of Speed-

of Nets Ź(us) routes violations (us) (us) (us) routes Źup

8 7049 13 0 468 17 485 13 14.53

16 20979 40 2 1428 52 1480 39 14.18

24 36651 76 14 3438 93 3531 70 10.38

30 51911 129 33 8450 191 8641 150 6.01

Ź Software-only sys. Hardware-software sys. Ź Ź

Number time Number ofRoutes

incurringtime taken by

softwaretime taken by

hardware total time number of Speed-

of Nets Ź(us) routes violations (us) (us) (us) routes Źup

4 1375 8 1 743 18 761 8 1.81

8 2053 15 3 1332 29 1361 15 1.51

12 2776 26 3 2616 52 2668 26 1.04

15 8957 90 30 4249 81 4330 38 2.07

16 10301 98 30 21426 602 22028 305 0.47

8 X 8 X 4 Grid

16 X 16 X 4 Grid


Conclusions

Extended accelerator for Maze Routing Support for Rip-Up and Reroute Support for multiple net routing Faster clock speed Control algorithm based on simulated evolution

Future Work Larger arrays in a PCI-based accelerator Performance measurements including interface overhead Hardware support to accelerate control algorithm

Documents

John A. Nestor, Kavon Nasabzadeh, and Oliver Bowen ECE Department Lafayette College Easton, Pennsylvania 18042 [email protected] Supporting Rip-Up