A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

Roman Lyseckya, Frank Vahida*, Sheldon X.-D. Tanb

aDepartment of Computer Science and EngineeringbDepartment of Electrical Engineering

University of California, Riverside*Also with the Center for Embedded Computer Systems at UC Irvine

This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx

2/22

IntroductionStandard binary - Separating Function and Architecture

SW__________________

SW__________________

ProfilingStandard Compiler

Binaryx86 Binary

Software binaries of the past Binary reflected specific language of underlying

architecture – limited portability Current “standard binary”

Concept: separate function from detailed architecture Develop new architectures for existing applications Trend towards dynamic translation and optimization

3/22

IntroductionBut Today’s Binaries are More than just Software

SW__________________

SW__________________

ProfilingStandard Compiler

BinarySW Binary

ProfilingCompiler/ Synthesis

BinaryBinary

Processor1Processor1

FPGAProc.

SW__________________

SW__________________

SW__________________

HW__________________

ProcessorProcessor2

Processor3Processor3 FPGA

Proc.

Proc.

FPGA

Proc.

Proc.

4/22

IntroductionJust-in-Time FPGA Compilation?

JIT FPGA compilation Idea: standard binary for FPGA Similar benefits as standard binary for

microprocessor Portability, transparency, standard tools

Embedded JIT compilation tools optimized for each FPGA

BinaryVHDL/Verilog

ProfilingStandard CAD Tools

BinaryStd. HW Binary

JIT FPGA Comp.

FPGA

+ + JIT FPGA Comp.

FPGA

+** +

MEM

5/22

IntroductionOne Use of JIT FPGA Compilation

CableTV Company

FeatureUpgradeFeatureUpgrade

SW____________

Processor ARM7

Processor ARM9

Processor ARM10

Processor ARM11

BinarySW Binary

6/22


CableTV Company


SW____________

Processor ARM7

Processor ARM9

Processor ARM10

Processor ARM11

HW____________

Processor FPGA 1

Processor FPGA 2

Processor FPGA 3

Processor FPGA 4

BinarySW Binary

BinaryHW Netlist3

BinarySW Binary

BinaryHW Netlist2

BinarySW Binary

BinaryHW Netlist1

BinarySW Binary

BinaryHW Netlist4

HW1____________HW2____________

HW3____________

HW4____________

7/22


CableTV Company


SW____________

Processor ARM7

Processor ARM9

Processor ARM10

Processor ARM11

HW____________

Processor FPGA 1

Processor FPGA 2

Processor FPGA 3

Processor FPGA 4

BinarySW Binary

BinaryHW Binary

JIT FPGA Comp.

JIT FPGA Comp.

JIT FPGA Comp.

JIT FPGA Comp.

8/22

µPI$

D$

FPGA

Profiler

Dynamic Part.

Module (DPM)

Time Energy

SW Only

HW/ SW

Partitioned application executes faster with lower energy consumption

55

IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)

Profile application to determine critical regions

22

Profiler

Initially execute application in software only

11

µPI$

D$

Partition critical regions to hardware

33

Dynamic Part.

Module (DPM)

Program configurable logic & update software binary

44

FPGA

Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

9/22

µPI$

D$

FPGA

Profiler

DPM(CAD)

IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)

BinaryBinary

Decompilation

BinaryHW Bitstream

RT Synthesis

PartitioningBinary Updater

BinaryUpdated Binary

BinaryStd. HW Binary

JIT FPGA CompilationJIT FPGA

Compilation

Tech. Mapping/Packing

Placement

Logic Synthesis

Routing

Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

10/22

IntroductionExisting FPGAs Not Suitable for JIT FPGA Compilation

Existing FPGAs require extremely complex CAD tools Designed to handle large arbitrary circuits, ASIC prototyping, etc. Require long execution times and very large memory usage Not suitable for dynamic on-chip execution

50 MB 60 MB10 MB

1 min

Log.

Syn

.

1 min

Tech

. Map

1-2 minsPl

ace

2-30 mins

Rou

te

10 MB

11/22

JIT FPGA Comp.

FPGA

+ +

JIT FPGA CompilationCAD-Oriented FPGA

Solution: Develop a custom CAD-oriented FPGA Careful simultaneous design of FPGA and CAD FPGA features evaluated for impact on CAD

Enables development of fast, lean JIT FPGA compilation tools

1s <1s

.5 MB

1 MB

<1s

1 MB

10s

3.6 MB

Tech. Mapping/Packing

Placement

Logic Synthesis

Routing

Lysecky/Vahid, DATE’04

12/22

Simple Configurable Logic FabricCAD-Oriented FPGA

SM

CLB

SM

SM

SM

SM

SM

CLB

SM

CLB

SM

SM

SM

SM

SM

CLB

Simple Configurable Logic Fabric (CLF) Hundreds of existing commercial and research FPGA fabrics

Most designed to balance circuit density and speed Analyzed FPGA’s features to determine their impact of CAD

Designed our CLF in conjunction with JIT FPGA compilation tools Array of configurable logic blocks (CLBs) surrounded by switch matrices

(SMs) CLB is directly connected to a SM

Along with SM design, allows for design of lean JIT routing


13/22

Simple Configurable Logic Fabric Combinational Logic Block

Combinational Logic Block Incorporate two 3-input 2-output LUTs

Equivalent to four 3-input LUTs with fixed internal routing

Allows for good quality circuit while reducing JIT technology mapping complexity

Provide routing resources between adjacent CLBs to support carry chains

Reduces number of nets we need to route

FPGAs SCLFFlexibility/Density: Large CLBs, various internal routing resources

Simplicity: Limited internal routing, reduce on-chip CAD complexity

LUTLUT

a b c d e f

o1 o2 o3o4

Adj.CLB

Adj.CLB


14/22

Simple Configurable Logic Fabric Switch Matrix

0

0L

1

1L2L

2

3L

3

0123

0L1L2L

3L

0123

0L1L2L3L

0 1 2 3 0L1L2L3L Switch Matrix

All nets are routed using only a single pair of channels throughout the configurable logic fabric

Each short channel is associated with single long channel

Designed for fast, lean JIT FPGA routing

FPGAs SCLFFlexibility/Speed: Large routing resources, various routing options

Simplicity: Allow for design of fast, lean routing algorithm


15/22

JIT FPGA Compilation Routing

FPGA Routing Find a path within FPGA to connect source

and sinks of each net within our hardware circuit

Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route

nets using shortest path Allows overuse (congestion) of resources

If congestion exists (illegal routing) Update cost of congested resources Rip-up all routes and reroute all nets

VPR [Betz, et al., 1997] Provides various improvements over

Pathfinder Routability-driven: Use fewest tracks

possible Timing-driven: Optimize circuit speed Many techniques are used in commercial

FPGA CAD tools

1

1

1

1

1

1

11

12

congestion

2

16/22

SM

CLB

SM

SM

SM

SM

CLB

SM SMSM

SM

CLB CLB

Routing Resource Graph

0/4

0/4

0/4

0/4

0/4

0/4

0/4

0/4

0/4 0/4

0/4 0/4 0/4

0/4 0/4 0/4

0/4

0/4

SM SM

SM

SM

SM

SM SMSM

SM

Resource Graph

ROCR - Riverside On-Chip Router Resource Graph

Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel

Requires much less memory as resource graph is smaller

JIT FPGA Compilation ROCR – Riverside On-chip Router

Route

Rip-up

Done!

illegal?

noyes

Lysecky/Vahid/Tan, DAC’04; Lysecky/Vahid, DATE’04

17/22

Scalability of On-chip RoutingExperimental Setup

SM

CLB

SM

SM

SM

SM

SM

CLB

SM

CLB

SM

SM

SM

SM

SM

CLB

Experimental Setup 100x100 configurable logic fabric

array Routing channel width of 34

Large enough to support all HW circuits

123 MCNC benchmark circuits Circuit complexity ranges from few

LUTs to tens of thousands of LUTs Performed technology mapping,

packing, and placement using FlowMap, T-VPack, and VPR’s bounding box placement

Routed each HW benchmark circuit using:

VPR’s timing-driven router VPR’s fast timing-driven router (-fast

option) Riverside On-Chip Router (ROCR)

18/22

Scalability of On-chip Routing

Memory Usage

126602

8352

113235

0

20000

40000

60000

80000

100000

120000

140000

VPR VPR (Fast) ROCR

Me

mo

ry U

sa

ge

(K

By

tes

)

Minimum

Average

Maximum

VPR requires over 100MB of on average

ROCR requires at most 8.3 MB VPR requires 18X more than ROCR on average

19/22


Algorithm Performance

0

25

50

75

100

125

150

175

200

050

010

0015

0020

0025

0030

0035

0040

00

Circuit Size (CLBs)

Ex

ec

uti

on

Tim

e (

s)

VPR VPR (Fast) ROCR

ROCR is over 40X times faster than VPR for small HW circuits

ROCR is 2X-3X times faster than VPR for large HW circuits

20/22


Critical Path

0

25

50

75

100

125

150

175

200

Circuit Size (CLBs)

Cri

tic

al P

ath

(n

s)

VPR VPR (Fast) ROCR

19% longer critical path than VPR2.6% shorter than VPR (Fast)

30%/27% longer critical path than VPR/VPR (Fast)

21/22


Wire Segments

0

15000

30000

45000

60000

75000

90000

Circuit Size (Nets)

Wir

e S

eg

me

nts

VPR VPR (Fast) ROCR

ROCR requires 2%/8% fewer wire segments than VPR/VPR (Fast) for larger HW circuits

22/22

Conclusions and Future Work Conclusions

Demonstrated ROCR scales well as circuit size increases On average 2X faster than VPR’s fast timing-driven router

Requiring 18X less memory than VPR Produces good circuit quality

Critical path 27% longer than VPR (Fast) on average 2.6% shorter critical path for largest HW circuit

Requires on average 5% fewer wire segments

Future Work Currently project: Major microprocessor vendor is fabricating our

custom FPGA Improvements to Riverside On-Chip Router (ROCR)

Improve ROCR’s performance for large HW circuits Incorporating timing information to achieve Analyze the scalability of ROCR as circuit size approaches FPGA capacity

JIT FPGA Compilation Development of standard HW binary Support more complex FPGA architectures JIT FPGA compilation

Documents

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation