22
A Study of the Scalability of On-Chip Routing for Just-in- Time FPGA Compilation Roman Lysecky a , Frank Vahid a* , Sheldon X.- D. Tan b a Department of Computer Science and Engineering b Department of Electrical Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

  • Upload
    lajos

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. Roman Lysecky a , Frank Vahid a* , Sheldon X.-D. Tan b a Department of Computer Science and Engineering b Department of Electrical Engineering University of California, Riverside - PowerPoint PPT Presentation

Citation preview

Page 1: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

Roman Lyseckya, Frank Vahida*, Sheldon X.-D. Tanb

aDepartment of Computer Science and EngineeringbDepartment of Electrical Engineering

University of California, Riverside*Also with the Center for Embedded Computer Systems at UC Irvine

This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx

Page 2: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

2/22

IntroductionStandard binary - Separating Function and Architecture

SW__________________

SW__________________

ProfilingStandard Compiler

Binaryx86 Binary

Software binaries of the past Binary reflected specific language of underlying

architecture – limited portability Current “standard binary”

Concept: separate function from detailed architecture Develop new architectures for existing applications Trend towards dynamic translation and optimization

Page 3: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

3/22

IntroductionBut Today’s Binaries are More than just Software

SW__________________

SW__________________

ProfilingStandard Compiler

BinarySW Binary

ProfilingCompiler/ Synthesis

BinaryBinary

Processor1Processor1

FPGAProc.

SW__________________

SW__________________

SW__________________

HW__________________

ProcessorProcessor2

Processor3Processor3 FPGA

Proc.

Proc.

FPGA

Proc.

Proc.

Page 4: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

4/22

IntroductionJust-in-Time FPGA Compilation?

JIT FPGA compilation Idea: standard binary for FPGA Similar benefits as standard binary for

microprocessor Portability, transparency, standard tools

Embedded JIT compilation tools optimized for each FPGA

BinaryVHDL/Verilog

ProfilingStandard CAD Tools

BinaryStd. HW Binary

JIT FPGA Comp.

FPGA

+ + JIT FPGA Comp.

FPGA

+** +

MEM

Page 5: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

5/22

IntroductionOne Use of JIT FPGA Compilation

CableTV Company

FeatureUpgradeFeatureUpgrade

SW____________

Processor ARM7

Processor ARM9

Processor ARM10

Processor ARM11

BinarySW Binary

Page 6: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

6/22

IntroductionOne Use of JIT FPGA Compilation

CableTV Company

FeatureUpgradeFeatureUpgrade

SW____________

Processor ARM7

Processor ARM9

Processor ARM10

Processor ARM11

HW____________

Processor FPGA 1

Processor FPGA 2

Processor FPGA 3

Processor FPGA 4

BinarySW Binary

BinaryHW Netlist3

BinarySW Binary

BinaryHW Netlist2

BinarySW Binary

BinaryHW Netlist1

BinarySW Binary

BinaryHW Netlist4

HW1____________HW2____________

HW3____________

HW4____________

Page 7: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

7/22

IntroductionOne Use of JIT FPGA Compilation

CableTV Company

FeatureUpgradeFeatureUpgrade

SW____________

Processor ARM7

Processor ARM9

Processor ARM10

Processor ARM11

HW____________

Processor FPGA 1

Processor FPGA 2

Processor FPGA 3

Processor FPGA 4

BinarySW Binary

BinaryHW Binary

JIT FPGA Comp.

JIT FPGA Comp.

JIT FPGA Comp.

JIT FPGA Comp.

Page 8: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

8/22

µPI$

D$

FPGA

Profiler

Dynamic Part.

Module (DPM)

Time Energy

SW Only

HW/ SW

Partitioned application executes faster with lower energy consumption

55

IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)

Profile application to determine critical regions

22

Profiler

Initially execute application in software only

11

µPI$

D$

Partition critical regions to hardware

33

Dynamic Part.

Module (DPM)

Program configurable logic & update software binary

44

FPGA

Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

Page 9: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

9/22

µPI$

D$

FPGA

Profiler

DPM(CAD)

IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)

BinaryBinary

Decompilation

BinaryHW Bitstream

RT Synthesis

PartitioningBinary Updater

BinaryUpdated Binary

BinaryStd. HW Binary

JIT FPGA CompilationJIT FPGA

Compilation

Tech. Mapping/Packing

Placement

Logic Synthesis

Routing

Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

Page 10: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

10/22

IntroductionExisting FPGAs Not Suitable for JIT FPGA Compilation

Existing FPGAs require extremely complex CAD tools Designed to handle large arbitrary circuits, ASIC prototyping, etc. Require long execution times and very large memory usage Not suitable for dynamic on-chip execution

50 MB 60 MB10 MB

1 min

Log.

Syn

.

1 min

Tech

. Map

1-2 minsPl

ace

2-30 mins

Rou

te

10 MB

Page 11: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

11/22

JIT FPGA Comp.

FPGA

+ +

JIT FPGA CompilationCAD-Oriented FPGA

Solution: Develop a custom CAD-oriented FPGA Careful simultaneous design of FPGA and CAD FPGA features evaluated for impact on CAD

Enables development of fast, lean JIT FPGA compilation tools

1s <1s

.5 MB

1 MB

<1s

1 MB

10s

3.6 MB

Tech. Mapping/Packing

Placement

Logic Synthesis

Routing

Lysecky/Vahid, DATE’04

Page 12: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

12/22

Simple Configurable Logic FabricCAD-Oriented FPGA

SM

CLB

SM

SM

SM

SM

SM

CLB

SM

CLB

SM

SM

SM

SM

SM

CLB

Simple Configurable Logic Fabric (CLF) Hundreds of existing commercial and research FPGA fabrics

Most designed to balance circuit density and speed Analyzed FPGA’s features to determine their impact of CAD

Designed our CLF in conjunction with JIT FPGA compilation tools Array of configurable logic blocks (CLBs) surrounded by switch matrices

(SMs) CLB is directly connected to a SM

Along with SM design, allows for design of lean JIT routing

Lysecky/Vahid, DATE’04

Page 13: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

13/22

Simple Configurable Logic Fabric Combinational Logic Block

Combinational Logic Block Incorporate two 3-input 2-output LUTs

Equivalent to four 3-input LUTs with fixed internal routing

Allows for good quality circuit while reducing JIT technology mapping complexity

Provide routing resources between adjacent CLBs to support carry chains

Reduces number of nets we need to route

FPGAs SCLFFlexibility/Density: Large CLBs, various internal routing resources

Simplicity: Limited internal routing, reduce on-chip CAD complexity

LUTLUT

a b c d e f

o1 o2 o3o4

Adj.CLB

Adj.CLB

Lysecky/Vahid, DATE’04

Page 14: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

14/22

Simple Configurable Logic Fabric Switch Matrix

0

0L

1

1L2L

2

3L

3

0123

0L1L2L

3L

0123

0L1L2L3L

0 1 2 3 0L1L2L3L Switch Matrix

All nets are routed using only a single pair of channels throughout the configurable logic fabric

Each short channel is associated with single long channel

Designed for fast, lean JIT FPGA routing

FPGAs SCLFFlexibility/Speed: Large routing resources, various routing options

Simplicity: Allow for design of fast, lean routing algorithm

Lysecky/Vahid, DATE’04

Page 15: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

15/22

JIT FPGA Compilation Routing

FPGA Routing Find a path within FPGA to connect source

and sinks of each net within our hardware circuit

Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route

nets using shortest path Allows overuse (congestion) of resources

If congestion exists (illegal routing) Update cost of congested resources Rip-up all routes and reroute all nets

VPR [Betz, et al., 1997] Provides various improvements over

Pathfinder Routability-driven: Use fewest tracks

possible Timing-driven: Optimize circuit speed Many techniques are used in commercial

FPGA CAD tools

1

1

1

1

1

1

11

12

congestion

2

Page 16: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

16/22

SM

CLB

SM

SM

SM

SM

CLB

SM SMSM

SM

CLB CLB

Routing Resource Graph

0/4

0/4

0/4

0/4

0/4

0/4

0/4

0/4

0/4 0/4

0/4 0/4 0/4

0/4 0/4 0/4

0/4

0/4

SM SM

SM

SM

SM

SM SMSM

SM

Resource Graph

ROCR - Riverside On-Chip Router Resource Graph

Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel

Requires much less memory as resource graph is smaller

JIT FPGA Compilation ROCR – Riverside On-chip Router

Route

Rip-up

Done!

illegal?

noyes

Lysecky/Vahid/Tan, DAC’04; Lysecky/Vahid, DATE’04

Page 17: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

17/22

Scalability of On-chip RoutingExperimental Setup

SM

CLB

SM

SM

SM

SM

SM

CLB

SM

CLB

SM

SM

SM

SM

SM

CLB

Experimental Setup 100x100 configurable logic fabric

array Routing channel width of 34

Large enough to support all HW circuits

123 MCNC benchmark circuits Circuit complexity ranges from few

LUTs to tens of thousands of LUTs Performed technology mapping,

packing, and placement using FlowMap, T-VPack, and VPR’s bounding box placement

Routed each HW benchmark circuit using:

VPR’s timing-driven router VPR’s fast timing-driven router (-fast

option) Riverside On-Chip Router (ROCR)

Page 18: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

18/22

Scalability of On-chip Routing

Memory Usage

126602

8352

113235

0

20000

40000

60000

80000

100000

120000

140000

VPR VPR (Fast) ROCR

Me

mo

ry U

sa

ge

(K

By

tes

)

Minimum

Average

Maximum

VPR requires over 100MB of on average

ROCR requires at most 8.3 MB VPR requires 18X more than ROCR on average

Page 19: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

19/22

Scalability of On-chip Routing

Algorithm Performance

0

25

50

75

100

125

150

175

200

050

010

0015

0020

0025

0030

0035

0040

00

Circuit Size (CLBs)

Ex

ec

uti

on

Tim

e (

s)

VPR VPR (Fast) ROCR

ROCR is over 40X times faster than VPR for small HW circuits

ROCR is 2X-3X times faster than VPR for large HW circuits

Page 20: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

20/22

Scalability of On-chip Routing

Critical Path

0

25

50

75

100

125

150

175

200

Circuit Size (CLBs)

Cri

tic

al P

ath

(n

s)

VPR VPR (Fast) ROCR

19% longer critical path than VPR2.6% shorter than VPR (Fast)

30%/27% longer critical path than VPR/VPR (Fast)

Page 21: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

21/22

Scalability of On-chip Routing

Wire Segments

0

15000

30000

45000

60000

75000

90000

Circuit Size (Nets)

Wir

e S

eg

me

nts

VPR VPR (Fast) ROCR

ROCR requires 2%/8% fewer wire segments than VPR/VPR (Fast) for larger HW circuits

Page 22: A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

22/22

Conclusions and Future Work Conclusions

Demonstrated ROCR scales well as circuit size increases On average 2X faster than VPR’s fast timing-driven router

Requiring 18X less memory than VPR Produces good circuit quality

Critical path 27% longer than VPR (Fast) on average 2.6% shorter critical path for largest HW circuit

Requires on average 5% fewer wire segments

Future Work Currently project: Major microprocessor vendor is fabricating our

custom FPGA Improvements to Riverside On-Chip Router (ROCR)

Improve ROCR’s performance for large HW circuits Incorporating timing information to achieve Analyze the scalability of ROCR as circuit size approaches FPGA capacity

JIT FPGA Compilation Development of standard HW binary Support more complex FPGA architectures JIT FPGA compilation