Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering

Dynamic FPGA Routing for Just-in-Time Compilation

Roman Lyseckya, Frank Vahida*, Sheldon X.-D. Tanb

aDepartment of Computer Science and EngineeringbDepartment of Electrical Engineering

University of California, Riverside*Also with the Center for Embedded Computer Systems at UC Irvine

This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN

fellowship

2

IntroductionJust-in-Time Compilation has Become Commonplace

Just-in-Time Compilation Modern Pentium processors

Dynamically translate instructions onto underlying RISC architecture

Transmeta Crusoe & Efficeon Dynamic code morphing Translate x86 instructions to

underlying VLIW processor Interpreted languages

Distribute SW as processor independent bytecode/source

SW typically executed on a virtual machine

JIT compile bytecode to processor’s native instructions

Java, Python, etc.

SW__________________

SW__________________

ProfilingStandard Compiler

BinarySW Binary

Processor3ProcessorJIT

Recompile

3

IntroductionJust-in-Time Compilation also Performs Optimization

Dynamic optimizations are increasingly common Dynamically recompile binary during execution Dynamo [Bala, et al., 2000] - Dynamic software

optimizations Identify frequently executed code segments (hotpaths) Recompile with higher optimization

BOA [Gschwind, et al., 2000] - Dynamic optimizer for Power PC

Advantages Transparent optimizations

No designer effort No tool restrictions

Adapts to actual usage Speedups of up 20%-30% -- 1.3X

JIT compilation operates on software binaries

4

IntroductionBut Today’s Binaries are More than just Software

SW__________________

SW__________________

ProfilingStandard Compiler

BinarySW Binary

ProfilingCompiler/ Synthesis

BinaryBinary

Processor1Processor1

FPGAProc.

SW__________________

SW__________________

SW__________________

HW__________________

ProcessorProcessor2

Processor3Processor3 FPGA

Proc.

Proc.

FPGA

Proc.

Proc.

5

IntroductionJust-in-Time FPGA Compilation?

JIT FPGA compilation Idea: standard binary for FPGA Similar benefits as standard binary for

microprocessor Portability, transparency, standard tools

Embedded JIT compilation tools optimized for each FPGA

BinaryVHDL/Verilog

ProfilingStandard CAD Tools

BinaryStd. HW Binary

JIT FPGA Comp.

FPGA

+ + JIT FPGA Comp.

FPGA

+** +

MEM

6

IntroductionOne Use of JIT FPGA Compilation

CableTV Company

FeatureUpgradeFeatureUpgrade

SW____________

Processor ARM7

Processor ARM9

Processor ARM10

Processor ARM11

BinarySW Binary

7


CableTV Company


SW____________

Processor ARM7

Processor ARM9

Processor ARM10

Processor ARM11

HW____________

Processor FPGA 1

Processor FPGA 2

Processor FPGA 3

Processor FPGA 4

BinarySW Binary

BinaryHW Netlist3

BinarySW Binary

BinaryHW Netlist2

BinarySW Binary

BinaryHW Netlist1

BinarySW Binary

BinaryHW Netlist4

HW1____________HW2____________

HW3____________

HW4____________

8


CableTV Company


SW____________

Processor ARM7

Processor ARM9

Processor ARM10

Processor ARM11

HW____________

Processor FPGA 1

Processor FPGA 2

Processor FPGA 3

Processor FPGA 4

BinarySW Binary

BinaryHW Binary

JIT FPGA Comp.

JIT FPGA Comp.

JIT FPGA Comp.

JIT FPGA Comp.

9

µPI$

D$

Warp Config. Logic

Architecture

Profiler

Dynamic Part.

Module (DPM)

Partitioned application executes faster with lower energy consumption

55

IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)

Profile application to determine critical regions

22

Profiler

Initially execute application in software only

11

µPI$

D$

Partition critical regions to hardware

33

Dynamic Part.

Module (DPM)

Program configurable logic & update software binary

44

Warp Config. Logic

Architecture

Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid DAC’03; Stitt/Vahid, ICCAD’02

10

ARMI$

D$

WCLA

Profiler

DPM

IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)

BinaryBinary

Decompilation

BinaryHW Bitstream

RT Synthesis

PartitioningBinary Updater

BinaryUpdated Binary

BinaryStd. HW Binary

JIT FPGA CompilationJIT FPGA

Compilation

Tech. Mapping/Packing

Placement

Logic Synthesis

Routing

Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

11

IntroductionAll that CAD on-chip?

CAD people may first think Just-in-Time FPGA compilation is “absurd”

CAD tools are extremely complex Require long execution times on power desktop workstations Require very large memory resources Usually require GBytes of hard drive space Costs of complete CAD tools package can exceed $1 million All that CAD on-chip?

50 MB 60 MB10 MB

1 min

Log.

Syn

.

1 min

Tech

. Map

1-2 mins

Plac

e

2-30 mins

Rou

te

10 MB

12

Simultaneous FPGA/CAD Design

Careful simultaneous design of configurable logic fabric and CAD tools

Analyze architectural features as to their impacts on on-chip Just-in-Time CAD tools

Fast execution time Very low data memory Produce reasonable (good) hardware circuits

13

SM

CLB

SM

SM

SM

SM

SM

CLB

Simultaneous FPGA/CAD Design Configurable Logic Fabric

SM

CLB

SM

SM

SM

SM

SM

CLB

Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs)

Each CLB is directly connected to a SM Switch matrix connections

Four short wires connect adjacent SMs Four long wires connect every other SM together

Lysecky/Vahid, DATE’04

14

Simultaneous FPGA/CAD Design Combinational Logic Block Design

Incorporate two 3-input 2-output LUTs Corresponds to four 3-input LUTs Allows for good quality circuit while reducing on-chip CAD tools

complexity Provide routing resources between adjacent CLBs to

support carry chains

LUTLUT

a b c d e f

o1 o2 o3o4

Adj.CLB

Adj.CLB


15

Simultaneous FPGA/CAD Design Switch Matrix

0

0L

1

1L2L

2

3L

3

0123

0L1L2L

3L

0123

0L1L2L3L

0 1 2 3 0L1L2L3L

Switch Matrix SM connected using eight channels per

side Four short channels Four long channels

Routes wires from different side using the same channel

Each short channel is associated with single long channel

Wires are routed using a single pair of channels through configurable logic fabric


16

FPGA Routing FPGA Routing

Find a path within FPGA to connect source and sinks of each net within our hardware circuit

Typically use a form of maze routing [Lee, 1961]

Routes each net using Dijkstra’s shortest path algorithm

17

1

1

1

1

1

1

11

1

FPGA Routing Pathfinder [Ebeling, et al., 1995]

Introduced negotiated congestion During each routing iteration, route

nets using shortest path Allows overuse (congestion) of

routing resources If congestion exists (illegal routing)

Update cost of congested resources based on the amount of overuse

Rip-up all routes and reroute all nets

2

congestion

2

18

FPGA Routing VPR – Versatile Place and Route [Betz, et al., 1997]

Uses modified Pathfinder algorithm Increase performance over original Pathfinder algorithm Routability-driven routing

Goal: Use fewest tracks possible Timing-driven routing

Goal: Optimize circuit speed

Routing Resource GraphResource Graph

Route

Rip-up

Done!

congestion? illegal?

noyes

19

JIT FPGA Routing Riverside On-Chip Router (ROCR)

Represent routing nets between CLBs as routing between SMs Resource Graph

Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel

Requires much less memory than VPR as resource graph is much smaller

SM SM SM

SM SM SM

SM SM SM

0/4

0/4

0/4

0/4

0/4

0/4

0/4

0/4

0/4 0/4

0/4 0/4 0/4

0/4 0/4 0/40/4

0/4

20

JIT FPGA Routing Riverside On-Chip Router (ROCR) - Global Routing

Based on VPR’s routability-driven router Utilizes similar cost model consisting of base, historical

congestion, and current congestion costs Routes nets between SMs using greedy, depth-first

routing algorithm Faster than traditional VPR’s breadth-first routing method Requires addition of adjustment cost to direct ROCR to re-

route illegal nets using different initial routing path Ignores illegal routing within SMs If congestion exists, rip-up and re-route only the illegal

routes Reduces computation time during successive routing

iterations

21

JIT FPGA Routing Riverside On-Chip Router (ROCR) - Detailed Routing

Assign specific channels to each route Construct routing conflict graph

Routes conflict if assigning same channel results in an illegal routing within any SM

Use Brelaz’s greedy vertex coloring algorithm [Brelaz, 1979] If illegal routes exist, rip-up illegal routes and repeat global routing

0

0L

1

1L2L

2

3L

3

0123

0L1L2L

3L

0123

0L1L2L3L

0 1 2 3 0L1L2L3L

R1

R2

R4

R3

R3

R1 R2

22

Experiments Memory Usage

0

10000

20000

30000

40000

50000

60000

70000

Benchmark

Me

mo

ry U

sa

ge

(K

B) VPR (RD) VPR (TD) ROCR

VPR requires over 50MB of memory with an average of over 20 MB

ROCR requires at most 3.6 MB 13X less than VPR on average

23

Experiments Algorithm Performance

0

10

20

30

40

50

60

Benchmark

Ex

ec

uti

on

Tim

e (

s)

VPR (TD) ROCR

ROCR is on average 10X faster than VPR (TD)Up to 21X faster for ex5p

24

Experiments Critical Path Results

0

25

50

75

100

125

150

Benchmark

Crt

ica

l pa

th (

ns

)

VPR (RD) VPR (TD) ROCR

But 10% shorter critical path than VPR (RD)

32% longer critical path than VPR (TD)

25

Experiments Wire Segments

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

Benchmark

Wir

e S

eg

me

nts

VPR (RD) VPR (TD) ROCR

10% more wire segments than VPR (TD/RD)

26

Conclusions Developed Riverside On-Chip Router (ROCR)

Fast, lean on-chip router for JIT FPGA compilation Order of magnitude less memory required On average 10X faster than VPR’s faster routing algorithm

Produces acceptable circuit quality Uses only 10% more routing resources Critical path 10% shorter than VPR’s routability-driven

router

JIT FPGA Compilation Enables development of a standard HW binary

Brings portability of SW design to HW designers Presently requires custom FPGA fabric

Future work - Overhead of mapping simple fabric onto commercial fabric?

Documents

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering