View
222
Download
1
Tags:
Embed Size (px)
Citation preview
Dynamic FPGA Routing for Just-in-Time Compilation
Roman Lyseckya, Frank Vahida*, Sheldon X.-D. Tanb
aDepartment of Computer Science and EngineeringbDepartment of Electrical Engineering
University of California, Riverside*Also with the Center for Embedded Computer Systems at UC Irvine
This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN
fellowship
2
IntroductionJust-in-Time Compilation has Become Commonplace
Just-in-Time Compilation Modern Pentium processors
Dynamically translate instructions onto underlying RISC architecture
Transmeta Crusoe & Efficeon Dynamic code morphing Translate x86 instructions to
underlying VLIW processor Interpreted languages
Distribute SW as processor independent bytecode/source
SW typically executed on a virtual machine
JIT compile bytecode to processor’s native instructions
Java, Python, etc.
SW__________________
SW__________________
ProfilingStandard Compiler
BinarySW Binary
Processor3ProcessorJIT
Recompile
3
IntroductionJust-in-Time Compilation also Performs Optimization
Dynamic optimizations are increasingly common Dynamically recompile binary during execution Dynamo [Bala, et al., 2000] - Dynamic software
optimizations Identify frequently executed code segments (hotpaths) Recompile with higher optimization
BOA [Gschwind, et al., 2000] - Dynamic optimizer for Power PC
Advantages Transparent optimizations
No designer effort No tool restrictions
Adapts to actual usage Speedups of up 20%-30% -- 1.3X
JIT compilation operates on software binaries
4
IntroductionBut Today’s Binaries are More than just Software
SW__________________
SW__________________
ProfilingStandard Compiler
BinarySW Binary
ProfilingCompiler/ Synthesis
BinaryBinary
Processor1Processor1
FPGAProc.
SW__________________
SW__________________
SW__________________
HW__________________
ProcessorProcessor2
Processor3Processor3 FPGA
Proc.
Proc.
FPGA
Proc.
Proc.
5
IntroductionJust-in-Time FPGA Compilation?
JIT FPGA compilation Idea: standard binary for FPGA Similar benefits as standard binary for
microprocessor Portability, transparency, standard tools
Embedded JIT compilation tools optimized for each FPGA
BinaryVHDL/Verilog
ProfilingStandard CAD Tools
BinaryStd. HW Binary
JIT FPGA Comp.
FPGA
+ + JIT FPGA Comp.
FPGA
+** +
MEM
6
IntroductionOne Use of JIT FPGA Compilation
CableTV Company
FeatureUpgradeFeatureUpgrade
SW____________
Processor ARM7
Processor ARM9
Processor ARM10
Processor ARM11
BinarySW Binary
7
IntroductionOne Use of JIT FPGA Compilation
CableTV Company
FeatureUpgradeFeatureUpgrade
SW____________
Processor ARM7
Processor ARM9
Processor ARM10
Processor ARM11
HW____________
Processor FPGA 1
Processor FPGA 2
Processor FPGA 3
Processor FPGA 4
BinarySW Binary
BinaryHW Netlist3
BinarySW Binary
BinaryHW Netlist2
BinarySW Binary
BinaryHW Netlist1
BinarySW Binary
BinaryHW Netlist4
HW1____________HW2____________
HW3____________
HW4____________
8
IntroductionOne Use of JIT FPGA Compilation
CableTV Company
FeatureUpgradeFeatureUpgrade
SW____________
Processor ARM7
Processor ARM9
Processor ARM10
Processor ARM11
HW____________
Processor FPGA 1
Processor FPGA 2
Processor FPGA 3
Processor FPGA 4
BinarySW Binary
BinaryHW Binary
JIT FPGA Comp.
JIT FPGA Comp.
JIT FPGA Comp.
JIT FPGA Comp.
9
µPI$
D$
Warp Config. Logic
Architecture
Profiler
Dynamic Part.
Module (DPM)
Partitioned application executes faster with lower energy consumption
55
IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)
Profile application to determine critical regions
22
Profiler
Initially execute application in software only
11
µPI$
D$
Partition critical regions to hardware
33
Dynamic Part.
Module (DPM)
Program configurable logic & update software binary
44
Warp Config. Logic
Architecture
Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid DAC’03; Stitt/Vahid, ICCAD’02
10
ARMI$
D$
WCLA
Profiler
DPM
IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)
BinaryBinary
Decompilation
BinaryHW Bitstream
RT Synthesis
PartitioningBinary Updater
BinaryUpdated Binary
BinaryStd. HW Binary
JIT FPGA CompilationJIT FPGA
Compilation
Tech. Mapping/Packing
Placement
Logic Synthesis
Routing
Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02
11
IntroductionAll that CAD on-chip?
CAD people may first think Just-in-Time FPGA compilation is “absurd”
CAD tools are extremely complex Require long execution times on power desktop workstations Require very large memory resources Usually require GBytes of hard drive space Costs of complete CAD tools package can exceed $1 million All that CAD on-chip?
50 MB 60 MB10 MB
1 min
Log.
Syn
.
1 min
Tech
. Map
1-2 mins
Plac
e
2-30 mins
Rou
te
10 MB
12
Simultaneous FPGA/CAD Design
Careful simultaneous design of configurable logic fabric and CAD tools
Analyze architectural features as to their impacts on on-chip Just-in-Time CAD tools
Fast execution time Very low data memory Produce reasonable (good) hardware circuits
13
SM
CLB
SM
SM
SM
SM
SM
CLB
Simultaneous FPGA/CAD Design Configurable Logic Fabric
SM
CLB
SM
SM
SM
SM
SM
CLB
Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs)
Each CLB is directly connected to a SM Switch matrix connections
Four short wires connect adjacent SMs Four long wires connect every other SM together
Lysecky/Vahid, DATE’04
14
Simultaneous FPGA/CAD Design Combinational Logic Block Design
Incorporate two 3-input 2-output LUTs Corresponds to four 3-input LUTs Allows for good quality circuit while reducing on-chip CAD tools
complexity Provide routing resources between adjacent CLBs to
support carry chains
LUTLUT
a b c d e f
o1 o2 o3o4
Adj.CLB
Adj.CLB
Lysecky/Vahid, DATE’04
15
Simultaneous FPGA/CAD Design Switch Matrix
0
0L
1
1L2L
2
3L
3
0123
0L1L2L
3L
0123
0L1L2L3L
0 1 2 3 0L1L2L3L
Switch Matrix SM connected using eight channels per
side Four short channels Four long channels
Routes wires from different side using the same channel
Each short channel is associated with single long channel
Wires are routed using a single pair of channels through configurable logic fabric
Lysecky/Vahid, DATE’04
16
FPGA Routing FPGA Routing
Find a path within FPGA to connect source and sinks of each net within our hardware circuit
Typically use a form of maze routing [Lee, 1961]
Routes each net using Dijkstra’s shortest path algorithm
17
1
1
1
1
1
1
11
1
FPGA Routing Pathfinder [Ebeling, et al., 1995]
Introduced negotiated congestion During each routing iteration, route
nets using shortest path Allows overuse (congestion) of
routing resources If congestion exists (illegal routing)
Update cost of congested resources based on the amount of overuse
Rip-up all routes and reroute all nets
2
congestion
2
18
FPGA Routing VPR – Versatile Place and Route [Betz, et al., 1997]
Uses modified Pathfinder algorithm Increase performance over original Pathfinder algorithm Routability-driven routing
Goal: Use fewest tracks possible Timing-driven routing
Goal: Optimize circuit speed
Routing Resource GraphResource Graph
Route
Rip-up
Done!
congestion? illegal?
noyes
19
JIT FPGA Routing Riverside On-Chip Router (ROCR)
Represent routing nets between CLBs as routing between SMs Resource Graph
Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel
Requires much less memory than VPR as resource graph is much smaller
SM SM SM
SM SM SM
SM SM SM
0/4
0/4
0/4
0/4
0/4
0/4
0/4
0/4
0/4 0/4
0/4 0/4 0/4
0/4 0/4 0/40/4
0/4
20
JIT FPGA Routing Riverside On-Chip Router (ROCR) - Global Routing
Based on VPR’s routability-driven router Utilizes similar cost model consisting of base, historical
congestion, and current congestion costs Routes nets between SMs using greedy, depth-first
routing algorithm Faster than traditional VPR’s breadth-first routing method Requires addition of adjustment cost to direct ROCR to re-
route illegal nets using different initial routing path Ignores illegal routing within SMs If congestion exists, rip-up and re-route only the illegal
routes Reduces computation time during successive routing
iterations
21
JIT FPGA Routing Riverside On-Chip Router (ROCR) - Detailed Routing
Assign specific channels to each route Construct routing conflict graph
Routes conflict if assigning same channel results in an illegal routing within any SM
Use Brelaz’s greedy vertex coloring algorithm [Brelaz, 1979] If illegal routes exist, rip-up illegal routes and repeat global routing
0
0L
1
1L2L
2
3L
3
0123
0L1L2L
3L
0123
0L1L2L3L
0 1 2 3 0L1L2L3L
R1
R2
R4
R3
R3
R1 R2
22
Experiments Memory Usage
0
10000
20000
30000
40000
50000
60000
70000
Benchmark
Me
mo
ry U
sa
ge
(K
B) VPR (RD) VPR (TD) ROCR
VPR requires over 50MB of memory with an average of over 20 MB
ROCR requires at most 3.6 MB 13X less than VPR on average
23
Experiments Algorithm Performance
0
10
20
30
40
50
60
Benchmark
Ex
ec
uti
on
Tim
e (
s)
VPR (TD) ROCR
ROCR is on average 10X faster than VPR (TD)Up to 21X faster for ex5p
24
Experiments Critical Path Results
0
25
50
75
100
125
150
Benchmark
Crt
ica
l pa
th (
ns
)
VPR (RD) VPR (TD) ROCR
But 10% shorter critical path than VPR (RD)
32% longer critical path than VPR (TD)
25
Experiments Wire Segments
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Benchmark
Wir
e S
eg
me
nts
VPR (RD) VPR (TD) ROCR
10% more wire segments than VPR (TD/RD)
26
Conclusions Developed Riverside On-Chip Router (ROCR)
Fast, lean on-chip router for JIT FPGA compilation Order of magnitude less memory required On average 10X faster than VPR’s faster routing algorithm
Produces acceptable circuit quality Uses only 10% more routing resources Critical path 10% shorter than VPR’s routability-driven
router
JIT FPGA Compilation Enables development of a standard HW binary
Brings portability of SW design to HW designers Presently requires custom FPGA fabric
Future work - Overhead of mapping simple fabric onto commercial fabric?