Upload
lajos
View
41
Download
0
Embed Size (px)
DESCRIPTION
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. Roman Lysecky a , Frank Vahid a* , Sheldon X.-D. Tan b a Department of Computer Science and Engineering b Department of Electrical Engineering University of California, Riverside - PowerPoint PPT Presentation
Citation preview
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation
Roman Lyseckya, Frank Vahida*, Sheldon X.-D. Tanb
aDepartment of Computer Science and EngineeringbDepartment of Electrical Engineering
University of California, Riverside*Also with the Center for Embedded Computer Systems at UC Irvine
This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx
2/22
IntroductionStandard binary - Separating Function and Architecture
SW__________________
SW__________________
ProfilingStandard Compiler
Binaryx86 Binary
Software binaries of the past Binary reflected specific language of underlying
architecture – limited portability Current “standard binary”
Concept: separate function from detailed architecture Develop new architectures for existing applications Trend towards dynamic translation and optimization
3/22
IntroductionBut Today’s Binaries are More than just Software
SW__________________
SW__________________
ProfilingStandard Compiler
BinarySW Binary
ProfilingCompiler/ Synthesis
BinaryBinary
Processor1Processor1
FPGAProc.
SW__________________
SW__________________
SW__________________
HW__________________
ProcessorProcessor2
Processor3Processor3 FPGA
Proc.
Proc.
FPGA
Proc.
Proc.
4/22
IntroductionJust-in-Time FPGA Compilation?
JIT FPGA compilation Idea: standard binary for FPGA Similar benefits as standard binary for
microprocessor Portability, transparency, standard tools
Embedded JIT compilation tools optimized for each FPGA
BinaryVHDL/Verilog
ProfilingStandard CAD Tools
BinaryStd. HW Binary
JIT FPGA Comp.
FPGA
+ + JIT FPGA Comp.
FPGA
+** +
MEM
5/22
IntroductionOne Use of JIT FPGA Compilation
CableTV Company
FeatureUpgradeFeatureUpgrade
SW____________
Processor ARM7
Processor ARM9
Processor ARM10
Processor ARM11
BinarySW Binary
6/22
IntroductionOne Use of JIT FPGA Compilation
CableTV Company
FeatureUpgradeFeatureUpgrade
SW____________
Processor ARM7
Processor ARM9
Processor ARM10
Processor ARM11
HW____________
Processor FPGA 1
Processor FPGA 2
Processor FPGA 3
Processor FPGA 4
BinarySW Binary
BinaryHW Netlist3
BinarySW Binary
BinaryHW Netlist2
BinarySW Binary
BinaryHW Netlist1
BinarySW Binary
BinaryHW Netlist4
HW1____________HW2____________
HW3____________
HW4____________
7/22
IntroductionOne Use of JIT FPGA Compilation
CableTV Company
FeatureUpgradeFeatureUpgrade
SW____________
Processor ARM7
Processor ARM9
Processor ARM10
Processor ARM11
HW____________
Processor FPGA 1
Processor FPGA 2
Processor FPGA 3
Processor FPGA 4
BinarySW Binary
BinaryHW Binary
JIT FPGA Comp.
JIT FPGA Comp.
JIT FPGA Comp.
JIT FPGA Comp.
8/22
µPI$
D$
FPGA
Profiler
Dynamic Part.
Module (DPM)
Time Energy
SW Only
HW/ SW
Partitioned application executes faster with lower energy consumption
55
IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)
Profile application to determine critical regions
22
Profiler
Initially execute application in software only
11
µPI$
D$
Partition critical regions to hardware
33
Dynamic Part.
Module (DPM)
Program configurable logic & update software binary
44
FPGA
Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02
9/22
µPI$
D$
FPGA
Profiler
DPM(CAD)
IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)
BinaryBinary
Decompilation
BinaryHW Bitstream
RT Synthesis
PartitioningBinary Updater
BinaryUpdated Binary
BinaryStd. HW Binary
JIT FPGA CompilationJIT FPGA
Compilation
Tech. Mapping/Packing
Placement
Logic Synthesis
Routing
Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02
10/22
IntroductionExisting FPGAs Not Suitable for JIT FPGA Compilation
Existing FPGAs require extremely complex CAD tools Designed to handle large arbitrary circuits, ASIC prototyping, etc. Require long execution times and very large memory usage Not suitable for dynamic on-chip execution
50 MB 60 MB10 MB
1 min
Log.
Syn
.
1 min
Tech
. Map
1-2 minsPl
ace
2-30 mins
Rou
te
10 MB
11/22
JIT FPGA Comp.
FPGA
+ +
JIT FPGA CompilationCAD-Oriented FPGA
Solution: Develop a custom CAD-oriented FPGA Careful simultaneous design of FPGA and CAD FPGA features evaluated for impact on CAD
Enables development of fast, lean JIT FPGA compilation tools
1s <1s
.5 MB
1 MB
<1s
1 MB
10s
3.6 MB
Tech. Mapping/Packing
Placement
Logic Synthesis
Routing
Lysecky/Vahid, DATE’04
12/22
Simple Configurable Logic FabricCAD-Oriented FPGA
SM
CLB
SM
SM
SM
SM
SM
CLB
SM
CLB
SM
SM
SM
SM
SM
CLB
Simple Configurable Logic Fabric (CLF) Hundreds of existing commercial and research FPGA fabrics
Most designed to balance circuit density and speed Analyzed FPGA’s features to determine their impact of CAD
Designed our CLF in conjunction with JIT FPGA compilation tools Array of configurable logic blocks (CLBs) surrounded by switch matrices
(SMs) CLB is directly connected to a SM
Along with SM design, allows for design of lean JIT routing
Lysecky/Vahid, DATE’04
13/22
Simple Configurable Logic Fabric Combinational Logic Block
Combinational Logic Block Incorporate two 3-input 2-output LUTs
Equivalent to four 3-input LUTs with fixed internal routing
Allows for good quality circuit while reducing JIT technology mapping complexity
Provide routing resources between adjacent CLBs to support carry chains
Reduces number of nets we need to route
FPGAs SCLFFlexibility/Density: Large CLBs, various internal routing resources
Simplicity: Limited internal routing, reduce on-chip CAD complexity
LUTLUT
a b c d e f
o1 o2 o3o4
Adj.CLB
Adj.CLB
Lysecky/Vahid, DATE’04
14/22
Simple Configurable Logic Fabric Switch Matrix
0
0L
1
1L2L
2
3L
3
0123
0L1L2L
3L
0123
0L1L2L3L
0 1 2 3 0L1L2L3L Switch Matrix
All nets are routed using only a single pair of channels throughout the configurable logic fabric
Each short channel is associated with single long channel
Designed for fast, lean JIT FPGA routing
FPGAs SCLFFlexibility/Speed: Large routing resources, various routing options
Simplicity: Allow for design of fast, lean routing algorithm
Lysecky/Vahid, DATE’04
15/22
JIT FPGA Compilation Routing
FPGA Routing Find a path within FPGA to connect source
and sinks of each net within our hardware circuit
Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route
nets using shortest path Allows overuse (congestion) of resources
If congestion exists (illegal routing) Update cost of congested resources Rip-up all routes and reroute all nets
VPR [Betz, et al., 1997] Provides various improvements over
Pathfinder Routability-driven: Use fewest tracks
possible Timing-driven: Optimize circuit speed Many techniques are used in commercial
FPGA CAD tools
1
1
1
1
1
1
11
12
congestion
2
16/22
SM
CLB
SM
SM
SM
SM
CLB
SM SMSM
SM
CLB CLB
Routing Resource Graph
0/4
0/4
0/4
0/4
0/4
0/4
0/4
0/4
0/4 0/4
0/4 0/4 0/4
0/4 0/4 0/4
0/4
0/4
SM SM
SM
SM
SM
SM SMSM
SM
Resource Graph
ROCR - Riverside On-Chip Router Resource Graph
Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel
Requires much less memory as resource graph is smaller
JIT FPGA Compilation ROCR – Riverside On-chip Router
Route
Rip-up
Done!
illegal?
noyes
Lysecky/Vahid/Tan, DAC’04; Lysecky/Vahid, DATE’04
17/22
Scalability of On-chip RoutingExperimental Setup
SM
CLB
SM
SM
SM
SM
SM
CLB
SM
CLB
SM
SM
SM
SM
SM
CLB
Experimental Setup 100x100 configurable logic fabric
array Routing channel width of 34
Large enough to support all HW circuits
123 MCNC benchmark circuits Circuit complexity ranges from few
LUTs to tens of thousands of LUTs Performed technology mapping,
packing, and placement using FlowMap, T-VPack, and VPR’s bounding box placement
Routed each HW benchmark circuit using:
VPR’s timing-driven router VPR’s fast timing-driven router (-fast
option) Riverside On-Chip Router (ROCR)
18/22
Scalability of On-chip Routing
Memory Usage
126602
8352
113235
0
20000
40000
60000
80000
100000
120000
140000
VPR VPR (Fast) ROCR
Me
mo
ry U
sa
ge
(K
By
tes
)
Minimum
Average
Maximum
VPR requires over 100MB of on average
ROCR requires at most 8.3 MB VPR requires 18X more than ROCR on average
19/22
Scalability of On-chip Routing
Algorithm Performance
0
25
50
75
100
125
150
175
200
050
010
0015
0020
0025
0030
0035
0040
00
Circuit Size (CLBs)
Ex
ec
uti
on
Tim
e (
s)
VPR VPR (Fast) ROCR
ROCR is over 40X times faster than VPR for small HW circuits
ROCR is 2X-3X times faster than VPR for large HW circuits
20/22
Scalability of On-chip Routing
Critical Path
0
25
50
75
100
125
150
175
200
Circuit Size (CLBs)
Cri
tic
al P
ath
(n
s)
VPR VPR (Fast) ROCR
19% longer critical path than VPR2.6% shorter than VPR (Fast)
30%/27% longer critical path than VPR/VPR (Fast)
21/22
Scalability of On-chip Routing
Wire Segments
0
15000
30000
45000
60000
75000
90000
Circuit Size (Nets)
Wir
e S
eg
me
nts
VPR VPR (Fast) ROCR
ROCR requires 2%/8% fewer wire segments than VPR/VPR (Fast) for larger HW circuits
22/22
Conclusions and Future Work Conclusions
Demonstrated ROCR scales well as circuit size increases On average 2X faster than VPR’s fast timing-driven router
Requiring 18X less memory than VPR Produces good circuit quality
Critical path 27% longer than VPR (Fast) on average 2.6% shorter critical path for largest HW circuit
Requires on average 5% fewer wire segments
Future Work Currently project: Major microprocessor vendor is fabricating our
custom FPGA Improvements to Riverside On-Chip Router (ROCR)
Improve ROCR’s performance for large HW circuits Incorporating timing information to achieve Analyze the scalability of ROCR as circuit size approaches FPGA capacity
JIT FPGA Compilation Development of standard HW binary Support more complex FPGA architectures JIT FPGA compilation