Upload
dugan
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Clustering of Large Designs for Channel-Width Constrained FPGAs. Marvin TomGuy Lemieux University of British Columbia Department of Electrical and Computer Engineering Vancouver, BC, Canada. Overview. Introduction, Goals and Motivation - PowerPoint PPT Presentation
Citation preview
Clustering of Large Designs forChannel-Width Constrained
FPGAs
Marvin Tom Guy Lemieux
University of British ColumbiaDepartment of Electrical and Computer Engineering
Vancouver, BC, Canada
Overview
• Introduction, Goals and Motivation– Reduce channel width, lower cost, make circuits “routable”
• Reducing Channel Width By Depopulation
• Large Benchmark Circuits
• New Clustering Technique– Selective Depopulation
• Conclusions and Future Work
Mesh-Based FPGA Architecture
• Channel width– Number of routing
tracks per channel
L L L
L L L
L L L
L L L
L L L
L L L
L L L
L
L
L
L
• Larger FPGA devices: more tiles– Channel width is fixed
Motivation: Area of FPGA Devices
alu4
apex2
apex4
bigkey
des
diffeq
dsip
elliptic
ex1010
ex5p
frisc
misex3
pdc
s298s38417
s38584seq
spla
tseng
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300
CLB Count
Routed Channel
Width
Number ofLayout Tiles
SIZE ofLayout Tile
Total Layout AREA= SIZE * Number
MCNC Circuits Mapped onto an FPGA
Motivation: Channel Width Demand
alu4
apex2
apex4
bigkey
des
diffeq
dsip
elliptic
ex1010
ex5p
frisc
misex3
pdc
s298s38417
s38584seq
spla
tseng
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300
CLB Count
Routed Channel
Width
Logic RangeUser buys bigger device.
InterconnectRange
User hasno choice!
Devices built for worst-casechannel width (fixed width)
Interconnect cost dominates (>70%)
MCNC Circuits Mapped onto an FPGA
Goal: Reduce Channel Width
alu4
apex2
apex4
bigkey
des
diffeq
dsip
elliptic
ex1010
ex5p
frisc
misex3
pdc
s298s38417
s38584seq
spla
tseng
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300
CLB Count
Routed Channel
Width
But { apex4, elliptic, frisc, ex1010, spla, pdc } are unroutable….
Can we make them routable in a Constrained FPGA?
Altera Cyclone• Channel width constraint of 80 routing tracks
Constrained FPGA• Channel width constraint of 60 routing tracks• Smaller area, lower cost for low-channel-width circuits
alu4
apex2
apex4
bigkey
clma
des
diffeq
dsip
elliptic
ex1010
ex5p
frisc
misex3
pdc
s298s38417
s38584seq
spla
tseng
pdc
ex1010
frisc splaapex4 elliptic
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700
CLB Count
Ro
ute
d C
ha
nn
el W
idth
Possible Solution• Trade-off logic utilization for channel width
– User can always buy more logic…. (not more wires)
FPGA 1 FPGA 2
L L L L
L L L L
L L L L
L L L L
L L L L
L L L L
L L L L
L L L L
L
L
L
L
L L L L L
Trade-off:
CLB count
for
Channel width
But….. can we achieve lower Total Area? ( = SIZE * CLB Count)
Logic Element: BLE and CLB
• Basic Logic Element (BLE)– ‘k’-input LUT + FF
• Clustered Logic Block (CLB) – ‘N’ BLEs, ‘N’ outputs– ‘I’ shared inputs
‘I’ Inputs ‘N’ Outputs
BLE #1
BLE #2
BLE #3
BLE #4
BLE #5
CLB
L L L L
L L L L
L L L L
L L L L
Note: I < k*N
CLB Depopulation
• Normally: CLBs fully packed– Reduces total # of CLBs
needed for circuit
• CLB Depopulation: Tessier, DeHon– Do not use all BLEs – Increase # CLBs used – Decrease channel width – Decrease overall area
• Problem– Increase in # CLBs high for
large circuits– Our work: limits # CLB increase
‘I’ Inputs ‘N’ Outputs
BLE #1
BLE #2
BLE #3
BLE #4
BLE #5
CLB
Uniform Depopulation
• Previous work – Depopulate each CLB by
equal amount
• But… circuit observations– regions of high routing
demand– regions of low routing
demand
• Depopulate in low congestion areas ??– Unnecessary increase in
area
Non-Uniform Depopulation
• Our depopulation method:– Assume congestion is
localized– Depopulate only congested
areas
• We show non-uniform de-population– Effective method of channel
width reduction– Graceful tradeoff between
channel width and area– Makes unroutable circuits
routable
Depopulation Methodsto
Reduce Channel Width
CLB Depopulation
• General Approach– Use existing clustering tools– Do not fill CLB while
clustering
1. Input-Limited• Eg. Maximum 67% input
utilization per CLB• Might use all BLEs
2. BLE-Limited• Eg. Maximum 60% BLE
utilization per CLB• Might use all Inputs
‘I’ Inputs‘N’ Outputs
BLE #1
BLE #2
BLE #3
BLE #4
BLE #5
CLB
Reducing Channel Width Results(max cluster size 16)
• Input-Limited• No channel width control
30
40
50
60
70
80
90
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Cluster Size (BLE-Limit)
Routed Channel
Width
6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54Number of Inputs (Input-Limit)
Input-limited clmaBLE-Limited clma
• BLE-Limited• (almost) monotonically increasing good channel width control
Benchmark Circuit Creation
(We want BIG circuits!)
(What do REALLY BIG circuits look like?)
Benchmarking Circuits: Some Observations
• Altera has bigger benchmarks than academics– We noted similar characteristics:
• Some LARGE circuits routable with NARROW routing channels
• Some SMALL circuits need WIDE routing channels
• What if each circuit is IP Block in larger system… ??
20 Largest MCNC Benchmarks
Altera Cyclone Benchmarks [CICC 2003]
LUT Range
10:1 (1,000..10,000 LUTs)
10:1 (2,500..25,000 LUTs)
Channel Width Range
4:1 (20..80 tracks)
3:1 (40..120 tracks)
Benchmark Creation – IP Blocks
• Mimic process of creating large designs– “IP Blocks” <==> MCNC Circuits– SoC <==> Randomly integrate/stitch together “IP Blocks”– IP Blocks have varied interconnect needs
• Real-life large designs: System-on-Chip Methodology– IP blocks (own, 3rd party)
• Re-use improves productivity
– Primarily integration and verification effort
Benchmark Creation – Large Designs
• Considered 3 stitching schemes…
– Independent• IP Blocks are not connected to each other
– Pipeline• Outputs of one IP block connected to inputs of next IP block
– Clique• Outputs of each IP block are uniformly distributed to inputs of
all other IP blocks
MetaCircuit:Reducing Routed Channel Width?
• Observations
– IP blocks are tightly-connected internally– IP blocks have varied channel width needs
• Hypotheses
1. Placement keeps each “IP block” together
2. IP blocks has large routed channel width MetaCircuit has large routed channel width
Hypothesis Testing:MetaCircuit P&R Results
• Use VPR FPGA tools from University of Toronto
• Hypothesis 1– VPR placer successfully
groups IP blocks from random initial placement
• Hypothesis 2– VPR router confirms channel
width of MetaCircuit is dominated by a few IP blocks{ pdc, clma, ex1010 }
Consequences of Hypothesis 2
• Question– Shrink channel width of few IP blocks
?? shrink channel width of MetaCircuit?
• How to shrink channel widths?– Selective CLB Depopulation !!– Depopulate hard-to-route IP blocks the most
• How much to depopulate?– Channel width profiling of IP block…
Meeting Channel Width Constraints:Selective Depopulation
• Step 1: Channel Width Profiling of IP Blocks (Congestion Estimation)
• Step 2: Re-cluster Only Congested IP Blocks (Selective Depopulation)
IP Block Properties• Cluster IP Blocks into N=16, k=6 • VPR: determine minimum channel width for each IP Block• Sort IP Blocks based on channel width
0
10
20
30
40
50
60
70
80
90
alu4s2
98tse
ng
mise
x3
s384
17ex
5p
s385
84
apex
2se
qdif
feq
apex
4ds
ip
bigke
yde
ssp
lafri
scclm
a
ex10
10 pdc
ellipt
ic
IP Blocks, sorted by Channel Width
Ch
ann
el W
idth
Hard-to-Route CircuitsEasy-to-Route Circuits
Channel Width Profiling of IP Block• Cluster sizes
– NA = FPGA Architecture Cluster Size (fixed)– NC = BLE-Limit Size (variable)
• Sweep NC for each IP block
10
20
30
40
50
60
70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
BLE-Limit Size, NC
Ro
ute
d C
han
nel
Wid
th
clma
tseng
Analysis with Constraint• Given channel-width constraint of 60 tracks
– tseng routable (easy)– clma routable for NC <= 10– clma not routable for NC > 10
10
20
30
40
50
60
70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
BLE-Limit Size, NC
Ro
ute
d C
han
nel
Wid
th
clma
tseng
Our Technique: Selective Depopulation
• Step 1: Channel Width Profiling of IP Blocks (Congestion Estimation)
• Step 2: Re-cluster Only Congested IP Blocks (Selective Depopulation)
Uniform Depopulation• Minimum NC Cluster Size
– De-populate all clusters equally
– Eg, use NC=10 for both IP Blocks
10
20
30
40
50
60
70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Cluster Size, NC
Ro
ute
d C
ha
nn
el
Wid
th
clmatseng
Non-Uniform Depopulation• Maximal NC Cluster Size
– Depopulate each IP block according to maximal cluster size
– Eg, clma NC=10, tseng NC=16
10
20
30
40
50
60
70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Cluster Size, NC
Ro
ute
d C
ha
nn
el
Wid
th
clmatseng
Uniform vs. Non-Uniform
2
4
6
8
10
12
14
16
18
20
40 50 60 70 80 90 100
• Non-Uniform depopulation better than Uniform– Lower CLB count– Higher LUT utilization
Channel Width Constraint
Uniform Non-Uniform
LUT UtilizationTotal CLBs Needed
Channel Width Constraint
x 1,
000
0
0.2
0.4
0.6
0.8
1
40 50 60 70 80 90 100
Uniform Non-Uniform
MetaCircuit Clustering Results
• Depopulate the most-congested IP blocks
– (BLE-Limit) of each IP block shown(max=16)
– Some IP blocks are depopulated more than others
0.8
1
1.2
1.4
1.6
1.8
2
40 50 60 70 80 90 100
1
Channel Width Constraint
No
rmal
ized
Are
a
MetaCircuit P&R Results
40
50
60
70
80
90
100
40 50 60 70 80 90 100
• Clique MetaCircuit– P&R channel width results closely match “constraints”
• Shrink Channel Width by ~20% (from 95 to 75), NO AREA INCREASE by ~50% (from 95 to 50), 1.7x area increase
Channel Width Constraint
Ch
ann
el W
idth
Constraint Routed
Other MetaCircuit Results
Circuit Clustering Tool
Channel Width Decreases
( < 1.05 x Area )
( 1.7 x – 3.5 x Area )
CliqueT-VPack
iRAC Rep.20%7%
50%29%
Independent*T-VPack
iRAC Rep.24%27%
42%30%
Pipeline*T-VPack
iRAC Rep.25%11%
55%27%
* These latest results are better than those given in paper
Critical Path Delay and Average Wirelength• Expect critical path delay to increase under tighter constraints
– Delay “noise” due to instability of floorplan locations
• Average wirelength / net increases under tighter constraints
23
24
25
26
27
28
29
30
40 50 60 70 80 90 100
Channel Width Constraint
Cri
tic
al
Pa
th (
ns
)
13
14
15
16
17
18
19
20
Avg
. R
ou
ted
Wir
elen
gth
per
Net
Critical PathAvg. WL/Net
Conclusion• System-level technique to map large System-on-Chip (SoC) designs
to channel-width constrained FPGAs using fewer routing resources
• Depopulating CLBs effective at reducing channel width
• Non-uniform depopulation important to limit area inflation
• Channel width reduced– by 0-20% with < 5% area increase– by up to 50% with 3.3 X area increase
• Effective solution to trade-off CLBs for Interconnect !!!– UNROUTABLE circuits (channel width TOO LARGE)
can be made ROUTABLE (reduced channel width)by buying an FPGA with MORE LOGIC!!!
End of Talk
Future Work
• Real-Life SoC Benchmark– Licensed IP: Bluetooth baseband processor– 325,000 ASIC gates– Numerous IP blocks of varying complexity– Needed to authenticate “Synthetic” results
• Automated technique to find “hard” IP blocks– Granularity is based on design hierarchy (?)– Replaces time-consuming Step 1 of process