Upload
craig-hurless
View
214
Download
0
Embed Size (px)
Citation preview
Conjoining Soft-Core FPGA Processors
David Sheldona, Rakesh Kumarb, Frank Vahida*, Dean Tullsenb ,
Roman Lyseckyc
aDepartment of Computer Science and EngineeringUniversity of California, Riverside
*Also with the Center for Embedded Computer Systems at UC IrvinebDepartment of Computer Science and Engineering
University of California, San DiegocDepartment of Electrical and Computer Engineering
University of Arizona
This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software
donations from Xilinx
David Sheldon, UC Riverside 2 of 22
FPGA Soft Core Processors
Soft-core Processor HDL description
Flexible implementation
FPGA or ASIC
Technology independent
HDLDescription
FPGA ASIC
Spartan 3Virtex 2 Virtex 4
David Sheldon, UC Riverside 3 of 22
FPGA Soft Core Processors
Soft Core Processors can have configurable options Datapath units Cache Bus architecture
Current commercial FPGA Soft-Core Processors Xilinx Microblaze Altera Nios
FPGA
μP
Cache
FPU
MAC
David Sheldon, UC Riverside 4 of 22
Conjoinment Overview
Base micro-processo
r FPU
Base micro-processo
rFPUFPU FPU
FPU
Application 1 Application 2
“Conjoining”
Add necessary units to both processors
Conjoin the FPU Unit Conjoined FPU unit
David Sheldon, UC Riverside 5 of 22
Conjoinment Background Conjoinment proposed for multicore desktop processing (Kumar 2004) Reduces size with reasonable performance overhead
e.g., cache conjoinment overhead: 1%-13%
ICache Sharing DCache Sharing
David Sheldon, UC Riverside 6 of 22
Outline Conjoinment for soft-core FPGA processors
Area savings
Performance overhead
Tuning heuristic for two configurable soft-cores with conjoin option
size
perf
?
David Sheldon, UC Riverside 7 of 22
Area Savings
Significant potential area savings Limitations
Does not consider multiplexing costs Due to absence of FPGA synthesis tools supporting conjoinment
But good potential justifies further investigation
BaseMicroBlaze
Multiplier
Barrel ShifterDivider
FPU
Unit Size
Multiplier
Barrel Shifter
Divider
FPU
1331
228
122
2738
0
2000
4000
6000
8000
10000
bs div mul fpuUnit instantiated w ith base processor
Equivalent LUTs
unconjoined
conjoined
6% 4%23%
32%
David Sheldon, UC Riverside 8 of 22
Outline Conjoinment for soft-core FPGA processors
Area savings
Performance overhead
Tuning heuristic for two configurable soft-cores with conjoin option
size
perf
?
David Sheldon, UC Riverside 9 of 22
Performance Overhead No simulator exists for conjoined processors We developed our own
Trace-based conjoined processor simulator
Conj. simulat
or
Simulation uses pessimistic performance assumptions
Kumar's techniques can improve Simulator outputs contention information
Final cycles can be compared to unconjoined to determine performance overhead
brev
bitmnp
Xilinx simulator
app1 app2
trace1trace2
Access stallContention stall
David Sheldon, UC Riverside 10 of 22
Performance Overhead
brev
bitmnp
00.5
11.5
22.5
33.5
44.5
(brev),canrdrbrev,(canrdr)(brev),bitmnpbrev,(bitmnp)(brev),brevbrev,(brev)
(bitmnp),canrdrbitmnp,(canrdr)(bitmnp),bitmnpbitmnp,(bitmnp)(canrdr),canrdrcanrdr,(canrdr)
Speedup
Conjoined
Unconjoined
17% 2.4%
Speedup: Application time on optimally configured processor / avg. app. time on base processor
Compared configuration with conjoinment versus without Performance overhead usually small, averaged just 4.2%
Overhead caused by access delays and contention of the hardware units
David Sheldon, UC Riverside 11 of 22
Outline Conjoinment for soft-core FPGA processors
Area savings
Performance overhead
Tuning heuristic for two configurable soft-cores with conjoin option
size
perf
?
David Sheldon, UC Riverside 12 of 22
NO FPU NO FPU
Tuning Heuristic
5 choices per unit e.g., FPU – no unit, 1 only, 2 only, 1 & 2, and conjoined
4 units 54 = 625 possible configurations Simulation: ~30 minutes per configuration Need search heuristic to tune
BaseMicroBlaze
1
BaseMicroBlaze
2FPU 2
FPU conjoined
Multiplier
Barrel ShifterDivider
MultiplierMultiplier
FPU 1
David Sheldon, UC Riverside 13 of 22
Map to 0-1 Knapsack Problem
MicroBlaze
Multiplier
sizepe
rf
Divider
size
perf
size
perf
Barrel Shifter
perf
size
FPU
BS
Perf increment
Size increment
FPU MUL DIV
1.1 0.9 1.2 1.0
1.4 2.7 1.8 1.1
Perf/Size 0.96 0.34 0.63 0.93
Creating the model
Synthesis
MicroBlaze
FPU
Synthesis
App
Base
David Sheldon, UC Riverside 14 of 22
Map to 0-1 Knapsack Problem
First consider tuning without conjoinment Problem of instantiating units to limited FPGA size can be mapped to the 0-1 knapsack problem
Add items, each with weight and benefit, to weight-constrained knapsack such that profit maximized
MUL 1 1 1 FPU 1
Base MicroBlaz
e
MUL 2 2 2 FPU 2
Available FPGA
Base MicroBlaz
e
Items:
Weights:Benefits:
Knapsack
Note: Mapping inexact – weights/benefits not strictly additive
1331 228 121 2738 1331 228 121 2738
0.08 0.62 0.00 0.00 0.22 0.76 0.00 0.00
MUL 1
FPU 1
MUL 2
David Sheldon, UC Riverside 15 of 22
Disjunctively Constrained Knapsack
Problem: If conjoined unit included, can't also include standalone unit
Solution: Map to disjunctively-constrained 0-1 knapsack Yanada T., “Heuristic and Exact Algorithms for the Disjunctively
Constrained Knapsack Problem”, 2002 Prohibits specific item pairs from being in the knapsack ILP solution, running time is pseudo polynomial
Base MicroBlaz
e
Available FPGA
Base MicroBlaz
e
Knapsack
MUL 1 1 1 FPU 1 MUL 2 2 2 FPU 2Items:
MUL C C C FPU C
David Sheldon, UC Riverside 16 of 22
Disjunctively Constrained Knapsack
Base MicroBlaz
e
Available FPGA
Base MicroBlaz
e
Knapsack
MUL 1 1 1 FPU 1 MUL 2 2 2 FPU 2Items:
MUL C C C FPU C
Weights:Benefits:
1331 228 121 2738 1331 228 121 2738
0.08 0.62 0 0 0.22 0.76 0 0
Weights:Benefits 1:
1331 228 121 2738
0.06 0.54 0 0
Benefits 2:0.21 0.71 0 0
MUL 1
MUL C Conjoined
benefits shows a small decrease in benefit from the unconjoined unit
Conjoined units provide benefits to both processors
David Sheldon, UC Riverside 17 of 22
Disjunctively Constrained Knapsack
Running Time Modeling
5 Synthesis runs for each Processor At most 4 runs of the conjoined Simulator
Disjunctively Constrained 0-1 Knapsack NP-complete problem Solved with a heuristic Heuristic takes < 1 min
David Sheldon, UC Riverside 18 of 22
Results Data gathered for the Xilinx Microblaze Soft-core Processor
10 EEMBC and Powerstone benchmarks aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk
Obtained results for all possible pairwise conjoinment
We only show conjoinment data when both applications use unit
To avoid making conjoinment appear better than it is
David Sheldon, UC Riverside 19 of 22
Results
0
1
2
3
4
5
6
7
8
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Size (Equiv LUTs)
Speedup
bitmnp, bitmnp knapsack
bitmnp, bitmnp optimal
canrdr, canrdr knapsack
canrdr, canrdr optimal
BaseFP01, BaseFP01 knapsack
BaseFP01, BaseFP01 optimal
BaseFP01, bitmnp knapsack
BaseFP01, bitmnp optimal
BaseFP01, canrdr knapsack
BaseFP01, canrdr optimal
tblook, tblook knapsack
tblook, tblook optimal
tblook, bitmnp knapsack
tblook, bitmnp optimal
tblook, canrdr knapsack
tblook, canrdr optimal
Knapsack approach finds near-optimal in most cases
David Sheldon, UC Riverside 20 of 22
Results Knapsack heuristic finds near-optimal in most cases (versus exhaustive with conjoinment)
Runs in seconds One example had sub-optimal results (2.9 times slower)
Performance overhead due to conjoinment just a few percent on average
012345678
knapsack
exhaustive w/ conj.
exhaustive w/o conj.
David Sheldon, UC Riverside 21 of 22
Results
0
2000
4000
6000
8000
10000
12000knapsack
exhaustive w/ conj.
exhaustive w/o conj.
On average the knapsack approach yields the same size as the exhaustive with conjoinment
Average size savings of 16%
David Sheldon, UC Riverside 22 of 22
Conclusions Conjoining two soft-core FPGA processors reduces average size by 16% Performance overhead just a few percent in most cases
Disjunctively constrained 0-1 knapsack approach finds near-optimal in most cases
But could be improved for some examples
Future Consider multiplexing size and delay overheads
Apply Kumar's advanced conjoining techniques to reduce overheads