Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Resource Optimal Design of Large Multipliers for FPGAs
Martin Kumm*, Johannes Kappauf*,
Matei Istoan† and Peter Zipf*
*University of Kassel, Germany†University Lyon, France
24'th IEEE Symposium on Computer Arithmetic
25.07.2017
Motivation
Multiplication is a fundamental arithmetic operation
Embedded multipliers available in the FPGA fabric are limited in size (& quantity)
Larger multipliers can be decomposed into smaller multipliers realized by DSP blocks or logic resources
Question of interest: How to do the decomposition in a (resource) optimal way?
2
Outline
1. How to formulate the problem as tiling problem?
2. How do the tiles look like?
3. How to solve the problem?
3
Outline
1. How to formulate the problem as tiling problem?
2. How do the tiles look like?
3. How to solve the problem?
4
Multiplier Decomposition
5
A×B = (AH2n +AL)(BH2
m +BL)
= AHBH| {z }
M4
2n+m+AHBL| {z }
M3
2n +ALBH| {z }
M2
2m+ALBL| {z }
M1
A large multiplier can be decomposed into several smaller multipliers:
Multiplier Tiling
6
The multiplier can be graphically represented as an X×Y board which is tiled by smaller multiplier, represented as rectangles [de Dinechin 2009]
The required left shift can be obtained from the sum of the tile coordinates (x,y)
016320
16
32
M1
M2M4
M3
y
↑
← x
32×32 board with
n=m=16 bit mult.
A×B = (AH216 +AL)(BH2
16 +BL)
= AHBH| {z }
M4
232+AHBL| {z }
M3
216+ALBH| {z }
M2
216+ALBL| {z }
M1
Multiplier Tiling
7
A valid multiplier tiling is as follows:
The board must completely covered without overlaps of the tiles
Overlaps with the border of the board are allowed
01724344158530
17
24
34
41
58
53
y
↑
← x
53×53 multiplier [de Dinechin 2009]
Outline
1. How to formulate the problem as tiling problem?
2. How do the tiles look like?
3. How to solve the problem?
8
Logic-based Tiles
9
Several LUT-based multipliers can be used:
3×3 Mult., which can be mapped to six 6-input LUTs (LUT6) [Brunie 2013]
2×3 Mult. which can be mapped to three LUT6 (realizing five LUT5) [Kumm 2015]
1×2 Mult., uses a single LUT6 (realizing two LUT5)
In addition, LUT/carry-chain multipliers are used:
Single row of an FPGA-optimized Baugh-Wooley multiplier [Parandeh-Afshar 2011]
Shapes of the Logic-based Tiles
10
030
3
(a) 3× 3
030
2
020
3
(b) 3× 2/2× 3
010
2
020
1
(c) 2× 1/1× 2
. . .
. . .0k0
2
(d) k × 2
......
020
k
(e) 2× k
LUT Requirements in the Compressor Tree
11
0 200 400 600 800 1,000 1,200 1,400 1,6000
500
1,000
Input bits (#bits)
#LUTs
multi-input addition
x3 operation0.65×#bits
Logic-based Multipliers
12
Cost is composed to:
To get the "quality" of a multiplier, an efficiency metric is defined as benefit/cost ratio:
Es =areas
costs
costs = #LUTm+ 0.65ws
Shape Tile area Word size (ws) #LUTm Total cost (costs) Efficiency (Es)
1× 1 1 1 1 1.65 0.6251× 2 2 2 1 2.3 0.872× 3 6 5 3 6.25 0.963× 3 9 6 6 9.9 0.91
2× k 2k k + 2 k + 1 1.65k + 2.3 2k
1.65k+2.3
(= 1.21 for k → ∞)
DSP-based Tiles
13
Xilinx DSP blocks contain 18×25 bit (signed)/17×24 bit (unsigned) multipliers
They contain additional post-adders
These can be used to add a multiplier result already obtained
This reduces the size of the compressor tree
Graphically, this can be represented as a so-called super-tile
[Banescu 2010]
Super-Tiles of Xilinx FPGAs
14
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k) (l)
Outline
1. How to formulate the problem as tiling problem?
2. How do the tiles look like?
3. How to solve the problem?
15
Formalizing the Problem
20
Constant/Variable Meaning
x, y ∈ N0 CoordinatesX,Y ∈ N0 Outer bounds of the multiplier to be designedMx,y ∈ {0, 1} Shape of the multiplier to be designed; true when (x, y) is within
the area of the multiplierS Set of small multipliers with different shapeS = |S| Number of available smaller multiplierss ∈{0, 1, . . . , S − 1} Shape index of smaller Multiplierms
x,y∈ {0, 1} Boolean constant describing each small multiplier; true when
(x, y) is within the area of the multiplier of shape s
costs ∈ R Cost of a small multiplier of shape s
dsx,y
∈ {0, 1} Decision variable, which is true when multiplier of shape s isplaced at coordinate (x, y)
Specification of a Tile
21
0120
1
2
3
y
↑
← x
m0
0,0= m
0
0,1= m
0
0,2= m
0
1,0= m
0
1,1= 1
Setting
with all other m's zero would define the following tile:
ILP Formulation
22
The multiplier tiling problem can be reformulated into an integer linear programming (ILP) as follows:
The ILP problem can be solved by using standard solvers
minimize
S−1X
s=0
X−1X
x=0
Y−1X
y=0
costsdsx,y
subject to
S−1X
s=0
X−1X
x0=0
Y−1X
y0=0
msx−x0,y−y0d
sx0,y0 = 1
9
=
;
for 0 ≤ x ≤ X,
0 ≤ y ≤ Y
with Mx,y = 1
ILP Formulation
23
Graphical representation of the left-hand-side of the ILP constraint:
0123450
1
2
3
4
5
y
↑
← x
m0
0,3d0
1,2= 0
m0
0,2d0
1,2= 1
m0
0,1d0
1,2= 1
m0
0,0d0
1,2= 1
m0
1,1d0
1,2= 1
m0
1,0d0
1,2= 1
The cost of DSP blocks are hard to compare with the cost of LUTs
Better to constrain the DSP count of a certain application
A single additional constraint can be used to specify the number of DSPs (#DSP):
where Ds specifies the number of DSPs in multiplier shape s
Additional DSP Constraint
24
S−1X
s=0
X−1X
x=0
Y−1X
y=0
Dsdsx,y = #DSP
Four important cases were considered:
24×24 (single precision)
32×32
53×53 (double precision)
64×64
Each evaluated for varying DSP count up to DSP-only implementation
Results
25
Resulting Tilings 24/32 Bit
26
0240
24
24× 24, 0 DSP
0240
17
24
24× 24, 1 DSP
0240
34
24
24× 24, 2 DSP
0320
32
32× 32, 0 DSP
024320
17
32
32× 32, 1 DSP
017320
24
32
32× 32, 2 DSP
0617320
24
41
32
32× 32, 3 DSP
08
32
32× 32, 4 DSP
Resulting Tilings 24/32 Bit
26
0240
24
24× 24, 0 DSP
0240
17
24
24× 24, 1 DSP
0240
34
24
24× 24, 2 DSP
0320
32
32× 32, 0 DSP
024320
17
32
32× 32, 1 DSP
017320
24
32
32× 32, 2 DSP
0617320
24
41
32
32× 32, 3 DSP
08
32
32× 32, 4 DSP
Baugh-Wooley multiplier
[Parandeh-Afshar 2011]
Resulting Tilings 24/32 Bit
26
0240
24
24× 24, 0 DSP
0240
17
24
24× 24, 1 DSP
0240
34
24
24× 24, 2 DSP
0320
32
32× 32, 0 DSP
024320
17
32
32× 32, 1 DSP
017320
24
32
32× 32, 2 DSP
0617320
24
41
32
32× 32, 3 DSP
08
32
32× 32, 4 DSP
2×k and 1:2 performs
best for LUT-based
multiplication
Resulting Tilings 24/32 Bit
26
0240
24
24× 24, 0 DSP
0240
17
24
24× 24, 1 DSP
0240
34
24
24× 24, 2 DSP
0320
32
32× 32, 0 DSP
024320
17
32
32× 32, 1 DSP
017320
24
32
32× 32, 2 DSP
0617320
24
41
32
32× 32, 3 DSP
08
32
32× 32, 4 DSP
efficient solution
utilizing
two super-tiles
Resulting Tilings 53 Bit
27
082449530
17
34
41
53
53× 53, 5 DSP
02450530
17
34
53
53× 53, 6 DSP
03172734530
24
41
58
53
53× 53, 7 DSP
12294153580
12
24
41
58
53× 53, 8 DSP
012244158
12
29
41
53
58
53× 53, 9 DSP
Resulting Tilings 53 Bit
27
082449530
17
34
41
53
53× 53, 5 DSP
02450530
17
34
53
53× 53, 6 DSP
03172734530
24
41
58
53
53× 53, 7 DSP
12294153580
12
24
41
58
53× 53, 8 DSP
012244158
12
29
41
53
58
53× 53, 9 DSP
pinwheel inside of a pinwheel
logic-mult. consumes
1/4 are compared to
previous hand-optimized
design [de Dinechin 2009]
Resulting Tilings 64 Bit
28
017345158640
24
41
58
64
64× 64, 7 DSP
0173458640
17
24
30
34
58
64
64× 64, 8 DSP
0623404764
0
6
23
40
47
64
64× 64, 9 DSP
016234064
02
16
19
23
33
40
43
47
50
67
64
64× 64, 10 DSP
02448720
13
23
30
47
64
64× 64, 11 DSP
Optimization & Synthesis Results
29
Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz]
24×24
[Brunie 2013] 1 216 65 212.4proposed 1 168 58 10.8% 287.4
[Brunie 2013] 2 0 0 418.9proposed 2 0 0 0.0% 418.9
32×32
[Banescu 2010] 0 1024 339 275.8proposed 0 1024 276 18.6% 304.4
[Brunie 2013] 1 648 205 192.8[Banescu 2010] 1 616 234 352.6
proposed 1 616 180 12.2% 302.5
[Brunie 2013] 2 288 94 270.1proposed 2 256 82 12.8% 338.0
[Brunie 2013] 3 135 75 194.0[Banescu 2010] 3 176 75 426.6
proposed 3 64 44 41.3% 314.5
[Brunie 2013] 4 0 17 314.7[Banescu 2010] 4 40 38 379.4
proposed 4 0 13 23.5% 181.7
Optimization & Synthesis Results
29
Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz]
24×24
[Brunie 2013] 1 216 65 212.4proposed 1 168 58 10.8% 287.4
[Brunie 2013] 2 0 0 418.9proposed 2 0 0 0.0% 418.9
32×32
[Banescu 2010] 0 1024 339 275.8proposed 0 1024 276 18.6% 304.4
[Brunie 2013] 1 648 205 192.8[Banescu 2010] 1 616 234 352.6
proposed 1 616 180 12.2% 302.5
[Brunie 2013] 2 288 94 270.1proposed 2 256 82 12.8% 338.0
[Brunie 2013] 3 135 75 194.0[Banescu 2010] 3 176 75 426.6
proposed 3 64 44 41.3% 314.5
[Brunie 2013] 4 0 17 314.7[Banescu 2010] 4 40 38 379.4
proposed 4 0 13 23.5% 181.7
less slices because of better
logic-based multiplier/compressor tree
Optimization & Synthesis Results
29
Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz]
24×24
[Brunie 2013] 1 216 65 212.4proposed 1 168 58 10.8% 287.4
[Brunie 2013] 2 0 0 418.9proposed 2 0 0 0.0% 418.9
32×32
[Banescu 2010] 0 1024 339 275.8proposed 0 1024 276 18.6% 304.4
[Brunie 2013] 1 648 205 192.8[Banescu 2010] 1 616 234 352.6
proposed 1 616 180 12.2% 302.5
[Brunie 2013] 2 288 94 270.1proposed 2 256 82 12.8% 338.0
[Brunie 2013] 3 135 75 194.0[Banescu 2010] 3 176 75 426.6
proposed 3 64 44 41.3% 314.5
[Brunie 2013] 4 0 17 314.7[Banescu 2010] 4 40 38 379.4
proposed 4 0 13 23.5% 181.7
less slices because of better
super-tile usage
Optimization & Synthesis Results
30
Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz]
53×53
[Banescu 2010] 5 1029 350 298.2proposed 5 769 295 15.7% 313.2
[Brunie 2013] 6 468 196 214.1[Banescu 2010] 6 721 220 298.2
proposed 6 361 180 8.2% 263.2
[Banescu 2010] 7 313 223 378.9proposed 7 193 137 38.6% 290.2
[Banescu 2010] 8 265 145 356.4proposed 8 25 81 44.1% 272.7
[Brunie 2013] 9 162 125 195.6[Banescu 2010] 9 215 174 255.8
proposed 9 0 72 42.4% 348.8
64×64
[Banescu 2010] 7 1504 614 245.0proposed 7 1191 430 30.0% 270.5
[Brunie 2013] 8 1188 420 194.2[Banescu 2010] 8 1096 449 280.7
proposed 8 652 348 17.1% 261.2
[Banescu 2010] 9 864 413 262.9proposed 9 475 217 47.5% 249.6
[Banescu 2010] 10 592 341 250.7proposed 10 187 179 47.5% 267.7
[Brunie 2013] 11 270 196 162.8[Banescu 2010] 11 592 268 225.3
proposed 11 0 108 44.9% 265.4
Optimization & Synthesis Results
30
Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz]
53×53
[Banescu 2010] 5 1029 350 298.2proposed 5 769 295 15.7% 313.2
[Brunie 2013] 6 468 196 214.1[Banescu 2010] 6 721 220 298.2
proposed 6 361 180 8.2% 263.2
[Banescu 2010] 7 313 223 378.9proposed 7 193 137 38.6% 290.2
[Banescu 2010] 8 265 145 356.4proposed 8 25 81 44.1% 272.7
[Brunie 2013] 9 162 125 195.6[Banescu 2010] 9 215 174 255.8
proposed 9 0 72 42.4% 348.8
64×64
[Banescu 2010] 7 1504 614 245.0proposed 7 1191 430 30.0% 270.5
[Brunie 2013] 8 1188 420 194.2[Banescu 2010] 8 1096 449 280.7
proposed 8 652 348 17.1% 261.2
[Banescu 2010] 9 864 413 262.9proposed 9 475 217 47.5% 249.6
[Banescu 2010] 10 592 341 250.7proposed 10 187 179 47.5% 267.7
[Brunie 2013] 11 270 196 162.8[Banescu 2010] 11 592 268 225.3
proposed 11 0 108 44.9% 265.4
DPS-only solutions with less DPSs
found
A method was proposed to optimally solve the multiplier tiling problem using ILP
Method allows to trade between DSP and logic resources
The problem is trackable for practical multiplier sizes
Combined with carefully selected logic-based multipliers and DSP super-tiles, significant resource reductions could be achieved
Conclusion
31
Thank You!
32
References
[de Dinechin 2009] Large Multipliers with Fewer DSP Blocks FPL 2012
[Kumm 2015] An Efficient Softcore Multiplier Architecture for Xilinx FPGAs, ARITH 2015
[Parandeh-Afshar 2011] Measuring and Reducing the Performance Gap between Embedded and
Soft Multipliers on FPGAs, FPL 2011
[Banescu 2010] Multipliers for Floating-Point Double Precision and Beyond on FPGAs, SIGARCH 2010
[Brunie 2013] Arithmetic Core Generation Using Bit Heaps, FPL 2013
Resulting LUT Cost
34
24× 24 (single precision floating point)
#DSP 2 1 0LUT cost 31.2 179.95 502.8∆LUT – 148.75 322.85CPU [s] 22.7 129 8
32× 32 (unsigned)
#DSP 4 3 2 1 0LUT cost 57.85 119.2 256.8 567.95 881.6∆LUT – 61.35 137.6 311.15 313.65CPU [s] 146 320 187 382 19
53× 53 (double precision floating point)
#DSP 9 8 7 6 5LUT cost 144.3 164.45 307 450.5 759.7∆LUT – 20.15 142.55 143.5 309.2CPU [s] 1433 701 4331 2112 27215
64× 64 (unsigned)
#DSP 11 10 9 8 7LUT cost 198.25 354.8 570.7 862.5 1192.35∆LUT – 156.55 215.9 291.15 329.9CPU [s] 43031 81149 21382 54001 TO
35
Efficiency Comparison
0 10 20 30 40 50 60 70
0.6
0.8
1
1.2
Area
E
2× k
1× 1
1× 2
2× 3
3× 3
Problem Shapes Considered
36
(a) Multi-Input addition of10 numbers with 10 bit each
(b) x3 operation for an inputword size of 6 bit
DSP-based Tiles
37
X-Ref Target - Figure 2-1
X
17-Bit Shift
17-Bit Shift
0
Y
Z
1
0
0
48
48
4
48
BCIN* ACIN*
OPMODE
PCIN*
MULTSIGNIN*
PCOUT*
CARRYCASCOUT*
MULTSIGNOUT*
CREG/C Bypass/Mask
CARRYCASCIN*
CARRYIN
CARRYINSEL
A:B
ALUMODE
B
B
A
C
M
P
PP
C
MULT25 X 18
A
18
30
3
PATTERNDETECT
PATTERNBDETECT
CARRYOUT
4
7
48
48
30
18
P
P
5
D 25
25
INMODE
BCOUT* ACOUT*
18
30
4 1
3018
Dual B Register
Dual A, D,
and Pre-adder
Xilinx DSP48E1 block
0
1
0
1
0
1
Carry
Logic
0
1
LUTLUTLUTLUT
A Baugh-Wooley-like multiplier that can be efficiently mapped to FPGAs was proposed in [Parandeh-Afshar 2011]
Two partial products are generated and added using carry chain
Compression tree of already reduced PP's necessary
38
Previous Work
0
1
0
1
0
1
Carry
Logic
0
1
LUTLUTLUTLUT
A Baugh-Wooley-like multiplier that can be efficiently mapped to FPGAs was proposed in [Parandeh-Afshar 2011]
Two partial products are generated and added using carry chain
Compression tree of already reduced PP's necessary
full adder
38
Previous Work
[Walters 2014] Partial-Product Generation and Addition for Multiplication in FPGAs with 6-Input
LUTs, ASILOMAR 2014
[Kumm 2015] An Efficient Softcore Multiplier Architecture for Xilinx FPGAs, ARITH 2015
[Walters 2016] Array Multipliers for High Throughput in Xilinx FPGAs with 6-Input LUTs, Computers, MDPI
[Parandeh-Afshar 2011]: Measuring and Reducing the Performance Gap between Embedded and
Soft Multipliers on FPGAs, FPL 2011
[de Dinechin 2009] Large Multipliers with Fewer DSP Blocks FPL 2012
[Banescu 2010] Multipliers for Floating-Point Double Precision and Beyond on FPGAs, SIGARCH 2010
[Brunie 2013]: Arithmetic Core Generation Using Bit Heaps, FPL 2013
39
Literature