Resource Optimal Design of Large Multipliers for FPGAs

Resource Optimal Design of Large Multipliers for FPGAs

Martin Kumm*, Johannes Kappauf*,

Matei Istoan† and Peter Zipf*

*University of Kassel, Germany†University Lyon, France

24'th IEEE Symposium on Computer Arithmetic

25.07.2017

Motivation

Multiplication is a fundamental arithmetic operation

Embedded multipliers available in the FPGA fabric are limited in size (& quantity)

Larger multipliers can be decomposed into smaller multipliers realized by DSP blocks or logic resources

Question of interest: How to do the decomposition in a (resource) optimal way?

2

Outline

1. How to formulate the problem as tiling problem?

2. How do the tiles look like?

3. How to solve the problem?

3

Outline




4

Multiplier Decomposition

5

A×B = (AH2n +AL)(BH2

m +BL)

= AHBH| {z }

M4

2n+m+AHBL| {z }

M3

2n +ALBH| {z }

M2

2m+ALBL| {z }

M1

A large multiplier can be decomposed into several smaller multipliers:

Multiplier Tiling

6

The multiplier can be graphically represented as an X×Y board which is tiled by smaller multiplier, represented as rectangles [de Dinechin 2009]

The required left shift can be obtained from the sum of the tile coordinates (x,y)

016320

16

32

M1

M2M4

M3

y

↑

← x

32×32 board with

n=m=16 bit mult.

A×B = (AH216 +AL)(BH2

16 +BL)

= AHBH| {z }

M4

232+AHBL| {z }

M3

216+ALBH| {z }

M2

216+ALBL| {z }

M1

Multiplier Tiling

7

A valid multiplier tiling is as follows:

The board must completely covered without overlaps of the tiles

Overlaps with the border of the board are allowed

01724344158530

17

24

34

41

58

53

y

↑

← x

53×53 multiplier [de Dinechin 2009]

Outline




8

Logic-based Tiles

9

Several LUT-based multipliers can be used:

3×3 Mult., which can be mapped to six 6-input LUTs (LUT6) [Brunie 2013]

2×3 Mult. which can be mapped to three LUT6 (realizing five LUT5) [Kumm 2015]

1×2 Mult., uses a single LUT6 (realizing two LUT5)

In addition, LUT/carry-chain multipliers are used:

Single row of an FPGA-optimized Baugh-Wooley multiplier [Parandeh-Afshar 2011]

Shapes of the Logic-based Tiles

10

030

3

(a) 3× 3

030

2

020

3

(b) 3× 2/2× 3

010

2

020

1

(c) 2× 1/1× 2

. . .

. . .0k0

2

(d) k × 2

......

020

k

(e) 2× k

LUT Requirements in the Compressor Tree

11

0 200 400 600 800 1,000 1,200 1,400 1,6000

500

1,000

Input bits (#bits)

#LUTs

multi-input addition

x3 operation0.65×#bits

Logic-based Multipliers

12

Cost is composed to:

To get the "quality" of a multiplier, an efficiency metric is defined as benefit/cost ratio:

Es =areas

costs

costs = #LUTm+ 0.65ws

Shape Tile area Word size (ws) #LUTm Total cost (costs) Efficiency (Es)

1× 1 1 1 1 1.65 0.6251× 2 2 2 1 2.3 0.872× 3 6 5 3 6.25 0.963× 3 9 6 6 9.9 0.91

2× k 2k k + 2 k + 1 1.65k + 2.3 2k

1.65k+2.3

(= 1.21 for k → ∞)

DSP-based Tiles

13

Xilinx DSP blocks contain 18×25 bit (signed)/17×24 bit (unsigned) multipliers

They contain additional post-adders

These can be used to add a multiplier result already obtained

This reduces the size of the compressor tree

Graphically, this can be represented as a so-called super-tile

[Banescu 2010]

Super-Tiles of Xilinx FPGAs

14

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

Outline




15

Formalizing the Problem

20

Constant/Variable Meaning

x, y ∈ N0 CoordinatesX,Y ∈ N0 Outer bounds of the multiplier to be designedMx,y ∈ {0, 1} Shape of the multiplier to be designed; true when (x, y) is within

the area of the multiplierS Set of small multipliers with different shapeS = |S| Number of available smaller multiplierss ∈{0, 1, . . . , S − 1} Shape index of smaller Multiplierms

x,y∈ {0, 1} Boolean constant describing each small multiplier; true when

(x, y) is within the area of the multiplier of shape s

costs ∈ R Cost of a small multiplier of shape s

dsx,y

∈ {0, 1} Decision variable, which is true when multiplier of shape s isplaced at coordinate (x, y)

Specification of a Tile

21

0120

1

2

3

y

↑

← x

m0

0,0= m

0

0,1= m

0

0,2= m

0

1,0= m

0

1,1= 1

Setting

with all other m's zero would define the following tile:

ILP Formulation

22

The multiplier tiling problem can be reformulated into an integer linear programming (ILP) as follows:

The ILP problem can be solved by using standard solvers

minimize

S−1X

s=0

X−1X

x=0

Y−1X

y=0

costsdsx,y

subject to

S−1X

s=0

X−1X

x0=0

Y−1X

y0=0

msx−x0,y−y0d

sx0,y0 = 1

9

=

;

for 0 ≤ x ≤ X,

0 ≤ y ≤ Y

with Mx,y = 1

ILP Formulation

23

Graphical representation of the left-hand-side of the ILP constraint:

0123450

1

2

3

4

5

y

↑

← x

m0

0,3d0

1,2= 0

m0

0,2d0

1,2= 1

m0

0,1d0

1,2= 1

m0

0,0d0

1,2= 1

m0

1,1d0

1,2= 1

m0

1,0d0

1,2= 1

The cost of DSP blocks are hard to compare with the cost of LUTs

Better to constrain the DSP count of a certain application

A single additional constraint can be used to specify the number of DSPs (#DSP):

where Ds specifies the number of DSPs in multiplier shape s

Additional DSP Constraint

24

S−1X

s=0

X−1X

x=0

Y−1X

y=0

Dsdsx,y = #DSP

Four important cases were considered:

24×24 (single precision)

32×32

53×53 (double precision)

64×64

Each evaluated for varying DSP count up to DSP-only implementation

Results

25

Resulting Tilings 24/32 Bit

26

0240

24

24× 24, 0 DSP

0240

17

24

24× 24, 1 DSP

0240

34

24

24× 24, 2 DSP

0320

32

32× 32, 0 DSP

024320

17

32

32× 32, 1 DSP

017320

24

32

32× 32, 2 DSP

0617320

24

41

32

32× 32, 3 DSP

08

32

32× 32, 4 DSP


26

0240

24

24× 24, 0 DSP

0240

17

24

24× 24, 1 DSP

0240

34

24

24× 24, 2 DSP

0320

32

32× 32, 0 DSP

024320

17

32

32× 32, 1 DSP

017320

24

32

32× 32, 2 DSP

0617320

24

41

32

32× 32, 3 DSP

08

32

32× 32, 4 DSP

Baugh-Wooley multiplier

[Parandeh-Afshar 2011]


26

0240

24

24× 24, 0 DSP

0240

17

24

24× 24, 1 DSP

0240

34

24

24× 24, 2 DSP

0320

32

32× 32, 0 DSP

024320

17

32

32× 32, 1 DSP

017320

24

32

32× 32, 2 DSP

0617320

24

41

32

32× 32, 3 DSP

08

32

32× 32, 4 DSP

2×k and 1:2 performs

best for LUT-based

multiplication


26

0240

24

24× 24, 0 DSP

0240

17

24

24× 24, 1 DSP

0240

34

24

24× 24, 2 DSP

0320

32

32× 32, 0 DSP

024320

17

32

32× 32, 1 DSP

017320

24

32

32× 32, 2 DSP

0617320

24

41

32

32× 32, 3 DSP

08

32

32× 32, 4 DSP

efficient solution

utilizing

two super-tiles

Resulting Tilings 53 Bit

27

082449530

17

34

41

53

53× 53, 5 DSP

02450530

17

34

53

53× 53, 6 DSP

03172734530

24

41

58

53

53× 53, 7 DSP

12294153580

12

24

41

58

53× 53, 8 DSP

012244158

12

29

41

53

58

53× 53, 9 DSP


27

082449530

17

34

41

53

53× 53, 5 DSP

02450530

17

34

53

53× 53, 6 DSP

03172734530

24

41

58

53

53× 53, 7 DSP

12294153580

12

24

41

58

53× 53, 8 DSP

012244158

12

29

41

53

58

53× 53, 9 DSP

pinwheel inside of a pinwheel

logic-mult. consumes

1/4 are compared to

previous hand-optimized

design [de Dinechin 2009]


28

017345158640

24

41

58

64

64× 64, 7 DSP

0173458640

17

24

30

34

58

64

64× 64, 8 DSP

0623404764

0

6

23

40

47

64

64× 64, 9 DSP

016234064

02

16

19

23

33

40

43

47

50

67

64

64× 64, 10 DSP

02448720

13

23

30

47

64

64× 64, 11 DSP

Optimization & Synthesis Results

29

Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz]

24×24

[Brunie 2013] 1 216 65 212.4proposed 1 168 58 10.8% 287.4

[Brunie 2013] 2 0 0 418.9proposed 2 0 0 0.0% 418.9

32×32

[Banescu 2010] 0 1024 339 275.8proposed 0 1024 276 18.6% 304.4

[Brunie 2013] 1 648 205 192.8[Banescu 2010] 1 616 234 352.6

proposed 1 616 180 12.2% 302.5

[Brunie 2013] 2 288 94 270.1proposed 2 256 82 12.8% 338.0

[Brunie 2013] 3 135 75 194.0[Banescu 2010] 3 176 75 426.6

proposed 3 64 44 41.3% 314.5

[Brunie 2013] 4 0 17 314.7[Banescu 2010] 4 40 38 379.4

proposed 4 0 13 23.5% 181.7


29


24×24

[Brunie 2013] 1 216 65 212.4proposed 1 168 58 10.8% 287.4

[Brunie 2013] 2 0 0 418.9proposed 2 0 0 0.0% 418.9

32×32


[Brunie 2013] 1 648 205 192.8[Banescu 2010] 1 616 234 352.6

proposed 1 616 180 12.2% 302.5

[Brunie 2013] 2 288 94 270.1proposed 2 256 82 12.8% 338.0

[Brunie 2013] 3 135 75 194.0[Banescu 2010] 3 176 75 426.6

proposed 3 64 44 41.3% 314.5

[Brunie 2013] 4 0 17 314.7[Banescu 2010] 4 40 38 379.4

proposed 4 0 13 23.5% 181.7

less slices because of better

logic-based multiplier/compressor tree


29


24×24

[Brunie 2013] 1 216 65 212.4proposed 1 168 58 10.8% 287.4

[Brunie 2013] 2 0 0 418.9proposed 2 0 0 0.0% 418.9

32×32


[Brunie 2013] 1 648 205 192.8[Banescu 2010] 1 616 234 352.6

proposed 1 616 180 12.2% 302.5

[Brunie 2013] 2 288 94 270.1proposed 2 256 82 12.8% 338.0

[Brunie 2013] 3 135 75 194.0[Banescu 2010] 3 176 75 426.6

proposed 3 64 44 41.3% 314.5

[Brunie 2013] 4 0 17 314.7[Banescu 2010] 4 40 38 379.4

proposed 4 0 13 23.5% 181.7

less slices because of better

super-tile usage


30


53×53


[Brunie 2013] 6 468 196 214.1[Banescu 2010] 6 721 220 298.2

proposed 6 361 180 8.2% 263.2



[Brunie 2013] 9 162 125 195.6[Banescu 2010] 9 215 174 255.8

proposed 9 0 72 42.4% 348.8

64×64


[Brunie 2013] 8 1188 420 194.2[Banescu 2010] 8 1096 449 280.7

proposed 8 652 348 17.1% 261.2



[Brunie 2013] 11 270 196 162.8[Banescu 2010] 11 592 268 225.3

proposed 11 0 108 44.9% 265.4


30


53×53


[Brunie 2013] 6 468 196 214.1[Banescu 2010] 6 721 220 298.2

proposed 6 361 180 8.2% 263.2



[Brunie 2013] 9 162 125 195.6[Banescu 2010] 9 215 174 255.8

proposed 9 0 72 42.4% 348.8

64×64


[Brunie 2013] 8 1188 420 194.2[Banescu 2010] 8 1096 449 280.7

proposed 8 652 348 17.1% 261.2



[Brunie 2013] 11 270 196 162.8[Banescu 2010] 11 592 268 225.3

proposed 11 0 108 44.9% 265.4

DPS-only solutions with less DPSs

found

A method was proposed to optimally solve the multiplier tiling problem using ILP

Method allows to trade between DSP and logic resources

The problem is trackable for practical multiplier sizes

Combined with carefully selected logic-based multipliers and DSP super-tiles, significant resource reductions could be achieved

Conclusion

31

Thank You!

32

References

[de Dinechin 2009] Large Multipliers with Fewer DSP Blocks FPL 2012

[Kumm 2015] An Efficient Softcore Multiplier Architecture for Xilinx FPGAs, ARITH 2015

[Parandeh-Afshar 2011] Measuring and Reducing the Performance Gap between Embedded and

Soft Multipliers on FPGAs, FPL 2011

[Banescu 2010] Multipliers for Floating-Point Double Precision and Beyond on FPGAs, SIGARCH 2010

[Brunie 2013] Arithmetic Core Generation Using Bit Heaps, FPL 2013

Resulting LUT Cost

34

24× 24 (single precision floating point)

#DSP 2 1 0LUT cost 31.2 179.95 502.8∆LUT – 148.75 322.85CPU [s] 22.7 129 8

32× 32 (unsigned)

#DSP 4 3 2 1 0LUT cost 57.85 119.2 256.8 567.95 881.6∆LUT – 61.35 137.6 311.15 313.65CPU [s] 146 320 187 382 19

53× 53 (double precision floating point)

#DSP 9 8 7 6 5LUT cost 144.3 164.45 307 450.5 759.7∆LUT – 20.15 142.55 143.5 309.2CPU [s] 1433 701 4331 2112 27215

64× 64 (unsigned)

#DSP 11 10 9 8 7LUT cost 198.25 354.8 570.7 862.5 1192.35∆LUT – 156.55 215.9 291.15 329.9CPU [s] 43031 81149 21382 54001 TO

35

Efficiency Comparison

0 10 20 30 40 50 60 70

0.6

0.8

1

1.2

Area

E

2× k

1× 1

1× 2

2× 3

3× 3

Problem Shapes Considered

36

(a) Multi-Input addition of10 numbers with 10 bit each

(b) x3 operation for an inputword size of 6 bit

DSP-based Tiles

37

X-Ref Target - Figure 2-1

X

17-Bit Shift

17-Bit Shift

0

Y

Z

1

0

0

48

48

4

48

BCIN* ACIN*

OPMODE

PCIN*

MULTSIGNIN*

PCOUT*

CARRYCASCOUT*

MULTSIGNOUT*

CREG/C Bypass/Mask

CARRYCASCIN*

CARRYIN

CARRYINSEL

A:B

ALUMODE

B

B

A

C

M

P

PP

C

MULT25 X 18

A

18

30

3

PATTERNDETECT

PATTERNBDETECT

CARRYOUT

4

7

48

48

30

18

P

P

5

D 25

25

INMODE

BCOUT* ACOUT*

18

30

4 1

3018

Dual B Register

Dual A, D,

and Pre-adder

Xilinx DSP48E1 block

0

1

0

1

0

1

Carry

Logic

0

1

LUTLUTLUTLUT

A Baugh-Wooley-like multiplier that can be efficiently mapped to FPGAs was proposed in [Parandeh-Afshar 2011]

Two partial products are generated and added using carry chain

Compression tree of already reduced PP's necessary

38

Previous Work

0

1

0

1

0

1

Carry

Logic

0

1

LUTLUTLUTLUT

A Baugh-Wooley-like multiplier that can be efficiently mapped to FPGAs was proposed in [Parandeh-Afshar 2011]

Two partial products are generated and added using carry chain

Compression tree of already reduced PP's necessary

full adder

38

Previous Work

[Walters 2014] Partial-Product Generation and Addition for Multiplication in FPGAs with 6-Input

LUTs, ASILOMAR 2014

[Kumm 2015] An Efficient Softcore Multiplier Architecture for Xilinx FPGAs, ARITH 2015

[Walters 2016] Array Multipliers for High Throughput in Xilinx FPGAs with 6-Input LUTs, Computers, MDPI

[Parandeh-Afshar 2011]: Measuring and Reducing the Performance Gap between Embedded and

Soft Multipliers on FPGAs, FPL 2011

[de Dinechin 2009] Large Multipliers with Fewer DSP Blocks FPL 2012

[Banescu 2010] Multipliers for Floating-Point Double Precision and Beyond on FPGAs, SIGARCH 2010

[Brunie 2013]: Arithmetic Core Generation Using Bit Heaps, FPL 2013

39

Literature

Documents

Resource Optimal Design of Large Multipliers for FPGAs