[IEEE ICSE 2002. IEEE International Conference on Semiconductor Electronics. Proceedings - Penang, Malaysia (19-21 Dec. 2002)] ICONIP '02. Proceedings of the 9th International Conference

I .

ICSE2002 Proc. 2002, Penang, Malaysia

High Performance Parallel Multiplier using Wallace- Booth Algorithm

Lakshmanan, Masuri Othman and Mohamad Alauddin Mohd.Ali Signal Processing Research Group

Department of Electrical, Electronic and Systems Engineering University Kebangsaan Malaysia

43600 UKM, Bangi, Selangor DE, Malaysia E-mail : lakshmanan~,vlsi.eng.ukm.mv

Abstract This paper presents an efiicient implementation of a VLSI high speed parallel multiplier using the Rad&-4 modified Booth algorithm and the Wallace Tree structure .The design is structured for a n x m multiplication where n can reach up to 126 bits. The WaUace Tree structure serves to compress the partial product term by a ratio of 3:2 111 .To enhance the speed of operation, carry-look-ahead(CLA) adders are used which is independent on the number of bits of the two operandsl21. An efficient VHDL code was written and successfully simulated and synthesised using Altera’s MaxpluslI(10.0) and ModelSim3.4 CAD tools.

1. INTRODUCTION

Arithmetic operations dominate the execution time of most Digital Signal Processing (DSP) algorithms and currently the time it takes to execute a multiplication operation is still the dominating factor in determining the instruction cycle time of a DSP chip and Reduced Instruction Set Computers (RISC) [3]. Among the many methods of implementing high speed parallel multipliers, there are two basic approaches, namely Booth algorithm [4] and the Wallace Tree compressors.

This paper describes an n x m bit signed parallel multiplier where n is equal to or less then 126.The decision to use a Radix-4 modified Booth algorithm rather than Radix-2 Booth algorithm is that in Radix-4, the number of partial products is reduced to n/2. Though Wallace Tree structure multipliers could be used but in this format, the multiplier array becomes very large and requires large numbers of logic gates and

0-7803-7578-5/02/$17.00 02002 IEEE

interconnecting wires which makes the chip design large and slows down the operating speed.

The paper is organized as follows: Section I1 describes the proposed algorithm. Section 111 describes the architecture and the VHDL implementation techniques.The simulation and synthesis results and conclusion are given in section IV.

11. ALGORITHM

One of the solutions of realizing high speed multipliers is to enhance parallelism which helps to decrease the number of subsequent calculation stages. The original version of the Booth algorithm (Radix-2) had two drawbacks [6]. They are: (i) The number of addsubtract operations and the number of shift operations becomes variable and becomes inconvenient in designing parallel multipliers. (ii) The algorithm becomes inefficient when there are isolated 1’s. These problems are overcome by using modified Radix4 Booth algorithm which scan strings of three bits with the algorithm given below: 1) Extend the sign bit 1 position if necessary to

2) Append a 0 to the right of the LSB of the

3) According to the value of each vector (see

ensure that n is even.

multiplier .

Table I), each Partial Product will he 0, +y , -y, +2y or -2y.

The negative values of y are made by taking the 2’s complement and in this paper cany-look-ahead (CLA) fast adders are used.

The multiplication of y is done by shifting y by one bit to the left. Thus, in any case, in

433


Number of partial products

designing a n-bit parallel multipliers, only n/2 partial products are generated.

Table I Radix4 Modified Booth algorithm scheme for odd values of i .

The process applied by Wallace can be summarized as follows: AAer generating the partial products, a set of Carry-Save-Adder (CSA) adder is used to reduce the partial product terms by a factor of 3:2 by the arrangement shown below in Fig.1

Number of levels of the Wallace tree

Fig.i Wallace tree multiplier

A CSA is a collection of full adders. The CSA accepts three n-bit operands and generate two n- hit results, an n-bit partial sum and an n-bit cany. A second CSA accepts these two bit-sequence and another input operand and generates a new partial sum cany. The term cany-propagate adder (CPA) is used to denote an adder which is not CSA. A propagate adder may propagate its carry using npple-cany- adders, carry-look-ahead (CLA) or some other methods and in this design, CLA is used. In general case, the delay accumulation of ripple- carry-adders is given as n AFA where AF4 is the delay of a single full adder.

The number of levels of the Wallace tree can be approximately given as:

~~

log(k f 2) log(3 12)

Numher of levels n

- d I 7 1

10 2 k 2'3

29 S k S42 8

Table 2 The number of levels in Wallace tree fork partial product terms,

111. ARCHITECTURE AND VHDL

The simplified block diagram of the architecture is shown in Fig.2

3:2 compressor

reS"l1

Fig. 2 A simplified block diagram of the Wallace-Booth architecture.

434


It is decided that the VHDL code for the entire Wallace-Booth multiplier is grouped into two parts. The common operating components namely:(i) partial product generator, (ii) shift register, (iii) CSA, (iv)CLA, and (iv) Wallace tree structure are declared in a subprogram package called the wallace-booth-lib. vhd. These components takes the form of procedures or funcfions. The second part f o m the main entity for the of the Wallace-Booth multiplier. One main reason why these components are declared as procedures and functions are that for the partial product generator which is made up of h/2 array register, the interlink between the subcomponents are done much elegantly rather then having separate VHDL codes and having to deal with too many interconnect wires which adds on to the complexity of the design. To reduce the interconnects, variables are used instead of signals for the mere fact that variables represents local storage as opposed to signals which represents circuit interconnects. In the partial product generator component, to perform each of the recoded operation, a simple structure consisting of a counter, multiplexer and a register are used as shown in Fig3. The whole of the recoded operation like +0, +y,+2y, -y and 2y are done using the for-loop statement.

Fig.3 Circuit to implement a counter and a storage. The 2's complement for the recoded operation -y forx is 101 or 1 IO is shown in Fig.4 and the portion of the VHDL code is shown in Listing 1.

I (HEN "101"=> ORkMOTO(m-1)LOOP

ItcrmediateQ) :=y(k)XOR'I',-take I'scomplementofy ND LOOP,

. add I to the previous result to obtain the 2'5 compl.ement Intermediate := CLA-addcr (Intermediate, Null_Vector,'l');

FOR k IN 0 TO (m-I) LOOP PP(i)(k) := Intsmcdiate(k);

Listing I. VHDL code for Two's complement.

Fig.4 Circuit to implement 2's complement arithmetic.

The following listing of VHDL code ensures that the n bits used is even. If n is even, the partial product is n/2,othemise it is (n/2+1).

n b g p := d2; -. ~ a b d a t e the number of p h a l products

nbdp :=d2+1; r-coded ("+I) := x-codedln);

end IF:

Listing 2. VHDL coding to ensure even bits

Recoded operation like +2y multiplies the multiplicand by 2 and it is achieved by left shifting y by one bit and the following listing implements this. Once the shifting is done, then, the LSB is set to zero.

pp(i)(O) := T; for kin Oto(m-l)loop

end loop; pp(i)(k+l) := y(k); .- shifl one position lo multiply by 2

I I

Listing 3. VHDL for shift operation

IV. SIMULATION, SYNTHESIS AND CONCLUSION

The VHDL codes were compiled successfully and verified for its workability in MaxPluslI. Some random numbers were used for verification and the following diagram shows the simulated results. For three arithmetic operations tested, the simulated result yielded: 125 x 55 = 6815 20 x 50 = 1000 45 x 33=1485

435

ICSE2002 Proc. 2002, Penang. Malaysia

Fig.5 Successful compilation of the VHDL code.

Fig.6 Simulation results. The synthesized result of the Booth-Wallace chip with FPGA Express is shown in Fig.7

The operating frequency was about 19MHz when tested for a 16 bit data input. This multiplier was designed to multiply n x m bits where n can range up to 126 hits. Hence using the Radiu-4 modified Booth algorithm. and Wallace Tree structure, signed and unsigned numbers can be multiplied successfully.

REFERENCES

C.S. Wallace, A suggestion for fast multipliers , IEEE Trans. Electronics. Comput.,vol.EC-13,pp.14-17, Feb.1964. Israel Karen, Computer Arithmetic Aleorithms. Prentice Hall. 1993. - S.Y. Kung, VLSI array processors, Prentice Hall. 1998. A.D. Booth, A signed binary multiplication technique, Quart. I. Mech. Appl. Math., vo1.4 pp 236-240, 1951. L.P. Rubinfield, A Proof of the modified Booth algorithm for multiplication, IEEE Trans on Computers C-24 (Oct-1975) , 1014-1015. David A. Patterson & John L.Hennessy, Computer Architecture-A Quantitative Approach. Morgan Kaufmann, 1996. Dauglas L.Perry, VHDL, 2"' Edition, McGraw-Hill Inc. Kevin Skahil1,VHDL for Programmable Logic, Addison -Wesley Publishing Co., Inc, 1996. K.C.Chang, Digital Design and Modelling with VHDL and Synthesis, IEEE Computer society Press.Los Alamitos, Califomia, 1997.

[IO] Ulrich Heinkel et.al., The VHDL Reference- A Practical Guide to Computer- Aided Integrated Circuit Design including VHDL-AMs, John Wiley & Sons, Ltd.,2000.

Fig.7 Synthesized output of the BoothWallace multiplier.

436

Documents

[IEEE ICSE 2002. IEEE International Conference on Semiconductor Electronics. Proceedings - Penang, Malaysia (19-21 Dec. 2002)] ICONIP '02. Proceedings of the 9th International Conference