[IEEE 2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE) - CHENNAI, India (2013.12.12-2013.12.14)] 2013 International Conference on

1Abstract—The concept of reconfigurable computing

facilitates flexibility of the software along with high performance of hardware. The FPGA based hardware provides bit level or fine grain re-configurability where as ASIC based hardware is capable of course grain reconfiguration that lead to accelerated (hardware) performances with lesser re-programming time. This paper presents the implementation of a multiplication accumulation (MAC) unit which is reconfigurable with respect to word lengths of the operands. The unit is capable of processing signed and unsigned numbers as per the requirement. The sub units- multiplier, adder and sign selection units are reconfigurable, can function as independent units and together as accumulation unit. Reconfiguration with word length, throughput or data type is implemented with the help of a set of multiplexers, de-multiplexers and pipeline registers. Two implementations using different adders were compared. One design uses carry save addition in adder module and the other uses ripple carry addition. The implementation using ripple carry adder shows significant improvement in area and power consumption over the other. However, the use of carry save adder gives about 2% improvement on speed than its counterpart.

Index Terms—MAC, Reconfiguration, Carry Save Adder, Ripple Carry Adder.

I. INTRODUCTION FPGA based systems have been a reasonable solution for

fine grain (bit level) reconfigurable computing. The time to market is less for these systems but have low performance and power cannot be a constraint to the design. ASIC based systems are suitable for accelerated performance and low power needs. Applications requiring variable word length data intensive computations along with good performance and power constraints could choose ASIC implementations with coarse granularity. The coarse grain reconfigurable systems are mostly used in Digital Signal Processing systems. Many efforts have already been put in the design and implementation of reconfigurable systems [1]. In a reconfigurable system the hardware utilization is raised or the functionalities are altered at runtime [2]. The implementation of such a system comprises of design reconfigurations, control reconfigurations and connection reconfigurations[3]. Many implementations of reconfig-urable multipliers and reconfigurable adders are available [4-6]. But all these act as an individual adder or multiplier unit. A reconfigurable MAC unit has been previously implemented in where the unit can function as a MAC or as

an individual reconfigurable multiplier or individually as an adder based on the reconfiguration signal given [7-10]. When the unit has to be operated as a MAC, it uses these reconfigurable multiplier and adder. Carry Save addition has been used to implement the reconfigurable adder [11].

In this paper a coarse grain reconfigurable Multiply Accumulate Unit (MAC) is implemented [12]. Two architectures have been considered, each incorporating a different adder unit. One is a carry save adder and the other one is ripple carry adder. The MAC has been reconfigurable with the word lengths of operands, type of data and its functionality. The paper in the following sections discusses the key points of the MAC architectures, individual sub-units in the MAC and an alternate implementation. Finally comparison is made between the two implementations with respect to area, power and performances.

II. RECONFIGURABLE MAC The architecture of the reconfigurable multiply

accumulate unit (MAC) is shown in Fig.1. It consists of the following sub-units: 1) Input register (IR) 2) Sign Selection unit (SSU) 3) Multiplication unit (MU) 4) Addition unit (AU) 5) Accumulation unit (AcU) and a 6) Reconfiguration Control unit (RCU). Input register is controlled by system clock and it samples data from the system bus every cycle. Input register provides data to the modules that follow it in the system. SSU interprets the incoming data as signed or unsigned. It comprises of two sub-units. One converts the incoming sign numbers into unsigned numbers and stores the signs. The other sub-unit evaluates the sign of the result using the stored signs of the operands. Multiplication unit computes products of unsigned numbers which vary in data word length. The word length of operands could be 4, 8 or 16 bits. The MU has a 4x4 array multiplier as a basic module. The addition unit computes the sums of varying word length. The word length of operands to the adder unit could be 8, 16 or 32. Accumulation unit collects the product of a multiplication from the multiplier unit and adds it to the previous result. The RCU is a distributed network of control signals that reconfigure the mentioned modules to provide different functionalities like multiplication, addition or multiply accumulation. RCU also configures the type (signed or unsigned) and word length of the data fed to the sub-units.

Comparison of architectures of a coarse-grain reconfigurable multiply-accumulate unit

C B Bidhul1,, Naveen Hampannavar2, Sajeevan Joseph3, Jayakrishnan P4, Sivasankaran Kumaravel5 1,2,3M.Tech VLSI students, 4,5Assistant Professor Senior, VLSI Division, School of Electronics Engineering

VIT University Vellore, Tamil Nadu, 632014, India [email protected]

225978-1-4673-6126-2/13/$31.00 c©2013 IEEE

Figure 1. Reconfigurable Multiply and Accumulate Unit (MAC)

A. Operation of the MAC The architecture of MAC briefed in the previous section allows it to function in three modes, namely, 1) reconfigurable multiplier 2) reconfigurable adder and 3) multiply accumulator.

When it functions as a multiplier the data reaches multiplication unit from the system bus through Input register, DEMUX1 and SSU at the input. SSU converts signed numbers into unsigned ones and passes them to MU. The product reaches system bus from MU through DEMUX2, SSU at the output, MUX1 and MUX2.

When functioning as a reconfigurable adder the data reaches AU from Input register through DEMUX1. The computed sum is sent to the output bus through MUX2.

The third function, multiply accumulation uses the products computed at MU through DEMUX2 and adds them at AcU to the previously computed products. The result is sent out to the system bus through MUX1 and MUX2.

B. Sign Selection Unit SSU converts the incoming signed data into unsigned data and lets the following modules to operate on unsigned data. It also stores the signs of the input operands into registers and uses them to evaluate the sign of the result when the corresponding unsigned operation result is computed. The data flow through SSU is shown in Fig.2. SSU at the input side comprises of a circuitry that acts on a data of 32 bits. When the data is two 16 bit numbers bits 31 and 15 represent signs. When it is four 8 bit numbers, bits 31, 23,

15 and 7 would represent signs. Similarly, when it is eight 4 bit numbers the sign bits are 31, 27,23,19,15,11,7 and 3. In order to convert to unsigned number format the sign bits are reset. For example, if two 16 bit numbers are input as 92378543h then the unsigned data will be 12370543h. As the data word length varies the unsigned product computation time also varies. Hence, the signs of the input operands need to be stored in registers for different lengths of time. The stored signs are made available at the SSU output sub-unit where the sign of the product is evaluated and is appended with the result if it is a signed multiplication.

Figure 2. Sign Selection Unit

C. Multiplication Unit It multiplies two unsigned numbers of varying word lengths. The data word length of multiplicands is reconfigurable as 4, 8 or 16 bits [13]. As a data of 32 bits is fetched into the system every cycle, it corresponds to two 16 bit numbers or four 8 bit numbers or eight 4 bit numbers. With this word length reconfiguration in place the multiplier is expected to do four 4-bit multiplications or two 8-bit multiplications or a single 16-bit multiplication. Multiplier with varying data word length can be implemented by using multipliers of smaller word lengths [14]. The smaller word length multipliers would compute partial products which can be added suitably by a set of adders to get the final product. A simple approach to the implementation of an 8x8 multiplier using four 4x4 multipliers is shown in Fig.3. Each 8-bit number can be divided into two 4-bit numbers. If A=A1*24+A0 and B=B1*24+B0, where A,B are 8-bit numbers and A1,A0,B1,B0 are 4-bit numbers, Then A*B= A1* B1*28+(A1*B0+A0*B1)24+A0*B0.

226 2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE)

Hence, four partial products are computed using 4x4 multipliers. The weighted bits of these partial products are added by two adders. Adders comprise a three 8-bit carry save adder and a 4-bit adder. In a similar way the 16x16 multiplier is implemented using four 8x8 multipliers.

Figure 3. An 8x8 multiplication decomposed into four 4x4 multiplication The complexity of the multiplier increases with the data

word length. This also elongates the critical path as more combinational stages add up. A pipeline stage can be introduced in between in order to reduce this critical path [15]. Implementation of the pipeline stage in an 8x8 multiplier is shown in Fig.4. Pipeline stages can be enabled in order to increase the throughput of the circuit. Thus, the multiplier gives four 8-bit products after a single pipeline stage, two 16-bit products after two pipeline stages and a single 32-bit product after three pipeline stages.

D. Addition Unit For addition Carry Save adder is used. When the number of summands is large or if the input is of higher word length, Carry Save Addition is the fastest [16-17]. The basic unit of the adder module is the carry save adder which can add four-4 bit numbers. The carry save block(abbreviated as Csv-block) depicted in Fig. 5 can be divided into two separate sections, first section consists of four blocks of half and full adders and the second consists of the fifth column of half and full adders and the OR gate at the end. The first section varies depending on the bit-width of the summands. The second section determines the sum of the final carry bits that is obtained from the last column of the first section [9]. In order to construct an adder of more number of summands, for example, eight 4-bit summands, two Csv-Blocks and a 6-bit adder is used. Each Csv-Block which adds four 4-bit numbers gives 6-bit sum. These 6-bit sums are added using the 6-bit adder. Bigger structures can be created using more Csv-Blocks. If we want to add even more number of summands, more Csv-Blocks can be used and the 6-bit adder be replaced with an appropriate unit. Fig. 6 shows an

adder of four 8-bit numbers out of two Csv-Blocks and a 6-bit adder. The 6-bit adder adds a 6-bit number and a 2-bit number. So, the most significant bit (the seventh sum bit) of the 6-bit adder need not be considered.

Figure 4. Reconfigurable 8x8 multiplier

Figure 5. Carry-save block (Csv-Block) which can add four 4-bit numbers To add summands of greater width the same method as in Fig. 6 can be used. The reconfigurable adder can operate for different configurations depending on the configuration

10-bit output

Four 4-bit input

2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE) 227

signals (cb1 and cb2) used - either four 32-bit numbers, or eight 16-bit numbers, or sixteen 8-bit numbers, the architecture of which is given in Fig. 7. It uses pipelining to increase operating clock frequency and to ensure synchronization of all stages [18].

Figure 6. An adder with four 8-bit inputs When cb1 = 0 and cb2 = 0, 10-bit adder1_1 and 10-bit adder1_2 are used to add eight 8-bit numbers and producing 11-bit each. The resulting sums are added in the 11-bit adder producing a 12-bit final sum. Under this condition the adder functions as four 32-bit adders.

Figure 7. Reconfigurable adder which can add sixteen 8-bit summands or eight 16-bit summands or four 32 bit-summands

When cb1 = 1 and cb2 = 0, 10-bit adder2_1 takes input from DMUX1_1 and DMUX1_2 and produces an 18-bit sum as shown in Fig. 7. Similarly, 10-bit adder2_2 takes input from

DMUX1_3 and DMUX1_4 and produces an 18-bit sum. These two 18-bit sums are taken through DMUX2_1 and DMUX2_2 and are added in 18-bit adder1 and a 19-bit result is produced. Under this condition the adder functions as eight 16-bit adders which produce a 19-bit sum.

E. Accumulation Unit (AcU) The implementation of AcU is as shown in Fig. 8. It consists of a two-summand adder and a delay register. The delay register holds the present sum of the adder and gives it to the adder to add it with the next available input. In our implementation we chose a carry-lookahead adder because it is one of the fastest adders for two summands. The maximum wordlength of sum that exits from the adder is 35 bits.

Figure 8. Accumulator Unit

F. Reconfiguration Control Logic RCL gives out various control signals that are used by various modules to determine the functionality of the system, word length and type of the operands and also control the pipeline stages. The control signals are as follows: 1) recon1 and recon2 determine the word length of the operands. For 4-bit operands recon1=0 and recon2=0; For 8-bit operands recon1=1 and recon2=0; For 16-bit operands recon1=1 and recon2=1. 2) recon11 and recon12 determine the functionality of the whole system: if it works as a multiplier or an adder or a multiply accumulator unit. It is an adder when reocon12=1; It is a multiplier when recon12=0 and recon11=0; It is a MAC when recon12=0 and recon11=1. 3) The data is interpreted as a signed or unsigned number by recon5 control signal. When recon5=1 it is a signed number, otherwise an unsigned number. 4) There are two pipeline stage control signals namely, recon7 and recon8. The pipeline stage in an 8x8 multiplier is enabled by recon7 where as the one in 16x16 multiplier can be enable by recon8. Depending on the functionality, data type and throughput of the system these control signals can be set.


III. RESULTS AND DISCUSSION The units of MAC have been designed using VerilogHDL, simulated using ModelSim and synthesized using Cadence RTL Compiler for 180 nm technology. Fig. 9 shows the simulation result for all the configurations. 16x16 multiplier shows latency of 5 clock cycles where as 8x8 multiplier shows 3 clock cycles latency, after which result will be available for each clock cycle. Adder shows a latency of 6 clock cycles. MAC takes an additional clock cycle after multiplication.

Figure 9. Simulation result of reconfigurable multiplier, adder and MAC For comparing the MAC architecture with another MAC, we replaced the innermost module of Adder Unit, i.e the four 4-bit Carry Save Adder (Csv) adders with a Ripple Carry Adder (RCA) [19-20]. Both the architectures were compared for maximum clock frequency, area utilization and power consumption in 180 nm technology whose values are tabulated below.

TABLE I. THE RECOMMENDED FONTS

From the table it is evident that the implementation using ripple carry adder shows significant improvement in area and power consumption over the other. However, the use of carry save adder gives about 2% improvement in speed than its counterpart. Fig.10 shows the physical view of the architecture implemented using Cadence SoC Encounter.

IV. CONCLUSION Implementation of reconfigurable MAC was carried out

in two different architectures. The system was reconfigurable with respect to wordlength of operands, data type and functionality. Reconfigurability could achieve variable speed and throughput. The architecture incorporating Ripple Carry Adder was area and power efficient whereas that incorporating Carry Save Adder was better performing. Hence, it is concluded that Ripple Carry Adder can be used when the design requires less area and power and Carry Save Adder can be used for high data rate application.

Figure 10. Physical view of the Reconfigurable MAC

REFERENCES [1] Kim, Yoonjin, et al. "Resource sharing and pipelining in coarse-

grained reconfigurable architecture for domain-specific optimization." Design, Automation and Test in Europe, 2005. Proceedings. IEEE, 2005.

[2] Chun-Hsian Huang, Pao-Ann Hsiung, Embedded Systems Letters, "Hardware Resource Virtualization for Dynamically Partially Reconfigurable Systems" IEEE Volume:1,Issue: 1,2009

[3] Ligon, W.B. III , Ramachandran, U.,"Exploration of reconfigurable architectures: an empirical approach", 3rd Symposium on the Frontiers of Massively Parallel Computation, 1990 Proceedings

[4] Shaolei Quan, Qiang Qiang, and Chin-Long Wey, “A Novel Reconfigurable Architecture of Low-Power Unsigned Multiplier for Digital Signal Processing”, IEEE

[5] Kim, Suhwan, and Marios C. Papaefthymiou. "Reconfigurable low energy multiplier for multimedia system design." VLSI, 2000. Proceedings. IEEE Computer Society Workshop on. IEEE, 2000.

[6] S. Perri, P. Corsonello and G. Cocorullo, “64-bit reconfigurable adder for low power media processing”, Electronics Leters 25th April 2002 Vol. 38 No. 9

[7] Pratap Kumar Dakua, Anamika Sinha, Shivdhari & Gourab, “Hardware Implementation of MAC Unit”, International Journal of Electronics Communication and Computer Engineering, Volume 3, Issue 1, NCRTCST, Pages 79-82

[8] Hong, Sangjin, and S-S. Chin. "Reconfigurable embedded MAC core design for low-power coarse-grain FPGA." Electronics Letters 39, no. 7 (2003): 606-608.

[9] K. Tatas, G. Koutroumpezis, D. Soudris, A. Thanailakis, “Architecture design of a coarse- grain reconfigurable multiply-accumulate unit for data-intensive applications”, INTEGRATION, the VLSI journal 40 (2007) 74–93

[10] Wey, Chin-Long, and Jin-Fu Li. "Design of reconfigurable array multipliers and multiplier-accumulators." Circuits and Systems, 2004. Proceedings. The 2004 IEEE Asia-Pacific Conference on. Vol. 1. IEEE, 2004.

[11] Prof. Loh, “Carry-Save Addition”, CS3220 - Processor Design - Spring 2005 February 2, 2005

[12] Ying Li, Jie Chen, “A Reconfigurable Architecture of a High Performance 32-bit MAC Unit for Embedded DSP”, ASIC Proceedings of 5th International Conference on ASIC, Volume 2, Pages 1285-1288, 2003.

[13] Kamal Rajagopalan, Sutton Peter“A Flexible Multiplication Unit for an FPGA Logic Block”, The 2001 IEEE International Symposium on Circuits and Systems Volume 4, Pages 546 – 549, 2001.

[14] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, Oxford University Press, 2000.

[15] Shanthala S, Cyril Prasanna Raj, Dr.S.Y.Kulkarni, “Design and VLSI Implementation of Pipelined Multiply Accumulate Unit”, Second

Module

Technology

Max. Clock Frequency

(MHz)

Area( 2)

Power (mW)

MAC(Carry Save)

180 nm

344.81

17028

17.134

MAC(Ripple Carry)

180 nm

338.04

16612

11.102

2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE) 229

International Conference on Emerging Trends in Engineering and Technology, ICETET-09.

[16] R.P.P. Singh, Parveen Kumar, Balwinder Singh, “Performance Analysis of Fast Adders using VHDL”, Advances in Recent Technologies in Communication and Computing, ARTCom '09, Pages 189 – 193, 2009.

[17] A.F. Tenca, “Multi-operand Floating-Point Addition”, 19th IEEE Symposium on Computer Arithmetic, ARITH 2009, Pages 161 – 168, 2009

[18] Luigi Dadda, Vincenzo Piuri, “Pipelined Adders”, Ieee Transactions On Computers, Vol. 45, No 3, March 1996

[19] Nagaraj Y, Shrinivas K, Veeresh K, Veeresh A, Madhu Patil, Dr.Chirag Sharma, “Fpga Implementation of Different Adder Architectures”, International Journal of Emerging Technology and Advanced Engineering, ISSN 2250-2459, Volume 2, Issue 8, August 2012

[20] Hamid M. Kamboh, Shoab A. Khan, “FPGA Implementation of Fast Adder”, 7th International Conference on Computing and Convergence Technology (ICCCT), Pages 1324 – 1327, 2012


Documents

[IEEE 2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE) - CHENNAI, India (2013.12.12-2013.12.14)] 2013 International Conference on