<![CDATA[Processor Array Architectures for Scalable Radix 4 Montgomery Modular Multiplication Algorithm]]>

Processor Array Architectures for ScalableRadix 4 Montgomery Modular

Multiplication AlgorithmAtef Ibrahim, Fayez Gebali, Senior Member, IEEE, Hamed Elsimary, and Amin Nassar

Abstract—This paper presents a systematic methodology for exploring possible processor arrays of scalable radix 4 modular

Montgomery multiplication algorithm. In this methodology, the algorithm is first expressed as a regular iterative expression, then the

algorithm data dependence graph and a suitable affine scheduling function are obtained. Four possible processor arrays are obtained

and analyzed in terms of speed, area, and power consumption. To reduce power consumption, we applied low power techniques for

reducing the glitches and the Expected Switching Activity (ESA) of high fan-out signals in our processor array architectures. The

resulting processor arrays are compared to other efficient ones in terms of area, speed, and power consumption.

Index Terms—Processor array, Montgomery multiplication, scalability, cryptography, secure communications, low power modular

multipliers.

Ç

1 INTRODUCTION

MANY cryptography applications, such as the encryp-tion/decryption operations of the RSA algorithm [1],

the Digital Signature Standard [2], the Diffie-Hellman keyexchange algorithm [3], and elliptic curve cryptography [4],all have an extensive use of modular multiplication andmodular exponentiation. The modular exponentiationoperation applies modular multiplication operation repeat-edly [5], [6], [7], [8], [9]. Therefore, the performance of anycryptography application critically depends on the effi-ciency of the modular multiplication operation.

There are several approaches for computing the modularmultiplication operation. The most efficient approach is theMontgomery modular multiplication algorithm [10], [11].The main advantage of this algorithm over ordinarymodular multiplication algorithms is that the modulusreduction of the partial product is done by shift operationswhich are easy to implement in hardware.

There are several papers in the literature that addressedthe issue of hardware implementation of Montgomerymodular multipliers for high operands of thousands ofbits. The most important of them are those published by A.Tenca, C. Koc, T. Todorov, E. Savas, and M. Ciftcibasi [12],[13], [14], [15]. In their publications, they introduced what is

called word-based (Scalable) Montgomery multiplicationalgorithm. Also, they extended their algorithm for highradix implementation. However, they did not consider theirhardware from the low power consumption point of view.

The goal of this work is to present a systematic techniquefor exploring possible processor arrays of scalable radix 4modular Montgomery multiplication algorithm and tocompare these processor arrays with other efficient onesin terms of speed, area, and power consumption.

This paper is organized as follows: Section 2 presents theScalable Multiple-Word Radix 4 Montgomery Multiplica-tion (MWR4MM) algorithm with recoding [16]. Section 3presents the systematic methodology we employed todesign the processor array architectures. Section 4 describesseveral techniques to decrease power dissipation. Section 5compares the resulting processor arrays to other recentefficient ones in terms of area, speed, and power consump-tion. Finally, Section 6 concludes the paper.

2 MWR4MM ALGORITHM

The notations used in this paper are as follows:

. M : modulus.

. mj : a single bit of M at position j.

. A : multiplier operand.

. aj : a single bit of A at position j.

. B : multiplicand operand.

. n : operand size (in number of bits).

. R : a constant (called a Montgomery parameter),R ¼ 2n.

. qaj : coefficient determines the multiples of themultiplicand B ðqaj �BÞ.

. qmj : coefficient determines the multiples of themodulus M ðqmj �MÞ.

. S : intermediate partial product, or final result ofmodular multiplication.

. w : word size (in number of bits) of either B;M, or S.

1142 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 7, JULY 2011

. A. Ibrahim and H. Elsimary are with the Department of Microelectronics,Electronics Research Institute, Dokki, Cairo 12622, Egypt.E-mail: atef @ece.uvic.ca, [email protected].

. F. Gebali is with the Department of Electrical and Computer Engineering,University of Victoria, Victoria, BC V8W 3P6, Canada.E-mail: [email protected].

. A. Nassar is with the Electronics and Electrical CommunicationsDepartment, Cairo University, El-Giza, Cairo 12613, Egypt.E-mail: [email protected].

Manuscript received 6 June 2009; revised 14 Mar. 2010; accepted 1 July 2010;published online 8 Nov. 2010.Recommended for acceptance by S.-Q. Zhang.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-2009-06-0256.Digital Object Identifier no. 10.1109/TPDS.2010.196.

1045-9219/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society

. e ¼ dnwe: number of words in either B;M, or S.

. Ca; Cb : carry bits.

. ðBðe�1Þ; . . . ; Bð1Þ; Bð0ÞÞ: word vector of B.

. ðMðe�1Þ; . . . ;Mð1Þ;Mð0ÞÞ : word vector of M.

. ðSðe�1Þ; . . . ; Sð1Þ; Sð0ÞÞ : word vector of S.

. SðiÞðk�1...0Þ: bits k� 1 to 0 from the ith word of S.

. SðiÞjðk�1...0Þ: bits k� 1 to 0 from the ith word of S at

iteration j.

Algorithm 1 shows the MWR4MM algorithm. Thisalgorithm is an extension of the Multiple-Word High-Radix(Radix ¼ 2k) Montgomery Multiplication (MWR2kMM) al-gorithm presented in [14], but it used a recoding scheme torecode qmj [16].

Algorithm 1. Scalable MWR4MM algorithm

1: S ¼ 0

2: for j ¼ 0 to dn=2e � 1 do

3: qaj ¼ Boothða2jþ1��2j�1Þ ¼ BoothðAjÞ4: ðCa; Sð0ÞÞ ¼ Sð0Þ þ ðqaj �BÞð0Þ5: qmj ¼MontgðSð0Þ1��0 � ð4� ðM

ð0Þ1��0Þ

�1Þmod4Þ

6: ðCb; Sð0ÞÞ ¼ Sð0Þ þ ðqmj �MÞð0Þ7: for i ¼ 1 to e� 1 do

8: ðCa; SðiÞÞ ¼ SðiÞ þ ðqaj �BÞðiÞ9: ðCb; SðiÞÞ ¼ SðiÞ þ ðqmj �MÞðiÞ

10: Sði�1Þ ¼ ðSðiÞð1��0Þ; Sði�1Þðw�1��2ÞÞ

11: end for;

12: Ca ¼ Ca or Cb13: Sðe�1Þ ¼ signextðCa; Sðe�1Þ

ðw�1��2ÞÞ14: end for;

3 A SYSTEMATIC METHODOLOGY FOR PROCESSOR

ARRAY DESIGN

Systematic methodologies to design processor arrays allowfor design space exploration for optimizing performanceaccording to certain specifications while satisfying designconstraints. Several methodologies were proposed earlier[17], [18], [19], [20]. However, most of these methodologieswere not able to deal with algorithms that have dimensionsmore than two. The second author proposed a systematicmethodology that deals with algorithms of arbitrarydimensions [20]. The author proposed a formal algebraicprocedure for processor array design starting from aRegular Iterative Algorithm (RIA) for a 3D digital filterwhich gives rise to a dependency graph (DG) in 6D space.In this work, we used this formal technique to developprocessor arrays for scalable radix 4 Montgomery modularmultiplication algorithm. We also extended the methodol-ogy by using nonlinear processor projection operations.

3.1 Expressing the Algorithm as Iterative Algorithm

To develop a processor array, first, we must be able todescribe the MWR4MM algorithm using recursions thatconvert the algorithm into RIA. We can rewrite theMWR4MM algorithm, Algorithm 1, as in Algorithm 2 wherej represents iteration index and i represents word index.

Algorithm 2. Scalable MWR4MM algorithm as RIA

1: for i ¼ 0 to e do “initialization”

2: SðiÞ�1 ¼ 0

3: end for;

4: for j ¼ 0 to dn=2e � 2 do

5: SðeÞj ¼ 0

6: end for;

7: for j ¼ 0 to dn=2e � 1 do “main algorithm”

8: qaj ¼ Boothða2jþ1��2j�1Þ ¼ BoothðAjÞ

9: ðCa; Sð0Þj Þ ¼ Sð0Þj�1 þ ðqaj �BÞ

ð0Þ

10: qmj ¼MontgðSð0Þjð1��0Þ � ð4� ðMð0Þ1��0Þ

�1Þmod4Þ

11: ðCb; Sð0Þj Þ ¼ Sð0Þj þ ðqmj �MÞð0Þ

12: for i ¼ 1 to e� 1 do

13: ðCa; SðiÞj Þ ¼ SðiÞj�1 þ ðqaj �BÞ

ðiÞ

14: ðCb; SðiÞj Þ ¼ SðiÞj þ ðqmj �MÞðiÞ

15: Sði�1Þj ¼ ðSðiÞjð1��0Þ; S

ði�1Þjðw�1��2ÞÞ

16: end for;

17: Ca ¼ Ca or Cb18: S

ðe�1Þj ¼ signextðCa; Sðe�1Þ

jðw�1��2ÞÞ19: end for;

20: if S �M then S ¼ S �M21: end if;

3.2 Obtaining the Algorithm Dependency Graph

The MWR4MM Algorithm 2 can be easily defined on a 2Ddomain since there are two indices ðj; iÞ. The DG is shownin Fig. 1. The computation domain is the convex hull in the2D space, where the algorithm operations are defined asindicated by the gray and black circles in the 2D plane [20],[21]. From Fig. 1, we notice that the black circle operationcorresponds to the computation of steps 8, 9, 10, and 11 andthe gray circle operation corresponds to the computation ofsteps 13, 14, and 15 of Algorithm 2. Also, from this figure,

IBRAHIM ET AL.: PROCESSOR ARRAY ARCHITECTURES FOR SCALABLE RADIX 4 MONTGOMERY MODULAR MULTIPLICATION... 1143

Fig. 1. MWR4MM algorithm dependency graph for n ¼ 10 and w ¼ 2.

we notice that the input variables BðiÞ and MðiÞ arerepresented by horizontal lines, variables qaj and qmj arerepresented by vertical lines, and the output variable Sj isrepresented by vertical lines.

3.3 Data Scheduling

Pipelining or broadcasting the variables of an algorithm isdetermined by the choice of a timing function that assigns atime value to each node in the DG. A simple but useful timingfunction is an affine scheduling function of the form [20].

tðpÞ ¼ Cp� c; ð1Þ

where the function tðpÞ associates a time value t to a point pin the DG. Value of c is chosen to ensure that only positivetime index values are obtained. The row vector C ¼ ½c1 c2� isthe scheduling vector and c is an integer. The affinescheduling function must satisfy several conditions. FromFig. 1, we observe that in each column, the input of eachgray node depends on the output of the black node for thesame column, thus we can write

tðpðj; 0ÞÞ < tðpðj; iÞÞ 8 0 � j � dn=2e � 1 and 1 � i � e:ð2Þ

Applying our scheduling function in (1) to this inequality,we get

c1 c2½ �j

0

� �< c1 c2½ �

j

i

� �; ð3Þ

jc1 < jc1 þ ic2; ð4Þ

which could be simplified to

c2 > 0: ð5Þ

Similarly from Fig. 1, we observe that the output variable

SðiÞj depends on the previous output value S

ðiþ1Þj�1 , thus we

can write

½c1 c2�j

i

� �> c1 c2½ �

j� 1

iþ 1

� �; ð6Þ

jc1 þ ic2 > ðj� 1Þc1 þ ðiþ 1Þc2; ð7Þ

which could be simplified to

c2 < c1; ð8Þ

hence, (5) and (8) could be merged as

0 < c2 < c1: ð9Þ

From (9), there are many solutions for C, the mostreasonable and simplest one is

C ¼ ½2 1�: ð10Þ

If we want to pipeline a variable whose nullvector is �, wemust have

C�t 6¼ 0; ð11Þ

where � is the nullvector of the variable dependence matrix[20]. On the other hand, if we want to broadcast a variablewhose nullvector is �, we must have [20]

C�t ¼ 0: ð12Þ

To study the timing of the variables BðiÞ;MðiÞ; Aj, and SðiÞj ,

we first find their nullvectors

�B ¼ ½1 0�; ð13Þ�M ¼ ½1 0�; ð14Þ�A ¼ ½0 1�; ð15Þ�S ¼ ½�1 1�: ð16Þ

The product of C and these nullvectors gives

C�tB ¼ 2; ð17ÞC�tM ¼ 2; ð18ÞC�tA ¼ 1; ð19ÞC�tS ¼ �1; ð20Þ

therefore, all input and output variables will be pipelined.

3.4 DG Node Projection

The projection operation is a many-to-one function thatmaps several nodes of the DG onto a single node, whichconstitutes the resulting processor array. Thus, severaloperations in the DG are mapped to a single PE. Theprojection operation allows hardware economy by multi-plexing several operations in the DG on a single PE. Thesecond author [20] explained how to perform the projectionoperation using a projection matrix P . To obtain theprojection matrix, we need to define a desired projectiondirection d. The vector d belongs to the null space of P .Since we are dealing with a 2D DG, matrix P is a row vectorand d is a column vector [20]. A valid projection direction d

must satisfy the inequality [20]

Cd 6¼ 0: ð21Þ

In the following, we will discuss design space explora-

tions based on the timing function C obtained in (10). There

are many projection vectors that satisfy (21) for the

scheduling function in (10). For simplicity, we choose four

of them as follows:

d1 ¼ ½1 0�t; ð22Þd2 ¼ ½0 1�t; ð23Þd3 ¼ ½2 1�t; ð24Þd4 ¼ ½1 1�t: ð25Þ

The corresponding projection matrices are given by

P1 ¼ ½0 1�; ð26ÞP2 ¼ ½1 0�; ð27ÞP3 ¼ ½�1 2�; ð28ÞP4 ¼ ½�1 1�: ð29Þ

Our processor design space now allows for four processor

array configurations for each projection vector for the

timing function C. In the following sections, we study the

processor arrays associated with each design option.

3.4.1 Design1: Using d1 ¼ ½1 0�t

A point in the DG p ¼ ½j i�t will be mapped by the projection

matrix P1 ¼ ½0 1� onto the point


p0 ¼ P1p ¼ i: ð30Þ

The resulting processor array corresponding to the projec-tion matrix P1 consists of eþ 1 PEs. Only z ¼ dðeþ 1Þ=2ePEs are active at each time step. Fig. 2 shows the processoractivity for the case n ¼ 10 and w ¼ 2, where the blacknodes represent active PEs and white nodes represent idlePEs. Since only maximum z PEs are active at a given timestep, the PEs are not well utilized. To improve PEutilization, we need to reduce the number of processors.We note from Fig. 2 that PE2i and PE2iþ1 are active atnonoverlapping time steps. Thus, each pair of PEs (PE2i andPE2iþ1) can be mapped to a single PE without causing timeconflicts. This can be achieved using the following non-linear projection operation:

l ¼ bi=2c mod z: ð31Þ

The resulting processor array for different values of n andw is shown in Fig. 3. This processor array is new and wasnot reported before. The processor array now consists of zPEs and the calculation latency is 2dn=2e clock cycles.Input words Bð2iÞ; Bð2iþ1Þ;Mð2iÞ, and Mð2iþ1Þ are allocatedto processor PEi and input Aj is allocated to processorPE0. qaj and qmj are generated inside PE0 and pipelinedto the next PEs with higher indices. The intermediateoutput words S

ðiÞj of each PE are pipelined between

adjacent PEs. A tristate buffer at the output of each PEensures that it is the only output fed to the output bus.The PEs hardware implementation details for Design1 aregiven in the supplementary material provided with thispaper, which can be found on the Computer Society

Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.196.


A point in the DG p ¼ ½j i�t will be mapped by the projectionmatrix P2 ¼ ½1 0� onto the point

p0 ¼ P2p ¼ j: ð32Þ

The resulting processor array corresponding to the projec-tion matrix P2 consists of dn=2e PEs. Only z ¼ dðeþ 1Þ=2ePEs are active at each time step. Fig. 4 shows the processoractivity for the case n ¼ 10 and w ¼ 2, where the blacknodes represent active PEs and white nodes represent idlePEs. Since only maximum z PEs are active at a given timestep, the PEs are not well utilized. To improve PEutilization, we need to reduce the number of processors.We note from Fig. 4 that PEj and PEjþz are active atnonoverlapping time steps. Thus, each pair of PEs (PEj andPEjþz) can be mapped to a single PE without causing timeconflicts. This can be achieved using the following non-linear projection operation:

l ¼ j mod z: ð33Þ

The resulting processor array, for different values of n andw,after applying the above modulo operation on the activitygraph in Fig. 4 is shown in Fig. 5. This processor array isidentical to the one reported in [23], [24]. The processor arraynow consists of z PEs and the calculation latency is 2dn=2eclock cycles. Input wordsBðiÞandMðiÞ are pipelined throughthe PEs and input Aj is allocated to each PE. qaj and qmj aregenerated and used inside each PE. The intermediate outputwords S

ðiÞj of each PE are pipelined to the next PE with higher


Fig. 2. Processor activity at different time steps for d1 ¼ ½1 0�t; n ¼ 10,and w ¼ 2.

Fig. 3. Processor array for d1 ¼ ½1 0�t for different values of n and w.


index. The PEs hardware implementation details for Design2are given in the supplementary material provided with thispaper, which can be found on the Computer Society DigitalLibrary at http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.196.

3.4.3 Design3: Using d3 ¼ ½2 1�tA point in the DG p ¼ ½j i�t will be mapped by the projectionmatrix P3 ¼ ½�1 2� onto the point

p0 ¼ P3p ¼ �jþ 2i: ð34Þ

The resulting processor array corresponding to the projec-tion matrix P3 consists of 2eþ dn=2e þ 1 PEs, after adding afixed increment to all PE indices to ensure nonnegative PEindex values [20]. Only z ¼ dðeþ 1Þ=2e PEs are active ateach time step. Fig. 6 shows the processor activity for thecase n ¼ 10 and w ¼ 2 where the black nodes representactive PEs and white nodes represent idle PEs. Since onlymaximum z PEs are active at a given time step, the PEs arenot well utilized. To improve PE utilization, we need toreduce the number of processors. We notice from Fig. 6 thatall PEs whose indices are given by (34) can be mapped toPEs with indices l as

l ¼ ð�jþ 2iÞ mod z ð35Þ

without any timing conflicts. This statement is true as long

as the inequality Cd3 6¼ z is satisfied. Applying the above

nonlinear projection operation and using the activity graph

of Fig. 6 produces the processor array shown in Fig. 7. For

different values of z, the nonlinear mapping of (35) will

produce different processor arrays. This is due to the

mixing of indices i and j in the nonlinear mapping

operation. This was not the case for Designs 1 and 2. In

the processor array of Fig. 7, input words BðiÞ and MðiÞ are

loaded to the PEs through the first eþ 1 clock cycles and

then pipelined between them. Input Aj is allocated to each

PE. qaj and qmj are generated inside each PE and pipelined

to the next PEs. The intermediate output words SðiÞj of each

PE are pipelined to the next PE. A tristate buffer at the

output of each PE ensures that it is the only output fed to

the output bus.


A point in the DG p ¼ ½j i�t will be mapped by the projectionmatrix P4 ¼ ½�1 1� onto the point

p0 ¼ P4p ¼ �jþ i: ð36Þ

The resulting processor array corresponding to the projec-tion matrix P4 consists of eþ dn=2e þ 1 PEs, after adding afixed increment to all PE indices to ensure nonnegative PEindex values [20]. Only z ¼ dðeþ 1Þ=2e PEs are active ateach time step. Fig. 8 shows the processor array activity forthe case n ¼ 14 and w ¼ 2 where the black nodes representactive PEs and white nodes represent idle PEs. Since onlymaximum z PEs are active at a given time step, the PEs arenot well utilized. To improve PE utilization, we need toreduce the number of processors. We note from Fig. 8 thateach PE is active each four time steps. Thus, all PEs whoseindices given by (36) can be mapped to PEs with indices l as

l ¼ ð�jþ iÞ mod z ð37Þ

without any timing conflicts. This statement is true as longas the inequality Cd4 6¼ z is satisfied. When n ¼ 10 andw ¼ 2, we get z ¼ 3. Therefore, the inequality will not besatisfied and thus we should exclude all cases that havez ¼ 3. In practice, z > 3 and the inequality will always besatisfied. Applying the above nonlinear projection opera-tion and using the activity graph of Fig. 8 produces theprocessor array shown in Fig. 9. For different values of z,the nonlinear mapping of (37) will produce differentprocessor arrays. This is due to the mixing of indices iand j in the nonlinear mapping operation. In the processorarray of Fig. 9, input words BðiÞand MðiÞ are loaded to thePEs through the first eþ 1 clock cycles and then pipelined


Fig. 5. Processor array for d2 ¼ ½0 1�t for different values of n and w.


Fig. 7. Processor array for d3 ¼ ½2 1�t , n ¼ 10, and w ¼ 2.

between them. Input Aj is allocated to each PE. qaj and qmj

are generated inside each PE and pipelined to the next PEs.

The intermediate output words SðiÞj of each PE are pipelined

between the PEs. A tristate buffer at the output of each PE

ensures that it is the only output fed to the output bus.

4 LOW POWER TECHNIQUES

In this section, we further improve the hardware of theprevious designs to dissipate less power. The blockdiagrams of the modified first PE (PE0) of Design1 andmodified PE of Design2, provided in the supplementarymaterial, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.196, each has two circuit modules incharge of the Booth and Montgomery recodings. Thesemodules comprise only combinational logic circuits [16].The outputs of these modules have unbalanced path delaysand consequently introduce glitches which cause unneces-sary dynamic power dissipation. Furthermore, the fan outsof the glitchy signals are large and this leads to significantdissipated power.

To reduce the glitching power dissipation, we put in some

latches and force the outputs of these modules (PP and MM)

to pass through latches. If all flip-flops and registers capture

their inputs at the clock’s rising edge, then the latches are

transparent when the clock is in a low state. If the outputs of

the two recoding modules can reach their stable values

before the clock’s falling edge, none of the glitches can

propagate to the fan-out modules driven by the outputs. We

name these latches “glitch blockers” [25]. The glitch blockers

are also very effective for reducing the glitches appearing in

the Carry Save Adder (CSA) since they synchronize the

arrival of PP and MM at the CSA’s inputs. The sametechnique can be applied to Design3 and Design4.

PP-Generator module, see the supplementary material,which can be found on the Computer Society Digital Libraryat http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.196, generates its output (PP) by modifying a word ofB according to the Booth recoder’s outputs, SEL_PP andEN_PP. Also, MM-Generator module, see the supplemen-tary material, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.196, generates its output (MM) bymodifying a word of M according to the Montgomeryrecoder’s outputs, SEL_MM and EN_MM. When PP iszeroed by EN_PP, PP outputted from the PP-Generator doesnot depend on SEL_PP. Thus, keeping SEL_PP frozen at thattime is effective for reducing power dissipation. Samereasoning also applies to SEL_MM. We place two 1-bitflip-flops [25] and construct feedback loops for SEL_PP andSEL_MM to implement this idea. The same technique can beapplied to Design3 and Design4.

The experimental results show that power consumptionof Design1 and Design2 before applying the low powertechniques was 56.20 mW and 57.23 mW (@10 MHZ and1.8 V), respectively. After applying the low power techni-ques, these values are reduced to 34.23 mW and 40.20 mW,respectively. Also, Table 3 shows the significance ofapplying the low power techniques on our designs. Theresults in this table show that Design1 has a significant gainin reducing power consumption over the other designs. Thepower reduction of Design1 ranges between 15 and50 percent compared to the other designs reported here orin the literature.

5 DESIGNS COMPARISON

In this section, we compare the designs developed in thispaper with the previous efficient Montgomery multiplierdesigns in terms of area, speed, and power consumption[14], [16], [26].

For Design1, we observe that the projection operationresults in PE0 performing more complex operationscompared to the other PEs in the processor array. Hence,the PEs in the processor array have simpler structure andneed to do less operations that consume less powercompared to PE0. This results in overall simpler hardwarefor the array as a whole. Therefore, we expect less hardwareresources for Design1 and less power consumption as well.

For Designs 2, 3, and 4, however, all PEs performidentical operations which are identical to the operationsdone by PE0 in Design1. Hence, all PEs have identicalhardware structures and are close to identical power


Fig. 9. Processor array for d4 ¼ ½1 1�t, n ¼ 14, and w ¼ 2.


consumption. The only difference between the designs isthe paths taken by the signals which are coordinated by thecontrol unit in each PE. Therefore, we expect higherhardware resource requirements and more power con-sumption compared to Design1. To illustrate these points,we implemented Design1 and Design2 as representative ofthe two classes of processor arrays proposed in this work.

The estimated data of area and speed are generated bysynthesizing the VHDL code of each design using Leonardosynthesis tool, from Mentor Graphics corporation, forxc3s1600e-5fg484 (Spartan 3E FPGA) as target technology.For power estimation, we used “XPower analyzer” fromXilinx Corporation, for the same part as target technology.

5.1 Area Estimation

The area of each design depends mainly on two designparameters: the number of architecture stages (z) and theword size (w) of the operands. Table 1 compares the totalarea needed for the first two designs to other efficientMontgomery multiplier designs for n ¼ 1024 and fordifferent values of w and z. The area improvement ofDesign1 over the other compared designs is shown in thetable. We conclude from Table 1 that Design1 has asignificant gain in reducing area over other designs fordifferent values of w and z.

5.2 Time Estimation

The total computation time for each design is equal to theproduct of the number clock cycles it takes and the clock

period. The critical path delay determines the clock periodand depends on the number of PEs (z) and the word size (w)of the operands. Equation (38) represents the total numberof clock cycles needed for Design1 while (39) represents thetotal number of clock cycles needed for Design2 and allother compared designs (where k ¼ 2 in case of radix 4designs and k ¼ 3 in the case of radix 8 designs).

Tdesign1 ¼ 2dn=2e þ 2ðz� 1Þ; ð38ÞTdesign2 ¼ dn=kzeð2zþ 1Þ þ eþ 1: ð39Þ

Table 2 compares the total computation time in �s neededfor the first two designs to other efficient Montgomerymultiplier designs for n ¼ 1024 and for different values of wand z. The time improvement of Design1 over other designsis shown in the table. We conclude from Table 2 thatDesign1 has a small gain in reducing the total computationtime over other radix 4 designs and a moderate gain overradix 8 designs for different values of w and z.

5.3 Power Estimation

Table 3 compares the power consumption in mW for thefirst two designs, after applying the low power techniques,to other efficient Montgomery multiplier designs for n ¼1024 and for w ¼ 32 and z ¼ 17. We notice from Table 3 thatDesign1 has a significant gain in reducing power consump-tion over Design2 after applying the low power techniques.This is attributed to the low power techniques, applied toeach design, add hardware only to the first PE of Design1while this hardware is replicated for each PE of Design2.This extra hardware adds overhead area to the total designand contributes to the increase of the power consumption ofDesign2 over Design1. The power improvement of Design1over other designs is also shown in the table. We concludefrom Table 3 that Design1 has a significant gain in reducingthe total power consumption over other designs.

6 SUMMARY AND CONCLUSION

This paper presented a systematic technique to explore allpossible processor arrays for the Scalable MWR4MMalgorithm. This technique first converted the MWR4MMalgorithm to regular iterative expression (RIA). Havingobtained the RIA, we were able to develop a datadependency graph which allowed us to explore all possibledata timing options. Earlier approaches did not explain howthe designs were obtained and at best help developed onedesign and did not allow for design space exploration. Fourpossible processor array structures were obtained. Design2


TABLE 1Total Area Comparison in Number of CLBs for n ¼ 1024 and

Different Values of w and z

TABLE 2Total Computation Time ð�sÞ for n ¼ 1024

TABLE 3Power Consumption ðmWÞ for n ¼ 1024; w ¼ 32, and z ¼ 17

(@10 MHZ and 1.8 V)

processor array is identical to the one obtained by Tawalbehand Tenca [16] and almost has similar processor architec-

tures compared to Design3 and Design4. Design1 is a noveldesign and was not reported before in the literature. To

reduce power consumption, we applied effective low powertechniques for reducing the glitches and the Expected

Switching Activity (ESA) of high fan-out signals in each

processor array. From the previous analysis, Section 5, wecan conclude that Design1 has better performance than other

designs in terms of area, speed, and power consumption.

REFERENCES

[1] L. Rivest, A. Shamir, and L. Adleman, “A Method for ObtainingDigital Signatures and Public-Key Cryptosystems,” Comm. ACM,vol. 21, no. 2, pp. 120-126, Feb. 1978.

[2] Nat’l Inst. for Standards and Technology, “Digital SignatureStandard (DSS),” Fed. Information Processing Standards Publication(FIPS PUB 186-2), Jan. 2000.

[3] M. Hellman, “New Directions on Cryptography,” IEEE Trans.Information Theory, vol. IT-22, no. 6, pp. 644-654, Nov. 1976.

[4] N. Koblitz, “Elliptic Curve Cryptosystems,” Math. of Computation,vol. 48, no. 177, pp. 203-209, Jan. 1987.

[5] A. Menezes, Applications on Finite Fields. Kluwer AcademicPublishers, 1993.

[6] B. Kaliski, C. Koc, and T. Acar, “Analyzing and ComparingMontgomery Multiplication Algorithms,” IEEE Micro, vol. 16,no. 3, pp. 26-33, June 1996.

[7] T. Hamano, “O(n)-Depth Circuit Algorithm for Modular Expo-nentiation,” Proc. IEEE 12th Symp. Computer Arithmetic, pp. 188-192, 1995.

[8] C. Paar and T. Blum, “Montgomery Modular Exponentiation onReconfigurable Hardware,” Proc. 14th IEEE Symp. ComputerArithmetic, pp. 70-77, 1999.

[9] J. Bajard, L. Didier, and P. Kornerup, “An RNS MontgomeryModular Multiplication Algorithm,” IEEE Trans. Computers,vol. 47, no. 7, pp. 766-776, July 1998.

[10] P. Montgomery, “Modular Multiplication without Trial Division,”Math. of Computation, vol. 44, no. 170, pp. 519-521, Apr. 1985.

[11] T. Yanik, E. Savas, and C. Koc, “Incomplete Reduction in ModularArithmetic,” Math. of Computation, vol. 149, no. 2, pp. 46-54, Mar.2002.

[12] A. Tenca and C. Koc, “A Scalable Architecture for ModularMultiplication Based on Montgomery’s Algorithm,” IEEE Trans.Computers, vol. 52, no. 9, pp. 1215-1221, Sept. 2003.

[13] A. Tenca, E. Savas, and C. Koc, “A Design Framework for Scalableand Unified Architectures that Perform Multiplication in gf(p)and gf(2m),” Int’l J. Computer Research, vol. 13, no. 1, pp. 68-83,2004.

[14] T. Todorov, A. Tenca, and C. Koc, “High-Radix Design of aScalable Modular Multiplier,” Cryptographic Hardware and Em-bedded Systems, C. Koc, D. Naccache, and C. Paar, ed., pp. 189-205,Springer Verlag, 2001.

[15] E. Savas, A. Tenca, M. Ciftcibasi, and C. Koc, “MultiplierArchitectures for GF(p) and GF(2n),” Proc. IEE Computers andDigital Techniques, vol. 151, no. 2, pp. 147-160, Mar. 2004.

[16] L. Tawalbeh and A. Tenca, “Radix-4 Asic Design of a ScalableMontgomery Modular Multiplier Using Encoding Techniques,”master’s thesis, Oregon State Univ., 2002.

[17] S. Rao and T. Kailath, “Regular Iterative Algorithms and TheirImplementation on Processor Arrays,” Proc. IEEE, vol. 76, no. 3,pp. 259-269, Mar. 1988.

[18] S. Kung, VLSI Array Processors. Prentice- Hall, 1988.[19] E. Abdel-Raheem, “Design and Vlsi Implementation of Multirate

Filter Banks,” PhD dissertation, Univ. of Victoria, 1995.[20] F. El-Guibaly and A. Tawfik, “Mapping 3D IIR Digital Filter onto

Systolic Arrays,” Multidimensional Systems and Signal Processing,vol. 7, no. 1, pp. 7-26, Jan. 1996.

[21] A. Refiq and F. Gebali, “Processor Array Architectures for DeepPacket Classification,” IEEE Trans. Parallel and Distributed Systems,vol. 17, no. 3, pp. 241-252, Mar. 2006.

[22] H. Orup, “Simplifying Quotient Determination in High-RadixModular Multiplication,” Proc. 12th IEEE Symp. Computer Arith-metic, pp. 193-199, July 1995.

[23] G. Todorov and A. Tenca, “Asic Design, Implementation andAnalysis of a Scalable High-Radix Montgomery Multiplier,”master’s thesis, Oregon State Univ., 2000.

[24] C. Koc and A. Tenca, “A Word-Based Algorithm and Architecturefor Montgomery Multiplication,” Cryptographic Hardware andEmbedded Systems, C. Koc, D. Naccache, and C. Paar, ed., pp. 94-108, Springer, 1999.

[25] H. Son and S. Oh, “Design and Implementation of Scalable Low-Power Montgomery Multiplier,” Proc. Int’l Conf. Computer Design,(ICCD ’04), pp. 524-531, 2004.

[26] N. Pinckney and D. Harris, “Parallel High-Radix MontgomeryMultipliers,” Proc. 42nd Asilomar Conf. Signals, Systems andComputers, pp. 772-776, 2008.

Atef Ibrahim received the BSc degree inelectronics engineering from Mansoura Univer-sity, Egypt, in 1998, and the MSc degree inelectronics and electrical communications fromCairo University, Egypt, in 2004. He is currentlyworking toward the PhD degree in the Electro-nics and Electrical Communications Departmentof Cairo University, Egypt and is a visitor studentin the Electrical and Computer EngineeringDepartment of University of Victoria, Canada.

His research interests include computer arithmetic, cryptography, andVLSI design.

Fayez Gebali received the BSc degree inelectrical engineering (first class honors) fromCairo University, the BSc degree in mathematics(first class honors) from Ain Shams University,and the PhD degree in electrical engineeringfrom the University of British Columbia, wherehe was a holder of NSERC postgraduatescholarship. He is a professor of computerengineering at the University of Victoria. Hisresearch interests include multicore processors,

computer communications, and computer arithmetic. He is a seniormember of the IEEE.

Hamed Elsimary is a professor at the VLSIDepartment, Electronics Research Institute,Cairo, Egypt, and is currently on leave, workingas a professor at the College of ComputerEngineering and Science, King Saud University,Alkharj, Saudi Arabia. His research interestsinclude low power circuit design and computerarchitecture.

Amin Nassar is a professor at the Electronicsand Electrical Communications Department,Faculty of Engineering, Cairo University, Egypt.His research interests include computer-aideddesign, microprocessors and interface, andindustrial electronics.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

<![CDATA[Processor Array Architectures for Scalable Radix 4 Montgomery Modular Multiplication Algorithm]]>