Upload
amin
View
213
Download
1
Embed Size (px)
Citation preview
Processor Array Architectures for ScalableRadix 4 Montgomery Modular
Multiplication AlgorithmAtef Ibrahim, Fayez Gebali, Senior Member, IEEE, Hamed Elsimary, and Amin Nassar
Abstract—This paper presents a systematic methodology for exploring possible processor arrays of scalable radix 4 modular
Montgomery multiplication algorithm. In this methodology, the algorithm is first expressed as a regular iterative expression, then the
algorithm data dependence graph and a suitable affine scheduling function are obtained. Four possible processor arrays are obtained
and analyzed in terms of speed, area, and power consumption. To reduce power consumption, we applied low power techniques for
reducing the glitches and the Expected Switching Activity (ESA) of high fan-out signals in our processor array architectures. The
resulting processor arrays are compared to other efficient ones in terms of area, speed, and power consumption.
Index Terms—Processor array, Montgomery multiplication, scalability, cryptography, secure communications, low power modular
multipliers.
Ç
1 INTRODUCTION
MANY cryptography applications, such as the encryp-tion/decryption operations of the RSA algorithm [1],
the Digital Signature Standard [2], the Diffie-Hellman keyexchange algorithm [3], and elliptic curve cryptography [4],all have an extensive use of modular multiplication andmodular exponentiation. The modular exponentiationoperation applies modular multiplication operation repeat-edly [5], [6], [7], [8], [9]. Therefore, the performance of anycryptography application critically depends on the effi-ciency of the modular multiplication operation.
There are several approaches for computing the modularmultiplication operation. The most efficient approach is theMontgomery modular multiplication algorithm [10], [11].The main advantage of this algorithm over ordinarymodular multiplication algorithms is that the modulusreduction of the partial product is done by shift operationswhich are easy to implement in hardware.
There are several papers in the literature that addressedthe issue of hardware implementation of Montgomerymodular multipliers for high operands of thousands ofbits. The most important of them are those published by A.Tenca, C. Koc, T. Todorov, E. Savas, and M. Ciftcibasi [12],[13], [14], [15]. In their publications, they introduced what is
called word-based (Scalable) Montgomery multiplicationalgorithm. Also, they extended their algorithm for highradix implementation. However, they did not consider theirhardware from the low power consumption point of view.
The goal of this work is to present a systematic techniquefor exploring possible processor arrays of scalable radix 4modular Montgomery multiplication algorithm and tocompare these processor arrays with other efficient onesin terms of speed, area, and power consumption.
This paper is organized as follows: Section 2 presents theScalable Multiple-Word Radix 4 Montgomery Multiplica-tion (MWR4MM) algorithm with recoding [16]. Section 3presents the systematic methodology we employed todesign the processor array architectures. Section 4 describesseveral techniques to decrease power dissipation. Section 5compares the resulting processor arrays to other recentefficient ones in terms of area, speed, and power consump-tion. Finally, Section 6 concludes the paper.
2 MWR4MM ALGORITHM
The notations used in this paper are as follows:
. M : modulus.
. mj : a single bit of M at position j.
. A : multiplier operand.
. aj : a single bit of A at position j.
. B : multiplicand operand.
. n : operand size (in number of bits).
. R : a constant (called a Montgomery parameter),R ¼ 2n.
. qaj : coefficient determines the multiples of themultiplicand B ðqaj �BÞ.
. qmj : coefficient determines the multiples of themodulus M ðqmj �MÞ.
. S : intermediate partial product, or final result ofmodular multiplication.
. w : word size (in number of bits) of either B;M, or S.
1142 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 7, JULY 2011
. A. Ibrahim and H. Elsimary are with the Department of Microelectronics,Electronics Research Institute, Dokki, Cairo 12622, Egypt.E-mail: atef @ece.uvic.ca, [email protected].
. F. Gebali is with the Department of Electrical and Computer Engineering,University of Victoria, Victoria, BC V8W 3P6, Canada.E-mail: [email protected].
. A. Nassar is with the Electronics and Electrical CommunicationsDepartment, Cairo University, El-Giza, Cairo 12613, Egypt.E-mail: [email protected].
Manuscript received 6 June 2009; revised 14 Mar. 2010; accepted 1 July 2010;published online 8 Nov. 2010.Recommended for acceptance by S.-Q. Zhang.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-2009-06-0256.Digital Object Identifier no. 10.1109/TPDS.2010.196.
1045-9219/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society
. e ¼ dnwe: number of words in either B;M, or S.
. Ca; Cb : carry bits.
. ðBðe�1Þ; . . . ; Bð1Þ; Bð0ÞÞ: word vector of B.
. ðMðe�1Þ; . . . ;Mð1Þ;Mð0ÞÞ : word vector of M.
. ðSðe�1Þ; . . . ; Sð1Þ; Sð0ÞÞ : word vector of S.
. SðiÞðk�1...0Þ: bits k� 1 to 0 from the ith word of S.
. SðiÞjðk�1...0Þ: bits k� 1 to 0 from the ith word of S at
iteration j.
Algorithm 1 shows the MWR4MM algorithm. Thisalgorithm is an extension of the Multiple-Word High-Radix(Radix ¼ 2k) Montgomery Multiplication (MWR2kMM) al-gorithm presented in [14], but it used a recoding scheme torecode qmj [16].
Algorithm 1. Scalable MWR4MM algorithm
1: S ¼ 0
2: for j ¼ 0 to dn=2e � 1 do
3: qaj ¼ Boothða2jþ1���2j�1Þ ¼ BoothðAjÞ4: ðCa; Sð0ÞÞ ¼ Sð0Þ þ ðqaj �BÞð0Þ5: qmj ¼MontgðSð0Þ1���0 � ð4� ðM
ð0Þ1���0Þ
�1Þmod4Þ
6: ðCb; Sð0ÞÞ ¼ Sð0Þ þ ðqmj �MÞð0Þ7: for i ¼ 1 to e� 1 do
8: ðCa; SðiÞÞ ¼ SðiÞ þ ðqaj �BÞðiÞ9: ðCb; SðiÞÞ ¼ SðiÞ þ ðqmj �MÞðiÞ
10: Sði�1Þ ¼ ðSðiÞð1���0Þ; Sði�1Þðw�1���2ÞÞ
11: end for;
12: Ca ¼ Ca or Cb13: Sðe�1Þ ¼ signextðCa; Sðe�1Þ
ðw�1���2ÞÞ14: end for;
3 A SYSTEMATIC METHODOLOGY FOR PROCESSOR
ARRAY DESIGN
Systematic methodologies to design processor arrays allowfor design space exploration for optimizing performanceaccording to certain specifications while satisfying designconstraints. Several methodologies were proposed earlier[17], [18], [19], [20]. However, most of these methodologieswere not able to deal with algorithms that have dimensionsmore than two. The second author proposed a systematicmethodology that deals with algorithms of arbitrarydimensions [20]. The author proposed a formal algebraicprocedure for processor array design starting from aRegular Iterative Algorithm (RIA) for a 3D digital filterwhich gives rise to a dependency graph (DG) in 6D space.In this work, we used this formal technique to developprocessor arrays for scalable radix 4 Montgomery modularmultiplication algorithm. We also extended the methodol-ogy by using nonlinear processor projection operations.
3.1 Expressing the Algorithm as Iterative Algorithm
To develop a processor array, first, we must be able todescribe the MWR4MM algorithm using recursions thatconvert the algorithm into RIA. We can rewrite theMWR4MM algorithm, Algorithm 1, as in Algorithm 2 wherej represents iteration index and i represents word index.
Algorithm 2. Scalable MWR4MM algorithm as RIA
1: for i ¼ 0 to e do “initialization”
2: SðiÞ�1 ¼ 0
3: end for;
4: for j ¼ 0 to dn=2e � 2 do
5: SðeÞj ¼ 0
6: end for;
7: for j ¼ 0 to dn=2e � 1 do “main algorithm”
8: qaj ¼ Boothða2jþ1���2j�1Þ ¼ BoothðAjÞ
9: ðCa; Sð0Þj Þ ¼ Sð0Þj�1 þ ðqaj �BÞ
ð0Þ
10: qmj ¼MontgðSð0Þjð1���0Þ � ð4� ðMð0Þ1���0Þ
�1Þmod4Þ
11: ðCb; Sð0Þj Þ ¼ Sð0Þj þ ðqmj �MÞð0Þ
12: for i ¼ 1 to e� 1 do
13: ðCa; SðiÞj Þ ¼ SðiÞj�1 þ ðqaj �BÞ
ðiÞ
14: ðCb; SðiÞj Þ ¼ SðiÞj þ ðqmj �MÞðiÞ
15: Sði�1Þj ¼ ðSðiÞjð1���0Þ; S
ði�1Þjðw�1���2ÞÞ
16: end for;
17: Ca ¼ Ca or Cb18: S
ðe�1Þj ¼ signextðCa; Sðe�1Þ
jðw�1���2ÞÞ19: end for;
20: if S �M then S ¼ S �M21: end if;
3.2 Obtaining the Algorithm Dependency Graph
The MWR4MM Algorithm 2 can be easily defined on a 2Ddomain since there are two indices ðj; iÞ. The DG is shownin Fig. 1. The computation domain is the convex hull in the2D space, where the algorithm operations are defined asindicated by the gray and black circles in the 2D plane [20],[21]. From Fig. 1, we notice that the black circle operationcorresponds to the computation of steps 8, 9, 10, and 11 andthe gray circle operation corresponds to the computation ofsteps 13, 14, and 15 of Algorithm 2. Also, from this figure,
IBRAHIM ET AL.: PROCESSOR ARRAY ARCHITECTURES FOR SCALABLE RADIX 4 MONTGOMERY MODULAR MULTIPLICATION... 1143
Fig. 1. MWR4MM algorithm dependency graph for n ¼ 10 and w ¼ 2.
we notice that the input variables BðiÞ and MðiÞ arerepresented by horizontal lines, variables qaj and qmj arerepresented by vertical lines, and the output variable Sj isrepresented by vertical lines.
3.3 Data Scheduling
Pipelining or broadcasting the variables of an algorithm isdetermined by the choice of a timing function that assigns atime value to each node in the DG. A simple but useful timingfunction is an affine scheduling function of the form [20].
tðpÞ ¼ Cp� c; ð1Þ
where the function tðpÞ associates a time value t to a point pin the DG. Value of c is chosen to ensure that only positivetime index values are obtained. The row vector C ¼ ½c1 c2� isthe scheduling vector and c is an integer. The affinescheduling function must satisfy several conditions. FromFig. 1, we observe that in each column, the input of eachgray node depends on the output of the black node for thesame column, thus we can write
tðpðj; 0ÞÞ < tðpðj; iÞÞ 8 0 � j � dn=2e � 1 and 1 � i � e:ð2Þ
Applying our scheduling function in (1) to this inequality,we get
c1 c2½ �j
0
� �< c1 c2½ �
j
i
� �; ð3Þ
jc1 < jc1 þ ic2; ð4Þ
which could be simplified to
c2 > 0: ð5Þ
Similarly from Fig. 1, we observe that the output variable
SðiÞj depends on the previous output value S
ðiþ1Þj�1 , thus we
can write
½c1 c2�j
i
� �> c1 c2½ �
j� 1
iþ 1
� �; ð6Þ
jc1 þ ic2 > ðj� 1Þc1 þ ðiþ 1Þc2; ð7Þ
which could be simplified to
c2 < c1; ð8Þ
hence, (5) and (8) could be merged as
0 < c2 < c1: ð9Þ
From (9), there are many solutions for C, the mostreasonable and simplest one is
C ¼ ½2 1�: ð10Þ
If we want to pipeline a variable whose nullvector is �, wemust have
C�t 6¼ 0; ð11Þ
where � is the nullvector of the variable dependence matrix[20]. On the other hand, if we want to broadcast a variablewhose nullvector is �, we must have [20]
C�t ¼ 0: ð12Þ
To study the timing of the variables BðiÞ;MðiÞ; Aj, and SðiÞj ,
we first find their nullvectors
�B ¼ ½1 0�; ð13Þ�M ¼ ½1 0�; ð14Þ�A ¼ ½0 1�; ð15Þ�S ¼ ½�1 1�: ð16Þ
The product of C and these nullvectors gives
C�tB ¼ 2; ð17ÞC�tM ¼ 2; ð18ÞC�tA ¼ 1; ð19ÞC�tS ¼ �1; ð20Þ
therefore, all input and output variables will be pipelined.
3.4 DG Node Projection
The projection operation is a many-to-one function thatmaps several nodes of the DG onto a single node, whichconstitutes the resulting processor array. Thus, severaloperations in the DG are mapped to a single PE. Theprojection operation allows hardware economy by multi-plexing several operations in the DG on a single PE. Thesecond author [20] explained how to perform the projectionoperation using a projection matrix P . To obtain theprojection matrix, we need to define a desired projectiondirection d. The vector d belongs to the null space of P .Since we are dealing with a 2D DG, matrix P is a row vectorand d is a column vector [20]. A valid projection direction d
must satisfy the inequality [20]
Cd 6¼ 0: ð21Þ
In the following, we will discuss design space explora-
tions based on the timing function C obtained in (10). There
are many projection vectors that satisfy (21) for the
scheduling function in (10). For simplicity, we choose four
of them as follows:
d1 ¼ ½1 0�t; ð22Þd2 ¼ ½0 1�t; ð23Þd3 ¼ ½2 1�t; ð24Þd4 ¼ ½1 1�t: ð25Þ
The corresponding projection matrices are given by
P1 ¼ ½0 1�; ð26ÞP2 ¼ ½1 0�; ð27ÞP3 ¼ ½�1 2�; ð28ÞP4 ¼ ½�1 1�: ð29Þ
Our processor design space now allows for four processor
array configurations for each projection vector for the
timing function C. In the following sections, we study the
processor arrays associated with each design option.
3.4.1 Design1: Using d1 ¼ ½1 0�t
A point in the DG p ¼ ½j i�t will be mapped by the projection
matrix P1 ¼ ½0 1� onto the point
1144 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 7, JULY 2011
p0 ¼ P1p ¼ i: ð30Þ
The resulting processor array corresponding to the projec-tion matrix P1 consists of eþ 1 PEs. Only z ¼ dðeþ 1Þ=2ePEs are active at each time step. Fig. 2 shows the processoractivity for the case n ¼ 10 and w ¼ 2, where the blacknodes represent active PEs and white nodes represent idlePEs. Since only maximum z PEs are active at a given timestep, the PEs are not well utilized. To improve PEutilization, we need to reduce the number of processors.We note from Fig. 2 that PE2i and PE2iþ1 are active atnonoverlapping time steps. Thus, each pair of PEs (PE2i andPE2iþ1) can be mapped to a single PE without causing timeconflicts. This can be achieved using the following non-linear projection operation:
l ¼ bi=2c mod z: ð31Þ
The resulting processor array for different values of n andw is shown in Fig. 3. This processor array is new and wasnot reported before. The processor array now consists of zPEs and the calculation latency is 2dn=2e clock cycles.Input words Bð2iÞ; Bð2iþ1Þ;Mð2iÞ, and Mð2iþ1Þ are allocatedto processor PEi and input Aj is allocated to processorPE0. qaj and qmj are generated inside PE0 and pipelinedto the next PEs with higher indices. The intermediateoutput words S
ðiÞj of each PE are pipelined between
adjacent PEs. A tristate buffer at the output of each PEensures that it is the only output fed to the output bus.The PEs hardware implementation details for Design1 aregiven in the supplementary material provided with thispaper, which can be found on the Computer Society
Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.196.
3.4.2 Design2: Using d2 ¼ ½0 1�t
A point in the DG p ¼ ½j i�t will be mapped by the projectionmatrix P2 ¼ ½1 0� onto the point
p0 ¼ P2p ¼ j: ð32Þ
The resulting processor array corresponding to the projec-tion matrix P2 consists of dn=2e PEs. Only z ¼ dðeþ 1Þ=2ePEs are active at each time step. Fig. 4 shows the processoractivity for the case n ¼ 10 and w ¼ 2, where the blacknodes represent active PEs and white nodes represent idlePEs. Since only maximum z PEs are active at a given timestep, the PEs are not well utilized. To improve PEutilization, we need to reduce the number of processors.We note from Fig. 4 that PEj and PEjþz are active atnonoverlapping time steps. Thus, each pair of PEs (PEj andPEjþz) can be mapped to a single PE without causing timeconflicts. This can be achieved using the following non-linear projection operation:
l ¼ j mod z: ð33Þ
The resulting processor array, for different values of n andw,after applying the above modulo operation on the activitygraph in Fig. 4 is shown in Fig. 5. This processor array isidentical to the one reported in [23], [24]. The processor arraynow consists of z PEs and the calculation latency is 2dn=2eclock cycles. Input wordsBðiÞandMðiÞ are pipelined throughthe PEs and input Aj is allocated to each PE. qaj and qmj aregenerated and used inside each PE. The intermediate outputwords S
ðiÞj of each PE are pipelined to the next PE with higher
IBRAHIM ET AL.: PROCESSOR ARRAY ARCHITECTURES FOR SCALABLE RADIX 4 MONTGOMERY MODULAR MULTIPLICATION... 1145
Fig. 2. Processor activity at different time steps for d1 ¼ ½1 0�t; n ¼ 10,and w ¼ 2.
Fig. 3. Processor array for d1 ¼ ½1 0�t for different values of n and w.
Fig. 4. Processor activity at different time steps for d2 ¼ ½0 1�t; n ¼ 10,and w ¼ 2.
index. The PEs hardware implementation details for Design2are given in the supplementary material provided with thispaper, which can be found on the Computer Society DigitalLibrary at http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.196.
3.4.3 Design3: Using d3 ¼ ½2 1�tA point in the DG p ¼ ½j i�t will be mapped by the projectionmatrix P3 ¼ ½�1 2� onto the point
p0 ¼ P3p ¼ �jþ 2i: ð34Þ
The resulting processor array corresponding to the projec-tion matrix P3 consists of 2eþ dn=2e þ 1 PEs, after adding afixed increment to all PE indices to ensure nonnegative PEindex values [20]. Only z ¼ dðeþ 1Þ=2e PEs are active ateach time step. Fig. 6 shows the processor activity for thecase n ¼ 10 and w ¼ 2 where the black nodes representactive PEs and white nodes represent idle PEs. Since onlymaximum z PEs are active at a given time step, the PEs arenot well utilized. To improve PE utilization, we need toreduce the number of processors. We notice from Fig. 6 thatall PEs whose indices are given by (34) can be mapped toPEs with indices l as
l ¼ ð�jþ 2iÞ mod z ð35Þ
without any timing conflicts. This statement is true as long
as the inequality Cd3 6¼ z is satisfied. Applying the above
nonlinear projection operation and using the activity graph
of Fig. 6 produces the processor array shown in Fig. 7. For
different values of z, the nonlinear mapping of (35) will
produce different processor arrays. This is due to the
mixing of indices i and j in the nonlinear mapping
operation. This was not the case for Designs 1 and 2. In
the processor array of Fig. 7, input words BðiÞ and MðiÞ are
loaded to the PEs through the first eþ 1 clock cycles and
then pipelined between them. Input Aj is allocated to each
PE. qaj and qmj are generated inside each PE and pipelined
to the next PEs. The intermediate output words SðiÞj of each
PE are pipelined to the next PE. A tristate buffer at the
output of each PE ensures that it is the only output fed to
the output bus.
3.4.4 Design4: Using d4 ¼ ½1 1�t
A point in the DG p ¼ ½j i�t will be mapped by the projectionmatrix P4 ¼ ½�1 1� onto the point
p0 ¼ P4p ¼ �jþ i: ð36Þ
The resulting processor array corresponding to the projec-tion matrix P4 consists of eþ dn=2e þ 1 PEs, after adding afixed increment to all PE indices to ensure nonnegative PEindex values [20]. Only z ¼ dðeþ 1Þ=2e PEs are active ateach time step. Fig. 8 shows the processor array activity forthe case n ¼ 14 and w ¼ 2 where the black nodes representactive PEs and white nodes represent idle PEs. Since onlymaximum z PEs are active at a given time step, the PEs arenot well utilized. To improve PE utilization, we need toreduce the number of processors. We note from Fig. 8 thateach PE is active each four time steps. Thus, all PEs whoseindices given by (36) can be mapped to PEs with indices l as
l ¼ ð�jþ iÞ mod z ð37Þ
without any timing conflicts. This statement is true as longas the inequality Cd4 6¼ z is satisfied. When n ¼ 10 andw ¼ 2, we get z ¼ 3. Therefore, the inequality will not besatisfied and thus we should exclude all cases that havez ¼ 3. In practice, z > 3 and the inequality will always besatisfied. Applying the above nonlinear projection opera-tion and using the activity graph of Fig. 8 produces theprocessor array shown in Fig. 9. For different values of z,the nonlinear mapping of (37) will produce differentprocessor arrays. This is due to the mixing of indices iand j in the nonlinear mapping operation. In the processorarray of Fig. 9, input words BðiÞand MðiÞ are loaded to thePEs through the first eþ 1 clock cycles and then pipelined
1146 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 7, JULY 2011
Fig. 5. Processor array for d2 ¼ ½0 1�t for different values of n and w.
Fig. 6. Processor activity at different time steps for d3 ¼ ½2 1�t; n ¼ 10,and w ¼ 2.
Fig. 7. Processor array for d3 ¼ ½2 1�t , n ¼ 10, and w ¼ 2.
between them. Input Aj is allocated to each PE. qaj and qmj
are generated inside each PE and pipelined to the next PEs.
The intermediate output words SðiÞj of each PE are pipelined
between the PEs. A tristate buffer at the output of each PE
ensures that it is the only output fed to the output bus.
4 LOW POWER TECHNIQUES
In this section, we further improve the hardware of theprevious designs to dissipate less power. The blockdiagrams of the modified first PE (PE0) of Design1 andmodified PE of Design2, provided in the supplementarymaterial, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.196, each has two circuit modules incharge of the Booth and Montgomery recodings. Thesemodules comprise only combinational logic circuits [16].The outputs of these modules have unbalanced path delaysand consequently introduce glitches which cause unneces-sary dynamic power dissipation. Furthermore, the fan outsof the glitchy signals are large and this leads to significantdissipated power.
To reduce the glitching power dissipation, we put in some
latches and force the outputs of these modules (PP and MM)
to pass through latches. If all flip-flops and registers capture
their inputs at the clock’s rising edge, then the latches are
transparent when the clock is in a low state. If the outputs of
the two recoding modules can reach their stable values
before the clock’s falling edge, none of the glitches can
propagate to the fan-out modules driven by the outputs. We
name these latches “glitch blockers” [25]. The glitch blockers
are also very effective for reducing the glitches appearing in
the Carry Save Adder (CSA) since they synchronize the
arrival of PP and MM at the CSA’s inputs. The sametechnique can be applied to Design3 and Design4.
PP-Generator module, see the supplementary material,which can be found on the Computer Society Digital Libraryat http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.196, generates its output (PP) by modifying a word ofB according to the Booth recoder’s outputs, SEL_PP andEN_PP. Also, MM-Generator module, see the supplemen-tary material, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.196, generates its output (MM) bymodifying a word of M according to the Montgomeryrecoder’s outputs, SEL_MM and EN_MM. When PP iszeroed by EN_PP, PP outputted from the PP-Generator doesnot depend on SEL_PP. Thus, keeping SEL_PP frozen at thattime is effective for reducing power dissipation. Samereasoning also applies to SEL_MM. We place two 1-bitflip-flops [25] and construct feedback loops for SEL_PP andSEL_MM to implement this idea. The same technique can beapplied to Design3 and Design4.
The experimental results show that power consumptionof Design1 and Design2 before applying the low powertechniques was 56.20 mW and 57.23 mW (@10 MHZ and1.8 V), respectively. After applying the low power techni-ques, these values are reduced to 34.23 mW and 40.20 mW,respectively. Also, Table 3 shows the significance ofapplying the low power techniques on our designs. Theresults in this table show that Design1 has a significant gainin reducing power consumption over the other designs. Thepower reduction of Design1 ranges between 15 and50 percent compared to the other designs reported here orin the literature.
5 DESIGNS COMPARISON
In this section, we compare the designs developed in thispaper with the previous efficient Montgomery multiplierdesigns in terms of area, speed, and power consumption[14], [16], [26].
For Design1, we observe that the projection operationresults in PE0 performing more complex operationscompared to the other PEs in the processor array. Hence,the PEs in the processor array have simpler structure andneed to do less operations that consume less powercompared to PE0. This results in overall simpler hardwarefor the array as a whole. Therefore, we expect less hardwareresources for Design1 and less power consumption as well.
For Designs 2, 3, and 4, however, all PEs performidentical operations which are identical to the operationsdone by PE0 in Design1. Hence, all PEs have identicalhardware structures and are close to identical power
IBRAHIM ET AL.: PROCESSOR ARRAY ARCHITECTURES FOR SCALABLE RADIX 4 MONTGOMERY MODULAR MULTIPLICATION... 1147
Fig. 9. Processor array for d4 ¼ ½1 1�t, n ¼ 14, and w ¼ 2.
Fig. 8. Processor activity at different time steps for d4 ¼ ½1 1�t; n ¼ 14,and w ¼ 2.
consumption. The only difference between the designs isthe paths taken by the signals which are coordinated by thecontrol unit in each PE. Therefore, we expect higherhardware resource requirements and more power con-sumption compared to Design1. To illustrate these points,we implemented Design1 and Design2 as representative ofthe two classes of processor arrays proposed in this work.
The estimated data of area and speed are generated bysynthesizing the VHDL code of each design using Leonardosynthesis tool, from Mentor Graphics corporation, forxc3s1600e-5fg484 (Spartan 3E FPGA) as target technology.For power estimation, we used “XPower analyzer” fromXilinx Corporation, for the same part as target technology.
5.1 Area Estimation
The area of each design depends mainly on two designparameters: the number of architecture stages (z) and theword size (w) of the operands. Table 1 compares the totalarea needed for the first two designs to other efficientMontgomery multiplier designs for n ¼ 1024 and fordifferent values of w and z. The area improvement ofDesign1 over the other compared designs is shown in thetable. We conclude from Table 1 that Design1 has asignificant gain in reducing area over other designs fordifferent values of w and z.
5.2 Time Estimation
The total computation time for each design is equal to theproduct of the number clock cycles it takes and the clock
period. The critical path delay determines the clock periodand depends on the number of PEs (z) and the word size (w)of the operands. Equation (38) represents the total numberof clock cycles needed for Design1 while (39) represents thetotal number of clock cycles needed for Design2 and allother compared designs (where k ¼ 2 in case of radix 4designs and k ¼ 3 in the case of radix 8 designs).
Tdesign1 ¼ 2dn=2e þ 2ðz� 1Þ; ð38ÞTdesign2 ¼ dn=kzeð2zþ 1Þ þ eþ 1: ð39Þ
Table 2 compares the total computation time in �s neededfor the first two designs to other efficient Montgomerymultiplier designs for n ¼ 1024 and for different values of wand z. The time improvement of Design1 over other designsis shown in the table. We conclude from Table 2 thatDesign1 has a small gain in reducing the total computationtime over other radix 4 designs and a moderate gain overradix 8 designs for different values of w and z.
5.3 Power Estimation
Table 3 compares the power consumption in mW for thefirst two designs, after applying the low power techniques,to other efficient Montgomery multiplier designs for n ¼1024 and for w ¼ 32 and z ¼ 17. We notice from Table 3 thatDesign1 has a significant gain in reducing power consump-tion over Design2 after applying the low power techniques.This is attributed to the low power techniques, applied toeach design, add hardware only to the first PE of Design1while this hardware is replicated for each PE of Design2.This extra hardware adds overhead area to the total designand contributes to the increase of the power consumption ofDesign2 over Design1. The power improvement of Design1over other designs is also shown in the table. We concludefrom Table 3 that Design1 has a significant gain in reducingthe total power consumption over other designs.
6 SUMMARY AND CONCLUSION
This paper presented a systematic technique to explore allpossible processor arrays for the Scalable MWR4MMalgorithm. This technique first converted the MWR4MMalgorithm to regular iterative expression (RIA). Havingobtained the RIA, we were able to develop a datadependency graph which allowed us to explore all possibledata timing options. Earlier approaches did not explain howthe designs were obtained and at best help developed onedesign and did not allow for design space exploration. Fourpossible processor array structures were obtained. Design2
1148 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 7, JULY 2011
TABLE 1Total Area Comparison in Number of CLBs for n ¼ 1024 and
Different Values of w and z
TABLE 2Total Computation Time ð�sÞ for n ¼ 1024
TABLE 3Power Consumption ðmWÞ for n ¼ 1024; w ¼ 32, and z ¼ 17
(@10 MHZ and 1.8 V)
processor array is identical to the one obtained by Tawalbehand Tenca [16] and almost has similar processor architec-
tures compared to Design3 and Design4. Design1 is a noveldesign and was not reported before in the literature. To
reduce power consumption, we applied effective low powertechniques for reducing the glitches and the Expected
Switching Activity (ESA) of high fan-out signals in each
processor array. From the previous analysis, Section 5, wecan conclude that Design1 has better performance than other
designs in terms of area, speed, and power consumption.
REFERENCES
[1] L. Rivest, A. Shamir, and L. Adleman, “A Method for ObtainingDigital Signatures and Public-Key Cryptosystems,” Comm. ACM,vol. 21, no. 2, pp. 120-126, Feb. 1978.
[2] Nat’l Inst. for Standards and Technology, “Digital SignatureStandard (DSS),” Fed. Information Processing Standards Publication(FIPS PUB 186-2), Jan. 2000.
[3] M. Hellman, “New Directions on Cryptography,” IEEE Trans.Information Theory, vol. IT-22, no. 6, pp. 644-654, Nov. 1976.
[4] N. Koblitz, “Elliptic Curve Cryptosystems,” Math. of Computation,vol. 48, no. 177, pp. 203-209, Jan. 1987.
[5] A. Menezes, Applications on Finite Fields. Kluwer AcademicPublishers, 1993.
[6] B. Kaliski, C. Koc, and T. Acar, “Analyzing and ComparingMontgomery Multiplication Algorithms,” IEEE Micro, vol. 16,no. 3, pp. 26-33, June 1996.
[7] T. Hamano, “O(n)-Depth Circuit Algorithm for Modular Expo-nentiation,” Proc. IEEE 12th Symp. Computer Arithmetic, pp. 188-192, 1995.
[8] C. Paar and T. Blum, “Montgomery Modular Exponentiation onReconfigurable Hardware,” Proc. 14th IEEE Symp. ComputerArithmetic, pp. 70-77, 1999.
[9] J. Bajard, L. Didier, and P. Kornerup, “An RNS MontgomeryModular Multiplication Algorithm,” IEEE Trans. Computers,vol. 47, no. 7, pp. 766-776, July 1998.
[10] P. Montgomery, “Modular Multiplication without Trial Division,”Math. of Computation, vol. 44, no. 170, pp. 519-521, Apr. 1985.
[11] T. Yanik, E. Savas, and C. Koc, “Incomplete Reduction in ModularArithmetic,” Math. of Computation, vol. 149, no. 2, pp. 46-54, Mar.2002.
[12] A. Tenca and C. Koc, “A Scalable Architecture for ModularMultiplication Based on Montgomery’s Algorithm,” IEEE Trans.Computers, vol. 52, no. 9, pp. 1215-1221, Sept. 2003.
[13] A. Tenca, E. Savas, and C. Koc, “A Design Framework for Scalableand Unified Architectures that Perform Multiplication in gf(p)and gf(2m),” Int’l J. Computer Research, vol. 13, no. 1, pp. 68-83,2004.
[14] T. Todorov, A. Tenca, and C. Koc, “High-Radix Design of aScalable Modular Multiplier,” Cryptographic Hardware and Em-bedded Systems, C. Koc, D. Naccache, and C. Paar, ed., pp. 189-205,Springer Verlag, 2001.
[15] E. Savas, A. Tenca, M. Ciftcibasi, and C. Koc, “MultiplierArchitectures for GF(p) and GF(2n),” Proc. IEE Computers andDigital Techniques, vol. 151, no. 2, pp. 147-160, Mar. 2004.
[16] L. Tawalbeh and A. Tenca, “Radix-4 Asic Design of a ScalableMontgomery Modular Multiplier Using Encoding Techniques,”master’s thesis, Oregon State Univ., 2002.
[17] S. Rao and T. Kailath, “Regular Iterative Algorithms and TheirImplementation on Processor Arrays,” Proc. IEEE, vol. 76, no. 3,pp. 259-269, Mar. 1988.
[18] S. Kung, VLSI Array Processors. Prentice- Hall, 1988.[19] E. Abdel-Raheem, “Design and Vlsi Implementation of Multirate
Filter Banks,” PhD dissertation, Univ. of Victoria, 1995.[20] F. El-Guibaly and A. Tawfik, “Mapping 3D IIR Digital Filter onto
Systolic Arrays,” Multidimensional Systems and Signal Processing,vol. 7, no. 1, pp. 7-26, Jan. 1996.
[21] A. Refiq and F. Gebali, “Processor Array Architectures for DeepPacket Classification,” IEEE Trans. Parallel and Distributed Systems,vol. 17, no. 3, pp. 241-252, Mar. 2006.
[22] H. Orup, “Simplifying Quotient Determination in High-RadixModular Multiplication,” Proc. 12th IEEE Symp. Computer Arith-metic, pp. 193-199, July 1995.
[23] G. Todorov and A. Tenca, “Asic Design, Implementation andAnalysis of a Scalable High-Radix Montgomery Multiplier,”master’s thesis, Oregon State Univ., 2000.
[24] C. Koc and A. Tenca, “A Word-Based Algorithm and Architecturefor Montgomery Multiplication,” Cryptographic Hardware andEmbedded Systems, C. Koc, D. Naccache, and C. Paar, ed., pp. 94-108, Springer, 1999.
[25] H. Son and S. Oh, “Design and Implementation of Scalable Low-Power Montgomery Multiplier,” Proc. Int’l Conf. Computer Design,(ICCD ’04), pp. 524-531, 2004.
[26] N. Pinckney and D. Harris, “Parallel High-Radix MontgomeryMultipliers,” Proc. 42nd Asilomar Conf. Signals, Systems andComputers, pp. 772-776, 2008.
Atef Ibrahim received the BSc degree inelectronics engineering from Mansoura Univer-sity, Egypt, in 1998, and the MSc degree inelectronics and electrical communications fromCairo University, Egypt, in 2004. He is currentlyworking toward the PhD degree in the Electro-nics and Electrical Communications Departmentof Cairo University, Egypt and is a visitor studentin the Electrical and Computer EngineeringDepartment of University of Victoria, Canada.
His research interests include computer arithmetic, cryptography, andVLSI design.
Fayez Gebali received the BSc degree inelectrical engineering (first class honors) fromCairo University, the BSc degree in mathematics(first class honors) from Ain Shams University,and the PhD degree in electrical engineeringfrom the University of British Columbia, wherehe was a holder of NSERC postgraduatescholarship. He is a professor of computerengineering at the University of Victoria. Hisresearch interests include multicore processors,
computer communications, and computer arithmetic. He is a seniormember of the IEEE.
Hamed Elsimary is a professor at the VLSIDepartment, Electronics Research Institute,Cairo, Egypt, and is currently on leave, workingas a professor at the College of ComputerEngineering and Science, King Saud University,Alkharj, Saudi Arabia. His research interestsinclude low power circuit design and computerarchitecture.
Amin Nassar is a professor at the Electronicsand Electrical Communications Department,Faculty of Engineering, Cairo University, Egypt.His research interests include computer-aideddesign, microprocessors and interface, andindustrial electronics.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
IBRAHIM ET AL.: PROCESSOR ARRAY ARCHITECTURES FOR SCALABLE RADIX 4 MONTGOMERY MODULAR MULTIPLICATION... 1149