${ m GF}(p)$ Based on a Systolic Arithmetic Unit]]>

412 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 54, NO. 5, MAY 2007

A High-Performance Elliptic Curve CryptographicProcessor for General Curves Over GF(p) Based on a

Systolic Arithmetic UnitGang Chen, Guoqiang Bai, and Hongyi Chen

Abstract—This brief presents a high-performance elliptic curvecryptographic processor for general curves over GF( ), whichfeatures a systolic arithmetic unit. We propose a new unifiedsystolic array that efficiently implements addition, subtraction,multiplication and division over GF( ). At the system level,the control dependencies in the operation sequence and themismatched communication between the systolic array and theseparate storage would stall the pipeline in the systolic array. Thesepipeline stalls are successfully avoided by using two optimizationmethods. Synthesized in 0.13- m standard-cell technology, theprocessor requires 1.01 ms to compute a 256-bit scalar multiplica-tion for general curves over GF( ).

Index Terms—Elliptic curve cryptography (ECC), finite field,hardware implementation, systolic array.

I. INTRODUCTION

ELLIPTIC curve cryptography (ECC) [1], [2] is anemerging public key cryptography. Many hardware im-

plementations of ECC have been proposed [3]–[12]. Most ofthe proposed processors that support general curves overare multiplier-based [7], [11], [12]. The datapath is composedof a -bit multiplier that performs -bit multipli-cations repeatedly to accomplish a full-precision Montgomerymultiplication, where and are digit sizes chosen ac-cording to the tradeoff between area and performance. Thedisadvantage of this architecture is the long delay of the carrychain in the multiplier. Recently, another architecture of ECCprocessor that is based on a systolic arithmetic unit appeared inthe literature [8], [10]. In [8] the processor for general curvesover is built on a systolic array for Montgomery modularmultiplication, and in [10] the processor for general curvesover is built on a systolic array for both multiplicationand division. In [8], although the systolic array exhibits a highinherent throughput, the processor suffers from the system-levelperformance degradation due to the inefficient utilization ofthe array. In this brief, we propose a high-performance ECCprocessor for general curves over that is based on asystolic arithmetic unit. Not only arithmetic operations,

Manuscript received July 26, 2006; revised September 27, 2006. This workwas supported by the National Natural Science Foundation of China underGrant 60273004, Grant 60576027, Grant 60206023, and Grant 60544008, andby the Hi-Tech Research and Development Program of China under Grant2006AA01Z415. This paper was recommended by Associate Editor V. G.Oklobdzija.

The authors are with the Institute of Microelectronics, Tsinghua University,Beijing 100084, China (e-mails: [email protected]; [email protected]; [email protected]).

Digital Object Identifier 10.1109/TCSII.2006.889459

including multiplication and division, are efficiently imple-mented in a unified systolic array, but the array is also almostfully utilized so as to maximize the system-level performance.

We introduce a modified Montgomery modular multiplica-tion algorithm [13] and propose an extension of the modulardivision algorithm proposed in [14]. A systolic array for multi-plication and division is derived from these algorithms and thecommon hardware is shared. At the system level, the control de-pendencies brought in the operation sequence by the conditionaloperations, and the mismatched communication between thesystolic array and the separate storage would stall the pipelinein the systolic array, degrading its utilization efficiency. Forthe control dependencies, we delay the conditional operations,and, due to the lack of independent operations, we schedulethe operations that are dependent on the conditional operationsinto the delay slots, which is enabled by an observation on theunderlying mathematics. The mismatched communication iseliminated by distributing the registers used for storage intoindividual processing elements (PEs) of the systolic array. Thepipeline stalls are almost completely avoided by using thesemethods. Synthesized in 0.13- m standard-cell technology,the processor requires 1.01 ms to compute a 256-bit scalarmultiplication over .

II. SYSTOLIC IMPLEMENTATION OF

ARITHMETIC OPERATIONS OVER

In this section we present the algorithms for arithmetic oper-ations over the underlying finite field and derive a systolic arrayfrom the algorithms.

A. Algorithms

is regarded as . An element of is a residueclass modulo , denoted as . isrepresented by any representative chosen from the residue class(e.g., ). Similarly, an arithmetic operation over is imple-mented as an equivalent operation on the representatives. Let

denote an arithmetic operation over , and denote anoperation on the representatives. implements if for any ,

, and for any and any , .In a hardware implementation, elements of are stored

and processed in the form of their representatives. Only repre-sentatives with small absolute values are appropriate due to thepractical limitation of the hardware area. Certain proper bounds,in which representatives are allowable, are predefined for an im-plementation. If a certain operation produces a representativebeyond the allowable bounds, the representative must be substi-tuted with a representative in the bounds, i.e., it must be reduced

1549-7747/$25.00 © 2007 IEEE

CHEN et al.: HIGH-PERFORMANCE ELLIPTIC CURVE CRYPTOGRAPHIC PROCESSOR 413

Fig. 1. Algorithm MONMODMUL: Montgomery modular multiplication.

Fig. 2. Algorithm MODDIV2: Modular division by 2.

modulo . Note that the conditional operations in modular re-ductions have a disadvantageous impact on the utilization effi-ciency of the systolic array, so the number of the modular reduc-tions required by a specific computation should be minimized.In traditional implementations the allowable bounds are .We relax the bounds to , where , and con-sequently many modular reductions are saved.

We use addition and subtraction over to implement additionand subtraction over , respectively. The lower and upperbounds on the result are the respective sums of the lower andupper bounds on the two operands.

The well-known Montgomery modular multiplication [13] isused to implement multiplication over . Fig. 1 shows aMontgomery modular multiplication algorithm that is modifiedto deal with data in the 2’s complement representation. In Fig. 1,

is the Montgomery radix, , and is sign extended tobits. A Montgomery modular multiplication can yield a re-

sult whose bounds are , which is tighter than the boundson either operand, or . The subroutineMODDIV2 is shown in Fig. 2, where the variable is initializedto either 1 or 1 when MODDIV2 is invoked for the first timeand in each subsequent invocation to MODDIV2 is negated if

is odd.We extend the modular division algorithm proposed in [14] to

perform division over . Fig. 3 shows the algorithm, where“ ” means that the two statements, and , are ex-ecuted in parallel. The bounds on the quotient are the same asthe bounds on the dividend. A simulation of the algorithm with

, which is exactly the same configuration when the algo-rithm is applied in our processor, shows that the average numberof the iterations is approximately .

B. Systolic Array

We map Algorithm MONMODMUL and Algorithm DIV intosystolic arrays and combine the two arrays using the techniquesdescribed in [14]. The computing logic in the multiplicationarray is fully shared with the division array, and the combina-tion doesn’t enlarge the critical path delay. The array also im-plements addition and subtraction without any modification tothe computing logic. Fig. 4 shows the combined systolic array.The th PE processes the th bits and the th bits ofdata. The rightmost PE processes the least significant two bitsand also generates the control signals. For the computation of aniteration of either algorithm, the computation flow issues fromthe rightmost PE and propagates toward the left passing one PE

Fig. 3. Algorithm DIV: Division over =(p).

Fig. 4. The combined multiplication/division systolic array.

(correspondingly, processing two bits) per cycle, and the com-putation of the iteration is completed once the computation flowreaches the leftmost PE. The computation of an iteration can bestarted once the least significant two bits of the results of the pre-vious iteration are computed, and these bits are computed in thefirst cycle of the computation of the previous iteration. There-fore, the computations of iterations can be started continuouslyat the rate of one iteration per cycle.

In Fig. 4, the 2-bit port I’s are used to input the two operands,and the 2-bit port O’s output the results of multiplication anddivision. The subscripts represent the bit positions. The input/output data are directly routed to/from the PEs and are con-sumed/produced in the same manner as data processing. Theonly exception is the multiplier operand of multiplication. It,denoted as I2S, is pumped into the rightmost PE serially at therate of one bit per cycle. The small black boxes illustrate theinput and output pattern with regard to the time.

The details of a PE in Fig. 4 are shown in Fig. 5, where thesmall blank boxes represent the pipeline registers. V1, S1, V2,S2, SWAP, MUL, DIV and INIT are control signals. O1 and O2are the results of multiplication and division, respectively.

III. METHODS FOR AVOIDING PIPELINE STALLS

The systolic array described above constitutes the datapath ofthe ECC processor. In order to achieve the maximum system-level throughput, the systolic datapath must be fully utilized, soall the pipeline stalls must be avoided.


Fig. 5. The details of PE. (a) A PE. (b) The computing logic (COMPUT) in(a). (c) The multiplexers to the registers (REGMUX) in (a).

A. Delaying Modular Reductions and Scheduling DependentOperations

We notice that modular reductions must be invoked fre-quently in the computation of an elliptic curve scalar multipli-cation. The cause is the iterative execution of point additionsand doublings. For example, consider that the base point isdoubled iteratively in the right-to-left double-and-add scalarmultiplication algorithm. The coordinate of is computedas a recurrence in the form , which implies aniterative subtraction sequence performed on . Note that sub-tractions are implemented as integer subtractions. This iterativesubtraction sequence would cause the bounds on to expandgradually. Hence, the value of must be periodically reducedmodulo to prohibit this expansion. Modular reduction, inthe simplest form, is implemented as a sequence of pairs ofcomparisons and conditional operations that depend on thecomparisons. The latency of one comparison is equal to thenumber of the pipeline stages, and each conditional operationmust wait for that long latency, stalling the pipeline. Thesepipeline stalls waste 14% of the total cycles.

In the operation sequence for scalar multiplication, the con-ditional operations in modular reductions are the sole sourceof control dependencies that stall the pipeline. A popular tech-nique for avoiding the pipeline stalls caused by control depen-dencies is delaying the conditional operations and schedulingsome useful operations into the delay slots. The efficiency ofthis technique relies on whether there are sufficient useful oper-ations that can be scheduled to fill the delay slots.

In order to demonstrate our scheduling method, consider anexample, reducing a variable from intomodulo . The following operation sequence is used to imple-ment the modular reduction.

An operation producing . Denoted as O1..

Compare with . Denoted as C1.If C1 yields “ ,” .If C1 yields “ ”, and quit the modular reduction.Compare with . Denoted as C2.If C2 yields “ ”, .If C2 yields “ ”, and quit the modular reduction.Compare with . Denoted as C3.If C3 yields “ ”, .If C3 yields “ ”, and quit the modular reduction.Compare with . Denoted as C4.If C4 yields “ ”, .If C4 yields “ ”, and quit the modular reduction.

.An operation depending on . Denoted as O2.Another operation depending on . Denoted as O3.

Note that the comparisons (i.e., testing ) are independent ofthe conditional additions/subtractions and assignments (i.e., up-dating ), benefiting operation scheduling. The conditional op-erations are delayed, and for each comparison a delay slot isformed after it. Also note that the operations preceding and suc-ceeding the modular reduction (i.e., O1, O2, and O3) are given.Like the typical context of the modular reduction that is appliedin the operation sequence for scalar multiplication, the modularreduction depends on the preceding operation O1 and the suc-ceeding operations O2 and O3 depend on the modular reduction.

First, schedule the comparisons C2, C3 and C4 into the delayslot after C1. The operation sequence is as follows.

An operation producing . Denoted as O1.Compare with . Denoted as C1.Compare with . Denoted as C2.Compare with . Denoted as C3.Compare with . Denoted as C4.If C1 yields “ ”, .If C1 yields “ ”, and quit the modular reduction.If C2 yields “ ”, .If C2 yields “ ”, and quit the modular reduction.If C3 yields “ ”, .If C3 yields “ ”, and quit the modular reduction.If C4 yields “ ”, .If C4 yields “ ”, and quit the modular reduction.An operation depending on . Denoted as O2.Another operation depending on . Denoted as O3.

CHEN et al.: HIGH-PERFORMANCE ELLIPTIC CURVE CRYPTOGRAPHIC PROCESSOR 415

However, since the depth of the delay slot is equal to the la-tency of the comparison, i.e., the number of the pipeline stages,and the computation time of one comparison is one cycle, thedelay slot can not be filled with only three comparisons. We at-tempt to schedule the operations preceding or succeeding themodular reduction into the delay slot. It is obviously unrea-sonable to schedule the operation O1 which produces . For-tunately, we can show that it does not destroy the functionalityof the operation sequence to schedule the operations O2 and O3although they depend on the modular reduction. Suppose thescheduling is performed as follows.

An operation producing . Denoted as O1.Compare with . Denoted as C1.Compare with . Denoted as C2.Compare with . Denoted as C3.Compare with . Denoted as C4.Operation depending on . Denoted as O2.Another operation depending on . Denoted as O3.If C1 yields “ ”, .If C1 yields “ ”, and quit the modular reduction.If C2 yields “ ”, .If C2 yields “ ”, and quit the modular reduction.If C3 yields “ ”, .If C3 yields “ ”, and quit the modular reduction.If C4 yields “ ”, .If C4 yields “ ”, and quit the modular reduction.

The operations O2 and O3 would use the reduced value ofif not scheduled, while after scheduling they use the unreducedvalue. Recalling Section II-A, we notice that the reduced valueand the unreduced value are in the same residue class moduloand represent the same element of . Therefore, the resultsyielded in those two cases are equivalent over .

This method is successfully applied to the operation sequencefor scalar multiplication. The delay slots in all the modular re-ductions, except the final one performed on the output datum,are filled with useful operations.

B. Distributing Registers Into Individual PEs

In the processor, the data produced by the systolic datapathare stored into a separate storage and the data consumed bythe datapath are loaded from the storage. In the systolic data-path, as shown in Fig. 4, two bits of a datum are produced orconsumed by a PE in one cycle and the more significant twobits are produced or consumed by the PE at the next stage inthe next cycle, except the multiplier operand of a multiplica-tion. Additionally, many different operations may be executedoverlappedly and each operation is performed inside a collectionof consecutive PEs at one time. Consequently, many reads andwrites occur on many different bits of different data in parallel.The data transfer mode of the datapath is of bit granularity andof high concurrency. In contrast, the peer of the data transfer,the separate storage, can afford only a limited number of readsand writes of different words of different data at one time. Thedata transfer mode is of word granularity and of low concur-rency. This mismatched communication forces the datapath towait for the storage until the required bits are available and theproduced bits are saved.

In order to address this problem, we remove the separatestorage, and combine the datapath with the storage by dis-tributing the registers used for storage into individual PEs. Inthe combined datapath and storage (abbreviated to CDS fromhence on), the th PE contains a set of two-bit registers that arethe th bits and the th bits of the full-precision registers.These bits are also produced and consumed by the the com-puting logic in the th PE. The computing logic in the PE andthe corresponding two-bit registers are tightly connected. EachPE receives and decodes the register addresses, loads/stores theinput/output data from/into the specified registers, and latchesthe register addresses for the PE at the next stage. In this waythe data can be accessed by the PEs in time and in place. Thedata movement among registers also takes place inside the PEsof the CDS. The liabilities of this method are the enlargementof the hardware area and the increase of the critical path delay.

Note that the multiplier operand of multiplication must bepumped into the rightmost PE serially at the rate of one bit percycle. This operand can be perfectly provided using only localconnections. In the CDS, the bits of a datum spread in both thetime and the space. Suppose that a multiplication is issued atcycle while its multiplier is produced by an operation com-pleted at cycle , where . Then, the th bit of themultiplier is available at cycle and in the th PE,and is required by the rightmost PE at cycle . The reg-ister address of the multiplier passes one PE per cycle and thusreaches the th PE at cycle . Therefore, the th bitmust be transfered from the th PE to the 0th PE incycles. Once a PE accepts the register address of the multiplier,it loads the two bits of the multiplier and passes them to the rightPE. Then, these two bits are passed to the right at the rate of onePE per cycle. Once they reach the rightmost PE, the lower bitis consumed immediately and the higher bit is consumed at thenext cycle. From the view of the th PE, it transfers the th bitof the multiplier to the th PE at cycle ,where . In each PE a two-bit register isrequired for shifting the multiplier to the right. A register whichis used for division but is spare during multiplication is used forthis purpose.

IV. PERFORMANCE ESTIMATION AND COMPARISON

Putting the above ideas together, we developed an ECCprocessor that integrates 130 PEs. Elliptic curve points arerepresented and processed in affine coordinates. For scalarmultiplication, the scalar is recoded into the nonadjacent form(NAF) from the LSB to the MSB. We synthesized the ECCprocessor using a 0.13- m CMOS standard cell library. Table Icompares our processor with some other ECC processors forgeneral curves over . The sixth and the seventh columnsshow the cycle counts and the times for one scalar multiplica-tion, where the scalar has the same bit length as .

The processors proposed in [7], [11], [12] are based on non-systolic multipliers. The processor in [7] employs an optimized64 64-bit multiplier and the reported result is typical of mul-tiplier-based standard-cell designs. The design in [11] is an ex-tension to a Sparc processor using a fully-pipelined 64 64-bitmultiplier. By pipelining the multiplier, this design improvesthe performance of the architecture presented in [7] at the ex-pense of hardware area. The authors reported a 1.5 GHz oper-ating frequency but didn’t give the details about the pipeline.


TABLE ICOMPARISON OF THE AREAS AND THE PERFORMANCES

Thus, the area is hard to estimate. The processor in [12] em-ploys a full-word multiplier that is built on the embedded multi-pliers and the fast carry look-ahead logic on the adopted FPGAs.This processor is the fastest FPGA implementation of ECC over

. To our knowledge, only one ECC processor for generalcurves over based on a systolic arithmetic unit has beenreported in the public literature so far. It is the processor de-scribed in [8], which employs a linear systolic array for Mont-gomery modular multiplication.

A performance comparison between our processor and theprocessor in [7] under a similar condition shows that our pro-cessor can run at a higher frequency but requires more cyclesper computation due to the finely pipelined carry chain, and thatin total our processor is 2.5 times faster. Although it is difficultto fairly compare our processor to the other three processors be-cause of the uses of different design methodologies, some obser-vations make sense. The processors based on nonsystolic mul-tipliers except for that in [11] require less cycles than ours, butoperate at accordingly lower frequencies and thus have loweroverall performances. The processor in [11] is 4 times fasterthan ours. However, the area cost resulting from the pipeliningand the effect of the full custom design should be noted. Theprocessor in [8] requires much more cycles than ours, which isprimarily due to two factors. First, the authors didn’t developa systolic modular divider or inverter, and thus the point addi-tion and doubling formulae in projective coordinates were used,involving more modular multiplications. Second, the authorsdidn’t consider any system-level optimizations, and thus manypipeline stalls existed, wasting many cycles. These two issuesare resolved successfully in our processor.

V. CONCLUSION

In this brief, we proposed a high-performance ECC processorfor general curves over . arithmetic operationsare efficiently implemented in a unified systolic array. We de-layed the modular reductions and distributed the registers intoindividual PEs such that the systolic array can be almost fullyutilized in the processor. Synthesized in 0.13- m standard-celltechnology, the processor requires 1.01 ms for a 256-bit scalarmultiplication. The processor can be inexpensively extended to

support general curves over . The implementation of theproposed processor does not rely on any specific property of theparameters of the elliptic curves or the finite fields, and the pro-cessor is thereby suitable for server-side applications that de-mand high flexibility and high performance.

REFERENCES

[1] N. Koblitz, “Elliptic curve cryptosystems,” Math. Comput., vol. 48, no.177, pp. 203–209, Jan. 1987.

[2] V. Miller, “Uses of elliptic curves in cryptography,” in Proc. Advancesin Cryptology (CRYPTO ’85), 1985, pp. 417–426.

[3] S. Sutikno, R. Effendi, and A. Surya, “Design and implementation ofarithmetic processor F for elliptic curve cryptosystems,” in Proc.IEEE Asia Pacific Conf. Circuits Syst. (APCCAS’98), Nov. 1998, pp.647–650.

[4] G. Orlando and C. Paar, “A high-performance reconfigurable ellipticcurve processor for GF(2 ),” in Proc. Cryptographic Hardware andEmbedded Systems (CHES’00), 2000, pp. 41–56.

[5] J. Goodman and A. P. Chandrakasan, “An energy-efficient reconfig-urable public-key cryptography processor,” IEEE J. Solid-State Cir-cuits, vol. 36, no. 11, pp. 1808–1820, Nov. 2001.

[6] P. H. Leong and I. K. Leung, “A microcoded elliptic curve processorusing FPGA technology,” IEEE Trans. VLSI Systems, vol. 10, no. 5, pp.550–559, Oct. 2002.

[7] A. Satoh and K. Takano, “A scalable dual-field elliptic curve crypto-graphic processor,” IEEE Trans. Comput., vol. 52, no. 4, pp. 449–460,Apr. 2003.

[8] S. B. Örs, L. Batina, B. Preneel, and J. Vandewalle, “Hardware imple-mentation of an elliptic curve processor over GF(p),” in Proc. 14thIEEE Int. Conf. Application-Specific Systems, Architectures, and Pro-cessors (ASAP ’03), Jun. 2003, pp. 433–443.

[9] E. Oztürk, B. Sunar, and E. Savas, “Low-power elliptic curve cryptog-raphy using scaled modular arithmetic,” in Proc. Cryptographic Hard-ware and Embedded Systems (CHES’04), Aug. 2004, pp. 92–106.

[10] A. K. Daneshbeh and M. A. Hasan, “Area efficient high speed ellipticcurve cryptoprocessor for random curves,” in Proc. Int. Conf. Infor-mation Technology: Coding and Computing (ITCC’04), Apr. 2004, pp.588–592.

[11] H. Eberle, S. Shantz, V. Gupta, N. Gura, L. Rarick, and L. Spracklen,“Accelerating next-generation public-key cryptosystems on general-purpose CPUs,” IEEE Micro, vol. 25, no. 2, pp. 52–59, Mar.–Apr.2005.

[12] C. J. McIvor, M. McLoone, and J. V. McCanny, “Hardware ellipticcurve cryptographic processor overGF(p),” IEEE Trans. Circuits Syst.I: Reg. Papers, vol. 53, no. 9, pp. 1946–1957, Sep. 2006.

[13] P. L. Montgomery, “Modular multiplication without trial division,”Math. Computation, vol. 44, no. 170, pp. 519–521, Apr. 1985.

[14] G. Chen, G. Bai, and H. Chen, “A new systolic architecture for modulardivision,” IEEE Trans. Comput., vol. 56, no. 2, pp. 282–286, Feb. 2007.

Documents

${ m GF}(p)$ Based on a Systolic Arithmetic Unit]]>