266 IEEE TRANSACTIONS ON COMPUTERS, VOL. …fayez/research/gf3m/beuchat_j11.pdfIndex Terms—Tate pairing, T pairing, elliptic curve, finite field arithmetic, Karatsuba multiplier,

Fast Architectures for the �TPairing over Small-Characteristic

Supersingular Elliptic CurvesJean-Luc Beuchat, Jeremie Detrey, Nicolas Estibals,

Eiji Okamoto, Member, IEEE, and Francisco Rodrıguez-Henrıquez

Abstract—This paper is devoted to the design of fast parallel accelerators for the cryptographic �T pairing on supersingular elliptic

curves over finite fields of characteristics two and three. We propose here a novel hardware implementation of Miller’s algorithm based

on a parallel pipelined Karatsuba multiplier. After a short description of the strategies that we considered to design our multiplier, we

point out the intrinsic parallelism of Miller’s loop and outline the architecture of coprocessors for the �T pairing over F2m and F3m .

Thanks to a careful choice of algorithms for the tower field arithmetic associated with the �T pairing, we manage to keep the pipelined

multiplier at the heart of each coprocessor busy. A final exponentiation is still required to obtain a unique value, which is desirable in

most cryptographic protocols. We supplement our pairing accelerators with a coprocessor responsible for this task. An improved

exponentiation algorithm allows us to save hardware resources. According to our place-and-route results on Xilinx FPGAs, our designs

improve both the computation time and the area–time trade-off compared to previously published coprocessors.

Index Terms—Tate pairing, �T pairing, elliptic curve, finite field arithmetic, Karatsuba multiplier, hardware accelerator, FPGA.

Ç

1 INTRODUCTION

IN 2000, Mitsunari et al. [36], Sakai et al. [41], and Joux [25]independently showed how to use bilinear pairings

defined over algebraic curves to solve cryptographicproblems of long standing. This discovery ignited anintensive research that, until today, has produced animpressive number of pairing-based cryptographic protocolproposals [13]. Practice has shown that one of the mostefficient options to compute bilinear pairings is to resort tothe Tate pairing operating on supersingular elliptic curvesof low embedding degrees.

Back in 1986, Miller [33], [34] presented an iterative

algorithm that can be adapted to compute the Tate pairing

with linear complexity with respect to the size of the input.

Since then, significant improvements of Miller’s algorithm

were independently proposed in 2002 by Barreto et al. [4]

and Galbraith et al. [17]. One year later, Duursma and Lee

presented a radix-3 variant of Miller’s algorithm especially

targeted at the case of characteristic three [14]. In 2004,Barreto et al. [3] introduced the �T approach, which furthershortens the loop of Miller’s algorithm. More recently,Hess et al. generalized these results to ordinary curves[22], [23], [46].

We extend here the work presented in [10] and proposenovel hardware architectures for computing the �T pairingover binary and ternary fields based on parallel pipelinedKaratsuba multipliers and enhanced unified arithmeticoperators. We stress that the modified Tate pairing can bedirectly computed from the reduced �T pairing at almost noextra cost [7]. Our hardware accelerators are able tocompute the �T pairing operating on supersingular ellipticcurves defined over F2691 and F3313 in just 18.8 and 16:9 �s,respectively (Table 5). We note that these field sizes enjoyan associated security equivalent to that of 105- and 109-bitsymmetric-key cryptosystems, respectively (Table 4).

The main strategies considered to design our parallelpipelined multiplier are described in Section 2. They areincluded in a VHDL code generator that allows us toexperiment on a wide range of operators as well as a varietyof design parameters. Thanks to a judicious choice ofalgorithms for performing tower field arithmetic and acareful analysis of the scheduling, we managed to keep ourpipelined units always busy. This allows us to compute oneiteration of Miller’s algorithm over ternary and binary fieldsin only 17 and 7 clock cycles, respectively (Sections 3 and 4).We summarize the results obtained from our FPGAimplementation and provide the reader with a thoroughcomparison against previously published coprocessors inSection 5.

For the sake of concision, we are forced to skip thedescription of many important concepts of elliptic curve

266 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 2, FEBRUARY 2011

. J.-L. Beuchat and E. Okamoto are with the Graduate School of Systems andInformation Engineering, University of Tsukuba, 1-1-1 Tennodai,Tsukuba, Ibaraki 305-8573, Japan.E-mail: [email protected], [email protected].

. J. Detrey and N. Estibals are with the CARAMEL Project-Team, INRIANancy–Grand Est, Batiment A, 615 rue du Jardin Botanique, 54602Villers-les-Nancy Cedex, France.E-mail: {jeremie.detrey, nicolas.estibals}@loria.fr.

. F. Rodrıguez-Henrıquez is with the Computer Science Department, Centrode Investigacion y de Estudios Avanzados del IPN, Av. InstitutoPolitecnico Nacional No. 2508, 07300 Mexico City, Mexico.E-mail: [email protected].

Manuscript received 19 Aug. 2009; revised 27 Mar. 2010; accepted 19 Apr.2010; published online 24 June 2010.Recommended for acceptance by J. Bruguera, M. Cornea, and D. Das Sarma.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCSI-2009-08-0395.Digital Object Identifier no. 10.1109/TC.2010.163.

0018-9340/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society

theory. We suggest the interested reader to review [44], [47]

for an in-depth coverage of this topic.

2 PARALLEL KARATSUBA MULTIPLIERS

Before delving into the specifics of our pairing coprocessor

architectures, we first detail here the Karatsuba multipliers

on which they extensively rely.We define the p-ary extension field Fpm as Fp½x�=ðfðxÞÞ,

where f is an irreducible degree-m monic polynomial over

Fp. The product of two arbitrary elements of Fpm repre-

sented as p-ary polynomials of degree at most m� 1 is

computed as the polynomial multiplication of the two

elements modulo f . Carefully selecting an irreducible

polynomial with low Hamming weight (i.e., trinomial,

tetranomial, etc.) and low subdegree allows for a simple

modular reduction step.In this work, due to its subquadratic space complexity,

we opted for a variant of the classical Karatsuba multiplier

to implement the polynomial product, while a few extra

adders and subtracters over Fp are dedicated to performing

the final reduction modulo f .

2.1 Variations on the Karatsuba Algorithm

The Karatsuba multiplication [27] is based on the observa-

tion that the polynomial product c ¼ a � b, for a and b 2 Fpm ,

can be computed as

c ¼ aLbL þ ½ðaH þ aLÞðbL þ bHÞ � ðaHbH þ aLbLÞ�xn

þ aHbHx2n;

where n ¼ dm2e, a ¼ aL þ xnaH , and b ¼ bL þ xnbH .Note that since we are working with polynomials, there

is no carry propagation. This allows one to split the

operands in a slightly different way: for instance, Hanrot

and Zimmermann [21] suggested to split them into odd- and

even-degree parts. It was adapted to multiplication over F2m

by Fan et al. [15]. Since there is no overlap between the odd

and even parts at the reconstruction step, this different

method of splitting saves approximately m additions over

Fp during the reconstruction of the product.Another natural way to generalize the Karatsuba multi-

plication is to split the operands into three or more parts, in

a classical way (i.e., splitting each operand into contiguous

parts from the lowest to the highest powers of x) or using a

generalized odd/even split (i.e., according to the degree

modulo the number of split parts). By applying this strategy

recursively, in each iteration, each polynomial multiplica-

tion is transformed into three or more subproducts of

smaller degree, until all the polynomial operands are

reduced to single coefficients. Nevertheless, practice has

shown that it is better to prune the recursion earlier,

performing the lowest-level multiplications using alterna-

tive techniques that are more compact and/or faster for

low-degree operands, such as the so-called schoolbook

method with quadratic complexity, which has been selected

for this work.

2.2 A Pipelined Architecture for the KaratsubaMultiplier

We pipelined our multiplier architecture by means ofoptional registers inserted between the computations ofthe required subproducts, where the depth of the pipelinecan be adjusted according to the complexity of theapplication at hand. This approach allows us to split thecritical path of the whole multiplier structure and, therefore,increase its operating frequency.

In order to study a wide range of implementationstrategies, we wrote a VHDL code generator, which auto-matically produces the description of different variants ofKaratsuba multipliers according to several parameters (fieldextension degree, irreducible polynomial, splitting method,etc.). Our automatic tool was extremely useful for selectingthe operator that showed the highest clock frequency, thesmallest area, or a good trade-off between them.

3 REDUCED �T PAIRING IN CHARACTERISTIC THREE

In the following, we consider the computation of thereduced �T pairing in characteristic three. Table 1 sum-marizes the parameters of the algorithm and of the super-singular curves. We refer the reader to [3], [8] for moredetails about the computation of the �T pairing. Recall that afinal exponentiation is required to obtain a unique value,which is desirable in the context of cryptographic protocols.As pointed out by Beuchat et al. [9], the computations of thenonreduced pairing (i.e., Miller’s algorithm) and of the finalexponentiation do not share the same datapath, and it seemsjudicious to pipeline these two tasks using two distinctcoprocessors in order to reduce the computation time andincrease the throughput.

3.1 Computation of Miller’s Algorithm

We rewrote in Algorithm 1 the reversed-loop algorithm incharacteristic three described in [8], denoting each iterationwith parenthesized indices in superscript in order to

BEUCHAT ET AL.: FAST ARCHITECTURES FOR THE �T PAIRING OVER SMALL-CHARACTERISTIC SUPERSINGULAR ELLIPTIC CURVES 267

emphasize the intrinsic parallelism of the �T pairing. Ateach iteration of Miller’s algorithm, two tasks are performedin parallel, namely, a sparse multiplication over F36m (lines 6and 7) and the computation of the coefficients for the nextsparse operation (lines 8–10). We say that an operand inF36m is sparse when some of its coefficients are trivial (i.e.,either zero, one, or minus one).

3.1.1 Sparse Multiplication over F36m

The intermediate result Rði�1Þ is multiplied by the sparseoperand SðiÞ (Algorithm 1, lines 6 and 7). This operation iseasier than a standard multiplication over F36m , but thechoice of the sparse multiplication algorithm requires carefulattention. Bertoni et al. [6] and Gorla et al. [18] tookadvantage of Karatsuba multiplication and Lagrange inter-polation, respectively, to reduce the number of multiplica-tions over F3m at the expense of several additions. (Note thatGorla et al. study standard multiplication over F36m in [18],but extending their approach to sparse multiplication isstraightforward.) In order to keep the pipeline of a Karatsubamultiplier busy, we would have to embed in our processor alarge multioperand adder (up to 12 operands for the schemeproposed by Gorla et al.) and several multiplexers to dealwith the irregular datapath. This would negatively impactthe area and the clock frequency, and we prefer consideringthe algorithm discussed by Beuchat et al. in [11] which gives

a better trade-off between the number of multiplications andadditions over the underlying field (Algorithm 2): it involves17 multiplications and 29 additions over F3m to compute SðiÞ

and Rði�1Þ � SðiÞ.We suggest to take advantage of a parallel Karatsuba

multiplier with seven pipeline stages to implement Miller’salgorithm. Since the algorithm we selected for sparsemultiplication over F36m requires at most the addition offour elements of F3m , it suffices to complement the multiplierwith a four-operand adder to compute s

ðiÞ3 , a

ðiÞj , and r

ðiÞj ,

0 � j � 5, as shown in Fig. 1. The second loop of Algorithm 2(lines 13–16) requires a small amount of additional hard-ware. Since the first two multiplications of the loop involverði�1Þ2j and r

ði�1Þ2jþ1 , respectively, we compute s

ðiÞj on-the-fly by

means of an accumulator. Furthermore, it seems convenientto store p

ðiÞ6 , p

ðiÞ7 , and s

ðiÞ3 in a circular shift register.

We managed to find a scheduling that allows us to start anew multiplication over F3m at each clock cycle, thuskeeping the pipeline busy and computing an iteration ofMiller’s algorithm in 17 clock cycles, as depicted in Fig. 2. Itis worth noticing that the cost of additions over F3m ishidden and the number of clock cycles depends only on theamount of multiplications over F3m . We easily identify fivedatapaths (denoted by the numerals to in Figs. 1 and 2)between the output of the four-operand adder and theinputs of the parallel multiplier.


TABLE 1Supersingular Curves over F3m and F2m

Specific attention is needed to design the register file

storing the coefficients of RðiÞ and the intermediate

variables aðiÞj , 0 � j � 6, of the sparse multiplication algo-

rithm. According to our scheduling scheme, we have to

read simultaneously up to three variables from the register

file. Thus, we decided to implement it by means of two

blocks of Dual-Ported RAM (DPRAM):

. The first one is connected to input M0 of the parallelmultiplier and input A0 of the four-operand adder,and stores the coefficients of RðiÞ.

. According to our scheduling (Fig. 2), the second

DPRAM block provides the four-operand adder

with its fourth input, namely, aðiÞj , 0 � j � 5, and

rði�1Þj , 0 � j � 2.

3.1.2 Computation of the Sparse Operand

The second task consists in computing the coefficients of the

sparse operand Sðiþ1Þ required for the next iteration of

Miller’s algorithm (Algorithm 1, lines 8–10). Two cubings

and an addition over F3m allow us to update the coordinates

of point P and to determine the coefficient tðiþ1Þ of the sparse

operand Sðiþ1Þ, respectively.

Recall that the �T pairing over F3m comes in two flavors:

the original one involves a cubing over F36m after each

sparse multiplication. Barreto et al. [3] explained how to get

rid of that cubing at the price of two cube roots over F3m to

update the coordinates of point Q. It is essential to consider

such an algorithm here, as an extra cubing over F36m would

put even more strain on the first task (which is already the

most expensive one). According to our results, the critical

path of the circuit is never located in a cube root operator

when pairing-friendly irreducible trinomials or pentano-

mials [2], [20] are used to define F3m . If, by any chance, such

polynomials are not available for the considered extension

of F3 and the critical path is in the cube root, it is always

possible to pipeline this operation. Therefore, the cost of

cube roots is hidden by the first task.The hardware implementation is rather straightforward

(Fig. 1): four registers, a cubing operator, and a cube root

operator allow us to store and update the coordinates of

points P and Q. Then, a two-operand adder computes the

sum of xðiÞP and x

ðiÞQ , and the result tðiÞ is memorized in a fifth

register. Multiplexers select the inputs of the parallel

multiplier according to our scheduling.

3.1.3 Initialization

The initialization step of the �T pairing (Algorithm 1, lines 1

and 2) involves a small amount of specific hardware in

order to compute xð0ÞP , y

ð0ÞP , x

ð0ÞQ , and y

ð0ÞQ . Note that we are

able to send tðiÞ, �yðiÞP , and ��yðiÞQ to input M1 of the parallel

multiplier (Fig. 1). Assuming that the constants 0, 1, and 2

are stored in the DPRAM block connected to input M0, we

compute the coefficients of Rð�1Þ by means of six multi-

plications over F3m :

rð�1Þ0 ¼ tð0Þ � �yð0ÞP ; r

ð�1Þ3 ¼ 0 � tð0Þ ¼ 0;

rð�1Þ1 ¼ 1 �

�� yð0ÞQ

�; r

ð�1Þ4 ¼ 0 � tð0Þ ¼ 0; and

rð�1Þ2 ¼ 2 � �yð0ÞP ¼ ��y

ð0ÞP ; r

ð�1Þ5 ¼ 0 � tð0Þ ¼ 0:

Loading the coordinates of points P and Q and performing

the initialization step involve 17 clock cycles (i.e., exactly the

same number of clock cycles as an iteration of Miller’s

algorithm). Therefore, our coprocessor returns Rððm�1Þ=2Þ

after 17 � ðmþ 3Þ=2 clock cycles.

3.2 Final Exponentiation

The second and last stage in the computation of the �T pairing

is the final exponentiation, where the result of Miller’s

algorithm Rððm�1Þ=2Þ ¼ �T ðP;QÞ is raised to the Mth power

(Algorithm 1, line 12). This exponentiation is necessary since

the nonreduced pairing �T ðP;QÞ is only defined up to

Nth powers in F�36m .

3.2.1 Improved Algorithm

In order to compute this final exponentiation, we use the

algorithm presented by Beuchat et al. [8]. This method

exploits the special form of the exponent M (see Table 1)

to achieve better performances than with a classical

square-and-multiply algorithm. Among other computa-

tions, this final exponentiation involves the raising of an

element of F�36m to the power of 3ðmþ1Þ=2, which Beuchat

et al. [8] perform by computing ðmþ 1Þ=2 successive

cubings over F�36m . Since each of these cubings requires six


cubings and six additions over F3m , the total cost of this

step is 3mþ 3 cubings and 3mþ 3 additions.

We present here a new method for computing U3ðmþ1Þ=2

for U ¼ u0 þ u1�þ u2�þ u3��þ u4�2 þ u5��

2 2 F�36m by ex-

ploiting the linearity of the Frobenius map (i.e., cubing

in characteristic three) to reduce the number of addi-

tions. Indeed, noting that �3i ¼ ð�1Þi�, �3i ¼ �þ ib, and

ð�2Þ3i

¼�2 � ib�þ i2, we obtain the following formula for

U3i , depending on the value of i:

U3i ¼ ðu0 � �1u2 þ �2u4Þ3i

þ �3ðu1 � �1u3 þ �2u5Þ3i

�

þ ðu2 þ �1u4Þ3i

� þ �3ðu3 þ �1u5Þ3i

��þ u3i

4 �2 þ �3u3i

5 ��2;


Fig. 1. Coprocessor for the �T pairing in characteristic three. (N.B. All control bits ci belong to f0; 1g.)

with �1 ¼ �ibmod 3, �2 ¼ i2 mod 3, and �3 ¼ ð�1Þi. Thus,according to the value of ðmþ 1Þ=2 modulo 6, thecomputation of U3ðmþ1Þ=2

will still require 3mþ 3 cubingsbut at most only six additions or subtractions over F3m .

3.2.2 Hardware Implementation

Our first attempt at computing the final exponentiation wasto use the unified arithmetic operator introduced by Beuchatet al. [8]. Unfortunately, due to the sequential schedulinginherent to this operator, it turned out that the finalexponentiation algorithm required more clock cycles thanthe computation of Miller’s algorithm by our coprocessor.We, therefore, had to consider a slightly more parallelarchitecture.

Noticing that the critical operations in the final exponen-tiation algorithm were multiplications and long sequences ofcubings over F3m , we designed the coprocessor for arithmeticover F3m depicted in Fig. 3. Besides a register file imple-mented by means of DPRAM, our coprocessor embeds aparallel–serial multiplier [45] processing D coefficients of anoperand at each clock cycle (typically, D ¼ 13 or 14), alongwith a novel unified operator supporting addition, subtrac-tion, accumulation, Frobenius map (i.e., cubing), and doubleFrobenius map (i.e., raising to the ninth power). Thisarchitecture allowed us to efficiently implement the finalexponentiation algorithm described, for instance, in [8],while taking advantage of the improvement proposed above.

3.2.3 Using Inverse Frobenius Maps

We adapt here the idea behind the square-root-and-multi-

ply algorithm for exponentiation over binary finite fields

given by Rodrıguez-Henrıquez et al. in [37].From the final exponentiation algorithm given in [8], it

can be noticed that Frobenius maps over F3m (i.e., cubings)are needed only to perform an inversion over F�3m [8, Algo-rithms 10 and 11] and for raising an element of F�36m to thepower of 3ðmþ1Þ=2, as already discussed in Section 3.2.1.

As far as the inversion is concerned, note that the first

two lines of [8, Algorithm 10] are dedicated to computing

u ¼ dð3m�3Þ=2 thanks to successive cubings and multi-

plications, where d 2 F�3m is the element to be inverted. It

is actually also possible to compute that same element by

means of only cube roots and multiplications by making

use of the identityffiffiffiz3ip ¼ z3m�i for all z 2 F3m , derived

from Fermat’s little theorem. Indeed, considering the

family of elements ðwiÞ0�i<m such that wi ¼ dð3m�3m�iÞ=2,

we note that w1 ¼ d3m�1 ¼ffiffiffid3p

and u ¼ dð3m�3Þ=2 ¼ wm�1.

Since, for any two integers n and n0, we also have

wnþn0 ¼ dð3m�3m�nÞ=2 � dð3m�n�3m�n�n

0 Þ=2 ¼ wn �ffiffiffiffiffiffiffiwn03np

;

it follows that, given an addition chain of length l for

m� 1, we can compute u ¼ wm�1 in l multiplications and

m� 1 cube roots over F3m . This has to be compared to


Fig. 2. Scheduling of Miller’s algorithm in characteristic three. (N.B. The numerals �1 to �5 refer to the datapaths between the output of the four-

operand adder and the inputs of the parallel multiplier in Fig. 1.)

the l multiplications and m� 1 cubings required in[8, Algorithm 10] to obtain the same u.

As for the raising of an element of F�36m to the power of

3ðmþ1Þ=2, also part of the final exponentiation algorithm, we

simply apply Fermat’s little theorem once more to see that

z3ðmþ1Þ=2 ¼ ffiffiffiz3ðm�1Þ=2p

for all z 2 F3m . Thus, we can directly trade

the 3mþ 3 required cubings (as explained in the analysis

given in Section 3.2.1) for 3m� 3 cube roots over F3m .

Hence, from the previous considerations, it is possible

to replace all Frobenius maps (cubings) by inverse

Frobenius maps (cube roots) in the final exponentiation.

This is particularly interesting since the irreducible

polynomial used to represent that F3m was carefully

chosen to allow for low-complexity cubings and cube

roots, as both are required for the computation of the

nonreduced �T pairing. Furthermore, it appears that for

the considered irreducible trinomials, the complexity of the

cube root is always lower than that of the cubing. This is

shown in Table 2, where the third column reports the total

number of required additions/subtractions over F3, and

the fourth column indicates the largest number of elements

of F3 that need to be added/subtracted to one another to

compute a coefficient of the result (for instance, cubing

over F397 requires summing at most four elements of F3 at

a time).In order to assess the impact of replacing the Frobenius

and double-Frobenius operators by inverse-Frobenius (cuberoot) and double-inverse-Frobenius (ninth root) operators inthe architecture presented in Fig. 3, we implemented thedifferent variants on Xilinx Virtex-4 LX FPGAs (xc4vlx40-11).The place-and-route results, reported in the last two columnsof Table 2, show that the use of cube roots usually shortensthe critical path, even though the circuits are then slightlylarger, as the cubing formulas generally involve morecommon subexpressions which can then share the samelogic and decrease the total resource usage. All in all, itappears that relying on inverse Frobenius maps to computethe final exponentiation is by and large an effectiveoptimization.

4 REDUCED �T PAIRING IN CHARACTERISTIC TWO

An approach similar to that of characteristic three allowedus to design a parallel coprocessor for the reduced


Fig. 3. Coprocessor for the final exponentiation of the �T pairing in characteristic three. (N.B. All control bits ci belong to f0; 1g.)

�T pairing in characteristic two, as depicted in Fig. 4. Thesupersingular curves and the parameters of the algorithmare summarized in Table 1.

4.1 Computation of Miller’s Algorithm

Applying the strategy that we used for characteristic threeto the case of characteristic two, we adopted the reversed-loop algorithm described in [7], which we recall here inAlgorithm 3. However, the scheduling turns out to beslightly more difficult than in characteristic three since wehave to perform three tasks in parallel at each iteration ofMiller’s algorithm.

4.1.1 Sparse Multiplication over F24m

The intermediate result F ði�1Þ computed during the

previous iteration is multiplied by the sparse operand

GðiÞ ¼ gðiÞ0 þ gðiÞ1 sþ t by means of a parallel Karatsuba

multiplier (Algorithm 3, lines 13 and 14). This operation is

easier than the standard multiplication over F24m and

requires only 6 multiplications and 14 additions over the

base field F2m , as explained in Algorithm 4. Thanks to the

careful scheduling given in Fig. 5, the intermediate

variables aðiÞ0 , a

ðiÞ1 , and a

ðiÞ2 (Algorithm 4, lines 1 and 2) are

generated on-the-fly by means of two accumulators. (Note

that in order to share the control bits c15 and c16 between

both accumulators, we compute aðiÞ0 twice at each iteration

of Miller’s algorithm.) The addition of at most four

operands belonging to F2m allows for computing fðiÞj , 0 �

j � 3 (Algorithm 4, lines 6–9).

It is worth mentioning that the datapath between the

output of the four-operand adder and the parallel multiplier

is much simpler than in characteristic three: it suffices to

delay fðiÞj , 0 � j � 3, by one clock cycle and there is,

therefore, no need for a memory block to store the operands

of the multiplier. Dealing with inputs A0 and A1 of the four-

operand adder is unfortunately more difficult because of

data dependencies between the coefficients of F ði�1Þ and F ðiÞ

in Algorithm 4. According to our scheduling, we update

the coefficients of F ðiÞ in the following order: fðiÞ0 , f

ðiÞ1 , f

ðiÞ2 ,

and, eventually, fðiÞ3 . Since f

ðiÞ2 and f

ðiÞ3 depend onf

ði�1Þ0 and

fði�1Þ1 , respectively, we have to keep a copy of those values

until the end of the ith iteration of Miller’s algorithm.

Instead of including a DPRAM block in our design, we

propose a solution based on two small FIFOs (see Fig. 6 for

details). An advantage of characteristic two over character-

istic three is that the register file is smaller in terms of circuit

area and requires fewer control bits.

4.1.2 Computation of g0 and g1

Both coefficients gðiþ1Þ0 and g

ðiþ1Þ1 (Algorithm 3, lines 15 and

16) are involved in the next sparse multiplication and,

thus, have to be computed beforehand. The product uðiþ1Þ �vðiþ1Þ is evaluated by means of our parallel Karatsuba

multiplier. Then, two-operand adders allow for working

out gðiþ1Þ0 and g

ðiþ1Þ1 .

4.1.3 Computation of u, v, and w

The three values uðiþ2Þ, vðiþ2Þ, and wðiþ2Þ (Algorithm 3, lines

17–20) will be required to calculate gðiþ2Þ0 and g

ðiþ2Þ1 during

the next iteration of Miller’s algorithm. Here, we have to

work with the coordinates of points P and Q that are stored

in a FIFO and updated by means of squaring and square

root operators, respectively (see Section 4.1.6). Then, an

accumulator allows for computing wðiþ2Þ in two clock

cycles. Depending on the values of � and , 1-bit adders

(i.e., XOR gates) are necessary to calculate the least

significant bit of uðiþ2Þ, vðiþ2Þ, and wðiþ2Þ.

4.1.4 Choosing the Adequate Karatsuba Multiplier

In these settings, a Karatsuba multiplier with five pipeline

stages can be kept busy during the computation of the main

loop, as shown in Fig. 5. Since we have to carry out seven

multiplications over F2m at each iteration, the calculation

time for the full loop is equal to 7 � ðmþ 1Þ=2 clock cycles. It is

again crucial to consider an algorithm with inverse Frobe-

nius maps (i.e., square roots) in order to avoid squaring F ðiÞ

at each iteration of Miller’s algorithm (see, for instance, [7] for

a survey of algorithms for the Tate pairing over super-

singular curves in characteristic two). Such an operation

would lengthen the computation time and pipeline bubbles

would be inserted in the multiplier.


TABLE 2Frobenius versus Inverse Frobenius Maps in Characteristic

Three and Their Influence on the Coprocessor of Fig. 3

4.1.5 Initialization

The initialization step requires specific attention. In order to

start multiplying uð0Þ by vð0Þ as soon as possible (Algorithm 3,

line 5), we load the coordinates of points P and Q in the

following order: xP , xQ, yP , and yQ. Thus, uð0Þ and vð0Þ

are available after two clock cycles. Thanks to this scheduling,

we complete the initialization step in 15 clock cycles.

4.1.6 Irreducible Pentanomials Suitable for Low-

Complexity Square Root Computation

Although irreducible trinomials allowing for simple compu-tations of squarings and square roots exist in some binaryfinite fields, as detailed in [37], this was not the case forseveral fields considered in this work (see Table 3). To tacklethis issue, we present here a novel family of irreducible


Fig. 4. Coprocessor for the �T pairing in characteristic two. (N.B. All control bits ci belong to f0; 1g.)

square-root-friendly pentanomials that, to the best of our

knowledge, has not been proposed before in the literature.Let m and d be two odd positive integers with d < m=2

and such that the degree-m monic polynomial

fðxÞ ¼ xm þ xm�d þ xm�2d þ xd þ 1

is irreducible over F2. We then represent the binary

extension field F2m as F2½x�=ðfðxÞÞ.Note that reducing modulo f , we have xmþ2dþ1 þ xmþ1 þ

xm�2dþ1 þ x3dþ1 ¼ x, where all the exponents on the left-

hand side are even. It then follows that

ffiffiffixp¼ xmþ2dþ1

2 þ xmþ12 þ xm�2dþ1

2 þ x3dþ12 :

Therefore, using this expression forffiffiffixp

, we can compute the

square root of an element a 2 F2m as [16]

ffiffiffiap¼Xm�1

2

i¼0

a2ixi þ

ffiffiffixp Xm�3

2

i¼0

a2iþ1xi mod fðxÞ:

Furthermore, one can show that if ð2m� 1Þ=7 � d �ð2mþ 1Þ=5, then the complete computation of the square

root (i.e., including the reduction modulo f) will require the

addition of at most three elements of F2 at a time for each

coefficient of the result. And finally, choosing d � ðm� 1Þ=6ensures that a squaring will involve only additions of at

most four operands.As reported in Table 3, three pentanomials of this family

have been selected to represent the finite fields F2557 , F2613 ,

and F2691 , with d ¼ 197, 185, and 243, respectively.

4.2 Final Exponentiation

The final exponentiation (Algorithm 3, line 22) is carried outaccording to the algorithms proposed in [7] and [42], [43] for ¼ 1 and ¼ �1, respectively. We took advantage of thealgorithm introduced in [12] when raising to the ð2m þ 1Þstpower over F24m . Here, again, the linearity of the Frobeniusmap allows us to reduce the number of additions whencomputing U2ðmþ1Þ=2

for U ¼ u0 þ u1sþ u2t þ u3st 2 F�24m .Noting that s2i ¼ sþ �1 and t2

i ¼ tþ �1sþ �2, where �1 ¼imod 2 and �2 ¼ bi2cmod 2, we obtain the following formulafor U2i , depending on the value of i modulo 4:

U2i ¼ ðu0 þ �1u1 þ �2u2 þ �3u3Þ2i

þ ðu1 þ �1u2 þ �2u3Þ2i

s

þ ðu2 þ �1u3Þ2i

tþ u2i

3 st;

where �3 ¼ 1 when imod 4 ¼ 1, and �3 ¼ 0 otherwise.

According to the value of ðmþ 1Þ=2 mod 4, the computation

of U2mþ1

2 requires 2mþ 2 squarings and at most four

additions over F2m .Here, again, a hybrid arithmetic operator, similar to the

one used in the case of characteristic three (see Fig. 3),allows us to perform the final exponentiation in slightly lessclock cycles than Miller’s algorithm without impactingtoo much on the resource usage. The architecture is verysimilar to that of characteristic three, except that weremoved the multiplications by 1 and �1, which are uselessin characteristic two, and replaced the double cubing by atriple-squaring operator, to accommodate for the longerchains of successive Frobenius maps in the final exponen-tiation algorithm. The parallel–serial multiplier processeshere between D ¼ 15 and D ¼ 17 coefficients of its secondoperand per clock cycle.

It is worth noting that the trick of trading Frobenius forinverse Frobenius maps used for the final exponentiationin characteristic three can also be applied to the case ofcharacteristic two. Indeed, as reported in Table 3, thecomplexity of the square root is always lower than that ofthe squaring over the considered finite fields.

However, putting this optimization into practice hap-pens to be more complex than in characteristic three. Apartfrom the Frobenius maps required for the inversion overF�2m and for the raising to the power of 2ðmþ1Þ=2 over F�24m ,further squarings are actually necessary to raise elementsof F�24m to the ð22m � 1Þst and ð2m þ 1Þst powers, as per[7, Algorithm 3] and [12, Algorithm 3], respectively.

These few squarings could be replaced by actualmultiplications, which would then slightly increase thenumber of clock cycles required to compute the finalexponentiation. Alternatively, we could take the fourth rootof the result, which by linearity of the square root wouldthen cancel all the extra Frobenius maps, but we wouldthen end up computing a fixed power of the �T pairing andnot the �T pairing itself.

However, observing that in characteristic two, the criticalpath of the whole pairing accelerator lies in the nonreducedpairing coprocessor and not in the final exponentiation one,this is actually a moot point as there is no use trying toshorten further the critical path in the final exponentiation.


Fig. 5. Scheduling of Miller’s algorithm in characteristic two.

We, therefore, decided against using this optimization

altogether in the case of characteristic two.

5 RESULTS AND COMPARISONS

5.1 Comparison with Previous Works

Thanks to our automatic VHDL code generator, we

designed several versions of the proposed architectures

and prototyped our coprocessors on Xilinx Virtex-II Pro and

Virtex-4 LX FPGAs with average speedgrade. Table 4 details

the specifics of the considered supersingular curves, while

Table 5 provides the reader with a comparison between our

work and accelerators for the Tate and �T pairings over

supersingular (hyper)elliptic curves published in the open

literature. (Note that our comparison remains fair since the

Tate pairing can be computed from the �T pairing at no

extra cost [7].) Finally, these results are summarized in

Fig. 7, where post-place-and-route computation time and

area–time product estimations are plotted against the

achieved level of security.

In the presented benchmarks, the logic resource usage is

given in terms of slices, which is the usual metric on Xilinx

FPGAs. Each slice comprises two four-input lookup tables

and two 1-bit flip-flops. Furthermore, it is worth noting

that even though our coprocessors also make use of some

embedded memory blocks as register files, they are by far

not a critical resource and are, therefore, not reported in

the benchmarks.

Our architectures are also much faster than software

implementations. Mitsunari wrote a very careful multi-

threaded implementation of the �T pairing over F397 and

F3193 [35]. He reported a computation time of 92 and 553 �s,

respectively, on an Intel Core 2 Duo processor (2.66 GHz).


Fig. 6. Contents of the register file of our coprocessor for the �T pairing in characteristic two.

TABLE 3Frobenius versus Inverse Frobenius Maps in Characteristic Two

Interestingly enough, his software library outperforms

several hardware architectures proposed by other research

ers for low levels of security. When we compare his results

with our work, we note that the gap between software and

hardware increases when considering larger values of m.

The computation of the �T pairing over F3193 on a Virtex-4

LX FPGA with a medium speedgrade is, for instance,

roughly 50 times faster than software. This speedup

justifies the use of large FPGAs which are now available

in servers and supercomputers such as the SGI Altix 4700

platform.

Kammler et al. [26] reported the first hardware im-

plementation of the Optimal Ate pairing [46] over a

Barreto–Naehrig (BN) curve [5], that is an ordinary curve

defined over a prime field Fp with embedding degree

k ¼ 12. The proposed design is implemented with a 130 nm

standard cell library and computes a pairing in 15.8 ms

over a 256-bit BN curve. It is, however, difficult to make a

fair comparison between our respective works since the

level of security and the target technology are not the same.

5.2 Characteristic Two versus Characteristic Three

It is worth noting that, in order to achieve the same level of

security for the �T pairing over supersingular curves in

characteristics two and three, the extension degree m of F2m

has to be larger than that of F3m0 . More precisely, we have

the ratio

m

m0¼ 3 log 3

2 log 2 2:4;

since the embedding degree is six in characteristic three,

against four in characteristic two. This ratio also applies

asymptotically to the number of iterations in Miller’s

algorithm, which is ðmþ 1Þ=2 and ðm0 þ 1Þ=2, respectively.

However, the arithmetic over F24m required for the

computation of the pairing in characteristic two is much

simpler than that the arithmetic over F36m0 : one iteration of

Miller’s algorithm requires only 7 multiplications over F2m ,

against 17 multiplications over F3m0 in the case of

characteristic three. Coincidentally, the ratio between the

two is also 17=7 2:4.

Thus, although necessitating 2.4 times as many iterations

as in characteristic three, the �T pairing over F2m requires

almost exactly as many products over the base field as the

�T pairing over F3m0 . Furthermore, a smaller extension

degree m0 compensates for the arithmetic over F3 being

more expensive than that over F2.

That close similarity in terms of performances between

characteristics two and three at a constant level of security,

as hinted at by this short analysis, can actually be observed

in the place-and-route results of our coprocessors (Fig. 7),

even though characteristic two appears to have a slight

advantage for low security.

6 CONCLUSION

We proposed novel architectures based on a parallel

pipelined Karatsuba multiplier for the �T pairing in

characteristics two and three. The main design challenge

we faced was to keep the pipeline continuously busy.

Accordingly, we modified the scheduling of Miller’s algo-

rithm in order to introduce more parallelism in the pairing

computation. We also presented a faster way to perform

the final exponentiation by exploiting the linearity of the

Frobenius map and/or taking advantage of a simpler inverse

Frobenius map in certain cases. Both software and hardware

implementations can benefit from these techniques.

To our knowledge, the implementation of our designs on

several Xilinx FPGA devices improved both the computation


TABLE 4Supersingular Elliptic Curves Considered in This Paper

time and the area–time trade-off of all the hardware pairing

coprocessors previously published in the open literature [1],

[7], [11], [19], [24], [28], [29], [30], [31], [38], [39], [40], [42], [43].

However, as of today, the design of pairing accelerators

providing a level of security equivalent to that of AES-128

remains a problem of major interest. Although Kammler et al.

[26] proposed a first solution over a Barreto–Naehrig curve,

several questions remain open. For instance, is it possible to

achieve such a level of security in hardware with super-

singular (hyper)elliptic curves at a reasonable cost in terms of

computation time and circuit area? Since several protocols

rely on such curves, it seems crucial to us to address this topic

in a near future.

Another interesting direction for further work is to

investigate the use of the hybrid operator (Fig. 3) to compute

the complete Tate pairing and not only the final exponentia-

tion. From our experiments, this operator should offer a

competitive balance between the area-efficient unified

operators of [7] and the latency-oriented architectures

presented here.


TABLE 5Hardware Accelerators for the Tate and �T Pairings

ACKNOWLEDGMENTS

The authors would like to thank Nidia Cortez-Duartealong with the anonymous referees for their valuablecomments. This work was supported by the Japan-FranceIntegrated Action Program (Ayame Junior/Sakura). Theauthors would also like to express their warmest thanks toTonito for his tasty contribution to this paper. Muchasgracias por tus riquısimos tacos!

REFERENCES

[1] A. Barenghi, G. Bertoni, L. Breveglieri, and G. Pelosi, “AFPGA Coprocessor for the Cryptographic Tate Pairing OverFp,” Proc. Fourth Int’l Conf. Information Technology: NewGenerations (ITNG ’08), 2008.

[2] P.S.L.M. Barreto, “A Note on Efficient Computation of Cube Rootsin Characteristic 3,” Report 2004/305, Cryptology ePrint Archive,2004.

[3] P.S.L.M. Barreto, S.D. Galbraith, C. �Oh�Eigeartaigh, and M.Scott, “Efficient Pairing Computation on Supersingular AbelianVarieties,” Designs, Codes and Cryptography, vol. 42, pp. 239-271, 2007.

[4] P.S.L.M. Barreto, H.Y. Kim, B. Lynn, and M. Scott, “EfficientAlgorithms for Pairing-Based Cryptosystems,” Advances in Cryp-tology—CRYPTO ’02, M. Yung, ed., pp. 354-368, 2002.

[5] P.S.L.M. Barreto and M. Naehrig, “Pairing-Friendly EllipticCurves of Prime Order,” Proc. Int’l Workshop Selected Areas inCryptography (SAC ’05), B. Preneel and S. Tavares, eds., pp. 319-331, 2006.

[6] G. Bertoni, L. Breveglieri, P. Fragneto, and G. Pelosi, “ParallelHardware Architectures for the Cryptographic Tate Pairing,” Proc.Third Int’l Conf. Information Technology: New Generations (ITNG ’06),2006.

[7] J.-L. Beuchat, N. Brisebarre, J. Detrey, E. Okamoto, and F.Rodrıguez-Henrıquez, “A Comparison between HardwareAccelerators for the Modified Tate Pairing over F2m andF3m ,” Proc. Int’l Conf. Pairing-Based Cryptography—Pairing ’08,S.D. Galbraith and K.G. Paterson, eds., pp. 297-315, 2008.

[8] J.-L. Beuchat, N. Brisebarre, J. Detrey, E. Okamoto, M. Shirase, andT. Takagi, “Algorithms and Arithmetic Operators for Computingthe �T Pairing in Characteristic Three,” IEEE Trans. Computers,vol. 57, no. 11, pp. 1454-1468, Nov. 2008.

[9] J.-L. Beuchat, N. Brisebarre, M. Shirase, T. Takagi, and E.Okamoto, “A Coprocessor for the Final Exponentiation of the�T Pairing in Characteristic Three,” Proc. Int’l Workshop Arith-metic of Finite Fields (Waifi ’07), C. Carlet and B. Sunar, eds.,pp. 25-39, 2007.

[10] J.-L. Beuchat, J. Detrey, N. Estibals, E. Okamoto, and F.Rodrıguez-Henrıquez, “Hardware Accelerator for the TatePairing in Characteristic Three Based on Karatsuba-OfmanMultipliers,” Proc. Workshop Cryptographic Hardware and EmbeddedSystems (CHES ’09), C. Clavier and K. Gaj, eds., pp. 225-239,2009.

[11] J.-L. Beuchat, H. Doi, K. Fujita, A. Inomata, P. Ith, A. Kanaoka, M.Katouno, M. Mambo, E. Okamoto, T. Okamoto, T. Shiga, M.Shirase, R. Soga, T. Takagi, A. Vithanage, and H. Yamamoto,“FPGA and ASIC Implementations of the �T Pairing in Char-acteristic Three,” Computers and Electrical Eng., vol. 36, no. 1,pp. 73-87, Jan. 2010.


Fig. 7. Performance comparison in terms of (a) Pairing computation time (top) and (b) area–time (AT) product (bottom) between the proposed

architectures and the coprocessors published in the literature, on Virtex-II Pro (left) and Virtex-4 (right) FPGAs.

[12] J.-L. Beuchat, E. Lopez-Trejo, L. Martınez-Ramos, S. Mitsunari,and F. Rodrıguez-Henrıquez, “Multi-Core Implementation of theTate Pairing over Supersingular Elliptic Curves,” Proc. Int’l Conf.Cryptology and Network Security (CANS ’09), J.A. Garay, A. Miyaji,and A. Otsuka, eds., pp. 413-432, 2009.

[13] R. Dutta, R. Barua, and P. Sarkar, “Pairing-Based CryptographicProtocols: A Survey,” Report 2004/64, Cryptology ePrint Archive,2004.

[14] I. Duursma and H.S. Lee, “Tate Pairing Implementation forHyperelliptic Curves y2 ¼ xp � xþ d,” Advances in Cryptology—A-SIACRYPT ’03, C.S. Laih, ed., pp. 111-123, 2003.

[15] H. Fan, J. Sun, M. Gu, and K.-Y. Lam, “Overlap-Free Karatsuba-Ofman Polynomial Multiplication Algorithm,” Report 2007/393,Cryptology ePrint Archive, 2007.

[16] K. Fong, D. Hankerson, J. Lopez, and A. Menezes, “FieldInversion and Point Halving Revisited,” IEEE Trans. Computers,vol. 53, no. 8, pp. 1047-1059, Aug. 2004.

[17] S.D. Galbraith, K. Harrison, and D. Soldera, “Implementing theTate Pairing,” Proc. Int’l Symp. Algorithmic Number Theory—ANTSV, C. Fieker and D.R. Kohel, eds., pp. 324-337, 2002.

[18] E. Gorla, C. Puttmann, and J. Shokrollahi, “Explicit Formulas forEfficient Multiplication in F36m ,” Proc. Int’l Workshop Selected Areasin Cryptography (SAC ’07), C. Adams, A. Miri, and M. Wiener, eds.,pp. 173-183, 2007.

[19] P. Grabher and D. Page, “Hardware Acceleration of the TatePairing in Characteristic Three,” Proc. Workshop CryptographicHardware and Embedded Systems (CHES ’05), J.R. Rao andB. Sunar, eds., pp. 398-411, 2005.

[20] D. Hankerson, A. Menezes, and M. Scott, Software Implementationof Pairings, chapter 12, pp. 188-206. IOS Press, 2009.

[21] G. Hanrot and P. Zimmermann, “A Long Note on Mulders’Short Product,” J. Symbolic Computation, vol. 37, no. 3, pp. 391-401, 2004.

[22] F. Hess, “Pairing Lattices,” Proc. Int’l Conf. Pairing-Based Crypto-graphy—Pairing ’08, S.D. Galbraith and K.G. Paterson, eds., pp. 18-38, 2008.

[23] F. Hess, N. Smart, and F. Vercauteren, “The Eta PairingRevisited,” IEEE Trans. Information Theory, vol. 52, no. 10,pp. 4595-4602, Oct. 2006.

[24] J. Jiang, “Bilinear Pairing (Eta_T Pairing) IP Core,” technical report,Dept. of Computer Science, City Univ. of Hong Kong, May 2007.

[25] A. Joux, “A One Round Protocol for Tripartite Diffie-Hellman,”Proc. Int’l Symp. Algorithmic Number Theory—ANTS IV,W. Bosma, ed., pp. 385-394, 2000.

[26] D. Kammler, D. Zhang, P. Schwabe, H. Scharwaechter, M.Langenberg, D. Auras, G. Ascheid, R. Leupers, R. Mathar, andH. Meyr, “Designing an ASIP for Cryptographic Pairings overBarreto-Naehrig Curves,” Report 2009/056, Cryptology ePrintArchive, 2009.

[27] A. Karatsuba and Y. Ofman, “Multiplication of MultidigitNumbers on Automata,” Soviet Physics Doklady (English Transla-tion), vol. 7, no. 7, pp. 595-596, Jan. 1963.

[28] M. Keller, T. Kerins, F. Crowe, and W.P. Marnane, “FPGAImplementation of a GF ð2mÞ Tate Pairing Architecture,” Proc. Int’lWorkshop Applied Reconfigurable Computing (ARC ’06), K. Bertels,J.M.P. Cardoso, and S. Vassiliadis, eds., pp. 358-369, 2006.

[29] M. Keller, R. Ronan, W.P. Marnane, and C. Murphy, “HardwareArchitectures for the Tate Pairing over GF ð2mÞ,” Computers andElectrical Eng., vol. 33, nos. 5/6, pp. 392-406, 2007.

[30] T. Kerins, W.P. Marnane, E.M. Popovici, and P.S.L.M. Barreto,“Efficient Hardware for the Tate Pairing Calculation in Char-acteristic Three,” Proc. Int’l Workshop Cryptographic Hardware andEmbedded Systems (CHES ’05), J.R. Rao and B. Sunar, eds., pp. 412-426, 2005.

[31] G. Komurcu and E. Savas, “An Efficient Hardware Implementationof the Tate Pairing in Characteristic Three,” Proc. Third Int’l Conf.Systems (ICONS ’08), E. Prasolova-Førland and M. Popescu, eds.,pp. 23-28, 2008.

[32] H. Li, J. Huang, P. Sweany, and D. Huang, “FPGA Implementa-tions of Elliptic Curve Cryptography and Tate Pairing over aBinary Field,” J. Systems Architecture, vol. 54, pp. 1077-1088, 2008.

[33] V.S. Miller, “Short Programs for Functions on Curves,” http://crypto.stanford.edu/miller, 1986.

[34] V.S. Miller, “The Weil Pairing, and Its Efficient Calculation,”J. Cryptology, vol. 17, no. 4, pp. 235-261, 2004.

[35] S. Mitsunari, “A Fast implementation of �T Pairing in Character-istic Three on Intel Core 2 Duo Processor,” Report 2009/032,Cryptology ePrint Archive, 2009.

[36] S. Mitsunari, R. Sakai, and M. Kasahara, “A New Traitor Tracing,”IEICE Trans. Fundamentals, vol. 2, pp. 481-484, Feb. 2002.

[37] F. Rodrıguez-Henrıquez, G. Morales-Luna, and J. Lopez, “Low-Complexity Bit-Parallel Square Root Computation Over GF ð2mÞfor All Trinomials,” IEEE Trans. Computers, vol. 57, no. 4, pp. 472-480, Apr. 2008.

[38] R. Ronan, C. Murphy, T. Kerins, C. �O h�Eigeartaigh, and P.S.L.M.Barreto, “A Flexible Processor for the Characteristic 3 �T Pairing,”Int’l J. High Performance Systems Architecture, vol. 1, no. 2, pp. 79-88,2007.

[39] R. Ronan, C. �Oh�Eigeartaigh, C. Murphy, M. Scott, and T. Kerins,“FPGA Acceleration of the Tate Pairing in Characteristic 2,” Proc.IEEE Int’l Conf. Field Programmable Technology (FPT ’06), pp. 213-220, 2006.

[40] R. Ronan, C. �Oh�Eigeartaigh, C. Murphy, M. Scott, and T. Kerins,“Hardware Acceleration of the Tate Pairing on a Genus 2Hyperelliptic Curve,” J. Systems Architecture, vol. 53, pp. 85-98,2007.

[41] R. Sakai, K. Ohgishi, and M. Kasahara, “Cryptosystems Based onPairing” Proc. 2000 Symp. Cryptography and Information Security(SCIS ’00), pp. 26-28, Jan. 2000.

[42] C. Shu, S. Kwon, and K. Gaj, “FPGA Accelerated Tate PairingBased Cryptosystem over Binary Fields,” Proc. IEEE Int’l Conf.Field Programmable Technology (FPT ’06), pp. 173-180, 2006.

[43] C. Shu, S. Kwon, and K. Gaj, “Reconfigurable ComputingApproach for Tate Pairing Cryptosystems over Binary Fields,”IEEE Trans. Computers, vol. 58, no. 9, pp. 1221-1237, Sept. 2009.

[44] J.H. Silverman, The Arithmetic of Elliptic Curves. Springer-Verlag,1986.

[45] L. Song and K.K. Parhi, “Low Energy Digit-Serial/Parallel FiniteField Multipliers,” J. VLSI Signal Processing, vol. 19, no. 2, pp. 149-166, July 1998.

[46] F. Vercauteren, “Optimal Pairings,” Report 2008/096, CryptologyePrint Archive, 2008.

[47] L.C. Washington, Elliptic Curves—Number Theory and Cryptography,second ed., CRC Press, 2008.

Jean-Luc Beuchat received the MSc and PhDdegrees in computer science from Swiss Fed-eral Institute of Technology, Lausanne, in 1997and 2001, respectively. He is an associateprofessor in the Graduate School of Systemsand Information Engineering at the University ofTsukuba. His current research interests includecomputer arithmetic and cryptography.

Jeremie Detrey received the MSc and PhDdegrees in computer science from the LIP of �ENSLyon, France, in 2003 and 2007, respectively,under the supervision of Florent de Dinechin andJean-Michel Muller. He is now a junior researcherin the CARAMEL Project-Team at INRIA inNancy, France. His research interests cover thevarious hardware aspects of computer arith-metic, with a special focus on finite fields, ellipticcurves, and their applications in cryptography.

Nicolas Estibals received the MSc degree incomputer science from the �ENS Lyon, France, in2009. He is currently working toward the PhDdegree at the CARAMEL Project-Team in Nancy,France, under the supervision of Jeremie Detreyand Pierrick Gaudry. His research interests arecryptography, computer arithmetic, and hard-ware development.


Eiji Okamoto received the BS, MS, and PhDdegrees in electronics engineering from TokyoInstitute of Technology in 1973, 1975, and 1978,respectively. He been working and studyingcommunication theory and cryptography forNEC Central Research Laboratories since1978. In 1991, he was a professor at JapanAdvanced Institute of Science and Technology,then at Toho University. Now he is a professor inthe Graduate School of Systems and Information

Engineering, University of Tsukuba. His research interests are crypto-graphy and information security. He is a member of the IEEE and acoeditor in chief of the International Journal of Information Security.

Francisco Rodrıguez-Henrıquez received theBSc degree in electrical engineering from theUniversity of Puebla, Mexico, the MSc degree inelectrical and computer engineering from theNational Institute of Astrophysics, Optics andElectronics (INAOE), Mexico, and the PhDdegree in electrical and computer engineeringfrom Oregon State University, in 1989, 1992,and 2000, respectively. Currently, he is aprofessor (CINVESTAV-3B researcher) in the

Computer Science Department, CINVESTAV-IPN, Mexico City, Mexico,which he joined in 2002. His major research interests are incryptography, finite field arithmetic, and hardware implementation ofcryptographic algorithms.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

266 IEEE TRANSACTIONS ON COMPUTERS, VOL. …fayez/research/gf3m/beuchat_j11.pdfIndex Terms—Tate pairing, T pairing, elliptic curve, finite field arithmetic, Karatsuba multiplier,