30
Montgomery Arithmetic from a Software Perspective * Joppe W. Bos 1 and Peter L. Montgomery 2 1 NXP Semiconductors 2 Self Abstract This chapter describes Peter L. Montgomery’s modular multiplication method and the various improvements to reduce the latency for software implementations on devices which have access to many computational units. We propose a representation of residue classes so as to speed modular multiplication without aecting the modular addition and subtraction algorithms. Peter L. Montgomery [55] 1 Introduction Modular multiplication is a fundamental arithmetic operation, for instance when computing in a finite field or a finite ring, and forms one of the basic operations underlying almost all currently deployed public-key cryptographic protocols. The ecient computation of modular multiplication is an im- portant research area since optimizations, even ones resulting in a constant-factor speedup, have a direct impact on our day-to-day information security life. In this chapter we review the computational aspects of Peter L. Montgomery’s modular multiplication method [55] (known as Montgomery multi- plication) from a software perspective (while the next chapter highlights the hardware perspective). Throughout this chapter we use the terms digit and word interchangeably. To be more precise, we typically assume that a b-bit non-negative multi-precision integer X is represented by an array of n = db/re computer words as X = n-1 X i=0 x i r i (the so-called radix-r representation), where r = 2 w for the word size w of the target computer archi- tecture and 0 x i < r. Here x i is the i-th word of the integer X. * This material has been published as Chapter 2 in “Topics in Computational Number Theory Inspired by Peter L. Montgomery” edited by Joppe W. Bos and Arjen K. Lenstra and published by Cambridge University Press. See www. cambridge.org/9781107109353. 1

Montgomery Arithmetic from a Software Perspective · Montgomery Arithmetic from a Software Perspective* Joppe W. Bos1 and Peter L. Montgomery2 1NXP Semiconductors 2Self Abstract This

Embed Size (px)

Citation preview

Montgomery Arithmetic from a Software Perspective*

Joppe W. Bos1 and Peter L. Montgomery2

1NXP Semiconductors2Self

Abstract

This chapter describes Peter L. Montgomery’s modular multiplication method and the variousimprovements to reduce the latency for software implementations on devices which have accessto many computational units.

We propose a representation of residueclasses so as to speed modularmultiplication without affecting the modularaddition and subtraction algorithms.

Peter L. Montgomery [55]

1 Introduction

Modular multiplication is a fundamental arithmetic operation, for instance when computing in a finitefield or a finite ring, and forms one of the basic operations underlying almost all currently deployedpublic-key cryptographic protocols. The efficient computation of modular multiplication is an im-portant research area since optimizations, even ones resulting in a constant-factor speedup, have adirect impact on our day-to-day information security life. In this chapter we review the computationalaspects of Peter L. Montgomery’s modular multiplication method [55] (known as Montgomery multi-plication) from a software perspective (while the next chapter highlights the hardware perspective).

Throughout this chapter we use the terms digit and word interchangeably. To be more precise,we typically assume that a b-bit non-negative multi-precision integer X is represented by an array ofn = db/re computer words as

X =

n−1∑i=0

xiri

(the so-called radix-r representation), where r = 2w for the word size w of the target computer archi-tecture and 0 ≤ xi < r. Here xi is the i-th word of the integer X.

*This material has been published as Chapter 2 in “Topics in Computational Number Theory Inspired by Peter L.Montgomery” edited by Joppe W. Bos and Arjen K. Lenstra and published by Cambridge University Press. See www.cambridge.org/9781107109353.

1

Let N be the positive modulus consisting of n digits while the input values to the modular multi-plication method (A and B) are non-negative integers smaller than N and consist of up to n digits.When computing modular multiplication C = AB mod N, the definitional approach is first to computethe product P = AB. Next, a division is computed to obtain P = NQ + C such that both C and Qare non-negative integers less than N. Knuth studies such algorithms for multi-precision non-negativeintegers [48, Alg. 4.3.1.D]. Counting word-by-word instructions, the method described by Knuth re-quires O(n2) multiplications and O(n) divisions when implemented on a computer platform. However,on almost all computer platforms divisions are expensive (relative to the cost of a multiplication). Isit possible to perform modular multiplication without using any division instructions?

If one is allowed to perform some precomputation which only depends on the modulus N, thenthis question can be answered affirmatively. When computing the division step, the idea is to useonly “cheap” divisions and “cheap” modular reductions when computing the modular multiplicationin combination with a precomputed constant (the computation of which may require “expensive” di-visions). These “cheap” operations are computations which either come for free or at a low cost oncomputer platforms. Virtually all modern computer architectures internally store and compute on datain binary format using some fixed word-size r = 2w as above. In practice, this means that all arith-metic operations are implicitly computed modulo 2w (i.e., for free) and divisions or multiplications by(powers of) 2 can be computed by simply shifting the content of the register which holds this value.

Barrett introduced a modular multiplication approach (known as Barrett multiplication [6]) usingthis idea. This approach can be seen as a Newton method which uses a precomputed scaled variantof the modulus’ reciprocal in order to use only such “cheap” divisions when computing (estimatingand adjusting) the division step. After precomputing a single (multi-precision) value, an implement-ation of Barrett multiplication does not use any division instructions and requires O(n2) multiplicationinstructions.

Another and earlier approach based on precomputation is the main topic of this chapter: Mont-gomery multiplication. This method is the preferred choice in cryptographic applications when themodulus has no “special” form (besides being an odd positive integer) that would allow more efficientmodular reduction techniques. See Section 3 on page 11 for applications of Montgomery multipli-cation in the “special” setting. In practice, Montgomery multiplication is the most efficient methodwhen a generic modulus is used (see e.g., the comparison performed by Bosselaers, Govaerts, andVandewalle [19]) and has a very regular structure which speeds up the implementation. Moreover,the structure of the algorithm (especially if its single branch, the notorious conditional “subtractionstep”, can be avoided, cf. page 9 in Section 2.4) has advantages when guarding against certain typesof cryptographic attacks (for more information on differential power analysis attacks see the seminalpaper by Kocher, Jaffe, and Jun [51]). In the next chapter, Montgomery’s method is compared witha version of Barrett multiplication in order to be more precise about the computational advantages ofthe former technique.

As observed by Shand and Vuillemin in [66], Montgomery multiplication can be seen as a gener-alization of Hensel’s odd division [40] for 2-adic numbers. In this chapter we explain the motivationbehind Montgomery arithmetic. More specifically, we show how a change of the residue class rep-resentatives used improves the performance of modular multiplication. Next, we summarize some ofthe proposed modifications to Montgomery multiplication which can further speed up the algorithmin certain settings. Finally, we show how to implement Montgomery arithmetic in software. We es-pecially study how to compute a single Montgomery multiplication concurrently using either vectorinstructions or when many computational units are available which can compute in parallel.

2

2 Montgomery multiplication

Let N be an odd b-bit integer and P a 2b-bit integer such that 0 ≤ P < N2. The idea behind Mont-gomery multiplication is to change the representatives of the residue classes and change the modularmultiplication accordingly. More precisely, we are not going to compute P mod N but 2−bP mod Ninstead. This explains the requirement that N needs to be odd (otherwise 2−b mod N does not exist).It turns out that this approach is more efficient (by a constant factor) on computer platforms.

Let us start with a basic example to illustrate the strategy used. A first idea is to reduce the valueP one bit at a time and repeat this for b bits such that the result has been reduced from 2b to b bits (asrequired). This can be achieved without computing any expensive modular divisions by noting that

2−1P mod N =

{P/2 if P is even,(P + N)/2 if P is odd.

When P is even, the division by two can be computed with a basic operation on computer architec-tures: shift the number one position to the right. When P is odd one can not simply compute thisdivision by shifting. A computationally efficient approach to compute this division by two is to makethis number P even by adding the odd modulus N, since obviously modulo N this is the same. Thisallows one to compute 2−1P mod N at the cost of (at most) a single addition and a single shift.

Let us compute D < 2N and Q < 2b such that P = 2bD − NQ since then D ≡ 2−bP mod N.Initially set D equal to P and Q equal to zero. We denote by qi the i-th digit when Q is written inbinary (radix-2), i.e., Q =

∑b−1i=0 qi2i. Next perform the following two steps b times starting at i = 0

until the last time when i = b − 1:

(Step 1) qi = D mod 2, (Step 2) D = (D + qiN)/2.

This procedure gradually builds the desired Q and at the start of every iteration

P = 2iD − NQ

remains invariant. The process is illustrated in the example below.

For N = 7 (3 bits) and P = 20 < 72 we compute D ≡ 2−3P mod N. At the start of the algorithm,set D = P = 20 and Q = 0.

i = 0, 20 = 20 · 20 − 7 · 0 ⇒ 2−0 · 20 ≡ 20 mod 7(Step 1) q0 = 20 mod 2 = 0, (Step 2) D = (20 + 0 · 7)/2 = 10

i = 1, 20 = 21 · 10 − 7 · 0 ⇒ 2−1 · 20 ≡ 10 mod 7(Step 1) q1 = 10 mod 2 = 0, (Step 2) D = (10 + 0 · 7)/2 = 5

i = 2, 20 = 22 · 5 − 7 · 0 ⇒ 2−2 · 20 ≡ 5 mod 7(Step 1) q2 = 5 mod 2 = 1, (Step 2) D = (5 + 1 · 7)/2 = 6

Since Q = q020 + q121 + q222 = 4 and P = 20 = 2kD − NQ = 23 · 6 − 7 · 4, we have computed2−3 · 20 ≡ 6 mod 7.

Example 1: Radix-2 Montgomery reduction

3

Algorithm 1 The Montgomery reduction algorithm. Compute PR−1 modulo the odd modulus N giventhe Montgomery radix R > N and using the precomputed Montgomery constant µ = −N−1 mod R.

Input: N, P, such that 0 ≤ P < N2.Output: C ≡ PR−1 mod N such that 0 ≤ C < N.

1: q← µ(P mod R) mod R2: C ← (P + Nq)/R3: if C ≥ N then4: C ← C − N5: end if6: return C

The approach behind Montgomery multiplication [55] generalizes this idea. Instead of dividing bytwo at every iteration the idea is to divide by a Montgomery radix R which needs to be coprime to,and should be larger than, N. By precomputing the value

µ = −N−1 mod R,

adding a specific multiple of the modulus N to the current value P ensures that

P + N (µP mod R) ≡ P − N(N−1P mod R

)(1)

≡ P − P ≡ 0 (mod R).

Hence, P + N (Pµ mod R) is divisible by the Montgomery radix R while P does not change moduloN. Let P be the product of two non-negative integers that are both less than N. After applyingEquation (1) and dividing by R, the value P (bounded by N2) has been reduced to at most 2N since

0 ≤P + N(µP mod R)

R<

N2 + NRR

< 2N (2)

(since R was chosen larger than N). This approach is summarized in Algorithm 1: given an integerP bounded by N2, it computes PR−1 mod N, bounded by N, without using any “expensive” divisioninstructions when assuming the reductions modulo R and divisions by R can be computed (almost) forfree. On most computer platforms, where one chooses R as a power of two, this assumption is indeedtrue.

Equation (2) guarantees that the output is bounded by 2N. Hence, a conditional subtraction needsto be computed at the end of Algorithm 1 to ensure the output is less than N. The process is illustratedin the example below, where P is the product of integers A, B with 0 ≤ A, B < N.

4

Algorithm 2 The radix-r interleaved Montgomery multiplication algorithm. Compute (AB)R−1 mod-ulo the odd modulus N given the Montgomery radix R = rn and using the precomputed Montgomeryconstant µ = −N−1 mod r. The modulus N is such that rn−1 ≤ N < rn and r and N are coprime.

Input: A =∑n−1

i=0 airi, B,N such that 0 ≤ ai < r, 0 ≤ A, B < R.Output: C ≡ (AB)R−1 mod N such that 0 ≤ C < N.

1: C ← 02: for i = 0 to n − 1 do3: C ← C + aiB4: q← µC mod r5: C ← (C + Nq)/r6: end for7: if C ≥ N then8: C ← C − N9: end if

10: return C

Exact divisions by 102 = 100 are visually convenient when using a decimal system: just shiftthe number two places to the right (or “erase” the two least significant digits). Assume thefollowing modular reduction approach: use the Montgomery radix R = 100 when computingmodulo N = 97. This example computes the Montgomery product of A = 42 with B = 17. First,precompute the Montgomery constant

µ = −N−1 mod R = −97−1 mod 100 = 67.

After computing the product P = AB = 42 · 17 = 714, compute the first two steps of Algorithm 1omitting the division by R:

P + N(µP mod R) = 714 + 97(67 · 714 mod 100)

= 714 + 97(67 · 14 mod 100)

= 714 + 97(938 mod 100)

= 714 + 97 · 38

= 4400.

Indeed, 4400 is divisible by R = 100 and we have computed

(AB)R−1 ≡ 42 · 17 · 100−1 ≡ 44 (mod 97)

without using any “difficult” modular divisions.

Example 2: Montgomery multiplication

5

2.1 Interleaved Montgomery multiplication

When working with multi-precision integers, integers consisting of n digits of w bits each, it is com-mon to write the Montgomery radix R as

R = rn = 2wn,

where w is the word-size of the architecture where Montgomery multiplication is implemented. TheMontgomery multiplication method (as outlined in Algorithm 1) assumes the multiplication is com-puted before performing the Montgomery reduction. This has the advantage that one can use asymp-totically fast multiplication methods (like e.g., Karatsuba [44], Toom-Cook [74, 25], Schönhage-Strassen [64], or Fürer [30] multiplication). However, this has the disadvantage that the intermediateresults can be as large as r2n+1. Or, stated differently, when using a machine word size of w bits theintermediate results are represented using at most 2n + 1 computer words.

The multi-precision setting was already handled in Montgomery’s original paper [55, Section 2]and the reduction and multiplication were meant to be interleaved by design When representing theintegers in radix-r representation

A =

n−1∑i=0

airi, such that 0 ≤ ai < r

then the radix-r interleaved Montgomery multiplication (see also the work by Dussé and Kaliski Jr.in [29]) ensures the intermediate results never exceed r+2 computer words. This approach is presentedin Algorithm 2. Note that this interleaves the naive schoolbook multiplication algorithm with theMontgomery reduction and therefore does not make use of any asymptotically faster multiplicationalgorithm. The idea is that every iteration divides by the value r (instead of dividing once by R = rn

in the “non-interleaved” Montgomery multiplication algorithm). Hence, the value for µ is adjustedaccordingly. In [22], Koç, Acar, and Kaliski Jr. compare different approaches to implementing multi-precision Montgomery multiplication. According to this analysis, the interleaved radix-r approach,referred to as coarsely integrated operand scanning in [22], performs best in practice.

2.2 Using Montgomery arithmetic in practice

As we have seen earlier in this section and in Algorithm 1 on page 4 Montgomery multiplication com-putes C ≡ PR−1 mod N. It follows that, in order to use Montgomery multiplication in practice, oneshould transform the input operands A and B to A = AR mod N and B = BR mod N: this is called theMontgomery representation. The transformed inputs (converted to the Montgomery representation)are used in the Montgomery multiplication algorithm. At the end of a series of modular multipli-cations the result, in Montgomery representation, is transformed back. This works correctly sinceMontgomery multiplication M(A, B,N) computes (AB)R−1 mod N and it is indeed the case that theMontgomery representation C of C = AB mod N is computed from the Montgomery representationsof A and B since

C ≡ M(A, B,N) ≡ (AB)R−1

≡ (AR)(BR)R−1

≡ (AB)R

≡ CR (mod N).

6

Converting an integer A to its Montgomery representation A = AR mod N can be performed usingMontgomery multiplication with the help of the precomputed constant R2 mod N since

M(A,R2,N) ≡ (AR2)R−1 ≡ AR ≡ A (mod N).

Converting (the result) back from the Montgomery representation to the regular representation is thesame as computing a Montgomery multiplication with the integer value one since

M(A, 1,N) ≡ (A · 1)R−1 ≡ (AR)R−1 ≡ A (mod N).

As mentioned earlier, due to the overhead of changing representations, Montgomery arithmetic isbest when used to replace a sequence of modular multiplications, since this overhead is amortized.A typical use-case scenario is when computing a modular exponentiation as required in the RSAcryptosystem [63].

As noted in the original paper [55] (see the quote at the start of this chapter) computing with num-bers in Montgomery representation does not affect the modular addition and subtraction algorithms.This can be seen from

A ± B ≡ AR ± BR ≡ (A ± B)R ≡ A ± B (mod N).

Computing the Montgomery inverse is, however, affected. The Montgomery inverse of a value A inMontgomery representation is A−1. This is different from computing the inverse of A modulo N sinceA−1 ≡ (AR)−1 ≡ A−1R−1 (mod N) is the Montgomery representation of the value A−1R−2. One ofthe correct ways of computing the Montgomery inverse is to invert the number in its Montgomeryrepresentation and Montgomery multiply this result by R3 since

M(A−1,R3,N) ≡ ((AR)−1R3)R−1 ≡ A−1R ≡ A−1 (mod N).

Another approach, which does not require any precomputed constant, is to compute the Montgomeryreduction of a Montgomery residue A twice before inverting since

M(M(A, 1,N), 1,N)−1 ≡ M((AR)R−1, 1,N)−1

≡ M(A, 1,N)−1

≡ (AR−1)−1

≡ A−1R

≡ A−1 (mod N).

2.3 Computing the Montgomery constants µ and R2

In order to use Montgomery multiplication one has to precompute the Montgomery constant µ =

−N−1 mod r. This can be computed with, for instance, the extended Euclidean algorithm. A particu-larly efficient algorithm to compute µ when r is a power of two and N is odd, the typical setting usedin cryptology, is given by Dussé and Kaliski Jr. and presented in [29]. This approach is recalled inAlgorithm 3.

To show that this approach is correct, it suffices to show that at the start of Algorithm 3 and at theend of every iteration we have Nyi ≡ 1 (mod 2i). This can be shown by induction as follows. At thestart of the algorithm we set y to one, denote this start setting with y1, and the condition holds since

7

Algorithm 3 Compute the Montgomery constant µ = −N−1 mod r for odd values N and r = 2w aspresented by Dussé and Kaliski Jr. in [29].Input: Odd integer N and r = 2w for w ≥ 1.Output: µ = −N−1 mod r

y← 1for i = 2 to w do

if (Ny mod 2i) , 1 theny← y + 2i−1

end ifend forreturn µ← r − y

N is odd by assumption. Denote with yi, for 2 ≤ i ≤ w, the value of y in Algorithm 3 at the end ofiteration i. When i > 1, our induction hypothesis is that Nyi−1 = 1 + 2i−1m for some positive integerm, at the end of iteration i − 1.

We consider two cases

• (m is even) Since Nyi−1 = 1 + m2 2i ≡ 1 (mod 2i) we can simply update yi to yi−1 and the

condition holds.

• (m is odd) Since Nyi−1 = 1 + 2i−1 + m−12 2i ≡ 1 + 2i−1 (mod 2i), we update yi with yi−1 + 2i−1.

We obtain N(yi−1 + 2i−1) = 1 + 2i−1(1 + N) + m−12 2i ≡ 1 (mod 2i) since N is odd.

Hence, after the for loop yw is such that Nyw ≡ 1 mod 2w and the returned value µ = r − yw ≡

2w − N−1 ≡ −N−1 (mod 2w) has been computed correctly.The precomputed constant R2 mod N is required when converting a residue modulo N from its

regular to its Montgomery representation (see Section 2.2 on page 6). When R = rn is a power of two,which in practice is typically the case since r = 2w, then this precomputed value R2 mod N can also becomputed efficiently. For convenience, assume R = rn = 2wn and 2wn−1 ≤ N < 2wn (but this approachcan easily be adapted when N is smaller than 2wn−1). Commence by setting the initial c0 = 2wn−1 < N.Next, start at i = 1 and compute

ci ≡ ci−1 + ci−1 mod N

and increase i until i = wn + 1. The final value

cwn+1 ≡ 2wn+1c0 ≡ 2wn+12wn−1 ≡ 22wn ≡(2wn)2

≡ R2 mod N

as required and can be computed with wn + 1 modular additions.

2.4 On the final conditional subtraction

It is possible to alter or even completely remove the conditional subtraction from lines 3–4 in Al-gorithm 1 on page 4. This is often motivated by either performance considerations or turning the(software) implementation into straight-line code that requires no conditional branches. This is one ofthe basic requirements for cryptographic implementations which need to protect themselves againsta variety of (simple) side-channel attacks as introduced by Kocher, Jaffe, and Jun [51] (those attackswhich use physical information, such as elapsed time, obtained when executing an implementation,to deduce information about the secret key material used). Ensuring constant running-time is a first

8

step in achieving this goal. In order to change or remove this final conditional subtraction the generalidea is to bound the input and output of the Montgomery multiplication in such a way that they canbe re-used in a subsequent Montgomery multiplication computation. This means using a redundantrepresentation, in which the representation of the residues used is not unique and can be larger thanN.

2.4.1 Subtraction-less Montgomery multiplication

The conditional subtraction can be omitted when the size of the modulus N is appropriately selectedwith respect to the Montgomery radix R. (This is a result presented by Shand and Vuillemin in [66]but see also the sequence of papers by Walter, Hachez, and Quisquater [77, 36, 78].) The idea is toselect the modulus N such that 4N < R and to use a redundant representation for the input and outputvalues of the algorithm. More specifically, we assume A, B ∈ Z/2NZ (residues modulo 2N), where0 ≤ A, B < 2N, since then the outputs of Algorithm 1 on page 4 and Algorithm 2 on page 5 arebounded by

0 ≤AB + N(µAB mod R)

R<

(2N)2 + NRR

<NR + NR

R= 2N. (3)

Hence, the result can be reused as input to the same Montgomery multiplication algorithm. Thisavoids the need for the conditional subtraction except in a final correction step (after having computeda sequence of Montgomery multiplications) where one reduces the value to a unique representationwith a single conditional subtraction.

In practice, this might reduce the number of arithmetic operations whenever the modulus canbe chosen beforehand and, moreover, simplifies the code. However, in the popular use-cases incryptography, e.g., in the setting of computing modular exponentiations when using schemes based onRSA [63] where the bit-length of the modulus must be a power of two due to compliance with crypto-graphic standards, the condition 4N < R results in a Montgomery-radix R which is represented usingone additional computer word (compared to the number of words needed to represent the modulus N).Hence, in this setting, such a multi-precision implementation without a conditional subtraction needsone more iteration (when using the interleaved Montgomery multiplication algorithm) to compute theresult compared to a version which computes the conditional subtraction.

2.4.2 Montgomery multiplication with a simpler final comparison

Another approach is not to remove the subtraction but make this operation computationally cheaper.See the analysis by Walter and Thompson in [79, Section 2.2] which is introduced again by Pu andZhao in [62]. In practice, the Montgomery-radix R = rn is often chosen as a multiple of the word-size of the computer architecture used (e.g., r = 2w for w ∈ {8, 16, 32, 64}). The idea is to reducethe output of the Montgomery multiplication to {0, 1, . . . ,R − 1}. instead of to the smaller range{0, 1, . . . ,N − 1}. Just as above, this is a redundant representation but working with residues fromZ/RZ. This representation does not need more computer words to represent the result and thereforedoes not increase the number of iterations one needs to compute; something which might be thecase when the Montgomery radix is increased to remove the conditional subtraction. Computing thecomparison if an integer x =

∑ni=0 xiri is at least R = rn can be done efficiently by verifying if the

most significant word xn is non-zero. This is significantly more efficient compared with computing afull multi-precision comparison.

This approach is correct since if the input values A and B are bounded by R then the output of the

9

Montgomery multiplication, before the conditional subtraction, is bounded by

0 ≤B + N(µAB mod R)

R<

R2 + NRR

= R + N. (4)

Subtracting N whenever the result is at least R ensures that the output is also less than R. Hence, onestill requires to evaluate the condition for subtraction in every Montgomery multiplication. However,the greater-or-equal comparison becomes significantly cheaper and the number of iterations requiredto compute the interleaved Montgomery multiplication algorithm remains unchanged. In the settingwhere a constant running-time is required this approach does not seem to bring a significant advan-tage (see the security analysis by Walter and Thompson in [79, Section 2.2] for more details). Asimple constant-running time solution is to compute the subtraction and select this result if no bor-row occurred. However, when constant running-time is not an issue this approach (using a cheapercomparison) can speed up the Montgomery multiplication algorithm.

2.5 Montgomery multiplication in F2k

The idea behind Montgomery multiplication carries over to finite fields of cardinality 2k as well.Such finite fields are known as binary fields or characteristic-two finite fields. The application ofMontgomery multiplication to this setting is outlined by Koç and Acar in [50]. Let n(x) be an irre-ducible polynomial of degree k. Then an element a(x) ∈ F2k � F2[x]/(n(x)) can be represented in thepolynomial-basis representation by a polynomial of degree at most k − 1

a(x) =

k−1∑i=0

aixi where ai ∈ F2.

The equivalent of the Montgomery-radix is the polynomial r(x) ∈ F2[x]/(n(x)) which in practice ischosen as r(x) = xk. Since n(x) is irreducible this ensures that the inverse r−1(x) mod n(x) exists andthe Montgomery multiplication

a(x)b(x)r−1(x) mod n(x)

is well-defined.Let a(x), b(x) ∈ F2[x]/(n(x)) and their product p(x) = a(x)b(x) of degree at most 2(k − 1).

Computing the Montgomery reduction p(x)r−1(x) mod n(x) of p(x) can be done using the samesteps as presented in Algorithm 1 on page 4 given the precomputed Montgomery constant µ(x) =

−n(x)−1 mod r(x). Hence, one computes

q(x) = p(x)µ(x) mod r(x)

c(x) = (p(x) + q(x)n(x))/r(x).

Note that the final conditional subtraction step is not required since

deg(c(x)) ≤ max(2(k − 1), k − 1 + k) − k = k − 1,

(because r(x) is a degree k polynomial). A large characteristic version of this approach using theinterleaved Montgomery multiplication for finite fields of large prime characteristic from Section 2.1on page 6, works here as well.

10

3 Using primes of a special form

In some settings in cryptography, most notably in elliptic curve cryptography (introduced indepen-dently by Miller and Koblitz in [54, 49]), the (prime) modulus can be chosen freely and is fixed fora large number of modular arithmetic operations. In order to gain a constant factor speedup whencomputing the modular multiplication, Solinas suggested [68] a specific set of special primes whichwere subsequently included in the FIPS 186 standard [75] used in public-key cryptography. Morerecently, prime moduli of the form 2s±δ have gained popularity where s, δ ∈ Z>0 and δ < 2s such thatδ is a (very) small integer. More precisely, the constant δ is small compared to the typical word-sizeof computer architectures used (e.g., less than 232) and often is chosen as the smallest integer suchthat one of 2s ± δ is prime. One should be aware that the usage of such primes of a special form notonly accelerates the cryptographic implementations, the cryptanalytic methods benefit as well. See,for instance, the work by this chapter’s authors, Kleinjung, and Lenstra related to efficient arithmeticto factor Mersenne numbers (numbers of the form 2M − 1) in [15]. An example of one of the primes,suggested by Solinas, is 2256 − 2224 + 2192 + 296 − 1 where the exponents are selected to be a multipleof 32 to speed up implementations on 32-bit platforms (but see for instance the work by Käsper howto implement such primes efficiently on 64-bit platforms [45]). A more recent example proposed byBernstein [7] is to use the prime 2255 − 19 to implement efficient modular arithmetic in the setting ofelliptic curve cryptography.

3.1 Faster modular reduction with primes of a special form

We use the Mersenne prime 2127 − 1 as an example to illustrate the various modular reduction tech-niques in this section. Given two integers a and b, such that 0 ≤ a, b < 2127 − 1, the modular productab mod 2127 − 1 can be computed efficiently as follows. (We follow the description given by the firstauthor of this chapter, Costello, Hisil, and Lauter from [10]). First compute the product with one’spreferred multiplication algorithm and write the result in radix-2128 representation

c = ab = c12128 + c0, where 0 ≤ c0 < 2128 and 0 ≤ c1 < 2126.

This product can be almost fully reduced by subtracting 2c1 times the modulus since

c ≡ c12128 + c0 − 2c1(2127 − 1) ≡ c0 + 2c1 (mod 2127 − 1).

Moreover, we can subtract 2127 − 1 one more time if the bit bc0/2127c (the most significant bit of c0)is set. When combining these two observations a first reduction step can be computed as

c′ = (c0 mod 2127) + 2c1 + bc0/2127c ≡ c (mod 2127 − 1) (5)

This already ensures that the result 0 ≤ c′ < 2128 since

c′ ≤ 2127 − 1 + 2(2126 − 1) + 1 < 2128.

One can then reduce c′ further using conditional subtractions. Reduction modulo 2127−1 can thereforebe computed without using any multiplications or expensive divisions by taking advantage of the formof the modulus.

11

3.2 Faster Montgomery reduction with primes of a special form

Along the same lines one can select moduli which speed up the operations when computing Mont-gomery reduction. Such special moduli have been proposed many times in the literature to reduce thenumber of arithmetic operations (see for instance the work by Lenstra [52], Acar and Shumow [1],Knezevic, Vercauteren, and Verbauwhede [47], Hamburg [37], and the first author of this chapter,Costello, Hisil, and Lauter [10, 11]). They are sometimes referred to as Montgomery-friendly modulior Montgomery-friendly primes. Techniques to scale an existing modulus such that this scaled mod-ulus has a special shape which reduces the number of arithmetic operations, using the same techniquesas for the Montgomery-friendly moduli, are also known and called Montgomery tail tailoring by Harsin [39]. Following the description in the book by Brent and Zimmermann [20], this can be seen as aform of preconditioning as suggested by Svoboda in [70] in the setting of division.

When one is free to select the modulus N beforehand, then the number of arithmetic operationscan be reduced if the modulus is selected such that

µ = −N−1 mod r = ±1

in the setting of interleaved Montgomery multiplication (as also used by Dixon and Lenstra in [28]).This ensures that the multiplication by µ can be avoided (since µ = ±1) in every iteration of the inter-leaved Montgomery multiplication algorithm. This puts a first requirement on the modulus, namelyN≡ ∓ 1 mod r. In practice, r = 2w where w is the word-size of the computer architecture. Hence, thisrequirement puts a restriction on the least significant word of the modulus (which equals either 1 or−1 ≡ 2w − 1 mod 2w).

Combining lines 4 and 5 from the interleaved Montgomery multiplication (Algorithm 2 on page 5)we see that one has to compute C+N(µC mod r)

r . Besides the multiplication by µ one has to computea multi-word multiplication with the (fixed) modulus N. In the same vein as the techniques fromSection 3.1 above, one can require N to have a special shape such that this multiplication can becomputed faster in practice. This can be achieved, for instance, when one of the computer wordsof the modulus is small or has some special shape while the remainder of the digits are zero exceptfor the most significant word (e.g., when µ = 1). Along these lines the first author of this chapter,Costello, Longa, and Naehrig select primes for usage in elliptic curve cryptography where

N = 2α(2β − γ) ± 1 (6)

where α, β, and γ are integers such that γ < 2β ≤ r.The final requirement on the modulus is to ensure that 4N < R = rn since this avoids the final

conditional subtraction (as shown on page 9 in Section 2.4). Examples of such Montgomery-friendlymoduli include

2252 − 2232 − 1 = 2192(260 − 240) − 1 = 2224(228 − 28) − 1

(written in different form to show the usage on different architectures which can compute with β-bitintegers) proposed by Hamburg in [37] and the modulus

2240(214 − 127) − 1 = 2222(232 − 218 · 127) − 1

proposed by the first author of this chapter, Costello, Longa, and Naehrig in [12]. The approach isillustrated in the example below. Other examples of Montgomery-friendly moduli are given in [32,Table 4] based on Chung-Hasan arithmetic [24].

12

Let us consider Montgomery reduction modulo 2127 − 1 on a 64-bit computer architecture (w =

64). This means we have α = 64, β = 63, and γ = 0 in Equation (6) on page 12 (2127 − 1 =

264(263 − 0) − 1). Multiplication by µ can be avoided since µ = −(2127 − 1)−1 mod 264 =

1. Furthermore, due to the special form of the modulus the multiplication by 2127 − 1 can besimplified. The computation of C+N(µC mod r)

r , which needs to be done for each computer word(twice in this setting), can be simplified when using the Montgomery interleaved multiplicationalgorithm.

Write C = c22128 + c1264 + c0 (see line 3 in Algorithm 2 on page 5) with 0 ≤ c2, c1, c0 < 264,then

C + N(µC mod r)r

=C + (2127 − 1)(C mod 264)

264

=(c22128 + c1264 + c0) + (2127 − 1)c0

264

=c22128 + c1264 + c02127

264

= c2264 + c1 + c0263.

Hence, only two addition and two shift operations are needed in this computation.

Example 3: Montgomery-friendly reduction modulo 2127 − 1

4 Concurrent computing of Montgomery multiplication

Since the seminal paper by the second author introducing modular multiplication without trial di-vision, people have studied ways to obtain better performance on different computer architectures.Many of these techniques are specifically tailored towards a (family) of platforms motivated by thedesire to enhance the practical performance of public-key cryptosystems.

One approach focuses on reducing the latency of the Montgomery multiplication operation. Thismight be achieved by computing the Montgomery product using many computational units in parallel.One example is to use the single instruction, multiple data (SIMD) programming paradigm. In thissetting a single vector instruction applies to multiple data elements in parallel. Many modern computerarchitectures have access to vector instruction set extensions to perform SIMD operations. Exampleplatforms include the popular high-end x86 architecture as well as the embedded ARM platformwhich can be found in the majority of modern smartphones and tablets. To highlight the potential,Gueron and Krasnov were the first to show in [35] that the computation of Montgomery multiplicationon the 256-bit wide vector instruction set AVX2 is faster than the same computation on the classicalarithmetic logic unit (ALU) on the x86_64 platform.

In Section 4.1 below we outline the approach by the authors of this chapter, Shumow, and Za-verucha from [17] for computing a single Montgomery multiplication using vector instruction set ex-tensions which support 2-way SIMD operations (i.e., perform the same operation on two data pointssimultaneously). This approach allows one to split the computation of the interleaved Montgomerymultiplication into two parts which can be computed in parallel. Note that in a follow-up work [65]by Seo, Liu, Großschädl, and Choi it is shown how to improve the performance on 2-way SIMD ar-chitectures even further. Instead of computing the two multiplications concurrently, as is presented inSection 4.1, they compute every multiplication using 2-way SIMD instructions. By careful schedul-

13

ing of the instructions they manage to significantly reduce the read-after-write dependencies whichreduces the number of bubbles (execution delays in the instruction pipeline). This results in a soft-ware implementation which outperforms the one presented in [17]. It would be interesting to see ifthese two approaches (from [17] and [65]) can be merged on platforms which support efficient 4-waySIMD instructions.

In Section 4.3 on page 21 we show how to compute Montgomery multiplication when integersare represented in a residue number system. This approach can be used to compute Montgomeryarithmetic efficiently on highly parallel computer architectures which have hundreds of computationalcores or more and when large moduli are used (such as in the RSA cryptosystem).

4.0.1 Related work on concurrent computing of Montgomery multiplication

A parallel software approach describing systolic Montgomery multiplication is described by Dixonand Lenstra in [28], by Iwamura, Matsumoto, and Imai in [42], and Walter in [76]. See Chapter 3 formore information about systolic Montgomery multiplication. Another approach is to use the SIMDvector instructions to compute multiple Montgomery multiplications in parallel. This can be usefulin applications where many computations need to be processed in parallel such as batch-RSA orcryptanalysis. This approach is studied by Page and Smart in [58] using the SSE2 vector instructionson a Pentium 4 and by the first author of this chapter in [9] on the Cell Broadband Engine (seeSection 4.2.1 on page 18 for more details about this platform).

An approach based on Montgomery multiplication which allows one to split the operand intotwo parts, which can then be processed in parallel, is called bipartite modular multiplication and wasintroduced by Kaihara and Takagi in [43]. The idea is to use a Montgomery radix R = rαn where α isa rational number such that αn is an integer and 0 < αn < n: hence, the radix R is smaller than themodulus N. For example, one can choose α such that αn = bn

2c. In order to compute xyr−αn mod N(where 0 ≤ x, y < N) write y = y1rαn + y0 and compute

xy1 mod N and xy0r−αn mod N

in parallel using a regular interleaved modular multiplication algorithm (see, e.g., the work by Brick-ell [21]) and the interleaved Montgomery multiplication algorithm, respectively. The sum of the twoproducts gives the correct Montgomery product of x and y since

(xy1 mod N) + (xy0r−αn mod N) ≡ x(y1rαn + y0)r−αn

≡ xyr−αn (mod N).

4.1 Montgomery multiplication using SIMD extensions

This section is an extended version of the description of the idea presented by this chapter’s authors,Shumow, and Zaverucha in [17] where an algorithm is presented to compute the interleaved Mont-gomery multiplication using two threads running in parallel performing identical arithmetic steps.Hence, this algorithm runs efficiently when using 2-way SIMD vector instructions as frequently foundon modern computer architectures. For illustrative purposes we assume a radix-232 system, but thiscan be adjusted accordingly to any other radix system. Note that for efficiency considerations thechoice of the radix system depends on the vector instructions available.

Algorithm 2 on page 5 outlines the interleaved Montgomery multiplication algorithm and com-putes two 1× n→ (n + 1) computer word multiplications, namely aiB and qN, and a single 1× 1→ 1

14

computer word multiplication (µC mod 232) every iteration. Unfortunately, these three multiplica-tions depend on each other and therefore can not be computed in parallel. Every iteration computes(see Algorithm 2)

1. C ← C + aiB

2. q← µC mod 232

3. C ← (C + qN)/232

In order to reduce latency, we would like to compute the two 1 × n→ (n + 1) word multiplications inparallel using vector instructions. This can be achieved by removing the dependency between thesetwo multi-word multiplications by computing the value of q first. The first word c0 of C =

∑n−1i=0 ci232i

is computed twice: once for the computation of q (in µC mod 232) and then again in the parallelcomputation of C + aiB. This is a relatively minor penalty of one additional one-word multiplicationand addition per iteration to make these two multi-word multiplications independent of each other.This means an iteration i can be computed as

1. q← µ(c0 + aib0) mod 232

2. C ← C + aiB

3. C ← (C + qN)/232

and the 1×n→ (n+1) word multiplications in steps 2 and 3 (aiB and qN) can be computed in parallelusing, for instance, 2-way SIMD vector instructions.

In order to rewrite the remaining operations, besides the multiplication, the authors of [17] sug-gest inverting the sign of the Montgomery constant µ, i.e., instead of using −N−1 mod 232 use µ =

N−1 mod 232. When computing the Montgomery product C = AB2−32n mod N, one can compute D(which contains the sum of the products aiB) and E (which contains the sum of the products qN),separately and in parallel using the same arithmetic operations. Due to the modified choice of theMontgomery constant µ we have C = D − E ≡ AB2−32n (mod M), where 0 ≤ D, E < N: the maxi-mum values of both D and E fit in an n-word integer. This approach is presented in Algorithm 4 onthe next page.

At the start of every iteration in the for-loop iterating with j, the two separate computationalstreams running in parallel need to communicate information to compute the value of q. More pre-cisely, this requires the knowledge of both d0 and e0, the least significant words of D and E respec-tively. Once the values of both d0 and e0 are known to one of the computational streams, the updatedvalue of q can be computed as

q = ((µa0)b j + µ(d0 − e0)) mod 232

= µ(c0 + b ja0) mod 232

since c0 = d0 − e0.Except for this communication cost between the two streams, to compute the value of q, all arith-

metic computations performed by computation 1 and computation 2 in the outer for-loop are identicalbut work on different data. This makes this approach suitable for computation using 2-way 32-bitSIMD vector instructions. This technique benefits from 2-way SIMD 32 × 32 → 64-bit multiplic-ation and matches exactly the 128-bit wide vector instructions as present in many modern computer

15

Algorithm 4 A parallel radix-232 interleaved Montgomery multiplication algorithm. Except for thecomputation of q, the arithmetic steps in the outer for-loop, performed by computation 1 and compu-tation 2, are identical. This approach is suitable for 32-bit 2-way SIMD vector instruction units.

Input:

A, B,M, µsuch that

A =∑n−1

i=0 ai232i, B =∑n−1

i=0 bi232i,M =∑n−1

i=0 mi232i,

0 ≤ ai, bi < 232, 0 ≤ A, B < M, 232(n−1) ≤ M < 232n,

2 - M, µ = M−1 mod 232,

Output: C ≡ AB2−32n mod M such that 0 ≤ C < MComputation 1 Computation 2

di = 0 for 0 ≤ i < n ei = 0 for 0 ≤ i < nfor j = 0 to n − 1 do for j = 0 to n − 1 do

q← ((µb0)a j + µ(d0 − e0)) mod 232

t0 ← a jb0 + d0 t1 ← qm0 + e0 // where t0 ≡ t1 (mod 232)

t0 ←⌊ t0232

⌋t1 ←

⌊ t1232

⌋for i = 1 to n − 1 do for i = 1 to n − 1 do

p0 ← a jbi + t0 + di p1 ← qmi + t1 + ei

t0 ←⌊ p0

232

⌋t1 ←

⌊ p1

232

⌋di−1 ← p0 mod 232 ei−1 ← p1 mod 232

end for end fordn−1 ← t0 en−1 ← t1

end for end for↘ ↙

C ← D − E // where D =

n−1∑i=0

di232i, E =

n−1∑i=0

ei232i

if C < 0 do C ← C + M end ifreturn C

architectures. By changing the radix used in Algorithm 4, larger or smaller vector instructions can besupported.

Note that as opposed to a conditional subtraction in Algorithm 1 on page 4 and Algorithm 2 onpage 5, Algorithm 4 computes a conditional addition because of the inverted sign of the precomputedMontgomery constant µ. This condition is based on the fact that if D − E is negative (produces aborrow), then the modulus is added to make the result positive.

4.1.1 Expected performance

We follow the analysis of the expected performance from [17], which just considers execution time.The idea is to perform an analysis of the number of arithmetic instructions as an indicator of theexpected performance when using a 2-way SIMD implementation instead of a regular (non-SIMD)implementation for the classical ALU. We assume the 2-way SIMD implementation works on pairsof 32-bit words in parallel and has access to a 2-way SIMD 32 × 32 → 64-bit multiplication in-struction. A comparison to a regular implementation is not straightforward since the word-size canbe different, the platform might be able to compute multiple instructions in parallel (on differentALUs) and the number of instructions per arithmetic operation might differ. This is why we present

16

Table 1: A simplified comparison, only stating the number of word-level instructions required, tocompute the Montgomery multiplication when using a 32n-bit modulus for a positive even integern. Two approaches are shown, a sequential one on the classical ALU (Algorithm 2 on page 5) and aparallel one using 2-way SIMD instructions (performing two operations in parallel, cf. Algorithm 4on the previous page).

InstructionClassical 2-way SIMD

32-bit 64-bit 32-bitadd - - nsub - - nshortmul n n

2 2nmuladd 2n n -muladdadd 2n(n − 1) n( n

2 − 1) -SIMD muladd - - nSIMD muladdadd - - n(n − 1)

a simplified comparison based on the number of arithmetic operations when computing Montgomerymultiplication using a 32n-bit modulus for a positive even integer n. We denote by muladdw(e, a, b, c)and muladdaddw(e, a, b, c, d) the computation of e = ab + c and e = ab + c + d, respectively, for0 ≤ a, b, c, d < 2w (and thus 0 ≤ e < 22w). These are basic operations on a computer architecturewhich works on w-bit words. Some platforms have these operations as a single instruction (e.g., onsome ARM architectures) while others implement this using separate multiplication and addition in-structions (as on the x86 platform). Furthermore, let shortmulw(e, a, b) denote e = ab mod 2w: thisw×w→ w-bit multiplication only computes the least significant w bits of the result and is faster thancomputing a full double-word product on most modern computer platforms.

Table 1 summarizes the expected performance of Algorithm 2 on page 5 and Algorithm 4 onthe preceding page in terms of arithmetic operations only. The shifting and masking operations areomitted for simplicity as well as the operations required to compute the final conditional subtractionor addition. When just taking the muladd and muladdadd instructions into account it becomes clearfrom Table 1 that the SIMD approach uses exactly half the number of instructions compared to the32-bit classical implementation and almost twice as many operations compared to the classical 64-bitimplementations. However, the SIMD approach requires more operations to compute the value of qevery iteration and has various other overheads (e.g., inserting and extracting values from the largevector registers). Hence, when assuming that all the characteristics of the SIMD and classical (non-SIMD) instructions are identical, which is most likely not the case on most platforms, then we expectAlgorithm 4 running on a 2-way 32-bit SIMD unit to outperform a classical 32-bit implementationusing Algorithm 2 by at most a factor of two while being roughly half as fast as a classical 64-bitimplementation.

4.2 A column-wise SIMD approach

A different approach, suitable for computing Montgomery multiplication on architectures supporting4-way SIMD instructions is outlined by the first chapter author and Kaihara in [13]. This approach isparticularly efficient on the Cell Broadband Engine (see a brief introduction to this architecture below),since it was designed for usage on this platform, but can be used on any platform supporting the SIMDinstructions used in this approach. This approach differs from the one described in the previous sectionin that it uses the SIMD instructions to compute the multi-precision arithmetic in parallel, so it works

17

on a lower level, while the approach from Section 4.1 above computes the arithmetic operations itselfsequentially but divides the work into two steps which can be computed concurrently.

4.2.1 The Cell broadband engine

The Cell Broadband Engine (cf. the introductions given by Hofstee [41] and Gschwind [34]), denotedby “Cell” and jointly developed by Sony, Toshiba, and IBM, is a powerful heterogeneous multipro-cessor which was released in 2005. The Cell contains a Power Processing Element, a dual-threadedPower architecture-based 64-bit processor with access to a 128-bit AltiVec/VMX single instruction,multiple data (SIMD) unit (which is not considered in this chapter). Its main processing power, how-ever, comes from eight Synergistic Processing Elements (SPEs). For an introduction to the circuitdesign see the work by Takahashi et al. [73]. Each SPE consists of a Synergistic Processing Unit(SPU), 256 KB of private memory called Local Store (LS), and a Memory Flow Controller (MFC).To avoid the complexity of sending explicit direct memory access requests to the MFC, all code anddata must fit within the LS.

Each SPU runs independently from the others at 3.192GHz and is equipped with a large registerfile containing 128 registers of 128 bits each. Most SPU instructions work on 128-bit operands de-noted as quadwords. The instruction set is partitioned into two sets: one set consists of (mainly) 4-and 8-way SIMD arithmetic instructions on 32-bit and 16-bit operands respectively, while the other setconsists of instructions operating on the whole quadword (including the load and store instructions)in a single instruction, single data (SISD) manner. The SPU is an asymmetric processor; each of thesetwo sets of instructions is executed in a separate pipeline, denoted by the even and odd pipeline for theSIMD and SISD instructions, respectively. For instance, the {4, 8}-way SIMD left-rotate instructionis an even instruction, while the instruction left-rotating the full quadword is dispatched into the oddpipeline. When dependencies are avoided, a single pair consisting of one odd and one even instructioncan be dispatched every clock cycle.

One of the first applications of the Cell processor was to serve as the brain of Sony’s PlaySta-tion 3 game console. Due to the wide-spread availability of this game console and the fact that onecould install and run one’s own software this platform has been used to accelerate cryptographic op-erations [27, 26, 23, 57, 18, 9] as well as cryptanalytic algorithms [69, 16, 14].

4.2.2 Montgomery multiplication on the Cell broadband engine

In this section we outline the approach presented by the first author of this chapter and Kaihara tailoredtowards the instruction set of the Cell Broadband Engine. Most notably, the presented techniques relyon an efficient 4-way SIMD instruction to multiply two 16-bit integers and add another 16-bit integerto the 32-bit result, and a large register file. Therefore, the approach described here uses a radixr = 216 to divide the large numbers into words that match the input sizes of the 4-SIMD multipliersof the Cell. This can easily be adapted to any other radix size for different platforms with differentSIMD instructions.

The idea is to represent integers X in a radix-216 system, i.e., X =∑n

i=0 xi216i where 0 ≤ xi < 216.However, in order to use the 4-way SIMD instructions of this platform efficiently these 16-bit digits xi

are stored in a 32-bit datatype. The usage of this 32-bit space is to ensure that intermediate values ofthe form ab + c do not produce any carries since when 0 ≤ a, b, c < 216 then 0 ≤ ab + c < 232. Hence,given an odd 16n-bit modulus N, then a Montgomery residue X, such that 0 ≤ X < 2N < 216(n+1), isrepresented using s =

⌈n+1

4

⌉vectors of 128 bits. Note that this representation uses roughly twice the

number of bits when compared to storing it in a “normal” radix-representation. The single additional

18

X =

Xs−1 =

128-bit length vector︷ ︸︸ ︷16-bit︸ ︷︷ ︸high

16-bit︸ ︷︷ ︸low

x3s−1 x2s−1 xs−1

......

X j =x3s+ j x2s+ j xs+ j x j

......

X0 =x3s x2s xs x0︸ ︷︷ ︸

the least significant 16-bit word of X

Figure 1: The 16-bit words xi of a 16(n + 1)-bit positive integer X =∑n

i=0 xi216i < 2N are storedcolumn-wise using s =

⌈n+1

4

⌉128-bit vectors X j on the SPE architecture.

16-bit word is required because the intermediate accumulating result of Montgomery multiplicationcan be almost as large as 2N (see page 9 in Section 2.4).

The 16-bit digits xi are placed column-wise in the four 32-bit datatypes of the 128-bit vectors.This representation is illustrated in Figure 1. The four 32-bit parts of the j-th 128-bit vector X j aredenoted by

X j = {X j[3], X j[2], X j[1], X j[0]}.

Each of the (n + 1) 16-bit words xi of X is stored in the most significant 16 bits of Xi mod s[⌊

is

⌋].

The motivation for using this column-wise representation is that a division by 216 can be computedefficiently: simply move the digits in vector X0 “one position to the right”, which in practice means alogical 32-bit right shift, and relabeling of the indices such that X j becomes X j−1, for 1 ≤ j < s−1 andthe modified vector X0 becomes the new Xs−1. Algorithm 5 on the next page computes Montgomerymultiplication using such a 4-way column-wise SIMD representation.

In each iteration, the indices of the vectors that contain the accumulating partial product U changecyclically among the s registers. In Algorithm 5, each 16-bit word of the inputs X, Y and N and theoutput Z is stored in the upper part (the most significant 16 bits) of each of the four 32-bit words ina 128-bit vector. The vector µ contains the replicated values of −N−1 mod 216 in the lower 16-bitpositions of the four 32-bit words. In its most significant 16-bit positions, the temporary vector Kstores the replicated values of yi, i.e., each of the parsed coefficients of the multiplier Y correspondingto the i-th iteration of the main loop. The operation A ← muladd(B, c, D), which is a singleinstruction on the SPE, represents the operation of multiplying the vector B (where data are stored inthe higher 16-bit positions of 32 bit words) by a vector with replicated 16-bit values of c across allhigher positions of the 32-bit words. This product is added to D (in 4-way SIMD manner) and theoverall result is placed into A.

The temporary vector V stores the replicated values of u0 in the least significant 16-bit words. Thisu0 refers to the least significant 16-bit word of the updated value of U, where U =

∑nj=0 u j216 j and

is stored as s vectors of 128-bit Ui mod s,Ui+1 mod s, . . . ,Ui+n mod s (where i refers to the index of themain loop). The vector Q is computed as an element-wise logical left shift by 16 bits of the 4-SIMDproduct of vectors V and µ.

The propagation of the higher 16-bit carries of U(i+ j) mod s as stated in lines 10 and 18 of Algo-rithm 5 consists of extracting the higher 16-bit words of these vectors and placing them into the lower

19

Algorithm 5 Montgomery multiplication algorithm for the Cell

Input:

N represented by s 128-bit vectors: Ns−1, . . . ,N0, such that216(n−1) ≤ N < 216n, 2 - N,

X,Y each represented by s 128-bit vectors: Xs−1, . . . , X0, andYs−1, . . . ,Y0, such that 0 ≤ X,Y < 2N,

µ : a 128-bit vector containing (−N)−1 (mod 216)replicated in all 4 elements.

Output:{

Z represented by s 128-bit vectors: Zs−1, . . . ,Z0, such thatZ ≡ XY2−16(n+1) mod N, 0 ≤ Z < 2N.

1: for j = 0 to s − 1 do2: U j = 03: end for4: for i = 0 to n do5: /* lines 6-9 compute U = yiX + U */

6: K = {yi, yi, yi, yi}

7: for j = 0 to s − 1 do8: U(i+ j) mod s = muladd(X j, K, U(i+ j) mod s)9: end for

10: Carry propagation on U(i+ j) mod s for j = 0, . . . , s − 1 (see text)11: /* lines 12-13 compute Q = µV mod 216 */

12: V = {u0, u0, u0, u0}

13: Q = shiftleft(mul(V , µ), 16)14: /* lines 15-17 compute U = NQ + U */

15: for j = 0 to s − 1 do16: U(i+ j) mod s = muladd(N j, Q, U(i+ j) mod s)17: end for18: Carry propagation on U(i+ j) mod s for j = 0, . . . , s − 1 (see text)19: /* line 20 computes the division by 216 */

20: Ui mod s = vshiftright(Ui mod s, 32)21: end for22: Carry propagation on Ui mod s for i = n + 1, . . . , 2n + 1 (see text)23: for j = 0 to s − 1 do24: Z j = U(n+ j+1) mod s

25: end for

16-bit positions of temporary vectors. These vectors are then added to the “next” vector U(i+ j+1) mod s

correspondingly. The operation is carried out for the vectors with indices j ∈ {0, 1, . . . , s − 2}. Forj = s − 1, the last index, the temporary vector that contains the words is logically shifted 32 bitsto the left and added to the vector Ui mod s. Similarly, the carry propagation of the higher wordsof U(i+ j) mod s in line 22 of Algorithm 5 is performed with 16-bit word extraction and addition, butrequires a sequential parsing over the (n + 1) 16-bit words.

Hence, the approach outlined in Algorithm 5 computes Montgomery multiplication by computingthe multi-word multiplications using SIMD instructions and representing the integers using a column-wise approach (see Figure 1 on the preceding page). This approach comes at the cost that a single16n-bit integer is represented by 128d n+1

4 e bits: requiring slightly over twice the amount of storage.

20

Note, however, that an implementation of this technique outperforms the native multi-precision big-number library on the Cell processor by a factor of about 2.5, as summarized in [13].

4.3 Montgomery multiplication using the residue number system representation

The residue number system (RNS) as introduced by Garner [31] and Merrill [53] is an approach,based on the Chinese remainder theorem, to represent an integer as a number of residues modulosmaller (coprime) integers. The advantage of RNS is that additions, subtractions and multiplicationcan be performed independently and concurrently on these smaller residues. Given an RNS basis βn =

{r1, r2, . . . , rn}, where gcd(ri, r j) = 1 for i , j, the RNS modulus is defined as R =∏n

i=1 ri. Given aninteger x ∈ Z/RZ and the RNS basis βn, this integer x is represented as an n-tuple ~x = (x1, x2, . . . , xn)where xi = x mod ri for 1 ≤ i ≤ n. In order to convert an n-tuple back to its integer value one canapply the Chinese remainder theorem (CRT)

x =

n∑i=1

xi

(Rri

)−1

mod ri

Rri

mod R. (7)

Modular multiplication using Montgomery multiplication in the RNS setting has been studied,for instance, by Posch and Posch in [61] and by Bajard, Didier, and Kornerup in [4] and subsequentwork. In this section we outline how to achieve this. First note for the application in which we areinterested, we can not use the modulus N as the RNS modulus since N is either prime (in the setting ofelliptic curve cryptography) or a product of two large primes (when using RSA). When computing theMontgomery reduction one has to perform arithmetic modulo the Montgomery radix. One possibleapproach is to use the RNS modulus R =

∏ni=1 ri as the Montgomery radix. This has the advantage

that whenever one computes with integers represented in this residue number system they are reducedmodulo R implicitly. However, since we are performing arithmetic in the ring Z/RZ this means thatdivision by R, as required in the Montgomery reduction, is not well-defined.

One way this problem can be circumvented is by introducing an auxiliary basis β′n = {r′1, r′2, . . . , r

′n}

with auxiliary RNS modulus R′ =∏n

i=1 r′i such that

gcd(R′,R) = gcd(R,N) = gcd(R′,N) = 1

(and both R and R′ are larger than 4N). The idea is to convert the intermediate result represented in βn

to the auxiliary basis β′n and perform the division by R here (since R and R′ are coprime this inverseexists).

The concept of base-extension, converting the representation from one base to another, is by Szaboand Tanaka in [71] (but see also the work by Gregory and Matula in [33]). Methods are either based onthe CRT as used by Shenoy and Kumaresan [67], Posch and Posch [60], and Kawamura, Koike, Sano,and Shimbo [46] or on an intermediate representation denoted by a mixed radix system as presentedin Szabo and Tanaka in [71]. Carefully selected RNS bases can significantly impact the performancein practice as shown by Bajard, Kaihara, and Plantard in [3] and Bigou and Tisserand in [8]. AnotherRNS approach is presented by Phillips, Kong, and Lim [59].

With these two RNS bases defined we can compute the Montgomery product modulo N. LetR =

∏ni=1 ri and R′ =

∏ni=1 r′i be the two RNS moduli for RNS basis βn and β′n respectively. The

representations of N in βn and β′n are denoted by ~N and ~N′. Let A, B ∈ Z/NZ be represented in bothRNS bases: βn (~a and ~b) and β′n (~a′ and ~b′). Then we can compute the Montgomery product in boththese RNS bases using the following steps.

21

1. Compute the product of A and B in both RNS bases βn and β′n:

~d = ~a · ~b, where di = aibi mod ri, for 1 ≤ i ≤ n,~d′ = ~a′ · ~b′ where d′i = a′ib

′i mod r′i for 1 ≤ i ≤ n.

2. Compute ((ab)(−N−1 mod R) mod R). This is realized by computing ~q = −~N−1 · ~d in basis βn asqi = −N−1

i di mod ri for 1 ≤ i ≤ n.

3. Convert ~q in basis βn to ~q′ in basis β′n (for instance using Equation (7)).

4. Compute the final part ~c′ = (~d′ + ~q′ · ~N′) · ~R−1 of the Montgomery multiplication (including thedivision by R) in basis β′n by computing c′i = (d′i + q′i N

′i )r−1

i mod r′i for 1 ≤ i ≤ n.

5. Convert ~c′ in basis β′n to ~c in basis βn (for instance using Equation (7)).

After step 5 we have ~c = {c1, c2, . . . , cn} and ~c′ = {c′1, c′2, . . . , c

′n} such that

ci ≡(abR−1 mod N

)mod ri and c′i ≡

(abR−1 mod N

)mod r′i .

This approach has been used to implement asymmetric cryptography on highly parallel computer ar-chitectures like graphics processing units (e.g., as in [5, 56, 72, 38, 2]). The results presented in thesepapers show that when multiple Montgomery multiplications are computed concurrently using RNSthe latency can be reduced significantly while the throughput is increased (compared to computa-tion on a multi-core CPU) when computing with thousands of threads on the hundreds of cores on agraphics processing unit. This highlights the potential of using graphics cards as cryptographic accel-erators when large batches of work require processing (and a low latency is required). The process isillustrated in the example below.

Let A = 42, B = 17 and N = 67. In this example we show how to compute the Montgomeryproduct ABR−1 mod N using a residue number system. Let us first define the two coprime RNSbases

β3 = {3, 7, 13} with RNS modulus R = 3 · 7 · 13 = 273,β′3 = {5, 11, 17} with RNS modulus R′ = 5 · 11 · 17 = 935,

such that both RNS moduli are larger than 4N. Recall that R plays the role of both the RNSmodulus as well as of the Montgomery radix. First, we need to represent the inputs A and B inboth RNS bases: A = 42 is represented as

• ~a = {42 mod 3, 42 mod 7, 42 mod 13} = {0, 0, 3} in basis β3,

• ~a′ = {42 mod 5, 42 mod 11, 42 mod 17} = {2, 9, 8} in basis β′3,

and B = 17 is represented as

• ~b = {17 mod 3, 17 mod 7, 17 mod 13} = {2, 3, 4} in basis β3,

• ~b′ = {17 mod 5, 17 mod 11, 17 mod 17} = {2, 6, 0} in basis β′3.

Example 4: RNS Montgomery multiplication

22

Furthermore, we need to represent the precomputed Montgomery constant µ = −N−1 mod R =

−67−1 mod 273 = 110 in basis β3

~µ = {110 mod 3, 110 mod 7, 110 mod 13} = {2, 5, 6},

as well as the modulus N in basis β′3

~N′ = {67 mod 5, 67 mod 11, 67 mod 17} = {2, 1, 16}.

Compute the first step: the product of these two numbers in both bases. This can be done inparallel for all the individual moduli

• ~d = ~a · ~b = {0 · 2 mod 3, 0 · 3 mod 7, 3 · 4 mod 13} = {0, 0, 12},

• ~d′ = ~a′ · ~b′ = {2 · 2 mod 5, 9 · 6 mod 11, 8 · 0 mod 17} = {4, 10, 0}.

Next, compute ~q = ~d · ~µ in basis β3

~q = ~d · ~µ = {0 · 2 mod 3, 0 · 5 mod 7, 12 · 6 mod 13} = {0, 0, 7}.

Change the representation of q: convert ~q = {0, 0, 7}, which is represented in basis β3, to ~q′ whichis represented in basis β′3. This can be done by first converting ~q back to its integer representationfollowing Equation (7) on page 21

q =

(0 ·

((273

3

)−1mod 3

)273

3 +

0 ·((

2737

)−1mod 7

)2737 +

7 ·((

27313

)−1mod 13

)27313

)mod 273

= 0 + 0 + 7 · 5 · 21 mod 273 = 189.

From q obtain ~q′ = {189 mod 5, 189 mod 11, 189 mod 17} = {4, 2, 2}. The final step computesthe result c in basis β′3 as ~c′ = (~d′ + ~q′ · ~N′) · ~R−1:

~c′ = ({4, 10, 0} + {4, 2, 2} · {2, 1, 16}) · {2, 5, 1}= {4, 5, 15}.

When converting this to the integer representation we obtain

c =

(4 ·

((935

5

)−1mod 5

)9355 +

5 ·((

93511

)−1mod 11

)93511 +

15 ·((

93517

)−1mod 17

)93517

)mod 935

= 374 + 170 + 440 mod 935 = 49.

23

This is indeed correct since

c = ABR−1 mod N= 42 · 17 · (3 · 7 · 13)−1 mod 67= 49.

Acknowledgements

The authors want to thank Robert Granger, Arjen K. Lenstra, Paul Leyland, and Colin D. Walter forfeedback and suggestions. Peter Montgomery’s authorship is honoris causa; all errors and inconsis-tencies in this chapter are the sole responsibility of the first author.

References

[1] T. Acar and D. Shumow. Modular reduction without pre-computation for special moduli. Tech-nical report, Microsoft Research, 2010. (Cited on page 12.)

[2] S. Antão, J.-C. Bajard, and L. Sousa. RNS-based elliptic curve point multiplication for massiveparallel architectures. The Computer Journal, 55(5):629–647, 2012. (Cited on page 22.)

[3] J. Bajard, M. E. Kaihara, and T. Plantard. Selected RNS bases for modular multiplication. InJ. D. Bruguera, M. Cornea, D. D. Sarma, and J. Harrison, editors, 19th IEEE Symposium onComputer Arithmetic – ARITH 2009, pages 25–32. IEEE Computer Society, 2009. (Cited onpage 21.)

[4] J.-C. Bajard, L.-S. Didier, and P. Kornerup. An RNS montgomery modular multiplication algo-rithm. IEEE Trans. Computers, 47(7):766–776, 1998. (Cited on page 21.)

[5] J.-C. Bajard and L. Imbert. A full RNS implementation of RSA. IEEE Transactions on Com-puters, 53(6):769–774, June 2004. (Cited on page 22.)

[6] P. Barrett. Implementing the Rivest Shamir and Adleman public key encryption algorithmon a standard digital signal processor. In A. M. Odlyzko, editor, Advances in Cryptology –CRYPTO’86, volume 263 of Lecture Notes in Computer Science, pages 311–323. Springer, Hei-delberg, Aug. 1987. (Cited on page 2.)

[7] D. J. Bernstein. Curve25519: New Diffie-Hellman speed records. In M. Yung, Y. Dodis, A. Ki-ayias, and T. Malkin, editors, PKC 2006: 9th International Conference on Theory and Practiceof Public Key Cryptography, volume 3958 of Lecture Notes in Computer Science, pages 207–228. Springer, Heidelberg, Apr. 2006. (Cited on page 11.)

[8] K. Bigou and A. Tisserand. Single base modular multiplication for efficient hardware RNSimplementations of ECC. In T. Güneysu and H. Handschuh, editors, Cryptographic Hardwareand Embedded Systems – CHES 2015, volume 9293 of Lecture Notes in Computer Science,pages 123–140. Springer, Heidelberg, Sept. 2015. (Cited on page 21.)

[9] J. W. Bos. High-performance modular multiplication on the Cell processor. In M. A. Hasan andT. Helleseth, editors, Workshop on the Arithmetic of Finite Fields – WAIFI 2010, volume 6087of Lecture Notes in Computer Science, pages 7–24. Springer, 2010. (Cited on pages 14 and 18.)

24

[10] J. W. Bos, C. Costello, H. Hisil, and K. Lauter. Fast cryptography in genus 2. In T. Johansson andP. Q. Nguyen, editors, Advances in Cryptology – EUROCRYPT 2013, volume 7881 of LectureNotes in Computer Science, pages 194–210. Springer, Heidelberg, May 2013. (Cited on pages 11and 12.)

[11] J. W. Bos, C. Costello, H. Hisil, and K. Lauter. High-performance scalar multiplication using8-dimensional GLV/GLS decomposition. In G. Bertoni and J.-S. Coron, editors, CryptographicHardware and Embedded Systems – CHES 2013, volume 8086 of Lecture Notes in ComputerScience, pages 331–348. Springer, Heidelberg, Aug. 2013. (Cited on page 12.)

[12] J. W. Bos, C. Costello, P. Longa, and M. Naehrig. Selecting elliptic curves for cryptography: anefficiency and security analysis. J. Cryptographic Engineering, 6(4):259–286, 2016. (Cited onpage 12.)

[13] J. W. Bos and M. E. Kaihara. Montgomery multiplication on the Cell. In R. Wyrzykowski,J. Dongarra, K. Karczewski, and J. Wasniewski, editors, Parallel Processing and Applied Math-ematics – PPAM 2009, volume 6067 of Lecture Notes in Computer Science, pages 477–485.Springer, Heidelberg, 2010. (Cited on pages 17 and 21.)

[14] J. W. Bos, M. E. Kaihara, T. Kleinjung, A. K. Lenstra, and P. L. Montgomery. Solving a 112-bit prime elliptic curve discrete logarithm problem on game consoles using sloppy reduction.International Journal of Applied Cryptography, 2(3):212–228, 2012. (Cited on page 18.)

[15] J. W. Bos, T. Kleinjung, A. K. Lenstra, and P. L. Montgomery. Efficient SIMD arithmeticmodulo a Mersenne number. In E. Antelo, D. Hough, and P. Ienne, editors, IEEE Symposiumon Computer Arithmetic – ARITH-20, pages 213–221. IEEE Computer Society, 2011. (Cited onpage 11.)

[16] J. W. Bos, T. Kleinjung, R. Niederhagen, and P. Schwabe. ECC2K-130 on cell CPUs. In D. J.Bernstein and T. Lange, editors, AFRICACRYPT 10: 3rd International Conference on Cryptol-ogy in Africa, volume 6055 of Lecture Notes in Computer Science, pages 225–242. Springer,Heidelberg, May 2010. (Cited on page 18.)

[17] J. W. Bos, P. L. Montgomery, D. Shumow, and G. M. Zaverucha. Montgomery multiplicationusing vector instructions. In T. Lange, K. Lauter, and P. Lisonek, editors, SAC 2013: 20th AnnualInternational Workshop on Selected Areas in Cryptography, volume 8282 of Lecture Notes inComputer Science, pages 471–489. Springer, Heidelberg, Aug. 2014. (Cited on pages 13, 14,15, and 16.)

[18] J. W. Bos and D. Stefan. Performance analysis of the SHA-3 candidates on exotic multi-corearchitectures. In S. Mangard and F.-X. Standaert, editors, Cryptographic Hardware and Embed-ded Systems – CHES 2010, volume 6225 of Lecture Notes in Computer Science, pages 279–293.Springer, Heidelberg, Aug. 2010. (Cited on page 18.)

[19] A. Bosselaers, R. Govaerts, and J. Vandewalle. Comparison of three modular reduction func-tions. In D. R. Stinson, editor, Advances in Cryptology – CRYPTO’93, volume 773 of LectureNotes in Computer Science, pages 175–186. Springer, Heidelberg, Aug. 1994. (Cited on page 2.)

[20] R. P. Brent and P. Zimmermann. Modern Computer Arithmetic. Cambridge University Press,2010. (Cited on page 12.)

25

[21] E. F. Brickell. A fast modular multiplication algorithm with application to two key cryptography.In D. Chaum, R. L. Rivest, and A. T. Sherman, editors, Advances in Cryptology – CRYPTO’82,pages 51–60. Plenum Press, New York, USA, 1982. (Cited on page 14.)

[22] Ç. K. Koç, T. Acar, and B. S. Kaliski Jr. Analyzing and comparing Montgomery multiplicationalgorithms. IEEE Micro, 16(3):26–33, 1996. (Cited on page 6.)

[23] H.-C. Chen, C.-M. Cheng, S.-H. Hung, and Z.-C. Lin. Integer number crunching on the Cellprocessor. International Conference on Parallel Processing, pages 508–515, 2010. (Cited onpage 18.)

[24] J. Chung and M. A. Hasan. Montgomery reduction algorithm for modular multiplication us-ing low-weight polynomial form integers. In 18th IEEE Symposium on Computer Arithmetic(ARITH-18), pages 230–239. IEEE Computer Society, 2007. (Cited on page 12.)

[25] S. Cook. On the minimum computation time of functions. PhD thesis, Harvard University, 1966.(Cited on page 6.)

[26] N. Costigan and P. Schwabe. Fast elliptic-curve cryptography on the cell broadband engine. InB. Preneel, editor, AFRICACRYPT 09: 2nd International Conference on Cryptology in Africa,volume 5580 of Lecture Notes in Computer Science, pages 368–385. Springer, Heidelberg, June2009. (Cited on page 18.)

[27] N. Costigan and M. Scott. Accelerating SSL using the vector processors in IBM’s cell broadbandengine for sony’s playstation 3. Cryptology ePrint Archive, Report 2007/061, 2007. http://eprint.iacr.org/2007/061. (Cited on page 18.)

[28] B. Dixon and A. K. Lenstra. Massively parallel elliptic curve factoring. In R. A. Rueppel, editor,Advances in Cryptology – EUROCRYPT’92, volume 658 of Lecture Notes in Computer Science,pages 183–193. Springer, Heidelberg, May 1993. (Cited on pages 12 and 14.)

[29] S. R. Dussé and B. S. Kaliski Jr. A cryptographic library for the Motorola DSP56000. InI. Damgård, editor, Advances in Cryptology – EUROCRYPT’90, volume 473 of Lecture Notesin Computer Science, pages 230–244. Springer, Heidelberg, May 1991. (Cited on pages 6, 7,and 8.)

[30] M. Fürer. Faster integer multiplication. In D. S. Johnson and U. Feige, editors, 39th Annual ACMSymposium on Theory of Computing, pages 57–66. ACM Press, June 2007. (Cited on page 6.)

[31] H. L. Garner. The residue number system. In Papers Presented at the the March 3-5, 1959, West-ern Joint Computer Conference, IRE-AIEE-ACM ’59 (Western), pages 146–153, New York, NY,USA, 1959. ACM. (Cited on page 21.)

[32] R. Granger and A. Moss. Generalised Mersenne numbers revisited. Math. Comput.,82(284):2389–2420, 2013. (Cited on page 12.)

[33] R. T. Gregory and D. W. Matula. Base conversion in residue number systems. In T. R. N. Raoand D. W. Matula, editors, 3rd IEEE Symposium on Computer Arithmetic – ARITH 1975, pages117–125. IEEE Computer Society, 1975. (Cited on page 21.)

26

[34] M. Gschwind. The Cell broadband engine: Exploiting multiple levels of parallelism in a chipmultiprocessor. International Journal of Parallel Programming, 35:233–262, 2007. (Cited onpage 18.)

[35] S. Gueron and V. Krasnov. Fast prime field elliptic-curve cryptography with 256-bit primes. J.Cryptographic Engineering, 5(2):141–151, 2015. (Cited on page 13.)

[36] G. Hachez and J.-J. Quisquater. Montgomery exponentiation with no final subtractions: Im-proved results. In Ç. K. Koç and C. Paar, editors, Cryptographic Hardware and EmbeddedSystems – CHES 2000, volume 1965 of Lecture Notes in Computer Science, pages 293–301.Springer, Heidelberg, Aug. 2000. (Cited on page 9.)

[37] M. Hamburg. Fast and compact elliptic-curve cryptography. Cryptology ePrint Archive, Report2012/309, 2012. http://eprint.iacr.org/2012/309. (Cited on page 12.)

[38] O. Harrison and J. Waldron. Efficient acceleration of asymmetric cryptography on graphicshardware. In B. Preneel, editor, AFRICACRYPT 09: 2nd International Conference on Cryptol-ogy in Africa, volume 5580 of Lecture Notes in Computer Science, pages 350–367. Springer,Heidelberg, June 2009. (Cited on page 22.)

[39] L. Hars. Long modular multiplication for cryptographic applications. In M. Joye and J.-J.Quisquater, editors, Cryptographic Hardware and Embedded Systems – CHES 2004, volume3156 of Lecture Notes in Computer Science, pages 45–61. Springer, Heidelberg, Aug. 2004.(Cited on page 12.)

[40] K. Hensel. Theorie der algebraischen Zahlen. Tuebner, Leipzig, 1908. (Cited on page 2.)

[41] H. P. Hofstee. Power efficient processor architecture and the Cell processor. In High-Performance Computer Architecture – HPCA 2005, pages 258–262. IEEE, 2005. (Cited onpage 18.)

[42] K. Iwamura, T. Matsumoto, and H. Imai. Systolic-arrays for modular exponentiation usingMontgomery method (extended abstract) (rump session). In R. A. Rueppel, editor, Advancesin Cryptology – EUROCRYPT’92, volume 658 of Lecture Notes in Computer Science, pages477–481. Springer, Heidelberg, May 1993. (Cited on page 14.)

[43] M. E. Kaihara and N. Takagi. Bipartite modular multiplication. In J. R. Rao and B. Sunar,editors, Cryptographic Hardware and Embedded Systems – CHES 2005, volume 3659 of LectureNotes in Computer Science, pages 201–210. Springer, Heidelberg, Aug. / Sept. 2005. (Cited onpage 14.)

[44] A. A. Karatsuba and Y. Ofman. Multiplication of many-digital numbers by automatic computers.Doklady Akad. Nauk SSSR, 145(2):293–294, 1962. Translation in Physics-Doklady 7, pp. 595–596, 1963. (Cited on page 6.)

[45] E. Käsper. Fast elliptic curve cryptography in OpenSSL. In G. Danezis, S. Dietrich, and K. Sako,editors, FC 2011 Workshops, volume 7126 of Lecture Notes in Computer Science, pages 27–39.Springer, Heidelberg, Feb. / Mar. 2012. (Cited on page 11.)

[46] S. Kawamura, M. Koike, F. Sano, and A. Shimbo. Cox-Rower architecture for fast parallelMontgomery multiplication. In B. Preneel, editor, Advances in Cryptology – EUROCRYPT 2000,

27

volume 1807 of Lecture Notes in Computer Science, pages 523–538. Springer, Heidelberg, May2000. (Cited on page 21.)

[47] M. Kneževic, F. Vercauteren, and I. Verbauwhede. Speeding up bipartite modular multiplication.In M. A. Hasan and T. Helleseth, editors, Arithmetic of Finite Fields – WAIFI, volume 6087 ofLecture Notes in Computer Science, pages 166–179. Springer, 2010. (Cited on page 12.)

[48] D. E. Knuth. Seminumerical Algorithms. The Art of Computer Programming. Addison-Wesley,Reading, Massachusetts, USA, 3rd edition, 1997. (Cited on page 2.)

[49] N. Koblitz. Elliptic curve cryptosystems. Mathematics of Computation, 48(177):203–209, 1987.(Cited on page 11.)

[50] Çetin Kaya. Koç and T. Acar. Montgomery multiplication in GF(2k). Designs, Codes andCryptography, 14(1):57–69, 1998. (Cited on page 10.)

[51] P. C. Kocher, J. Jaffe, and B. Jun. Differential power analysis. In M. J. Wiener, editor, Advancesin Cryptology – CRYPTO’99, volume 1666 of Lecture Notes in Computer Science, pages 388–397. Springer, Heidelberg, Aug. 1999. (Cited on pages 2 and 8.)

[52] A. K. Lenstra. Generating RSA moduli with a predetermined portion. In K. Ohta and D. Pei,editors, Advances in Cryptology – ASIACRYPT’98, volume 1514 of Lecture Notes in ComputerScience, pages 1–10. Springer, Heidelberg, Oct. 1998. (Cited on page 12.)

[53] R. D. Merrill. Improving digital computer performance using residue number theory. ElectronicComputers, IEEE Transactions on, EC-13(2):93–101, April 1964. (Cited on page 21.)

[54] V. S. Miller. Use of elliptic curves in cryptography. In H. C. Williams, editor, Advances inCryptology – CRYPTO’85, volume 218 of Lecture Notes in Computer Science, pages 417–426.Springer, Heidelberg, Aug. 1986. (Cited on page 11.)

[55] P. L. Montgomery. Modular multiplication without trial division. Mathematics of Computation,44(170):519–521, April 1985. (Cited on pages 1, 4, 6, and 7.)

[56] A. Moss, D. Page, and N. P. Smart. Toward acceleration of RSA using 3D graphics hardware.In S. D. Galbraith, editor, 11th IMA International Conference on Cryptography and Coding,volume 4887 of Lecture Notes in Computer Science, pages 364–383. Springer, Heidelberg, Dec.2007. (Cited on page 22.)

[57] D. A. Osvik, J. W. Bos, D. Stefan, and D. Canright. Fast software AES encryption. In S. Hongand T. Iwata, editors, Fast Software Encryption – FSE 2010, volume 6147 of Lecture Notes inComputer Science, pages 75–93. Springer, Heidelberg, Feb. 2010. (Cited on page 18.)

[58] D. Page and N. P. Smart. Parallel cryptographic arithmetic using a redundant Montgomeryrepresentation. IEEE Trans. Computers, 53(11):1474–1482, 2004. (Cited on page 14.)

[59] B. J. Phillips, Y. Kong, and Z. Lim. Highly parallel modular multiplication in the residue numbersystem using sum of residues reduction. Appl. Algebra Eng. Commun. Comput., 21(3):249–255,2010. (Cited on page 21.)

[60] K. Posch and R. Posch. Base extension using a convolution sum in residue number systems.Computing, 50(2):93–104, 1993. (Cited on page 21.)

28

[61] K. C. Posch and R. Posch. Modulo reduction in residue number systems. IEEE Trans. ParallelDistrib. Syst., 6(5):449–454, 1995. (Cited on page 21.)

[62] Q. Pu and X. Zhao. Montgomery exponentiation with no final comparisons: Improved results.In Pacific-Asia Conference on Circuits, Communications and Systems, pages 614–616. IEEE,2009. (Cited on page 9.)

[63] R. L. Rivest, A. Shamir, and L. M. Adleman. A method for obtaining digital signature and public-key cryptosystems. Communications of the Association for Computing Machinery, 21(2):120–126, 1978. (Cited on pages 7 and 9.)

[64] A. Schönhage and V. Strassen. Schnelle multiplikation großer zahlen. Computing, 7(3-4):281–292, 1971. (Cited on page 6.)

[65] H. Seo, Z. Liu, J. Großschädl, J. Choi, and H. Kim. Montgomery modular multiplication onARM-NEON revisited. In J. Lee and J. Kim, editors, Information Security and Cryptology -ICISC 2014, volume 8949 of Lecture Notes in Computer Science, pages 328–342. Springer,2015. (Cited on pages 13 and 14.)

[66] M. Shand and J. Vuillemin. Fast implementations of RSA cryptography. In E. E. S. Jr., M. J.Irwin, and G. A. Jullien, editors, 11th Symposium on Computer Arithmetic, pages 252–259.IEEE Computer Society, 1993. (Cited on pages 2 and 9.)

[67] A. Shenoy and R. Kumaresan. Fast base extension using a redundant modulus in RNS. Comput-ers, IEEE Transactions on, 38(2):292–297, 1989. (Cited on page 21.)

[68] J. A. Solinas. Generalized Mersenne numbers. Technical Report CORR 99–39, Centre forApplied Cryptographic Research, University of Waterloo, 1999. (Cited on page 11.)

[69] M. Stevens, A. Sotirov, J. Appelbaum, A. K. Lenstra, D. Molnar, D. A. Osvik, and B. de Weger.Short chosen-prefix collisions for MD5 and the creation of a rogue CA certificate. In S. Halevi,editor, Advances in Cryptology – CRYPTO 2009, volume 5677 of Lecture Notes in ComputerScience, pages 55–69. Springer, Heidelberg, Aug. 2009. (Cited on page 18.)

[70] A. Svoboda. An algorithm for division. Information processing machines, 9(25-34):28, 1963.(Cited on page 12.)

[71] N. S. Szabo and R. I. Tanaka. Residue arithmetic and its applications to computer technology.McGraw-Hill, 1967. (Cited on page 21.)

[72] R. Szerwinski and T. Güneysu. Exploiting the power of GPUs for asymmetric cryptography.In E. Oswald and P. Rohatgi, editors, Cryptographic Hardware and Embedded Systems –CHES 2008, volume 5154 of Lecture Notes in Computer Science, pages 79–99. Springer, Hei-delberg, Aug. 2008. (Cited on page 22.)

[73] O. Takahashi, R. Cook, S. Cottier, S. H. Dhong, B. Flachs, K. Hirairi, A. Kawasumi, H. Mu-rakami, H. Noro, H. Oh, S. Onish, J. Pille, and J. Silberman. The circuit design of the synergisticprocessor element of a Cell processor. In International conference on Computer-aided design –ICCAD 2005, pages 111–117. IEEE Computer Society, 2005. (Cited on page 18.)

[74] A. Toom. The complexity of a scheme of functional elements realizing the multiplication ofintegers. Soviet Mathematics Doklady, 3(4):714–716, 1963. (Cited on page 6.)

29

[75] U.S. Department of Commerce/National Institute of Standards and Technology. Digital Sig-nature Standard (DSS). FIPS-186-4, 2013. http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.186-4.pdf. (Cited on page 11.)

[76] C. D. Walter. Systolic modular multiplication. IEEE Transactions on Computers, 42(3):376–378, Mar. 1993. (Cited on page 14.)

[77] C. D. Walter. Montgomery exponentiation needs no final subtractions. Electronics Letters,35(21):1831–1832, Oct. 1999. (Cited on page 9.)

[78] C. D. Walter. Precise bounds for Montgomery modular multiplication and some potentiallyinsecure RSA moduli. In B. Preneel, editor, Topics in Cryptology – CT-RSA 2002, volume 2271of Lecture Notes in Computer Science, pages 30–39. Springer, Heidelberg, Feb. 2002. (Cited onpage 9.)

[79] C. D. Walter and S. Thompson. Distinguishing exponent digits by observing modular subtrac-tions. In D. Naccache, editor, Topics in Cryptology – CT-RSA 2001, volume 2020 of LectureNotes in Computer Science, pages 192–207. Springer, Heidelberg, Apr. 2001. (Cited on pages 9and 10.)

30