Accelerating Fully Homomorphic Encryption in Hardware

0018-9340 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2345388, IEEE Transactions on Computers

1

Accelerating Fully Homomorphic Encryption inHardware

Yarkın Doroz, Erdinc Ozturk, Berk Sunar

Abstract—We present a custom architecture for realizing the Gentry-Halevi fully homomorphic encryption (FHE) scheme. Thiscontribution presents the first full realization of FHE in hardware. The architecture features an optimized multi-million bit multiplierbased on the Schonhage Strassen multiplication algorithm. Moreover, a number of optimizations including spectral techniques as wellas a precomputation strategy is used to significantly improve the performance of the overall design. When synthesized using 90 nmtechnology the presented architecture achieves to realize the encryption, decryption, and recryption operations, in 18.1 msec, 16.1msec, and 3.1 sec, respectively, and occupies a footprint of less than 30 million gates.

Index Terms—Fully homomorphic encryption, cryptographic accelerators, large-number multiplication.

F

1 INTRODUCTION

One of the most significant developments in cryptogra-phy in the last few years has been the introduction ofthe first fully homomorphic encryption (FHE) schemeby Gentry [5]. Gentry’s lattice-based scheme appears tobe secure and hence settles an open problem posed byRivest et al. in 1978 [6].

Since addition and multiplication in any non-trivialring constitute a universal set of gates, a fully homomor-phic encryption scheme allows one to employ untrustedcomputing resources without risk of revealing sensitivedata. Using FHE, computation carried out directly onciphertexts carries over to the underlying plaintexts,therefore private computation becomes possible. FHEholds great promise for numerous applications includingprivate information retrieval and search, data aggrega-tion, electronic voting, biometrics, etc. For instance, usingFHE we may keep sensitive databases, e.g. medical andfinancial records, in encrypted form and perform privatequeries without any risk of compromising privacy. Ingeneral, by deploying FHE we may mitigate the vulner-abilities stemming from imperfect software.

Unfortunately, FHE has not yet sufficiently maturedto be used in real-life deployments. The bandwidth dueto ciphertext expansion is prohibitive. More significantly,after every few bit operations the ciphertext needs to behomomorphically re-encrypted to manage the growth innoise. Recryption is a computationally expensive oper-ation that takes in the order of seconds even on highend platforms. In [3], an FPGA implementation draft forimproving the speed of FHE primitives was proposedwithout any implementation results. Here we take asnapshot of the attainable hardware performance of theGentry-Halevi FHE variants. We are motivated by the

This work was in part supported by NSF Awards #1117590 and #1319130.Y. Doroz and B. Sunar are with Worcester Polytechnic Institute. E-mail:{ydoroz, sunar}@wpi.edu .E. Ozturk is with Istanbul Commerce University. E-mail:[email protected] .

fact that the first commercial implementations of mostpublic key cryptographic schemes, e.g. RSA, EllipticCurve DSA schemes have been via hardware productsdue to the inefficiency of general purpose computers. Wespeculate that a similar growth pattern to emerge in thematuration process of FHEs. While the efficiency short-comings of FHE’s are being worked out, we would liketo pose the following question: How far are FHE schemesfrom being offered as a hardware product? Clearly answeringthis question would be a major overtaking deservingthe evaluation of all FHE variants with numerous im-plementation techniques. Here we only provide initialresults. During the progression of this work potentiallymore efficient schemes appeared in the literature, e.g. see[7], [8], [9], [10]. Since most of these schemes are definedover ideal-rings they also rely on arithmetic with largenumbers or polynomials with large coefficients. The IP-cores we developed for fast number theoretic transformbased multipliers with efficient modular reduction mayalso be used to accelerate these newer schemes.

In this paper, we tackle the performance problemhead-on by introducing a custom ASIC design for theGentry-Halevi FHE. To the best of our knowledge this isthe first ASIC realization of the full scheme (excludingkey generation). Our hardware architecture supportsthe encryption, decryption and recryption primitives forthe 2048-dimension instantiation of the Gentry-Halevischeme. We utilize a number of optimizations, includingreformulation of the operations, use of spectral tech-niques and precomputation to speed up the arithmeticoperations. Another contribution of independent interestis the number theoretical transform based fast million bitmultiplier, which lies at the heart of all the primitives.

2 BACKGROUND

2.1 Gentry’s Fully Homomorphic SchemeA high-level description of Gentry’s scheme is as fol-lows. The scheme is based on identifying ideals I in



2

polynomial quotient rings Z[x]/ (f(x)) (with deg(f) = n)with Euclidean lattices LI ⊆ Rn by mapping eachresidue polynomial r(x) = a0 + · · · + an−1x

n−1 to itsvector of coefficients (a0, . . . , an−1). Gentry calls theseobjects ideal lattices. Ideal lattices provide additive andmultiplicative homomorphisms modulo a public-keyideal. We obtain an encryption procedure Encrypt suchthat Encrypt(x1) + Encrypt(x2) = Encrypt(x1 + x2) andEncrypt(x1) · Encrypt(x2) = Encrypt(x1 · x2). Therefore,any circuit C with efficient description can be evalu-ated homomorphically. However, this somewhat fullyhomomorphic scheme (SWHE) is not perfect. Due to thenoisy nature of the scheme, with each homomorphic gateevaluation the noise term in the partial result grows.After the evaluation of only a logarithmic depth circuit,the decryption fails to recover the correct result. To makethe scheme work, Gentry uses a number of tricks. Heintroduces a re-encryption procedure called Recrypt thattakes a noisy ciphertext and returns a noise-reducedversion. In a brilliant move, Gentry manages to obtainRecrypt again from the SWHE scheme by simply homo-morphically evaluating the decryption circuit using en-crypted secret key bits on the noisy ciphertext. To makethis work, the SWHE needs to be able to handle circuitsthat are deeper than its own decryption circuit before thelevel of noise becomes too large. SWHE schemes withthis property are called bootstrappable.

2.2 The Gentry-Halevi FHESmart and Vercauteren specialized Gentry’s scheme toprincipal-ideal lattices, and forced the determinant ofthe lattice to be a prime number [15]. While this spe-cialization improves the efficiency, it does not allowconstruction of the full scheme including bootstrappingand Recrypt for practical key sizes [15]. Gentry andHalevi remove the primality restriction by introducing aspecial hermitian normal form for the bases. Further op-timizations such as choosing sparse polynomials, batch-ing polynomial evaluations, customized resultant andinversion algorithm for f(x) = x2l ± 1 allowed the firstsoftware implementation of an FHE scheme. Here wegive a high-level description of the primitives as follows.Let dNc denote the round to nearest integer operation,[N ]d = (N mod d)− d, and [N ] = {0, 1, . . . , N − 1}.

Key Generation. The key generation phase is ratherinvolved but can be summarized in the following steps:

1) Set fn(x) = xn ± 1. Choose a random n = 2θ-dimensional integer lattice represented by a ran-domly chosen polynomial v(x) where vi are chosenfrom the set of t-bit signed integers.

2) Compute w(x) such that w(x)v(x) = d (mod fn(x))where d represents a constant integer. This taskmay be achieved by using the polynomial versionof the Extended Euclidean Algorithm 1.

1. Note that Section 4 of [15] presents a significantly more efficienttechnique for computing w(x).

3) Compute r = w0/w1 (mod d) and check if wi =wi+1r (mod d) for all i = 1, . . . , n−2. If the inverseof w1 does not exist restart the key generation pro-cedure by picking a new random v(x) polynomial.

4) In order to facilitate reencryption, randomly choosebit-vectors σi for i = 0, . . . , S − 1 where eachvector has Hamming weight one. Choose w′ asany one of the odd coefficients of w(x). Randomlychoose xi ∈ Zd for i = 0, . . . , s − 1 such that∑s−1

j=0

∑S−1i=0 σi(j)xjR

i (mod d) = w′. The parame-ter R ∈ Z may be chosen as a power of 2.

5) Let l = d2√Se. For reencryption, pick bits ηi,j for

i ∈ [s] , j ∈ [l] where ηi,j has Hamming weight2 when viewed as an l-dimensional vector. Thenencrypt each to obtain βi,j = Encrypt(ηi,j).

6) The public key is PK = (r, d, {xi : i ∈ [s]},{βi,j : i ∈ [s] , j ∈ [l]}) and the secret key isSK = (w′, σ0, σ1, . . . , σS−1).

Encryption. To encrypt a bit m ∈ {0, 1} first choosean n-th degree sparse random polynomial u(x) withcoefficients from {0, 1,−1} chosen with probability of0 is ρ. Using the PK parameters (r, d) encryption iscomputed as follows: Encrypt(m) =

[m+ 2

∑n−1i=0 uir

i]d.

When multiple bits are to be encrypted, one may batchthe computation yielding a significant speedup, e.g. kencryptions may be computed at cost only O(

√k) times

more than a single bit encryption using simultaneouspolynomial evaluation.

Decryption. We decrypt a ciphertext c ∈ Zd using thesecret key SK = (wi) simply by computing a modularmultiplication as Decrypt(c) = [cwi]d (mod 2) .

Recryption. The goal of recryption is to remove thenoise buildup experienced during homomorphic circuitevaluations. We may only evaluate circuits to a con-stant (small) depth depending on the specific choice ofparameters. To continue homomorphic evaluations weapply the recrypt procedure. Informally, recrypt worksby homomorphically decrypting the ciphertext using en-crypted secret key bits. A given ciphertext c is recryptedby taking the following steps:

1) Compute yj,i = cxjRi (mod d) for i = 0, . . . , S − 1

and j = 0, . . . , s− 1.2) Compute zj,i = yj,i/d as the p = dlog2(s + 1)e bit

approximation to the right of the binary point.3) For j ∈ [s] compute the quotients qj =∑

a∈[l] βj,a

(∑b∈[l] βj,bzj,i(a,b)

)(mod d) where the

index function is defined as i(a, b) = al −(a+12

)+

(b− a). Note that the βj,azj,i products are realizedas conditional additions in Zd (since zj,i are bits incleartext) and only the product of the result of theinner summation with βj,a requires multiplicationin Zd.

4) Finally compute the reencryption of c ∈ Zd isachieved by homomorphically evaluating the de-cryption circuit on c in encrypted form. Using a



3

number of optimization the decryption operationis expressed the following form DecryptSK(c) as∑j∈[s]

∑i∈[S]

σj(i)zj,i

+∑

j∈[s],i∈[l]

σj(i)(yj,i) (mod 2) .

Note that the inputs d and yj,i are in cleartextform while the secret key σj are in encrypted formduring the evaluation of the decryption circuit. Thefirst summation is homomorphically computed onthe individual bits (in encrypted form) via gradeschool addition of s fixed point numbers expressedusing p-bits. Therefore, in the computation of theactual recrypt operation σj(i) are replaced withtheir recoded and encrypted form, i.e. βj,i andinner summation of the first term with qj . Duringhomomorphic evaluation the (mod 2) additionsand multiplications turn into additions and mul-tiplications in Zd, respectively. The depth of thecircuit evaluating the carry output may be shownto be bounded by O(s2). Hence, we end up com-puting in the order of O(s2) multiplications in Zd

to figure out the carry bit and reflect it to the LSBin encrypted form using a Zd addition. The secondsum multiplies bits by ciphertexts in Zd.

2.3 Number Theoretic Transform Based ArithmeticThe Number Theoretic Transformation (NTT) is a specialform of Fourier Transform over rings. This special form,eliminates the error prone structure of Fourier Transformbecause of the floating point arithmetic. We use NTTas the backbone of the million-bit arithmetic operations.Especially, it is effective in large integer multiplication (inmillion-bit range) which lies at the heart of all the primi-tives. Common multiplication schemes (Karatsuba Algo-rithm [24], schoolbook multiplication method) becomeinfeasible for large integer multiplications. Schonhage-Strassen algorithm [18] is currently asymptotically thefastest algorithm for very-large numbers. It has beenshown that it outperforms classic schemes for operandsizes larger than 217 bits [20]. FFT based large integermultiplier architectures were presented in [21], [22], [17].

Schonhage Strassen Algorithm. The Schonhage-Strassen Algorithm is a NTT-based large integermultiplication algorithm, with a runtime ofO(N logN log logN) [18]. For a N -digit number,NTT is computed using the ring RN = Z/(2N + 1)Z,where N is a power of 2. A summary of the algorithmis as follows. For an in-depth review of the SchonhageStrassen algorithm see [19]. We sample the numbersA and B that fits into N -digits with a sampling sizeε. The selected p is a prime number with a primitiveroot w, i.e. wp = 1 (mod p). Then, we can represent theNTT forms of the numbers as Ak =

∑N−1k=0 wkak and

Bk =∑N−1

k=0 wkbk. Later, the components are multipliedto form ck = Ak · Bk mod p. Using the inverse-NTT(INTT) we compute Ck =

∑N−1k=0 w−kck. In the last

step, we accumulate the carry additions to finalizethe evaluation of C. To realize the Schonhage-StrassenAlgorithm efficiently, it is crucial to employ fast NTTand INTT computation techniques. We adopted themost common method for computing Fast FourierTransforms (FFTs), i.e. the Cooley-Tukey FFT Algorithm[16]. The algorithm computes the Fourier Transform ofa sequence X as Xk =

∑N−1j=0 xje

−i2πk jN , by turning

the length N transform computation into two N2 size

Fourier Transform computations as follows

Xk =

N/2−1∑m=0

m even

x2mθm + e−2πik

N

N/2−1∑m=0

m odd

x2m+1θm .

where θ = e−2πk iN/2 . We change e

−2πikN with powers of

w and perform the divisions into two halves with oddand even indices, recursively. With the use of fast trans-form technique, we can evaluate the Schonhage-StrassenMultiplication Algorithm in O(N logN log logN) time.

Modular Reduction. We may use Barrett Modular Re-duction (BMR) algorithm [4] to compute r ≡ x(mod M) as following:

r ≡ x (mod bk+1)︸︷︷︸r1

−⌊bx/bk−1cµ

bk+1

⌋M (mod bk+1)︸︷︷︸r2

.

In the equation b is the radix and other parameters arek = logb M + 1 and µ = bb2k/Mc. According to [4]r has the following equality: r < 3M . Therefore; afterevaluating r, first we check if it is negative and performr = r+bk+1 and later we subtract M from r while M < r.

Block-wise Arithmetic. In the further sections of thepaper, we refer to block-wise (or block) computations.The term defines separation of a large integer in theNTT form into computational block where each containsl digits. In other words each large integer that are inNTT form will be formed of N/l number of blocks. Sincethe large integers are suitable for performing parallelarithmetic in NTT form, these blocks are distributedamong the arithmetic units.

3 OVERVIEW OF OUR ARCHITECTURE

The overall architecture presented in Figure 1 con-tains five components: LARGE INTEGER MULTIPLIER,BARRETT REDUCTION UNIT, DECRYPTION COMPLETIONUNIT, ENCRYPTION UNIT and RECRYPTION UNIT. Theseare controlled by the MASTER CONTROL UNIT (MCU).Each of the Encryption, Decryption and Recryptionprimitives require large integer multiplications. How-ever, providing a dedicated LARGE INTEGER MULTI-PLIER for each primitive is too costly. In our design weincorporate one LARGE INTEGER MULTIPLIER that willbe shared between these primitives.

To realize each primitive, the MCU controls the unitsto complete its operation, handling the order of op-erations and I/O between the units and the external



4

Fig. 1. Overview of the Full Architecture

PrecisionEvaluator

(i)0b

(i)0b

..

.. . .

. . .

..

.

RecryptionUnit

dheadyhead

dheadyhead

<< R

Mult..t d

UnitControl

ControlUnit

(i)s−1b

b(i)s−1

Cache

ArithmeticUnit

I/O

MasterControl

Unit

ControlUnit

BarretReduction

Unit

UnitsArithmetic

Cache

UnitControl

Routing

MultiplierLarge Integer

Routing

ArithmeticUnit

UnitEncryption

βEPE[ ]u i

. . .

ControlUnit

ComputationModular

Arithmetic

Quotient(R−bits)

Quotient(s−bits)

α−1

EPE[0]

Arithmetic

uTable

u i

α−1 RPE[ ][s−1]RPE[ ][0]

RPE[0][0] RPE[0][s−1]

Decryption Completion

Unit

RAM

BitsPrecision

Table

memory. Since the operands are in the range of millionsof bits, data transactions between units are impractical.We assume an external memory unit (RAM) in thedesign for storage. We utilize a 64-bit bus for I/Otransactions between the units and the external RAM.Holding the public keys RAM acts as a shared memorybetween the units when the primitive operations arerealized. The operation time of a unit is then the totalof the time from RAM to the computation unit, thetime of the arithmetic operations and the time needed towrite the result back to RAM. With effective addressingand utilizing prefetching from RAM, the initial addressdecoding overhead can be eliminated.

3.1 Parameter SelectionIn the following we explain the details of the parameterselection for Large Integer Arithmetic to support million-bit multiplication. Next, we give parameters for FHEprimitives and show some potential trade-offs.

Large Integer Arithmetic Parameters. The parametersfor the NTT based implementation is based on [23], [1].We choose a 64-bit word size, and a sampling size ofε = 224 with modulus p = 264 − 232 + 1, a Solinas prime[2]. This allows us to realize a modular reduction using afew primitive arithmetic operations. A 128-bit number isdenoted as z = 296a+264b+232c+d. Using the selected p,we perform z (mod p) operation as 232(b+c)−a−b+d .The large integer size parameter N is chosen to satisfyN2 (ε−1)

2 < p to prevent overflow. Also, N should be big

enough to cover million-bit multiplication with smallestpossible value, i.e. 2 million bits < N · ε. The best candi-date for N is determined as 3·215.2 Given the parametersand equation wN ≡ 1 mod p, w = 3511764839390700819.

In Cooley-Tukey FFT, each recursive halving operationis referred as a stage and it is denoted as Si, which i is thestage index. The size of the smallest NTT block is selected12 digits and it is referred as the 0th stage, i.e. S0. Theremaining 13 stages are reconstruction stages and requiredifferent arithmetic operations from the ones in S0. Interms of INTT operations, every stage and operation isidentical to NTT. Only difference is selection of w′. It iscomputed as: w′ = w−1 mod p.

FHE Primitive Parameters. In instantiate the scheme forthe smallest parameters in [15] as; n = 2048, l = 46,s = 15, p′ = 5, S = 512, ρ = 2032

2048 and log d = 785000.In FHE primitives, we include additions in NTT form,

i.e. u2

∑ji=1 ui(ε− 1)2 < p. The j value is equal to S and

(1 − ρ) · n in Recryption and Encryption respectively.Therefore, we choose a different sample rate ε to pre-vent overflows. Selecting ε = 216, we support primitiveoperations up to 786432 bits, which is larger than log d.

4 LARGE INTEGER ARCHITECTURE

The FHE primites are based on efficient large integerarithmetic. In the following, we give the design details

2. We choose the digit size as a power of two which ease arithmeticcomputations.



5

of a large integer multiplier and a modular reductionarchitecture.

4.1 Large Integer Multiplier

Fig. 2. Overview of The Large Integer Multiplier

UnitFunction

RoutingRoutingReconstructionStage

Units

12x12NTT/INTT

Unit

Scale Unit

MultiplierControl

Unit

Cache

Architecture Overview. Our architecture is composed ofa data cache, a multiplier control unit, two routing unitsand a function unit, which is illustrated in Figure 2. Thearchitecture is designed to perform a restricted set ofspecial functions. There are four functions for handlingthe input/output transactions and three functions forarithmetic operations:

• Sequential Load: Stores a million-bit number to thecache.

• Sequential Unload: The cache releases its contentsstarting from the least significant to most significant.

• Butterfly Load: In NTT an important step is thedistribution of the digits into the right indices usingthe butterfly operation.

• Scale & Unload: The function overlaps scaling, i.e.N−1 (mod p) , with carry accumulations and out-puts the result.

• 12x12 NTT/INTT: The smallest NTT/INTT compu-tation is for 12 digits. NTT UNIT3 takes the digitssequentially and computes the 12 digit NTT/INTTby using simple shifts and addition operations.

• Stage-Reconstruction: This function is used for re-construction of a stage by the given stage index asinput. In order to complete a full reconstruction, itis called for 13 stages.

• Inner-Multiplication: Computes the digit-wise mod-ular multiplications. For this we utilize the multipli-ers used in STAGE-RECONSTRUCTION UNITS.

Using the functions outlined above, we can computethe product of million-bit numbers A and B using thefollowing sequence of operations:

1) A is loaded into cache by using BUTTERFLY LOAD.2) The NTT of number A, i.e. NTT(A), is computed by

calling; first 12X12 NTT function, and afterwardsSTAGE-RECONSTRUCTION function for all stages.

3) NTT(A) is stored to RAM using SEQUENTIAL UN-LOAD.

4) Using steps 1-2-3 above, we also compute NTT(B).5) The cache can only hold the half of the digits

of NTT(A) and NTT(B) together. Therefore, the

3. We refer to 12x12 NTT/INTT Unit as NTT UNIT

numbers are divided into lower and upper halves:NTT(A) = {NTT(A)h,NTT(A)l} and NTT(B) ={NTT(B)h,NTT(B)l}.

6) SEQUENTIAL LOAD stores NTT(A)h and NTT(B)h.7) INNER MULTIPLICATION computes modular multi-

plication of the upper halve: Ch[i] = NTT(A)h[i] ∗NTT(B)h[i].

8) The result is stored to the RAM by SEQUENTIALUNLOAD.

9) We repeat above three steps to compute the lowerpart: Cl[i] = NTT(A)l[i] ∗ NTT(B)l[i].

10) The result digits, i.e. C[i], are loaded into the cacheby SEQUENTIAL LOAD. At this point the cache willcontain the multiplication result, but still in theNTT form.

11) The result is converted to integer form by us-ing, 12X12 INTT function which is followed by acomplete STAGE-RECONSTRUCTION functions C ′ =INTT(C[i]).

12) In the last step, the result is scaled and the carriesare accumulated by SCALE & UNLOAD function tofinalize computation of C: C[i + 1] = C ′[i + 1] +bC ′[i]/pc and C[i] = C ′[i] (mod R).

Multiplier Cache System. The size of the cache isimportant for the timing of multiplications. In eachSTAGE-RECONSTRUCTION process of the NTT algorithm,we need to match the indices of odd and even digits.The index difference of the odd and even digits in areconstruction stage is: Si,diff = 12 · 2i−1, where i is theindex of reconstruction stages, i.e. 1 ≤ i ≤ 13. Since, inlater stages we require digits from distant indices, anadequate sized cache is chosen to reduce the number ofinput/output transactions between the cache and RAM.

Lets call N ′ as the chosen cache size. Then, we candivide the N digits into 2t = N/N ′ blocks, i.e. N ={N2t−1, N2t−2, . . . , N0}. Once a block is given as input,we can compute the reconstruction stages until N ′ <Si,diff for the ith stage. Then, starting from the ith stage,Nj requires digits from Nj+1 which j is block index. So,we need to divide Nj and Nj+1 into halves and matchthe upper halves of Nj with Nj+1, and lower halves ofNj with Nj+1. This matching process adds 2N ′ clockcycles for each block. Then, the total input/output over-head is evaluated as 2N · log2 (N/N ′), where log2 (N/N ′)is the number of the stages that requires digit matchingfrom different blocks. In our implementation, we aim tooptimize the speed by selecting N ′ as N .

Although a huge sized cache is important for ourdesign, a straight cache implementation is not sufficientto support parallelism. The main arithmetic functionsutilized in the multiplication process, such as 12 ×12 NTT/INTT 4, STAGE-RECONSTRUCTION and INNERMULTIPLICATION, are highly suitable for parallelization.

4. We used one 12 × 12 NTT/INTT unit for this function. For fewnumber of arithmetic units for multiplier, i.e. m = 4, the performancegain is %3. However, for larger m such as 64, performance gain willgo up to %20.



6

TABLE 1Assignment Table

arith0 arith1 arith2 arith3

S1-S10 sc0-sc1 sc2-sc3 sc4-sc5 sc6-sc7S11 sc0-sc1 sc2-sc3 sc4-sc5 sc6-sc7S12 sc0-sc2 sc1-sc3 sc4-sc6 sc5-sc7S13 sc0-sc4 sc1-sc5 sc2-sc6 sc3-sc7

To achieve parallelization, the cache should be able tosustain required bandwidth for multiple units. In orderto sustain the bandwidth, we ready the build up thecache by combining small, equal size caches or as werefer them sub-caches. Combining these sub-caches on atop level, we can select the cache to be used as a single-cache or a multi-cache system. In case of linear functions,such as SEQUENTIAL LOAD, BUTTERFLY LOAD, etc., thecache works as a single-cache with one input/outputports, where as for parallel functions, it works as a multi-cache system with multiple input/output ports. Thenumber of sub-caches should be equal to 2×m (doublethe size of STAGE-RECONSTRUCTION UNIT number) toeliminate access read/write to the same sub-cache in thereconstruction processes. Each sub-cache has a size ofN/(2×m) and we denote them as; {sc0, sc1, . . . , sc2m−1}.

Routing Unit. The ROUTING UNIT matches the odd andeven digits to the arithmetic units. As stated previously,the indice difference of the digits is (12 · 2i−1). There-fore, in last log 2m reconstruction stage, odd and evendigits fall into different sub-cache. The assignment ofsub-caches to proper arithmetic units5 for each STAGE-RECONSTRUCTION is shown in Table 1. In the Table,arithmetic units are referred as arithi, which i is theindex number.

Function Unit. The FUNCTION UNIT is divided intothree parts, i.e. the SCALER UNIT, the NTT UNIT andmultiple STAGE-RECONSTRUCTION UNIT.

SCALER UNIT: Denoting the digits as di and includ-ing the carries as ci, digits of the result is di ×N−1 + ci (mod p) = {ci+1, ri}. As N−1 (mod p) =0xFFFF555455560001 – a constant number with a specialform, we implemented the product using simple shiftand add circuit.

NTT UNIT: The unit computes 12-digit NTT and INTTusing the formula: xi =

∑11i=0(w

′)i × di mod p, wheredi is the given 12 digit input. The parameter w′ is setas; w′ = w213 (mod p) and w′ = (w−1)2

13

(mod p) forNTT and INTT operations respectively. Note that in NTTw′ = 0x10000 and in INTT w′ = 0xFFFFEFFFF00010001.These constant multiplications are implemented usingsimple add and shift circuit. These simple operations canbe squeezed into few clocks and pipelined to optimizethroughput. Due to pipelining 12-digit NTT of the largeinteger is completed in N clock cycles.

5. Arithmetic units are the STAGE-RECONSTRUCTION UNITS

STAGE RECONSTRUCTION UNIT: The unit in Figure 3 isresponsible for two functions; STAGE-RECONSTRUCTIONand INNER MULTIPLICATION. One of these functions isselected by the input sequence given to the control unit.The Arithmetic Logic Unit (ALU) in Figure 4 consistsof 32-bit multipliers, adders and a reduction circuit tocomplete 64-bit modular multiplications.

In INNER MULTIPLICATION 64-bit numbers are fedinto Odd and Coeff bus. Even is fed with zero, so thatALU only performs modular multiplication. The ALUcan output a modular multiplication product in everytwo clock cycles after the initial startup cost of thepipeline. The whole function takes N

2 multiplicationsand with m multipliers it will cost N

m clock cycles.

Fig. 3. Stage Reconstruction Unit

TableCoefficient

Stage Recon.

Control Unit

Control

Signals

SelectCoeff.

ALUInputSeq.

Data

Output

In STAGE-RECONSTRUCTION we compute: Oi,j =

Ei−1,j − Oi−1,j × wj (mod ni−1)i−1 (mod p) and Ei,j =

Ei−1,j +Oi−1,j × wj (mod ni−1)i−1 (mod p)

where, i denotes the stage index from 1 to 13, j denotesthe index of the digits, wi is the coefficient of stage i− 1and finally ni−1 is the modular reduction to select theappropriate power of the wi. The following equation istrue for ni parameters; ni+1 = 2×ni with initial setting ofn0 = 12. Ideally we need to store all the coefficients alongwith the odd digits. However this will require anotherlarge cache of size (12 + 24 + 48 + · · · + 49152) ≈ Ndigits. Although we save half memory size by reuse ofthe powers for different stages, necessity of storing w−1

doubles the size requirement. We reduce the memoryrequirement by using memory-time trade-off. The coef-ficients are computed efficiently as follows:

1) The coefficients required in two consecutivestages are as follows: Si+1 : w0

i , w1i , . . . , w

nii and

Si+2 : w0i+1, w

1i+1, . . . , w

ni+1

i+1 .2) Then Si+2 : w0

i+1, w1i+1, . . . , w

2nii+1 since it holds that

ni+1 = 2× ni.3) Further, Si+2 : w

02i , w

12i , . . . , w

2ni2

i , since wi = w2i+1.

4) This shows that half of the coefficients of Si+2 aresame as Si+1 and the other half are the square rootsof the coefficients of Si+1.

5) We compute square roots by multiplying each wji

with w1i+1.

Thus, we construct the COEFFICIENT TABLE by storingtwo columns of coefficients. In the first column, sinceour smallest computation block is 12, we compute andstore all wj

0 coefficients for 11 ≥ j ≥ 0. We denote thesecoefficients as wfirst,i, where i denotes the index of thecoefficient. In the second column, for each of the remain-ing stages we compute and store w1

i . The second columncoefficients are denoted by wsecond,i. This makes a totalof 24 coefficients which we can use to compute any of the



7

wji values. When we include also the coefficients for the

INTT operations, our table contains 48 coefficients. Thecomputation of an arbitrary coefficient using the tablecan be achieved as wj

i = wfirst,l ×∏i

t=0 wesecond,t .

The values of l and e are functions of i and j. Alsoe is a value equal to 1 or 0. Therefore we can omit themultiplications whenever e = 0. The total number ofmultiplications, for evaluating wj

i ×Oi, is computed as:1) In every reconstruction stage we start by multiply-

ing odd digits with wfirst,l’s. This step makes a totalof N

2 multiplications.2) Apart from the first reconstruction stage, in each

stage we also require coefficients from wsecond,0 towsecond,i−1. Since we cannot store the coefficients,in each stage we need to rebuild the previous stagecoefficients to build up the coefficients. We areusing half of the previous stage values so in eachstage we need N

4 additional multiplications.3) The total multiplications will then becomes∑i=12

i=0 (N2 + i× N4 ) = 26×N .

Fig. 4. ALU of The Stage Reconstruction Unit

Route

Re

cti

ud

on

mul_only

Coeff

Odd

Output

Even

Multiplier Control Unit. The MULTIPLIER CONTROLUNIT contains a state machine to complete a large in-teger multiplication operation. The main job is to sendcorrect indices to Funtion Units to complete a aritmeticfunctions, such as INNER MULTIPLICATION and STAGE-RECONSTRUCTION. NTT UNIT and SCALE UNIT onlyconsist of datapath, so we do not require any controlsignal to operate.

The MULTIPLIER CONTROL UNIT also handles in-put/output addressing of the sub-caches. Sequentialfunction require incremental addressing for each sub-cache. In STAGE-RECONSTRUCTION function, the ad-dressing is computed according to the stage level, whichis basically updated with the index range of dependentodd and even digits.

Performance Analysis. The latency of each functionalblock in cycles is given in Table 2. In order to perform acomplete multiplication we require; two complete NTTand two INNER MULTIPLICATION operations, and oneINTT operation.

4.2 Modular Reduction

In Barrett Reduction, the result is evaluated using twolarge integer multiplications and a few subtractions. The

TABLE 2Clock Cycle Counts of Functional Blocks

NTT(A) , NTT(B)

2 BUTTERFLY LOAD 2N2 12× 12 NTT 2N

2 STAGE-RECON 26N2 SEQUENTIAL UNLOAD 2N

AxB2 SEQUENTIAL LOAD 2N

2 INNER MULTIPLICATION N2

2 SEQUENTIAL UNLOAD 2N

INTT(AxB)

BUTTERFLY LOAD N12× 12 INTT NSTAGE-RECON 13N

SCALE UNLOAD NTOTAL 52.5N

values µ and M are stored in NTT form to avoid con-version costs. Selecting b = 264simplifies the arithmetic.Division and modular operations, such as bx/bk−1c andx (mod bk+1), are accomplished by reading from differ-ent memory address. In division, bits read starting frombk−1 to most significant bit and in modular arithmetic,bits read least significant to bk+1.

Once a Barrett Reduction is requested, the state ma-chine controls the million-bit multiplier and performsthe multiplications with M and µ, and computes r1 andr2 values. r1 and r2 is loaded for subtraction that isevaluated in digits by a simple subtracter with wordlength 64-bits. Since, reading both r1 and r2 uses hugeportion of the bandwidth, a κ-digit local cache is addedto store partial results to prevent the I/O collusionsfrom/to the RAM. The local cache is based on twoparallel κ/2-digit FIFOs so that we can output 2 digitsper clock. Later, r is checked if it is negative and itis corrected by setting zero after bk+1. For comparisonr and M are loaded into comparator unit to decider ≥ M . The decision is done by implementing a 512-bitcomparator. If the first 8-digits are equal then it loadsthe next 8-digits for decision until they are not equal. Ifcomparison is true, r is updated as r = r −M .

The time for a Barrett Reduction heavily depends onthe multiplications and the subtractions have a smalleffect. The multiplications are completed in 2 · 36.5Nclock cycles. Subtractions take ≈ 3× 23500 clock cycles,which is omitted. The total time required for a Barrettreduction is ≈ 73N .

5 FHE PRIMITIVES5.1 DecryptionThe decryption operation is a rather simple operationthat requires a modular multiplication operation fol-lowed by a modulo 2 reduction, i.e. Decrypt(c) = [cwi]d(mod 2). During the decryption operation, the MASTERCONTROL UNIT uses the LARGE INTEGER MULTIPLIERand the BARRETT REDUCTION units to realize [cwi]d.This is followed by the application of the DECRYPTIONCOMPLETION UNIT, which contains a simple arithmeticcircuit that takes the least significant digit of the modularmultiplication result and pads it with zeroes to match theoperand length.



8

We can reduce the large integer multiplication time bystoring the wi in NTT form. The conversion operationis only applied to the ciphertext c. Therefore the multi-plication takes in the order of 36.5N clock cycles. Themodulo 2 reduction is realized by reading the last digitand by forming the large integer result with paddingtakes less than 8,000 clock cycles. Since 8, 000 � 36.5Nwe neglect this quantity. Including Barrett Reduction, theoverall decryption operation takes 109.5N clock cycles.

5.2 Encryption

Fig. 5. Encryption Architecture64 64 64

m

4

64 64 64

EPE[2]

64

64

EPE[1]EPE[0]doubleadd_bit

read/writeoperation

64* β

βEPE[ −1]

Output

. . .

. . .

. . .

ControlUnit

m

start

U

iu

Input

The most time consuming part of encryption is eval-uating powers of r. In [11], these are computed usinga recursive algorithm. While asymptotically faster sucha recursive approach is not suitable for a hardware im-plementations. Instead we utilize a window based serialevaluation scheme. Algorithm A′ proposed in [12] issuitable for efficient polynomial evaluation in hardware.

Algorithm 1: Algorithm A’

Define u(r) = u0r0 + u1r

1 + · · ·+ un−1rn−1 and a1

window size k as k < n and k | n.Group coefficients of u(r) using powers of rk as:2

(u0r0 + · · ·+ uk−1r

k−1)r0+(ukr

0 + · · ·+ u2k−1rk−1)rk+ · · ·+

(un−kr0 + · · ·+ un−1r

k−1)rn−k .Define Inner Polynomials as:3

P (j) = rj·k(∑k−1

i=0 u(j·k+i)ri)

Then, the u(r) polynomial can be rewritten as:4

u(r) =∑n

k −1j=0 P (j) =

∑nk −1j=0 rj·k

(∑k−1i=0 u(j·k+i)r

i)

The algorithm divides the evaluation into three steps.First, the polynomial terms are grouped into windows ofk digits where each grouping is multiplied by increasingpowers of rk. After the summations in each window,the window sums are scaled by the proper power of rk.The last step is to aggregate the scaled window sums.The algorithm reduces the number of multiplications tok + 2n

k . A further speed-up is achieved by storing twotables; {r0, r1, . . . , rk−1} and {rk, r2k, . . . , rn−k}. Since ris set during the KeyGen step the lookup tables can beprecomputed. With the introduction of the lookup tables,the only multiplication operations needed are the onescomputed when the window sums are multiplied bythe power of rk. Using the lookup tables, the number

of multiplications are reduced further to nk − 1 with a

storage requirement of nk + k − 2.

Furthermore, the algorithm can be further improvedby realizing the operations entirely in the NTT domain.By storing the table elements in NTT form, an encryptionoperation may be realized as

Encrypt(m) = INTT

M + 2

nk−1∑

j=0

Rj·k

(k−1∑i=0

u(j·k+i)Ri)

)d

where we use uppercase symbols to denote the NTTform of the variables, e.g. R = NTT(r) and Rj =NTT(rj). Since the message m is a bit, we simplifyM = (0, . . . , 0) if m = 0, else M = (1, . . . , 1) if m = 1. Theequation eliminates NTT conversions and requires onlyone INTT and a single modular reduction at the end.

The NTT based arithmetic operations in aggregate,referred to as NTT-Encryption, are evaluated with whatwe call the ENCRYPTION UNIT. The remainder of theoperations are completed by utilizing the LARGE INTE-GER MULTIPLIER UNIT and the BARRETT REDUCTIONUNIT. To realize the encryption primitive, the MASTERCONTROL UNIT runs the ENCRYPTION UNIT, the LARGEINTEGER MULTIPLIER and the BARRETT REDUCTIONunits in order.

Encryption Unit. The ENCRYPTION UNIT is designedas a semi-systolic architecture as illustrated in Figure 5.The architecture contains a Control Unit, a storage unitfor u and ENCRYPTION PROCESSING ELEMENTS (EPEs).Since NTT based arithmetic can be rather efficiently par-allelized, the RAM access latency becomes the bottleneckin the design. We can achieve the maximum throughputby incorporating #EPEs=bandwidth/frequency process-ing elements into the design.

Encryption Processing Element (EPE). The EPE asshown in Figure 6 is designed to evaluate NTT-Encryption of a block size κ. The parameter κ alsorepresents the size of the local cache. The local cacheacts as a temporary variable t and it is used to reducethe number of I/O transactions. It is also important tonote that the unit is fully pipelined with 10 clock delay.Therefore each block operation will have an extra 10clock cycle delay in time evaluation, i.e. we multiplythe total timing with (1 + 10/κ). The unit evaluatesencryption in the following two steps:

• The first step evaluates of the window summationsand the scaling operation is shown in the Algorithm2. With built-in local cache, a window summationcan be evaluated in at most k input and 1 outputtransactions. If a ui value is zero, then Ri

(l) is notloaded into the system. Therefore, the probability ofthe coefficients of the u polynomial being 0 directlygives us the cost of the operations. The total numberof I/O transactions is k · (1− ρ) + 2, where 1− ρ isthe probability of non-zero terms and plus 2 is forinput of the scaler and output terms.

• The second step computes the window summationsalong with the doubling and addition of the mes-



9

Algorithm 2: Window Sum & Scale Operation

Input: r = {{R0(l), R

1(l), . . . , R

k−1(l) },

{Rk(l), R

2k(l), . . . , R

n−k(l) }}, u = {u0, u1, . . . , un−1}

Output: Inner Polynomial BlockP

(j)l = Rj·k

(l)

∑k−10 uiR

i(l)

for j = 0→ nk − 1 do1

t← 02

for i = 0→ k − 1 do3

if ui 6= 0 then t← t+ uiRi(l)4

t← t ·Rj·k(l)5

P(j)l ← t6

sage bit to finalize the NTT-Encryption. The algo-rithm is shown in Algorithm 3. The total numberof I/O transactions for the window summationsalso depends on the probability ρ. If ρ is largeenough the probability of P (j) = 0, i.e. ρk, will besufficiently large that they can be ignored during theadditions. Then, the total I/O transaction number isnk · (1 − ρk) + 1, where n

k represents the number ofwindows and plus 1 is for the output term.

Fig. 6. Encryption Processing Element

Mux

Mod

Add

Mod

Add

Mux

64 64

double m add_bit operation read/write

EPECacheLocal

t OutputInput Mod.Arith.

Algorithm 3: NTT Encryption

Input: P = {P (0)l , P

(1)l , ·, P (n

k −1)

l }Output: Rl = 2

∑nk −10 P (i)

l +M(l)

t← 0;1

for j = 0→ nk − 2 do2

if P j 6= 0 then t← t+ P(j)l + P

(j)l3

if P (j) 6= 0 then t← t+ P(nk −1)

l + P(nk −1)

l +M(l);4

Rl ← t5

The EPE is controlled by signals double, operation,add bit, read, write and clear. Each EPE is connectedwith a 64-bit bus, which is utilized to load the powersof r into the system. During the computation of the firstalgorithm, double and add bit signals are inactive andthe input is directly fed to the MODULAR ARITHMETICUNIT. The unit consist of a 64-bit modular subtracter,an adder and a multiplier. The operation signal enablesthe required modular arithmetic operation. In case ofui = ±1 modular adder/subtracter is used to computet = t ± Ri. For the scaling operation, modular mul-tiplier is enabled by the operation signal to compute

t = t · Rj·k. The second algorithm is realized using two64-bit modular adders that are controlled by the doubleand add bit signals. If the double signal is active, theinput is added to itself to double the window summa-tions: 2P = P + P . If the add bit signal is active, themessage bit m is added to the summation in NTT form.Using these two signals, the final equation is evaluatedas 2u(R) + M = 2

∑nk −10 P (i) + M . Additionally, it is

important to note that the clear signal is used in case ofsetting t = 0 and read/write signals are used to read andupdate t values.

Control Unit. The CONTROL UNIT is a state machine thatmanages the encryption operation. The inputs are themessage bit m, random polynomial u and its outputs arethe operation, double, add bit and clear signals. Once theu polynomial is loaded into the storage, the CONTROLUNIT performs an encryption operation as follows:

1) Take message bit m as input.2) Request for the u polynomial for the first window,{u0, u1, . . . , uk−1}.

3) Using clear signal, reset the cache units t← 0.4) Check the values of ui iteratively and skip if it

is zero. In case of ±1, β blocks of the powersril is loaded into the bus and operation signalis selected. Each arithmetic core is assigned withdifferent blocks to evaluate t = t+ ril .

5) Iterate index i. Computation of the window sumis completed. To scale the sum, the necessary βblocks (powers of rkl ) are loaded into the cache andthe operation signal is set to enable multiplication.The term t is updated as: t = t · Rk

l . Now tholds the result of a scaled window summation:P

(j)l = Rj·k

l

∑k−10 uiR

il . Since there are β blocks,

the window sum is evaluated for the β blocks asP

(j)sub = {P

(j)β−1, . . . , P

(j)1 , P

(j)0 }.

6) Sequentially write the results P(j)sub back to the main

memory.7) Using the steps 3–5, process each block to fin-

ish the computation of a window: P (j) =

{P (j)bs−1, . . . , P

(j)1 , P

(j)0 }, where bs is the block size.

8) Repeating steps 3–7, Compute all of the windows:{P (0), P (1), . . . , P (n

k −1)}.9) Using the clear signal, clear all caches to 0.

10) Assert the double signal starting from j = 0, βblocks of P (j) is loaded if P (j) 6= 0. This evaluatest = t+2 ·P (j)

l up to j = nk −1. The add bit signal is

activated for the case j = nk − 1. This will add the

message bit m: t = t+ 2 · P (nk −1)

l +m.11) Every arithmetic core unit writes the result sequen-

tially to the main memory.12) Using Steps 9–12 process each block to finish the

computation of the equation R = 2·∑n

k −1i=0 P (j)+m.

Parameters selection makes a significant difference forthe time efficiency of the architecture. Since an ri termis included if ui is not 0, the probability distributionof ui = {0, 1,−1} is important to evaluate the timings.



10

We select the window size as 64, and lets remind thatthe probability is selected ρ = 16/2048. Since we onlyhave 16 non-zero values, we need to evaluate 16 of32 windows in the worst case scenario6. If only 16 ofthese windows are evaluated which will cost 16 · 3Ncycles. Addition of these 16 windows will take 17N clockcycles. Including the number of EPE’s, INTT and BarrettReduction operations, total cost of the operations willamount to 65N

β · (1 + 10/κ) + 89N cycles.

5.3 RecryptionRecryption is evaluation as DecryptSK(c)

∑j∈[s]

∑i∈[S]

σj(i)zj,i

+∑

j∈[s],i∈[l]

σj(i)(yj,i) (mod 2) .

The first summation has following form: qj =∑a∈[l] βj,a

(∑b∈[l] βj,bzj,i(a,b)

)(mod d) . We can take

advantage from the fact that the public keys βj,l areknown ahead of time after KEYGEN. By precompu-tating and storing the keys in NTT form, i.e. B =NTT(β), we can eliminate many costly large integermultiplications. The equation is rewritten as qj =

INTT(∑

a∈[l] B(j,a)

(∑b∈[l] B(j,b)zj,i(a,b)

))(mod d) The

new equation eliminates most of the NTT and INTTconversions. Only one inverse conversion and a singlemodular reduction is required at the end. Furthermore,we benefit by computing the zj,i terms first and storingthem in a table. This allows us to compute the NTT basedarithmetic parts in blocks, since we are able to re-readthe zj,i for each block. We divide the above equation intofour steps:

• Evaluation of the precision bits zj,i.• Sum of PKs

Sj =∑

a∈[l] B(j,a)


).

• INTT conversion and Barrett Reduction of Sj .• Grade-School Addition.

The first two steps are computed by RECRYPTION UNITwhich is illustracted in Figure 9. The last two steps arecomputed by the MASTER CONTROL UNIT using theLARGE INTEGER MULTIPLIER and the BARRETT REDUC-TION UNIT.

5.3.1 Evaluating Precision BitsThe precision bits zj,i are p′-bit result of the quotient ofyj,i/d. We divide the computation of zj,i terms into twounits. First the BINARY COMPUTATION UNIT evaluatesthe precision bits zj,i = {b(0)j,i , . . . , b

(p′−1)j,i } =

yj,i

d . In theequation, j is the public key index, i is the hammingweight index and p′ = dlog2(s+1)e+1 is the number ofprecision bits. The second unit used in the computationis the MODULAR COMPUTATION UNIT which evaluatesyj,i = c · xj · Ri (mod d) by computing yj,i = yj,i−1 · R(mod d). The units are designed to make the evaluations

6. We divide the degree 2048 into 64 windows of equal degreepolynomials.

for one public key. Therefore, for a public key of size swe reuse the units for each xj . In the following we givethe design details.

Binary Computation Unit. The BINARY COMPUTATIONUNIT is illustrated in Figure 7. It consists of p′ bitquotient evaluation units, a p′ bit buffer and a storagetable that has size of p′ · S and denoted as Precison BitTable. As shown in the figure, the quotient evaluationunit is an architecture that performs binary division byshift and subtraction operations. By using this design,we have a smaller area and the timing overhead willstill remain small compared to the overall timing. Theevaluation of precision bits, for a specific value of j, i isas follows:

1) The QUOTIENT EVALUATION UNIT takes the firstk1 bits of the values yj,i and d which are loadedinto storage denoted as yhead and dhead.

2) Using a comparator yhead and dhead is compared:if yhead >= dhead then b

(l)i,j = 1 else b

(l)i,j = 0.

3) The precision bit b(l)i,j is loaded into the buffer. Thevalue dhead is updated as dhead = dhead � 1 using a1-bit shifter. Also, yhead is updated according usingthe value of b(l)i,j as: if b(l)i,j == 1 then yhead = yhead−dhead else yhead = yhead.

4) We iterate Steps 2 and 3 until all the precision bitsare calculated.

5) The Precision Bit Table has S rows and eachevaluated precision bits are loaded to the ith rowof the table from the buffer.

With 64-bit word size arithmetic; the loading operationtakes k1/64 cycles, each update of the values takes k1/64cycles and each precision bit evaluation with comparisontakes 1 cycle. The process of one precision bit evaluationtakes ( (p

′+1)k1

64 + 1) cycles.

Fig. 7. Binary Computation Unit

i,jb (l)

d

y

Mux>>1

load

Mux

EvaluationUnit

Quotientload

heady

headd

Compare

>>1

MuxSub

p

z iBuffer

TableBit

Precision

Modular Computation Unit. The MODULAR COMPUTA-TION UNIT is used to evaluate yj,i using the equationyj,i = yj,i−1 ·R (mod d) and setting yj,0 = c ·xj (mod d).The value R is special number equal to 2103. This simpli-fies the multiplication into a simple shift operation. Also,the modular reduction operation can be evaluated by ascaled subtraction operation, i.e. yj,i−t·d = yj,i (mod d),where t is the largest coefficient that ensures t·d < yj,i. Bycombining these two, the computation can be expressedas yi,j = (yi−1,j � 103)− t · d. In the equation coefficientt is at most 103-bit value, so we are able to design a fastmodular reduction unit that computes and multiplies thesmall coefficients with million-bit numbers. The design



11

in Figure 8 consist of 103-bit quotient evaluation unit,64x128 bit multiplier unit, a carry accumulate unit, 103-bit shifter, 64 bit subtracter and a local storage. Theevaluation of (yi−1,j � 103) − t · d is performed withthe following steps:

1) The 103-bit QUOTIENT EVALUATION UNIT takesthe first k2 bits of the values yj,i−1 and d.

2) The QUOTIENT EVALUATION UNIT evaluates the tvalue as a 103-bit number as explained in BINARYCOMPUTATION UNIT.

3) Since the evaluations output one bit at a time, thebits are loaded into a 128-bit buffer. The buffer willhold 103-bit t with a 25-bit zero padding to feed tas a constant value.

4) After computing t, we can evaluate yj,i = ys−t·d =(yj,i−1 � 103)− t · d

The evaluation of t by the QUOTIENT EVALUATIONUNIT takes ( 104k2

64 + 1) cycles. In the rest of the compu-tations, the design inputs two million-bit numbers andoutputs a million-bit result. The design is fully pipelinedand able to generate a result at each clock. Also, thepipeline delay is small that we can neglect in the tim-ings. Therefore, the transactions takes 47000/bandwidthcycles to finish an evaluation, where bandwidth is therate of digits per clock cycle.

Filling the Precision Bits Table. In order to completethe operations and fill the PRECISION BITS TABLE, weuse MODULAR COMPUTATION UNIT and BINARY COM-PUTATION UNIT in turns. Using a local CONTROL UNIT,we iterate the modular computation and binary compu-tation for S times to complete the table for a single publickey. Each public key has the initial modular multiplica-tion (c ·wi (mod d)) which we can reduce eliminate extraoperations by; converting c to NTT form for once7 andusing in each public key evaluation, pre-storing the pub-lic keys in NTT from. Therefore, we only perform digitmultiplications, INTT conversions and the Barrett Reduc-tion. Then, completing the table for a single public keytakes: τ = 93.5N+

(104k2+(p′+1)k1

64 + 2 + 47000

bandwidth

)×S

cycles. Using the same units for other public keys adds afactor of s to the overall timing, i.e. s ·τ+16N . However,each public key has an independent operation which wecan benefit by using multiple of these units. Still we needto increase the bandwidth by the number of units timesto achieve the speedup.

5.3.2 Evaluating the Sum of Public KeysRecall the equation for the summation of the public keys:

sj =∑a∈[l]

βj,a

∑b∈[l]

βj,bzj,i(a,b)

(mod d) .

As before we chose to store the β’s in NTT form toeliminate the conversions and rewrite the equation as:sj =

∑a∈[l] B(j,a)


)Since zj,i(a,b) is a

7. Adds an initial 16N clock cycles

Fig. 8. Modular Computation Unit

d head QuotientEval.Unit

(t=y /d)

oldy(y )new

CacheLocal64x64

Mult

64x64Multid

id

t1

t0

y head

y new

128

64

128

128

CarryAdder

64x128Multiplier

<<103 6464

CarryAccum.

64

Sub 64

192

d

p′-bit value, it is denoted as zj,i = {b(p′−1)

j,i , . . . , b(1)j,i , b

(0)j,i }8.

Then, sj turns into a p′ sized array that each bit com-putation is performed separately. By denoting sj =

{s(p′−1)

j , . . . , s(1)j , s

(0)j }, we can expand the equations as:

s(k)j =

∑a∈[l]

B(j,a)

∑b∈[l]

B(j,b)b(k)(j,i)

where k is the bit index. For the evaluation of theequation, we designed a RECRYPTION PROCESSING ELE-MENT (RPE) which includes small local storage for rapidcalculations. As clearly shown in the equation the sameB inputs are used in all the evaluations. Therefore, weformed an array of p′ RPE units and distribute each b

(k)j,i

to a unit. By doing that, we compute all s(k)j evaluationswith a single transactions rather than p′. Furthermore,we can replicate the RPE array for processing multipleblocks since the evaluations are in NTT form. Likewiseencryption, we can replicate RPE arrays to speed upthe computations. The bandwidth limits the number ofRPE arrays we can utilize. The design illustrated inFigure 9 consists of the RPE Arrays, the PRECISION BITSTABLE9 and a CONTROL UNIT. In the following we givedesign details of the units and describe their workingmechanisms.

Recryption Processing Element. The design of the RPEis illustrated in Figure 10. The unit consist of two 64-bit modular adders, one 64-bit modular multiplier, amultiplexer and two local storage units. The local storageunits are referred as up c and low c and has size of κ-digits each. The unit is fully pipelined with a total of 11clock–cycle depth. A complete block evaluation of theequation, performed by RPE is shown in Algorithm 4.

Control Unit. The CONTROL UNIT incorporates a statemachine to handle the transactions and compute theRecryption operation. It controls the PRECISION BITSTABLE for requesting the required zj,i bits with in-dex i and they are directly fed to the RPEs. The unitcontrols the request bits, read/write and clear signals.Including the output transactions the operation requires(S+x+p′)·N ·(1+11/κ)

φ · s clock cycles, in which factor of scomes from the public key number and φ comes from

8. The values a, b are omitted for simplicity.9. The table formed in Evaluating Precision Bits



12

Fig. 9. Recryption Architecture

ControlUnit

i,jb (0)

i,jb (0)

b (p−1)i,j

b (p−1)i,j

i,jb (1)

i,jb (1)

i,jb (0)

ClearRead/Write

b (p−1)i,ji,jb (1)

RPE[0][ ]

64

64

φ

φ

64

64

Outputs

RPE[p−1][ ]φ

Inputs

. . .

64

64

RPE[1][ ]φ

Request_bit

RPE[0][1]

64

64

RPE[0][0]

64

64

Read/WriteClear

Read/WriteClear

RPE[p−1][1]

64

64

Outputs

RPE[p−1][0]

64

64

Outputs

Inputs

Inputs

. . .

. . .

RPE[1][1] . . .

64

64

RPE[1][0] . . .

64

64

. . .

Precision

TableBits

Algorithm 4: Recryption Algorithm

Input: Bj = {Bj,l−1), . . . , Bj,1), Bj,0)}, b(k)(j,i) ∈

{b(p′−1)

(j,i) , . . . , b(1)(j,i), b

(0)(j,i)}

Output: sj =∑

a Bj,a)

(∑b Bj,b) · b

(k)j,i

)low c = 01

for a = 0→ l − 1 do2

up c = 03

for b = a+ 1→ l − 1 do4

if b(k)j,i == 1 then5

up c = up c+Bj,b)6

low c = low c+ up c ·Bj,a)7

Fig. 10. Recryption Processing Element

AddMod Add

Mod

b (x)i,j

Mux ModMul

cachelower

cacheupper

Input RPE

Read/WriteClear

Read/WriteClear

Output

the number of RPE arrays. Given the FHE primitiveparameters and x being 14, the timing is 531·N ·(1+11/κ)

φ ·s.

5.3.3 Conversion and Reduction of Sum of Public Keys

Once the precision bits for each public key are evaluated,they need to be converted back from the NTT domain.

Using an INTT and Barret Reduction algorithms, theconversions will take 89N clock cycles. Having s publickeys and p′ precision bits, the total operation will takes · p′ · 89N cycles. The operations can be parallelized byusing multiple large multiplier units. This will increasethe area by the number of multiplier units, but willreduce the computation time by the same factor.

5.3.4 Grade School AdditionIn this section we explain the method we used to addfive 15-bit numbers, where each bit of every number isin encrypted form. All the bits are represented by a verylarge number and a conventional addition algorithmdoes not apply. Assume we are realizing the bitwiseaddition operation: {c, s} = x+y where c is the resultingcarry bit of the addition operation in 5.3.4 and s is thesum. The logic realizing this operation is called a halfadder. The result c and s can be represented as follows:c = x AND y and s = x XOR y. Given that we realizethis half-adder on the ciphertext, we can modify theequations as follows: For {C, S} = X + Y we writeC = X × Y and S = X + Y where A, B, C and S areciphertext and multiplication and addition operationsare large-number modular arithmetic.

Since 15 5-bit numbers in ciphertext form need to beadded, we utilize Wallace tree [25] approach to minimizethe number of large multiplications. For this operation,we utilize a total of 78 large multiplication operationsand a total of 33+33+32+27+14 = 139 large additionoperations. The large multiplication operations requiremodular reductions, so that the bit growth is prevented.In the evaluation of C, we use two multiplicationsfollowed by additions, so the multiplications is reducedafter the additions. This reduce the Barrett Reductionoperation by half. Then, the total timing is equal to78 ·52.5N+39 ·73N+25N , in which 25N is approximatecost of the additions.

6 IMPLEMENTATION RESULTS

The design was synthesized with Synopsys Design Com-piler using 90 nm TSMC Library. Timing analysis showsa maximum frequency of 666 MHz. A moderate mem-ory speed of 1333 MTps (Megatransfers per second) isselected for the Main Memory (RAM). The ratio be-tween the memory and the main circuit frequency, i.e.β = φ = 1333

666 = 2, results in I/O speed of 2 transactions(digits in our case) per clock cycle, which led us to setour EPE and RPE Array numbers as 2 each. Also, localcache sizes of the RPE, EPE and Barrett Reduction Unitwas selected as 256 digits to have a smaller pipelinedelay, i.e. κ = 256. Finally, as mentioned before, weincorporated a single multiplier into our design. Underthese settings, the timings are found to be as shown inTable 3. Large Integer Multiplication, Barrett Reductionand Decryption operations share a single Large IntegerMultiplier. Therefore their latencies are directly related tothe latency of the multiplier. By increasing the number



13

TABLE 3Arithmetic Operation Timings

Operation # of Clock Cycles Timing

Large Integer Multiplication 52.5N 7.75 msecBarrett Reduction 73N 10.70 msecDecryption 109.5N 16.16 msec

EncryptionEPE 65N

2· (1.039) 4.98 msec

INTT & Reduction 89N 13.12 msecTotal 18.10 msec

Recryption

Evaluation of zj,i (93N + 840S) · s 0.488 sec+16N

Sum of PK 531·N2

· (1.042) · s 0.612 secINTT & Reduction s · p · 89N 0.985 msecGrade School 6967N 1.027 secTotal 3.112 sec

of EPE units as well as the bandwidth, we can reducethe latency of the Encryption operation. However, theencryption latency is already small and a single EPEtakes only about 26% of the entire Encryption operation.

The recryption operation is the most significant yetslowest operation. Therefore, we aimed at optimizingthis operation as much as possible. Using the recom-mended bandwidth settings, we reduced the total timeof the recryption operation to 3.1 seconds. For the initialtwo steps of the operation, the evaluation of zj,i and sumof PK, the same elements can be utilized to realize theevaluation for each public key. Since they are indepen-dent, the timing of the first two steps can be reducedto 1.075/λ, in which λ is the number of arithmeticunits utilized for these steps. This will increase the areaand bandwidth requirements for those operations by afactor of λ. For the last two steps of the operation, wecan reduce the timing by adding more LARGE INTEGERMULTIPLIER units. Using κ multipliers, the latency of theINTT & Reduction operation can be reduced by a factorof λ and the delay of the Wallace-Tree can be reducedby close to a factor of lambda. In multiplication, I/Otransactions take 20% of the total operation. Therefore,with 5 multipliers we can perform multiple operationswithout increasing the bandwidth. This will reduce theoperation latency by a factor of 5.

The design is synthesized with local cache sizes of256, 128 and 64 digits and the area results are shownin Table 4. The local cache sizes affect the area ofENCRYPTION, RECRYPTION and BARRETT REDUCTIONunits only. Among these three units, the RECRYPTIONUnit covers the largest area with 1.17 million gates for a256-digit cache size. The DECRYPTION Unit, whose onlypurpose is to take the least significant bit and to augmentit with zero bits, does not have any local cache andconsumes only about 200 gates and additional rewiring.The LARGE INTEGER MULTIPLIER has a fixed size cachethat makes it the largest hardware unit in the designwith 26.5 million gates for cache and 0.2 million gatesfor m = 4 arithmetic units. By increasing the number ofmultipliers m we may further reduce the execution time.The time for the large integer multiplication ( 158Nm +13N )is tabulated for various m in Table 5. Note that the time-

TABLE 4Area of Hardware Blocks (Millions of Gates)

64-digit 128-digit 256-digitCache Cache Cache

Large Integer Mul. 26.7 26.7 26.7Barrett Reduction 0.0209 0.0356 0.0647Decryption 0.0002 0.0002 0.0002Encryption 0.1047 0.1360 0.2060Recryption 0.6740 0.7741 1.1770

TABLE 5Time-Area trade-off

m 4 8 16 32 64 128Time (in N ) 52.5 32.7 22.8 17.9 15.4 14.2

Area 26.7 26.9 27.3 28.1 29.7 32.9Time × Area 1401 879.6 622.4 502.9 457.3 1817

area product is optimal for m = 64 improving multipli-cation speed by 3.4 times while the area is increased byonly 11%. We estimate a 2 times speedup in Recryption.

7 COMPARISON

We presented the first hardware implementation of afully homomorphic encryption system with the goal ofexploring the limits of the GH-FHE scheme in hardware.Hence a direct comparison with other FHE implemen-tations is not possible. While a comparing to softwareimplementations would not be fair, we find it useful tosummarize these results alongside ours in Table 6.

In LARGE INTEGER MULTIPLICATION, the time per-formance of our design is close to that of the Xeonsoftware implementation and ∼10 times slower thanGPU implementation. Decryption is 20% faster com-pared to Xeon software but it is 6.5 times slower than theGPU implementation. Encryption is 101 times faster thanthe Xeon software and 12.3 times faster than the GPUimplementation. However, the most critical operation isRecryption and we are 10 times faster than the Xeonsoftware implementation. Moreover, our design is still1.1 second faster than the GPU implementation. Weshould note however that our hardware runs at a slowerfrequency with a much lower gate count of less than 30million equivalent gates. In contrast, the NVidia GPU in[14] contains approximately 900 million and the Xeonprocessor contains 205 million gates. This shows thebenefit of our ASIC design compared to general pur-pose CPU and GPU implementation with much higherperformance at lower area cost. In Table 6 (bottom)we normalize the timings with the clock rates of thechips. In the most critical primitive, i.e. Recrypt, ourimplementation requires fewer than half the clock cyclescompared to GPU and 46.6 times fewer clock cyclescompared to Xeon implementations at a much lowerfootprint. Our implementation can benefit at least a fewtimes speedup, if synthesized with a smaller technologythan 90 nm like Xeon and GPU processors (40-45 nm).



14

TABLE 6Times in msec (top) and in million cycles (bottom)

Multiplication Decrypt Encrypt RecryptOurs 7.750 16.1 18.1 3100

GPU [14] 0.765 2.5 220 4200Xeon [15] 6.667 20.0 1800 32000

Ours 5.1 10.7 12 2000GPU [14] 0.8 2.8 253 4800Xeon [15] 20 60 5400 96000

8 CONCLUSION

In this work we took initial steps to remedy the efficiencybottleneck of FHE schemes by introducing the first cus-tom FHE architecture. For this we introduced a novellarge integer modular multiplier design realizing theShonhage Strassen algorithm and Barrett’s reduction inhardware. Using this core we implemented the Gentry-Halevi FHE primitives, e.g. encryption, decryption, andrecryption. Among these primitives we managed to im-prove the efficiency of the challenging recryption oper-ation to the point where we are surpassing its softwareimplementation performance on a high end GPU pro-cessor at a fraction of the footprint.

REFERENCES

[1] N. Emmart and C. Weems, High precision integer addition,subtraction and multiplication with a graphics processing unit,Parallel Processing Letters, vol. 20, no. 4, pp. 293–306, 2010.

[2] J. Solinas, Generalized mersenne numbers, Technical Reports, 1999.[3] D.B. Cousins, K. Rohloff, C. Peikert, and R. Schantz. SIPHER:

Scalable Implementation of Primitives for HomomorphicEncRyption–FPGA implementation using Simulink, HPEC, 2011.

[4] P. Barrett. Implementing the Rivest Shamir and Adleman publickey encryption algorithm on a standard digital signal processor,Advances in Cryptology - CRYPTO ’86, Santa Barbara, California,USA, 1986.

[5] C. Gentry, Fully homomorphic encryption using ideal lattices,STOC, 2009, pp. 169-178.

[6] R.L. Rivest, L. Adleman, and M.L. Dertouzos. On data banksand privacy homomorphisms. In Foundations of Secure Compu-tation, 1978.

[7] Z. Brakerski, C. Gentry, and V. Vaikuntanathan. Fully homomor-phic encryption without bootstrapping. Innovations in Theoreti-cal Computer Science, ITCS (2012): 309-325.

[8] Z. Brakerski and V. Vaikuntanathan. Efficient fully homomorphicencryption from (standard) LWE, In Foundations of ComputerScience (FOCS), pp. 97-106. 2011.

[9] Z. Brakerski and V. Vaikuntanathan. Fully homomorphic encryp-tion from ring-LWE and security for key dependent messages. InAdvances in Cryptology - CRYPTO 2011, Vol 6841, pp 505-524.

[10] A. Lopez-Alt, E. Tromer, and V. Vaikuntanathan. On-the-flymultiparty computation on the cloud via multikey fully homo-morphic encryption. In Proceedings of the 44th STOC, pp. 1219-1234. ACM, 2012.

[11] C. Gentry, A Fully Homomorphic Encryption Scheme. Ph.D. thesis,Department of Computer Science, Stanford University, 2009.

[12] M. Ian, P. Michael; Optimal algorithms for parallel polynomial eval-uation Switching and Automata Theory, 1971.

[13] M. van Dijk, C. Gentry, S. Halevi, V. Vaikuntanathan, FullyHomomorphic Encryption over the Integers. EUROCRYPT 2010,pp. 24–43.

[14] W. Wang, Y. Hu, L. Chen, X. Huang, B. Sunar, Accelerating FullyHomomorphic Encryption Using GPU, Proc. of HPEC 2012.

[15] C. Gentry and S. Halevi, Implementing Gentry’s Fully-Homomorphic Encryption Scheme, In EUROCRYPT 2011, LNCSvol. 6632, pages 129-148, Springer 2011.

[16] J. W. Cooley and J. W. Tukey. An algorithm for the machinecalculation of complex Fourier series. Math. of Comp., 19(90):297-301, 1965.

[17] K. Kalach and J.P. David, Hardware implementation of largenumber multiplication by FFT with modular arithmetic, IEEE-NEWCAS Conference, 2005. The 3rd International, pp. 267–270,19-22 June 2005.

[18] A. Schonhage and V. Strassen, Schnelle Multiplikation grosserZahlen, Computing 7(3), volume 7, Springer 1971, 281–292.

[19] J. von zur Gathen, J. Gerhard, Modern Computer Algebra, Cam-bridge University Press, 1999.

[20] L.C.C. Garcıa, Can Schonhage multiplication speed up the RSAdecryption or encryption?, MoraviaCrypt, 2007.

[21] S. Craven, C. Patterson, and P. Athanas, Super-sized multiplies:how do FPGAs fare in extended digit multipliers? Proc. of Mili-tary and Aerospace Programmable Logic Devices (MAPLD’04).

[22] S. Yazaki and K. Abe, ”An Optimum Design of FFT Multi-Digit Multiplier and Its VLSI Implementation,” Bulletin of theUniversity of Electro-Communications, Vol.18, No.1 and 2, pp.39-46, Jan. 2006.

[23] N. Emmart and C.C. Weems, High Precision Integer Multiplica-tion with a GPU Using Strassen’s Algorithm with Multiple FFTSizes, presented at Parallel Processing Letters, pp. 359-375, 2011.

[24] A. Karatsuba and Y. Ofman, Multiplication of Many-DigitalNumbers by Automatic Computers, Proceedings of the USSRAcademy of Sciences 145: 293294, 1962.

[25] C.S. Wallace, A Suggestion for a Fast Multiplier. IEEE Transac-tions on Electronic Computers vol. 13, no.1, pp.14–17, Feb. 1964.

Yarkın Doroz Yarkın Doroz received a BSc.degree in Electronics Engineering at 2009 anda MSc. degree in Computer Science at 2011from Sabanci University. Currently he is work-ing towards a Ph.D. degree in Electrical andComputer Engineering at Worcester PolytechnicInstitute. His research is focused on developinghardware/software designs for Fully Homomor-phic Encryption Schemes.

Erdinc Ozturk Erdinc Ozturk received his BSdegree in Microelectronics from Sabanci Uni-versity at 2003. He received his MS degreein Electrical Engineering at 2005 and PhD de-gree in Electrical and Computer Engineering at2009 from Worcester Polytechnic Institute. Hisresearch field was Cryptographic Hardware De-sign and he focused on efficient Identity BasedEncryption. After receiving his PhD degree, heworked at Intel in Massachusetts for 4 yearsas a hardware engineer, than he joined Istanbul

Commerce University as an assistant professor.

Berk Sunar Berk Sunar received his BSc de-gree in Electrical and Electronics Engineeringfrom Middle East Technical University in 1995and his Ph.D. degree in Electrical and Com-puter Engineering from Oregon State Universityin 1998. After briefly working as a member ofthe research faculty at Oregon State University,Sunar has joined Worcester Polytechnic Institutefaculty. He is currently heading the Vernam Ap-plied Cryptography Group. Sunar received theprestigious National Science Foundation Young

Faculty Early CAREER award in 2002 and IBM Research Pat GoldbergMemorial Best Paper Award in 2007.

Documents

Accelerating Fully Homomorphic Encryption in Hardware