Upload
berk
View
224
Download
0
Embed Size (px)
Citation preview
1
3
4
5
6 Q1
78
9
1 1
1213141516
1718192021
2 2
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
Q1
Microprocessors and Microsystems xxx (2014) xxx–xxx
MICPRO 2151 No. of Pages 10, Model 5G
24 June 2014
Q1
Contents lists available at ScienceDirect
Microprocessors and Microsystems
journal homepage: www.elsevier .com/locate /micpro
A million-bit multiplier architecture for fully homomorphic encryption
http://dx.doi.org/10.1016/j.micpro.2014.06.0030141-9331/� 2014 Published by Elsevier B.V.
⇑ Corresponding author.E-mail addresses: [email protected] (Y. Doröz), [email protected] (E. Öztürk),
[email protected] (B. Sunar).1 An earlier short conference version of this paper appeared in [41].
Please cite this article in press as: Y. Doröz et al., A million-bit multiplier architecture for fully homomorphic encryption, Microprocess. Microsyst.http://dx.doi.org/10.1016/j.micpro.2014.06.003
Yarkın Doröz a, Erdinç Öztürk b,⇑, Berk Sunar a
a Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609, USAb Istanbul Commerce University, Kucukyali, Istanbul 34840, Turkey
a r t i c l e i n f o a b s t r a c t
23242526272829303132
Article history:Received 6 December 2013Revised 16 May 2014Accepted 11 June 2014Available online xxxx
Keywords:Fully homomorphic encryptionVery-large number multiplicationNumber theoretic transform
3334
In this work we present a full and complete evaluation of a very large multiplication scheme in customhardware. We designed a novel architecture to realize a million-bit multiplication scheme based on theSchönhage–Strassen Algorithm. We constructed our scheme using Number Theoretical Transform (NTT).The construction makes use of an innovative cache architecture along with processing elements custom-ized to match the computation and access patterns of the NTT-based recursive multiplication algorithm.We realized our architecture with Verilog and using a 90 nm TSMC library, we could get a maximumclock frequency of 666 MHz. With this frequency, our architecture is able to compute the product oftwo million-bit integers in 7.74 ms. Our data shows that the performance of our design matches thatof previously reported software implementations on a high-end 3 GHz Intel Xeon processor, while requir-ing only a tiny fraction of the area.1
� 2014 Published by Elsevier B.V.
35
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
1. Introduction
Arbitrary precision arithmetic has been studied for a long time,and it is a well-established research field. Since multiplication isthe most compute-intensive basic operation, numerous algorithmsfor fast large-number multiplication has been introduced. Until1960, school-book multiplication method was thought to be thefastest algorithm for multiplying two large numbers. In 1960,Karatsuba Algorithm was discovered and it was published in1962 [16]. Although Karatsuba Algorithm is much more efficientthan school-book multiplication for large numbers, it is possibleto optimize multiplication algorithm for very-large numbers evenfurther. In 1971, Schönhage–Strassen Algorithm [10], which is anFFT-based large integer multiplication algorithm, was introduced.FFT-based algorithm is using floating-point arithmetic and thisleads to rounding errors. For precise multiplication, we useNumber Theoretic Transforms (NTT) instead of FFT.
Very-large integer arithmetic is used by a few number of appli-cations. However, this does not change the fact that efficientimplementations of very-large multiplication algorithms is animportant open problem. For some important and popular researchfields, such as primality testing and homomorphic encryption, the
80
81
82
83
84
85
runtime and throughput of multiplication of two very large num-bers is crucial for the practicality and the efficiency of the scheme.
As stated above, very-large integer multiplication is importantfor the field of Fully Homomorphic Encryption (FHE). The first pro-posed FHE scheme, which permits an untrusted party to evaluatearbitrary logic operations directly on the ciphertext, was intro-duced by Gentry in [4]. This was an exciting development for thefield of cryptographic research and led to tremendous amount ofstudies searching for further efficient FHE schemes. FHE holdsgreat promise for numerous applications including private infor-mation retrieval and private search and data aggregation, elec-tronic voting and biometrics. FHE’s role is further amplified inthe cloud where a server is shared by multiple users.
Although FHE schemes hold great promise, widespread applica-tions of FHE will only be possible when efficient FHE implementa-tions are available. Unfortunately, current FHE schemes, especiallythe one introduced by Gentry in [4], have significant performanceissues. None of the existing FHE proposals are considered to bepractical yet, due to the extreme computational resource require-ments. For instance, the Gentry and Halevi FHE [5], is reporting a�30 s latency for the re-encryption primitive on an Intel Xeonenabled server. Their scheme requires multiplication of operandsthat are a few million bits and performances of encryption, decryp-tion and re-encryption evaluations are primarily determined bythis operation. Existing software packages can handle integer oper-ations at this magnitude (e.g. GNU Multiple Precision ArithmeticLibrary [7]). However, the prohibitive high cost of the very-largemultiplier motivates the study of custom designed architectures.
(2014),
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
2 Y. DorözQ1 et al. / Microprocessors and Microsystems xxx (2014) xxx–xxx
MICPRO 2151 No. of Pages 10, Model 5G
24 June 2014
Q1
This is the goal of this work and we are not aware of a customdesigned architecture implementation for very-large integermultiplication.
While more efficient schemes are appearing in the literature,we would like to take a snapshot of the attainable hardwareperformance of the Gentry–Halevi FHE variant, by implementinga cost-efficient very-large integer multiplier using the FFT-basedmultiplication algorithm. We are motivated by the fact that thefirst commercial implementations of most public-key crypto-graphic schemes, e.g. RSA, Elliptic Curve DSA schemes have beenvia limited functionality hardware products due to the efficiencyshortcomings on general purpose computers. We speculate that asimilar growth pattern will emerge in the maturation process ofFHEs. While the obvious efficiency shortcomings of FHE’s are beingworked out, we would like to pose the following question: How farare FHE schemes from being offered as a hardware product? Clearlyanswering this question would be a major overtaking deservingthe evaluation of all FHE variants with numerous implementationtechniques. Here, by providing strong area and performance num-bers for the multiplier that we designed, we provide initial resultsfor the FHE scheme of Gentry and Halevi.
In this work we present an implementation of our architecturerealizing multiplication of two million-bit integers. We used avariation of the FTT-based Schönhage–Strassen Algorithm usingNTT approach. Given the massive amount of data that needs tobe efficiently processed, we identify data input/output and associ-ate routing as one of the challenges in our effort. A secondarychallenge is the recursive nature of NTT computations, whichmakes it hard to parallelize the architecture without impeding dataflow. We achieve a high-speed custom hardware architecture witha reasonable footprint that may be easily extended to realize theprimitives of Gentry–Halevi FHE Scheme.
The rest of the paper is organized as follows. We present a briefreview of the previous work in Section 2 and introduce the basicbackground in Section 3. We describe our approach in Section 4and present the details of our design in Section 5. After the perfor-mance analysis in Section 6, we share the implementation resultsin Section 7.
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
2. Previous work
Until recently, large integer multiplication implementationswere mostly limited to few thousand-bits for classical encryptionschemes. The introduction of Fully Homomorphic Encryption(FHE) schemes, especially Gentry and Halevi’s FHE scheme, chan-ged the focus to very-large integer multiplication algorithms. Moregroups started working on multiplication designs for different plat-forms with different optimization interests. In this section, we willgive background information on previous implementations of tra-ditional large integer multiplication algorithms in Section 2.1. Also,in Section 2.2, we are giving background information on previousimplementations of FHE schemes.
200
201
202
203
204
205
206
207
208
209
210
211
212
2.1. Traditional large integer multipliers
For cryptographic applications, key sizes differ for differentalgorithms and schemes, to provide the same amount of security.For mainstream cryptographic applications, largest key sizes areused for public-key cryptography, such as RSA algorithm. Thesekeys are currently in the range of 1024-bit to 2048-bits, for reason-able security. However, FHE schemes are quite unusual requiringsupport for extremely long integer operands, from hundreds ofthousands to millions of bits.
As we are discussing very-large integer arithmetic, which is nota well-studied research field, the natural place to turn for high per-
Please cite this article in press as: Y. Doröz et al., A million-bit multiplier archithttp://dx.doi.org/10.1016/j.micpro.2014.06.003
formance large integer multiplication algorithms is the field ofpublic-key cryptography. If we examine the efficiency and themethods of the algorithms being used for large-integer multiplica-tion, we could extrapolate our findings to very-large integermultiplication schemes. Public key schemes such as RSA andDiffie–Hellman commonly use efficient multiplication algorithmsthat can handle operands of thousands of bits. Numerous efficientimplementations supporting RSA is available in the literature[1–3]. The thousand-bit multiplication algorithms are generallyrealized using two methods: classical multiplication and Karatsubamultiplication. For software implementations, due to constant-time requirements of the cryptographic schemes and the longand inefficient carry chains on the CPU pipeline, classical multipli-cation algorithm gives faster results. However, for hardwareimplementations, Karatsuba Algorithm is widely used. Theseimplementations typically break down the long operands using afew iterations of the Karatsuba–Ofman algorithm [16]. Themultiplication complexity of the Karatsuba–Ofman algorithm issub-quadratic, i.e. Oðnlog23Þ where n denotes the length of the oper-ands. While presenting great savings, even the Karatsuba–Ofmanalgorithm becomes insufficient when operand sizes grow beyondtens of thousands of bits.
Although there is a significant amount of literature on FFT-based multiplication algorithms, we encountered very few thatgave results for a detailed and full implementation in hardware.One of the very few implementation examples is the architecturepresented in [9]. They implement an FFT-based multiplicationscheme with scalable inputs. This architecture consists only ofone stage of the butterfly unit, which is not a complete design,and one FFT unit that achieves a 1024-bit FFT computation in448 clock cycles. In the follow-up paper [12] the authors presentan area-cost comparison of several multiplier architectures. Theirimplementation realizes a modular multiplication, using an archi-tecture built around 3
2 log S butterfly operators and one accumula-tor in S clock cycles, where S is the degree of the polynomial that isused for FFT multiplication. This polynomial is the result of FFT-conversion of the integer to be multiplied.
In [13], the authors implement large-squaring operation utiliz-ing a more complicated Number Theoretical Transform (NTT)based architecture. The authors claim that their squaring architec-ture can also be used for multiplication with minor modifications.They take the square of a 12-million digit number in 34 ms using aXC2VP100 with a 80 MHz clock frequency. Their result is 1.76times faster than a software implementation on Intel Pentium 4.However, since their design costs ten times more than a Pentiumprocessor, the authors note that the cost of their design outweighsthe performance gain.
Another design focusing on hardware implementation of very-large multiplication was proposed in [14]. This algorithm is moresimilar to the algorithms that we used in our implementation.For number of digits ranging from 25 to 213, their implementationis 25:7 times faster than software-based implementations, with anarea cost of 9:05 mm2. When the number of digits is increased to221 (with a digit size of 4 bits) their implementation is 35.7 timesfaster than software-based implementations, with an area of16:1 mm2. Although this architecture achieves fast multiplication,the computations are performed over floating-point numbers,which introduces the possibility of an error. Indeed, the authorsreport error rates steadily increasing with increasing hamming-weight. They report error rates of about 0:025 and 0:06 for theoperand lengths 4000 and 8000 bits, respectively. Clearly, the errorrates will reach unacceptable levels for million-bit operands.Furthermore, for cryptographic applications, even a single bit errorwill have disastrous effects on the overall scheme.
Another very-large number multiplication implementation onFPGA was presented by Wang and Huang [28]. The authors propose
ecture for fully homomorphic encryption, Microprocess. Microsyst. (2014),
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
Y. DorözQ1 et al. / Microprocessors and Microsystems xxx (2014) xxx–xxx 3
MICPRO 2151 No. of Pages 10, Model 5G
24 June 2014
Q1
an architecture for a 768 K-bit FFT multiplier using a 64 K-pointfinite field FFT as the key component. This component was imple-mented on both a Stratix V FPGA and an NVIDIA Tesla C2050 GPUand the FPGA implementation achieved a speed that is twice thespeed of the implementation on the GPU [23].
279
280
281
282
283
284
285
286
287
288
289
290
296
297
298
299
300
301
302
304304
305
306
307
308
309
310
311312
314314
315316
318318
319
320
321
322
2.2. GPU and FPGA implementations of homomorphic encryptionschemes
There are currently several research groups working towardsdevelopment of optimized FHE architectures targeting platformsalternative to software implementations, to improve the efficiencyof Somewhat Homomorphic Encryption (SHE) and FHE schemes.Recent developments in this area show that efficient hardwareimplementations will drastically increase the efficiency of fullyhomomorphic encryption schemes. The performance of very-largemultiplication, which is an important primitive of some FHEschemes, could be significantly improved through the use of FPGAtechnology. Compared to ASIC design, FPGA technology offers flex-ibility at low cost. In addition, modern FPGAs include embeddedhardware blocks, which are optimized for multiply-accumulate(MAC) operations, and thus can be exploited when implementingvery-large multiplications.
Another alternative to software implementation that canimprove the performance of SHE schemes is GPU implementations.An efficient GPU implementation of an FHE scheme was presentedby Wang et al. [23]. The authors implemented the small parametersize version of Gentry and Halevi’s lattice-based FHE scheme [18]on an NVIDIA C2050 GPU using the FFT algorithm, achievingspeed-up factors of 7.68, 7.4 and 6.59 for encryption, decryptionand the recryption operations, respectively. The FFT algorithmwas used to target the bottleneck of this lattice-based scheme,namely the modular multiplication of very-large numbers, andthe authors took advantage of the highly parallelized GPU architec-ture to accelerate the performance of the FHE scheme. However,the authors state that even with this speed-up, the implementedFHE scheme remains unpractical. In [42], the authors present anoptimized version of their previous implementation. This modifiedmethod, implemented on an NVIDIA GTX 690, achieves speed-upfactors of 174, 7.6 and 13.5 for encryption, decryption and therecryption operations, respectively, when compared to results ofthe implementation of Gentry and Halevi’s FHE scheme [18] thatruns on an Intel Core i7 3770K machine.
Other work has also focused on the use of efficient large integermultiplication implementations for many FHE schemes [18,39,40],in order to achieve faster hardware implementations. The use of aComba multiplier [43], which can be implemented efficiently usingthe DSP slices of an FPGA [44], was proposed by Moore et al. [26] toimprove the performance of integer based FHE schemes [38,40]. AnFPGA implementation of a RLWE SHE scheme was also targeted byCousins et al. [24,25], in which Matlab Simulink is used to designthe FHE primitives.
Lastly, Cao et al. [27] proposed a large-integer multiplier usinginteger-FFT multipliers combined with Barrett reduction to targetthe multiplication and modular reduction bottlenecks featured inmany FHE schemes. The encryption step in the proposed integerbased FHE schemes by Coron et al. [39,40] were designed andimplemented on a Xilinx Virtex-7 FPGA. The synthesis results showspeed up factors of over 40 are achieved compared to existing soft-ware implementations of this encryption step [27]. This speed upillustrates that further research into hardware implementationscould greatly improve the performance of these FHE schemes. Anoverview of the above mentioned multipliers for different plat-forms is given in Table 4.
Please cite this article in press as: Y. Doröz et al., A million-bit multiplier archithttp://dx.doi.org/10.1016/j.micpro.2014.06.003
3. Background
Common multiplication schemes (Karatsuba Algorithm,classical school-book multiplication method) become infeasiblefor very-large number multiplications. For very-large operandsizes, efficient NTT-based multiplication schemes have been devel-oped. Schönhage–Strassen Algorithm is currently asymptoticallythe fastest algorithm for very-large numbers. It has been shownthat it outperforms classic schemes for operand sizes larger than217 bits [11].
3.1. Schönhage Strassen Algorithm
The Schönhage–Strassen Algorithm is an NTT-based large inte-ger multiplication algorithm, with a runtime of OðN log N log log NÞ[10]. For an N-digit number, NTT is computed using the ringRN ¼ Z=ð2N þ 1ÞZ, where N is a power of 2.
In [15], the algorithm is explained as follows:
Algorithm 1. Schönhage–Strassen Algorithm
In the algorithm, for the NTT evaluation we sample the num-
bers A and B into N-digits with a digit size of �. By this sampling,we represent numbers A and B with tuples ða0; a1; . . . ; aN�1Þ andðb0; b1; . . . ; bN�1Þ, respectively, where each ai is � bits. NTT evalua-tion on these numbers is done as follows:ANTTk ¼
XN�1
i¼0
wikai and BNTTk ¼
XN�1
i¼0
wikbi;
where ANTT and BNTT are NTT-mapped tuples of A and B, respectively.The multiplications are modular multiplications with a selectedprime p and w is a primitive root of p, i.e. wp ¼ 1ðmod pÞ. Here,the prime p has to be selected large enough to prevent overflowerrors during the next phase of the NTT-based multiplication algo-rithm. For integer multiplication, the tuples in the NTT-form aremultiplied to form another tuple CNTT with elements CNTT
k :
CNTTk ¼ ANTT
k � BNTTk ðmod pÞ:
Using the inverse-NTT we compute
Ck ¼XN�1
k¼0
w�kCNTTk :
As the last step, we can accumulate the carry additions to final-ize the evaluation of C.
To realize the Schönhage–Strassen Algorithm efficiently, it iscrucial to employ fast NTT and inverse-NTT computation
ecture for fully homomorphic encryption, Microprocess. Microsyst. (2014),
323
324
325
326327
329329
330
331332
334334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
4 Y. DorözQ1 et al. / Microprocessors and Microsystems xxx (2014) xxx–xxx
MICPRO 2151 No. of Pages 10, Model 5G
24 June 2014
Q1
techniques. We adopted the most common method for computingFast Fourier Transforms (FFTs), i.e. the Cooley–Tukey FFTAlgorithm [8]. The algorithm computes the Fourier Transform ofa sequence X:
Xk ¼XN�1
j¼0
xje�i2pk jN;
by turning the length N transform computation into two N2 size Fou-
rier Transform computations as follows
Xk ¼XN=2�1
m¼0m even
x2me�i2pk mN=2 þ e
�2pikN
XN=2�1
m¼0m odd
x2mþ1e�i2pk mN=2:
We change e�2pik
N with powers of w and perform the divisionsinto two halves with odd and even indices, recursively. With theuse of fast transform technique, we can evaluate the Schönhage–Strassen Multiplication Algorithm in OðN log N log log NÞ time.
4. Our approach
Our literature review has revealed that existing multiplierschemes are lacking some of the essential features we need forFHE schemes. A few references propose FFT based implementa-tions of large integer multiplication techniques. These techniquesexploit the convolution theorem to gain significant speedup overother multiplication techniques. To overcome the inefficienciesexperienced in [13] and the noise problem experienced in [14]we chose to develop an application specific custom architecturebased on the NTT transform.
Fig. 1. A part of fully parallel circuit realizi
Please cite this article in press as: Y. Doröz et al., A million-bit multiplier archithttp://dx.doi.org/10.1016/j.micpro.2014.06.003
The Schönhage–Strassen Algorithm is the fastest and most effi-cient option for large integer multiplications. The algorithm has anasymptotic complexity of OðN log N log log NÞ. Given the size of ouroperands the Schönhage–Strassen Algorithm is perfectly suited tofulfill our performance needs. Another advantage of the algorithmis that it lends itself to a high degree of parallelization. A part ofhigh level illustration of the parallel realization of the algorithmfor short operands is shown in Fig. 1. In the figure, shown registersare the same registers which are shown separately for ease of view.We can add more reconstruction units in parallel with factor of thedigit size.
In our implementation of the Schönhage–Strassen Algorithm,the number of digits is 3� 215 = 98,304. With the Cooley–Tukeyapproach, we form our butterfly circuit down to the size of3� 22 ¼ 12 digits. This approach provides ample opportunitiesfor parallelization. However, there are limits on how far we goexploiting the parallelism in a hardware realization. There aretwo reasons of this:
� The first reason for this is, the stage reconstruction computa-tions are simple operations that take a few clock cycles,whereas we have huge data size to process. To overcome thisobstacle we need an architecture that can handle a higher databandwidth with rapid data access to perform calculations inparallel. Otherwise just by increasing the computational blockswhile keeping restricted (low) bandwidth we will not be able toimprove the performance.� As for the second reason, in the algorithm the index range of the
dependent digits double in each stage which requires
ng the Schönhage–Strassen Algorithm.
ecture for fully homomorphic encryption, Microprocess. Microsyst. (2014),
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
Y. DorözQ1 et al. / Microprocessors and Microsystems xxx (2014) xxx–xxx 5
MICPRO 2151 No. of Pages 10, Model 5G
24 June 2014
Q1
significantly more routing on the digits. As the number of com-putational blocks and the stage number increases, too manycollisions and overlaps occur in the routing for the VLSI designtools to handle.
Having a well designed high capacity cache is the key to achiev-ing high performance large integer multiplier. Furthermore, thecache must be tailored to match of the computational elements.In our design we chose to incorporate m ¼ 4 computation units,so that we can have some performance boost and also avoid therouting problems that might occur. Even with m ¼ 4 the bus widthreaches 512-bits. In order to supply the computation blocks withsufficient data, we have chosen the cache size of the architectureas N ¼ 3� 215 = 98,304. Moreover, to enable parallel reads withoutimpeding the bandwidth, we divided the cache into 2�msub-caches. It is possible to incorporate more computation units,but it will also make the control logic harder to design and createadditional routing complications.
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
486486
487
488
489
490
491493493
494
495
5. Design details
5.1. Parameter selection
The parameters for the implementation are based on theparameters in [15]. We choose 64-bit word size for fast andefficient computations, sampling size as b ¼ 24 and the modulusas p ¼ 264 � 232 þ 1. The reason for the particular choice of p is thatit allows us to realize a modular reduction using only a few prim-itive arithmetic operations. A 128-bit number is denoted asz ¼ 296aþ 264bþ 232c þ d. Using the selected p, we performzðmod pÞ operation as 232ðbþ cÞ � a� bþ d.
For the parameter N, we need to satisfy N2 ð2
b � 1Þ2< p to
prevent the digit-wise overflows in the intermediary computationsin NTT. Also, N should be big enough to cover million-bit multipli-cation with smallest possible value, so that the computation timeis small. The best candidate for N is determined asN ¼ 3 � 215 = 98,304. As we use 24 bits for digit length, 3 � 215 digitshold a 1,179,648-bit number for NTT-based multiplication(Although 3 � 215 � 24 is double this size, the number to be multi-plied has to be truncated with 0s to double its size, therefor wecan only multiply 2 1,179,648-bit numbers by using 3 � 215 digitsof 24 bits). This is the closest we could get to 1 million bits withoutlosing efficiency for the NTT transform. Therefore, this parameterachieves one million-bit multiplication with minimum memoryoverhead.
Finally, we determine the Nth primitive root of unity, i.e. w.From the equation wN � 1 mod p, we determine w ¼ 3511764839390700819.
In Cooley–Tukey FFT Algorithm, each recursive halving opera-tion is referred as a stage and it is denoted as Si, where i is the stageindex. The size of the smallest NTT block is selected to be 12entries and it is referred as the 0th stage, i.e. S0. The remainingstages are reconstruction stages and require different arithmeticoperations from the ones in S0. As the nature of the NTT algorithm,in each reconstruction stage processing block size is doubled.Therefore, it takes 13 reconstruction stages to complete the NTT
operation.In terms of INTT operations, every stage and operation is iden-
tical to NTT. Only difference is selection of w0. It is computed as:w0 ¼ w�1 mod p.
496498498
499
500
501
5.2. Architecture overview
Our architecture is composed of a data cache, a central controlunit, two routing units and an function unit. The multiplication
Please cite this article in press as: Y. Doröz et al., A million-bit multiplier archithttp://dx.doi.org/10.1016/j.micpro.2014.06.003
architecture is illustrated in Fig. 2. The architecture is designedto perform a restricted set of special functions. There are fourfunctions for handling the input/output transactions and threefunctions for arithmetic operations:
Sequential load. It is used to store a million-bit number to thecache. The number is given in digits starting from the least signif-icant digit and stored in sequence starting from the beginning ofthe cache.
Sequential unload. The cache releases its contents starting fromthe least significant to most significant digit.
Butterfly load. In NTT an important step is the distribution ofthe digits into the right indices using the butterfly operation. Start-ing from the least significant digit, the digits are inserted to thecache locations identified by the butterfly indices.
Scale & unload. In the last step of the NTT-based multiplicationcomputation, as a part of the INTT operation, the digits need to bescaled by N�1ðmod pÞ, as a part of the multiplication algorithm.After this operation, the carries are accumulated and the digitsare scaled back to 24 bits. The SCALE UNIT overlaps scale and carryaccumulation of the digits with output, since it is the final stepin million-bit multiplication.
12 � 12 NTT/INTT. The smallest NTT/INTT computation is for 12digits. 12 � 12 NTT/INTT UNIT takes the digits sequentially andcomputes the 12 digit NTT/INTT by using simple shifts and addi-tion operations.
Stage-reconstruction. This function is used for reconstruction ofthe stages. According to the given input, it reconstructs an NTT
or INTT stages. Reconstructions only complete a stage at a time,with the given stage index input. In order to complete a full recon-struction, i.e. completing evaluation of NTT, it is recalled for all 13stages.
Inner-multiplication. This operation is used to compute the digit-wise modular multiplications. For this we utilize the multipliersused in STAGE-RECONSTRUCTION UNIT.
Using the functions outlined above, we can compute the prod-uct of million-bit numbers A and B using the following sequenceof operations:
1. A is loaded into cache by using BUTTERFLY LOAD.2. The NTT of number A, i.e. NTTðAÞ, is computed by calling;
first 12 � 12 NTT function, and afterwards STAGE-RECONSTRUCTION function for all stages.
3. NTTðAÞ is stored back to the RAM using SEQUENTIAL UNLOAD.4. Using the same sequence of functions as above, we also
compute NTTðBÞ.5. The cache can only hold the half of the digits of NTTðAÞ and
NTTðBÞ. Therefore, the numbers are divided into lower andupper halves:
ecture
NTTðAÞ ¼ fNTTðAÞh;NTTðAÞlg;NTTðBÞ ¼ fNTTðBÞh;NTTðBÞlg:
6. We use SEQUENTIAL LOAD function to store NTTðAÞh andNTTðBÞh.
7. Later on, an INNER MULTIPLICATION function is used to calculatethe digit-wise modular multiplications:
Ch½i� ¼ NTTðAÞh½i� NTTðBÞh½i�:
8. The result is stored back to the RAM by SEQUENTIAL UNLOAD.9. We repeat above three steps to compute the lower part:
Cl½i� ¼ NTTðAÞl½i� NTTðBÞl½i�:
10. The result digits, i.e. C½i�, are loaded into the cache bySEQUENTIAL LOAD. At this point the cache will contain the mul-tiplication result, but still in the NTT form.
for fully homomorphic encryption, Microprocess. Microsyst. (2014),
502
503
504505507507
508
509
510511513513514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
UnitFunction
RoutingRoutingReconstructionStage
Units
12x12NTT/INTT
Unit
Scale Unit
MultiplierControl
Unit
Cache
Fig. 2. Overview of the large integer multiplier.
6 Y. DorözQ1 et al. / Microprocessors and Microsystems xxx (2014) xxx–xxx
MICPRO 2151 No. of Pages 10, Model 5G
24 June 2014
Q1
11. The result is converted into integer form by using, 12 � 12INTT function which is followed by a complete STAGE-RECON-
STRUCTION functions:
554
2 In tthe area
Pleasehttp:/
C0 ¼ INTTðC½i�Þ:
555
556
12. In the last step, the result is scaled and the carries are accu-mulated by SCALE & UNLOAD function to finalize computationof C:
557
558
C½iþ 1� ¼ C 0½iþ 1� þ bC 0½i�=pc: 559560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588589
Table 1Assignment table.
arith0 arith1 arith2 arith3
S1–S10 sc0–sc1 sc2–sc3 sc4–sc5 sc6–sc7
5.3. Cache system
The size of the cache is important for the timing of multiplica-tions. In each stage reconstruction process of the NTT algorithm,we need to match the indices of odd and even digits. The index dif-ference of the odd and even digits for a reconstruction stage is com-puted as: Si;diff ¼ 12 � 2i�1, where i is the index of reconstructionstages, i.e. 1 6 i 6 14. Since later stages require digits from distantindices, an adequate sized cache should be chosen to reduce thenumber of input/output transactions between the cache and RAM.
Lets call N0 as the chosen cache size. Then, we can divide the Ndigits into 2t ¼ N=N0 blocks, i.e. N ¼ fN2t�1;N2t�2; . . . ;N0g. Once ablock is given as input, we can compute the reconstruction stagesuntil N0 < Si;diff for the ith stage. Then, stating from the ith stage, Nj
requires digits from Njþ1 which j is block index. So, we need todivide Nj and Njþ1 into halves and match the upper halves of Nj
with Njþ1, and lower halves of Nj with Njþ1. This matching processadds 2N0 clock cycles for each block. Then, the total input/outputoverhead is evaluated as 2N � log2ðN=N0Þ, where log2ðN=N0Þ is thenumber of the stages that requires digit matching from differentblocks. In our implementation, we aim to optimize the speed byselecting N0 as N.
Although a huge sized cache is important for our design, astraight cache implementation is not sufficient to support parallel-ism. The main arithmetic functions utilized in the multiplicationprocess, such as 12� 12 NTT/INTT,2 STAGE-RECONSTRUCTION and INNER
MULTIPLICATION, are highly suitable for parallelization. In order toachieve parallelization, the cache should be able to sustain requiredbandwidth for multiple units. In order to sustain the bandwidth, webuild up the cache by combining small, equal size caches or as werefer them sub-caches. Combining these sub-caches on a top level,we can select the cache to be used as a single-cache or a multi-cachesystem. In case of linear functions, such as SEQUENTIAL LOAD and BUTTER-
FLY LOAD, the cache works as a single-cache with one input/outputports, where as for parallel functions, it works as a multi-cache sys-tem with multiple input/output ports. The number of sub-cachesshould be equal to 2�m (double the size of multipliers in STAGE-
he design we choose to implement one unit, because its benefit is not worthcost.
cite this article in press as: Y. Doröz et al., A million-bit multiplier archit/dx.doi.org/10.1016/j.micpro.2014.06.003
RECONSTRUCTION UNIT) to eliminate access read/write to the same sub-cache in the reconstruction processes. Each sub-cache has a size ofN=ð2�mÞ and we denote them as; fsc0; sc1; . . . ; sc2m�1g.
5.4. Routing unit
The ROUTING UNITS play an important role to match the odd andeven digits to the arithmetic units. As stated previously, the indexdifference of the digits is ð12 � 2i�1Þ. Therefore, in last log 2m recon-struction stage, odd and even digits fall into different sub-cache.The assignment of sub-caches to the arithmetic units for each stagereconstruction is shown in Table 1. In the Table, arithmetic unitsare referred as arithi, which i is the index number.
Clearly, each arithmetic unit is assigned to two sub-caches ineach stage. From S1 to S10, odd and even indices are matched fromthe same sub-cache. Once an evaluation in sc2i is finished it contin-ues to process sc2iþ1. In rest of the stages, odd and even values arepresent in different sub-caches, so they are accessed by selectingsci1 and sci2 interchangeably. These assignments betweensub-caches and STAGE RECONSTRUCTION UNITS are done by the CENTRAL
CONTROL UNIT.
5.5. Function unit
The function unit can be divided into three parts, i.e. the SCALE
UNIT, the 12 � 12 NTT/INTT UNIT and the STAGE RECONSTRUCTION UNIT.SCALER uNIT: As mentioned earlier, after the 12 � 12 INTT
operation, a final step needs to be performed to compute the prod-uct. This final operation can be shown as di � N�1 þ ciðmod pÞ ¼fciþ1; rig. For our parameter choice N�1ðmod pÞ ¼ 0� FFFF555455560001 – a constant number with a special form that can becomputed by simple shift and add operations with a low numberof arithmetic operations. The unit takes digits in each clock cyclestarting from the least significant digit d0. The carry c0 is set to zeroin startup. After each scaling operation, the digit is added with acarry from the previous operation and least b ¼ 24 bits of the64-bit result is taken as the result ri. The remaining bits are usedas carry ciþ1 in the next iteration.
12 � 12 NTT/INTT UNIT: This unit can be used for either comput-ing the first stage of NTT or for INTT according to the requiredoperation. This first NTT/INTT stage is basically computed usingthe NTT formula:
S11 sc0–sc1 sc2–sc3 sc4–sc5 sc6–sc7
S12 sc0–sc2 sc1–sc3 sc4–sc6 sc5–sc7
S13 sc0–sc4 sc1–sc5 sc2–sc6 sc3–sc7
ecture for fully homomorphic encryption, Microprocess. Microsyst. (2014),
591591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641 Q2
642
643
644645
647647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680681
683683
684
685
Route
Re
cti
ud
on
mul_only
Coeff
Odd
Output
Even
Fig. 4. ALU of the stage reconstruction unit.
Y. DorözQ1 et al. / Microprocessors and Microsystems xxx (2014) xxx–xxx 7
MICPRO 2151 No. of Pages 10, Model 5G
24 June 2014
Q1
Xj ¼Xn
i¼0
ðw0Þji � di ðmod pÞ: ð1Þ
In the equation, j is the index of the digit to be computed, and di
refers to the digit value with index i. The parameterw0 ¼ w213 ðmod pÞ, since the coefficient is squared in each stage.The parameter n is the smallest NTT block value to be evaluatedand in our architecture it is chosen to be n ¼ 12. The computationcan be performed in two ways. First, all the powers of ðw0Þij can bepre-computed into a lookup table and reused during the multipli-cations. However, this will be a costly operation since we need 121multiplications (coefficients with ðw0Þ0 are ignored). When theentire million-bit operand is considered, it will take 121�N=n = 991,232 multiplication operations. Having m ¼ 4 multiplierswill decrease the computation time by a factor of four. However,each multiplication takes two clock cycles and the total computa-tion will take 495,616 clock cycles.
The overhead of the first stage can be decreased significantly byexploiting the special structure of the constant coefficients. Indeed,we can realize the modular multiplications with simple shift andadd operations followed by a modular reduction at the end. Fur-thermore, note that in NTT w0 = 0 � 10,000 and in INTT
w0 ¼ 0� FFFFEFFFF00010001 and therefore only a few additionsand shifts are needed. These simple operations can be squeezedinto few clocks and pipelined to optimize throughput. Also coeffi-cients repeat after a while, since ðw0Þ12 � w3 2ð15Þ � 1ðmod pÞ. Thiseases the construction of the NTT/INTT units, since we can reusethe same circuit in the process. Due to pipelining all operationscan be performed in N clock cycles if we neglect the initial cyclesneeded to fill the pipeline to get the first output.
STAGE RECONSTRUCTION UNIT: This unit holds 64 bit multipliers and isresponsible for the operation of two functions: STAGE-RECONSTRUCTION
and INNER MULTIPLICATION. The architecture is more complex than theother units in the FUNCTION UNIT. It is composed of a control unit, anda coefficient block and an arithmetic unit. The architecture is illus-trated in Fig. 3.
In an INNER MULTIPLICATION function, two digits are read from thecache and fed to the Coeff and odd input lines. Since we are readingdigits from each sub-cache one at a time, we have two clock cyclesto perform a 64-bit multiplication. In order to reach higher fre-quencies, we used two 32-bit multipliers and extended them toconstruct the 64-bit multipliers. Using a classical high-schoolapproach, we are able to perform a 64-bit multiplication operationin 2 cycles. Since cache I/O speed is the bottleneck here and wecannot read faster than 2 cycles, utilizing faster multiplicationalgorithms, such as Karatsuba Multiplication algorithm, wouldnot have improved the performance. The INNER MULTIPLICATION func-tion computes a 64-bit multiplication, additions and a modularreduction. We pipelined the architecture to increase the clockfrequency and the throughput. The unit can output a modular mul-tiplication product in every two clock cycles after the initial startupcost of the pipeline. The whole function takes N
2 multiplications andwith m multipliers it will cost N
m clock cycles (see Fig. 4).In the STAGE-RECONSTRUCTION function, a more complex control
mechanism is required compared to other functions. In each stagewe need to compute
686
687
688
689
690
691
692
693
TableCoefficient
Stage Recon.
Control Unit
Control
Signals
SelectCoeff.
ALUInputSeq.
Data
Output
Fig. 3. Stage reconstruction unit.
Please cite this article in press as: Y. Doröz et al., A million-bit multiplier archithttp://dx.doi.org/10.1016/j.micpro.2014.06.003
Oi;j ¼ Ei�1;j � Oi�1;j �wjðmod ni�1Þi�1 ðmod pÞ;
Ei;j ¼ Ei�1;j þ Oi�1;j �wjðmod ni�1Þi�1 ðmod pÞ:
ð2Þ
In the equation, i denotes the stage index from 1 to 13; j denotesthe index of the digits, wi is the coefficient of stage i� 1 and finallyni�1 is the modular reduction to select the appropriate power of thewi. We may compute the terms ni iteratively as niþ1 ¼ 2� ni wheren0 ¼ 12. Ideally we need to store all the coefficients along with theodd digits. However this will require another large cache of sizeð12þ 24þ 48þ � � � þ 49152Þ � N digits. The cache can be reducedto almost half in size, since during the stage computations eachstage uses all of the coefficients from the previous stage. Therefore,the size can be reduced to 12 + (24 + 48 + � � � + 49,152)/2 = 49,152 � N. In order to reduce the storage requirement, wecan compute the coefficients using multiplications efficiently byusing pre-stored coefficients as follows:
1. The coefficients required in two consecutive stages are asfollows: Siþ1 : w0
i ;w1i ; . . . ;wni
i and Siþ2 : w0iþ1;w
1iþ1; . . . ;wniþ1
iþ1 .
2. Then Siþ2 : w0iþ1;w
1iþ1; . . . ;w2ni
iþ1 since it holds that niþ1 ¼ 2� ni.
3. Further, Siþ2 : w02i ;w
12i ; . . . ;w
2ni2
i , since wi ¼ w2iþ1.
4. This shows that half of the coefficients of Siþ2 are same as Siþ1
and the other half are the square roots of the coefficients of Siþ1.5. We can compute the square roots by multiplying each wj
i withw1
iþ1.
Using the properties described above, we construct the COEFFI-
CIENT TABLE by storing two columns of coefficients. In the first col-umn, since our smallest computation block is 12, we computeand store all wj
0 coefficients for 11 P j P 0. We denote these coef-ficients as wfirst;i, where i denotes the index of the coefficient. In thesecond column, for each of the remaining stages we compute andstore w1
i . The second column coefficients are denoted by wsecond;i.This makes a total of 24 coefficients which we can use to computeany of the wj
i values. When we include also the coefficients for theINTT operations, our table contains 48 coefficients. The computa-tion of an arbitrary coefficient using the table can be achieved as
wji ¼ wfirst;l �
Yi
t¼0
wesecond;t:
The values of l and e are functions of i and j. Also e is a valueequal to 1 or 0. Therefore we can omit the multiplications when-ever e ¼ 0. The total number of multiplications for computatingwj
i � Oi can be calculated as follows:
1. In every reconstruction stage we start by multiplying odd digitswith wfirst;l’s. This step makes a total of N
2 multiplications.2. Apart from the first reconstruction stage, in each stage we also
require coefficients from wsecond;0 to wsecond;i�1. Since we cannotstore the coefficients, in each stage we need to rebuild the pre-vious stage coefficients to build up the coefficients. We are
ecture for fully homomorphic encryption, Microprocess. Microsyst. (2014),
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
747747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777778
780780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
8 Y. DorözQ1 et al. / Microprocessors and Microsystems xxx (2014) xxx–xxx
MICPRO 2151 No. of Pages 10, Model 5G
24 June 2014
Q1
using half of the previous stage values so in each stage we needN4 additional multiplications.
3. The total multiplications will then becomesPi¼12
i¼0N2 þ i� N
4
� �¼
26� N.
The STAGE RECONSTRUCTION architecture works with the followingsteps: The CENTRAL CONTROL UNIT sends an input bit sequence as{opcode, stage_level, digit_index} to the STAGE RECONSTRUCTION CONTROL
UNIT. The opcode defines whether the function is an NTT, INTT oran INNER MULTIPLICATION operation. If it is a STAGE-RECONSTRUCTION func-tion, it checks the stage_level bits to determine the computingstage and digit_index bits to determine the starting point of thecomputation from the N digit range. These bits are necessary espe-cially if there are multiple STAGE RECONSTRUCTION UNITS. Each unitneeds to determine its own startup coefficient, i.e. wj
i, so they ini-tialize different wfirst;l and wsecond;t . Also another important initiali-zation is the mul_only signal (for multiplication only), which iscrucial for computing wj
i � Oi. In each multiplication of an odd digitwith a coefficient means, either the computation of wj
i � Oi is com-pleted or it still needs more multiplications. According to therequirement, the mul_only signal is used for enabling Ei;jOi;j �wjðmod niÞ
i calculation. If the coefficient does not need any moreprocessing, the new odd and even values are stored in the cache.Otherwise, the even digit is not updated and the odd digit isupdated using the rule O0iþ1;j ¼ O0i;j �wsecond;l and stored into thecache. Here O0i;j denotes previously updated odd digit with a coeffi-cient. In order to complete the computation of partially completedcoefficients, the system is fed with even and O0 digits, and with anew coefficient. This process continues until all the values arecomputed for a stage.
5.6. Central control unit
The CENTRAL CONTROL UNIT contains a state machine to handle thefunctions that are given as inputs. It is mainly responsible to startthe requested function given as input to the system. Since 12� 12NTT/INTT Unit and SCALE UNIT consist only of datapath, they areonly controlled with start and ready signals. As mentioned earlier,the STAGE RECONSTRUCTION UNIT requires more complex operations, soeach one is managed by their own control logic and each onlyrequires a bit sequence {opcode, stage_level, digit_index} fromthe CENTRAL CONTROL UNIT. These bit sequences are pre-determinedsequences according to the number of STAGE RECONSTRUCTION UNITS
and the cache size.The CENTRAL CONTROL UNIT also handles input/output addressing of
the sub-caches. Sequential functions require incremental address-ing for each sub-cache. In STAGE-RECONSTRUCTION function, the address-ing is computed according to the stage level, which is basicallyupdated with the index range of dependent odd and even digits.
The addressing for the BUTTERFLY LOAD function requires a bitmore complex approach compared to incremental addressing. Foreach level the butterfly operation divides given input block intoeven and odd indices recursively. For instance:
The N digits are divided into smaller blocks until the digit sizereduces to 12. This operation creates 8192 blocks with 12 digitsthat forms a clear pattern that can be easily constructed. In ourconstruction N = 98,304 the pattern formed above can be con-structed using lookup tables consisting of 35 elements.
Please cite this article in press as: Y. Doröz et al., A million-bit multiplier archithttp://dx.doi.org/10.1016/j.micpro.2014.06.003
6. Performance analysis
The latency of each functional block in cycles is given in Table 2.Any LOAD or UNLOAD function takes N clock cycles, where N is thenumber of digits. Since the architecture consist of one 12� 12NTT/INTT function block the functions take N clock cycles perfunction. For the intermediate NTT/INTT operations, each stagehas varying number of clock cycles to complete its operation. Forthe given stage with index i, i.e. Si, the total number of clock cyclesfor that stage is N
m
� �þ N
2m
� �� ði� 1Þ, where m denotes the total
number of multipliers. The INNER MULTIPLICATION operation takes dNme
clock cycles.In order to perform a complete multiplication, two complete
NTT operations, two INNER MULTIPLICATION operations and one INTT
operation need to be performed. A complete NTT or INTT opera-tion can be performed as follows:
1. The number is loaded into the cache by using BUTTERFLY LOAD
operation.2. 12� 12 NTT/INTT operation is performed.
3. STAGE RECONSTRUCTION operations are performed for stages 1–13.4. A SCALE & UNLOAD is used to carry out the digit-wise additions,
carry propagations and to scale the product.
The total number of clock cycles for a NTT/INTT operation willtake N cycles per load and unload operations, N cycles for 12� 12NTT/INTT operation and each stage reconstruction will adddN
me þ N2m
� �� ði� 1Þ cycles simplified further as
Xi¼13
i¼1
Nm
� �þ N
2m
� �� ði� 1Þ
� ¼ 13
Nm
� �þ 78
N2m
� �:
We need an additional 3N cycles to load one operand, for the12� 12 NTT computation and to unload the final result. The totalnumber of cycles to compute the NTT of one operand becomes
52Nm
� �þ 3N. During the multiplication, we need to run NTT/INTT
three times. An INNER MULTIPLICATION operation will take N cycles perload and unload operation and N
m
� �clock cycles for the intermediate
multiplication operations. Note that since our cache can only holdone operand at a time we need to perform two INNER MULTIPLICATION
operations. Hence, one complete large integer multiplication willtake two NTT, two INNER MULTIPLICATION and one INTT operationsresulting in a total of 158 N
m
� �þ 13N. For instance, for a design with
m ¼ 4 multiplier units we need in total 2� 16N þ 2� 9N4 þ 16N ¼
52:5N clock cycles. The cycle counts for the functional block andfor large integer multiplication is given in Table 2.
6.1. Scalability of the proposed multiplier
Here we briefly study the scalability of the proposed multiplier.We would like to be able to reach an optimal design point withoutunnecessarily crippling the time performance. In Table 3 we listthe number of cycles and the cycle times number of multipliersunits, in other words the time area product, for various choicesof number of multiplier units incorporated in the large multiplierdesign. Clearly, increasing m decreases the time required to per-form the computation. However, we see diminishing returns dueto the constant term in the complexity equation, i.e. 158 N
m
� �þ 13N.
To obtain the small improvement, we pay significantly in area,reducing overall efficiency as evidenced in the time area productshown in the last column. This suggests that a better optimizationstrategy is to simply instantiate the large NTT multipliers with asmall number of multiplier units, e.g. m ¼ 4 or m ¼ 8, and thenspend more area by replicating the large multipliers to gain linear
ecture for fully homomorphic encryption, Microprocess. Microsyst. (2014),
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
Table 2Clock cycle counts of functional blocks.
NTT(A), NTT(B) 2 BUTTERFLY LOAD 2N2 12� 12 NTT 2N2 STAGE-RECON 104 N
m
� �2 SEQUENTIAL UNLOAD 2N
A�B 2 SEQUENTIAL LOAD 2N2 INNER MULTIPLICATION 2 N
m
� �2 SEQUENTIAL UNLOAD 2N
INTT(A�B) BUTTERFLY LOAD N12� 12 INTT NSTAGE-RECON 52 N
m
� �SCALE UNLOAD N
Total 158 Nm
� �þ 13N
Table 3Clock cycle counts and time area product for various number of multiplier units.
m Total cycles Time area (m� cycles)
4 52:5N 210N8 32:7N 262N
16 22:8N 366N32 17:9N 574N64 15:4N 990N
Table 5FHE performance estimates for m ¼ 4 multiplier units (M = mult., MM = modularmult.).
Mul Barrett red Dec Enc Recrypt
# of Ops 1 M 2 M 1 MM 90 MM 1354 MMOurs 7.74 ms 15.4 ms 23.2 ms 2.09 s 31.4 sGH [5] 6.67 ms 13.3 ms 20 ms 1.8 s 32 s
Y. DorözQ1 et al. / Microprocessors and Microsystems xxx (2014) xxx–xxx 9
MICPRO 2151 No. of Pages 10, Model 5G
24 June 2014
Q1
improvement in the overall FHE primitive execution times for alinear investment in area.
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873Q3
874
875
876
877
7. Implementation results
In our design we fixed the number of digits as N = 98,304. Theresulting architecture is capable of computing the product of two1,179,648-bit integers in 5:16 million clock cycles. We modeledthe proposed architecture into Verilog modules and synthesizedSynopsis Design Compiler using the TSMC 90 nm cell library. Thearchitecture reaches at a maximum clock frequency of 666 MHzand can compute the product of two 1,179,648-bit integers in7:74 ms.
While it is impossible to make a fair comparison between anapplication specific architecture and other types of platforms, i.e.a general purpose microprocessor, GPUs and FPGAs, we find ituseful to state the results side-by-side as shown in Table 4. TheNVIDIA C2050 and NVIDIA GTX 690 GPUs are high-end expensiveprocessors and consume significantly more area than our design;to be precise the GPUs contain � 18:7 and � 43:7 times moreequivalent gates respectively. In another perspective, we are ableto fit 18 or 43 of our multipliers into the same area that the GPUsoccupy. Similarly, the Stratix V FPGA implementation [28] is anexpensive high end DSP oriented FPGA with significant on-chipmemory. Compared to the proposed design a rough back of theenvelope calculation reveals a 23:7 times larger footprint. Theproposed design occupies only a tiny fraction of the footprint com-pared to the other proposals; in operations like recryption we cansurpass at timing by implementing more of our multipliers and
878
879
880881882883884885Q4886887
Table 4Comparison of timings for FFT multipliers.
Design Platform Mult.(in ms)
Area (in millionequiv. gates)
Proposed FFT multiplier 90 nm TSMC 7.74 26.7FFT multiplier/reduction [23] NVIDIA C2050 GPU 0.765 500FFT multiplier/reduction [42] NVIDIA GTX 690 0.583 1166Only FFT transform [28] Stratix V FPGA 0.125 633
Please cite this article in press as: Y. Doröz et al., A million-bit multiplier archithttp://dx.doi.org/10.1016/j.micpro.2014.06.003
running them in parallel or they can be used for running multiplerecryption processes. Although our multiplier is approximately 12times slower, additional multipliers amortize the timing up to 3.5times.
To gain more insight we focus more closely into the architec-ture. The main contributor to the footprint of the architecturewas the 768 Kbyte cache which requires 26:5 million gates. Incontrast, the footprint of the functional blocks is only 0:2 milliongates. The total area of the design comes 26:7 million equivalentgates. With the time performance summarized in Table 5 ourarchitecture matches the speed of Intel Xeon software [5] runningthe NTL software package [17] which features a very similar NTT-based multiplication algorithm.
8. Conclusion
We presented the first full evaluation of a million-bit multipli-cation scheme in custom hardware. For this, we introduced a novelarchitecture for efficiently realizing multi-million bit multiplica-tions. Our design implements the Schönhage–Strassen Algorithmoptimized using the Cooley–Tukey FTT technique to speed up theNTT conversions. When implemented in 90 nm technology, thearchitecture is capable of computing a million-bit multiplicationin 7.74 ms while consuming an area of only 26 million gates. Ourarchitecture presents a viable alternative to previous proposals,which either lack the required performance level or cannot provideerror-free multiplication for the operand lengths we are seeking tosupport. Finally, we presented performance estimates for the Gen-try–Halevi FHE primitives, assuming the large integer multiplier isused as a coprocessor. Our estimates show that the performance ofour design matches the performance of previously reported soft-ware implementations on a high end Intel Xeon processor, whilerequiring only a tiny fraction of the area. Therefore, our evaluationshows that much higher levels of performance may be reached byimplementing FHE in custom hardware thereby bringing FHEcloser to deployment.
9. Uncited references
[6,19–22,29–37].
Acknowledgments
Funding for this research was in part provided by the USNational Science Foundation CNS Award #1117590. Funding wasalso provided by The Scientific and Technological Research Councilof Turkey, project number 113C019.
References
[1] C.K. Koc., High-Speed RSA Implementation, TR 201, RSA Laboratories,November 1994, 73p.
[2] C.K. Koc, RSA Hardware Implementation, TR 801, RSA Laboratories, April 1996,30p.
[3] Shay Gueron, Efficient software implementations of modular exponentiation, J.Cryptogr. Eng. 2 (1) (2012) 31–43.
[4] C. Gentry, A Fully Homomorphic Encryption Scheme, Ph.D. Thesis, Departmentof Computer Science, Stanford University, 2009.
ecture for fully homomorphic encryption, Microprocess. Microsyst. (2014),
888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961
962963964965966967968969970971972973974975976977978979980981982
983
985985
986987988989990991992993994
996996
99799899910001001100210031004100510061007100810091010
10121012
1013101410151016101710181019102010211022102310241025
10 Y. DorözQ1 et al. / Microprocessors and Microsystems xxx (2014) xxx–xxx
MICPRO 2151 No. of Pages 10, Model 5G
24 June 2014
Q1
[5] Craig Gentry, Shai Halevi, Implementing gentry’s fully-homomorphicencryption scheme, in: EUROCRYPT 2011, LNCS, vol. 6632, Springer, 2011,pp. 129–148.
[6] D. Cousins, K. Rohloff, C. Peikert, R. Schantz, SIPHER: Scalable Implementationof Primitives for Homomorphic Encryption-FPGA Implementation usingSimulink, HPEC 2011.
[7] The GNU Multiple Precision Arithmetic Library, GMP 5.0.5. <http://gmplib.org/>.
[8] J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complexFourier series, Math. Comput. 19 (90) (1965) 297–301.
[9] K. Kalach, J.P. David, Hardware implementation of large number multiplicationby FFT with modular arithmetic, in: IEEE-NEWCAS Conference, 2005. The 3rdInternational, 19–22 June, 2005, pp. 267–270.
[10] Arnold Schönhage, Volker Strassen, Schnelle multiplikation grosser zahlen,Computing 7(3), vol. 7, Springer, Wien, 1971.
[11] Luis Carlos Coronado Garcìa, Can Schönhage multiplication speed up the RSAdecryption or encryption?, MoraviaCrypt (2007)
[12] J.P. David, K. Kalach, N. Tittley, Hardware complexity of modularmultiplication and exponentiation, IEEE Trans. Comput. 56 (10) (2007)1308–1319.
[13] S. Craven, C. Patterson, P. Athanas, Super-sized multiplies: how do FPGAs farein extended digit multipliers? in: Proceedings of the 7th InternationalConference on Military and Aerospace Programmable Logic Devices(MAPLD’04), Washington, DC, USA, September 2004.
[14] S. Yazaki, K. Abe, An optimum design of FFT multi-digit multiplier and its VLSIimplementation, Bull. Univ. Electro-Commun. 18 (1 and 2) (2006) 39–46.
[15] N. Emmart, C.C. Weems, High precision integer multiplication with a GPUusing Strassen’s algorithm with multiple FFT sizes, Parallel Process. Lett.(2011) 359–375.
[16] A. Karatsuba, Yu. Ofman, Multiplication of many-digital numbers by automaticcomputers, Proc. USSR Acad. Sci. 145 (1962) 293–294.
[17] V. Shoup, NTL: A Library for doing Number Theory, Library Version 5.5.2.<http://www.shoup.net/ntl/>.
[18] C. Gentry, S. Halevi, Implementing Gentry’s fully-homomorphic encryptionscheme, in: EUROCRYPT, 2011, pp. 129–148.
[19] C. Gentry, S. Halevi, N.P. Smart, Homomorphic evaluation of the AES circuit,IACR Cryptol. ePrint Arch. 2012 (2012) 99.
[20] Z. Brakerski, C. Gentry, V. Vaikuntanathan, Fully homomorphic encryptionwithout bootstrapping, Electr. Colloq. Comput. Complex. (ECCC) 18 (2011)111.
[21] N.P. Smart, F. Vercauteren, Fully homomorphic SIMD operations, IACR Cryptol.ePrint Arch. 2011 (2011) 33.
[22] J.-S. Coron, T. Lepoint, M. Tibouchi, Batch fully homomorphic encryption overthe integers, IACR Cryptol. ePrint Arch. 2013 (2013) 36.
[23] W. Wang, Y. Hu, L. Chen, X. Huang, B. Sunar, Accelerating fully homomorphicencryption using GPU, in: HPEC, 2012, pp. 1–5.
[24] D. Cousins, K. Rohloff, R. Schantz, C. Peikert, SIPHER: scalable implementationof primitives for homomorphic encryption, Internet Source, September 2011.
[25] D. Cousins, K. Rohloff, C. Peikert, R.E. Schantz, An update on SIPHER (scalableimplementation of primitives for homomorphic encRyption) -FPGAimplementation using simulink, in: HPEC, 2012, pp. 1–5.
[26] C. Moore, N. Hanley, J. McAllister, M. O’Neill, E. O’Sullivan, X. Cao, TargetingFPGA DSP slices for a large integer multiplier for integer based FHE, in:Workshop on Applied Homomorphic Cryptography, vol. 7862, 2013.
[27] Xiaolin Cao, Ciara Moore, Mire O’Neill, Elizabeth O’Sullivan, Neil Hanley,Accelerating fully homomorphic encryption over the integers with super-sizehardware multiplier and modular reduction, IACR Cryptol. ePrint Arch. (2013).
[28] W. Wang, X. Huang, FPGA implementation of a large-number multiplier forfully homomorphic encryption, in: ISCAS, 2013, pp. 2589–2592.
[29] C. Gentry, S. Halevi, Fully homomorphic encryption without squashing usingdepth-3 arithmetic circuits, IACR Cryptol. ePrint Arch. 2011 (2011) 279.
[30] N.P. Smart, F. Vercauteren, Fully homomorphic encryption with relativelysmall key and ciphertext sizes, in: Public Key Cryptography, 2010, pp. 420–443.
[31] Z. Brakerski, V. Vaikuntanathan, Efficient fully homomorphic encryption from(standard)LWE, in: FOCS, 2011, pp. 97–106.
[32] C. Gentry, S. Halevi, N.P. Smart, Better bootstrapping in fully homomorphicencryption, IACR Cryptol. ePrint Arch. 2011 (2011) 680.
[33] Craig Gentry, Shai Halevi, Nigel P. Smart, Fully homomorphic encryption withpolylog overhead, IACR Cryptol. ePrint Arch. 2011 (2011) 566.
[34] O. Regev, On lattices, learning with errors, random linear codes, andcryptography, in: STOC, 2005, pp. 84–93.
[35] V. Lyubashevsky, C. Peikert, O. Regev, On ideal lattices and learning with errorsover rings, IACR Cryptol. ePrint Arch. 2012 (2012) 230.
Please cite this article in press as: Y. Doröz et al., A million-bit multiplier archithttp://dx.doi.org/10.1016/j.micpro.2014.06.003
[36] Z. Brakerski, V. Vaikuntanathan, Fully homomorphic encryption from ring-LWE and security for key dependent messages, in: CRYPTO, 2011, pp. 505–524.
[37] K. Lauter, M. Naehrig, V. Vaikuntanathan, Can homomorphic encryption bepractical? in: Cloud Computing Security Workshop, 2011, pp. 113–124.
[38] M. van Dijk, C. Gentry, S. Halevi, V. Vaikuntanathan, Fully homomorphicencryption over the integers, in: EUROCRYPT, 2010, pp. 24–43.
[39] J.-S. Coron, A. Mandal, D. Naccache, M. Tibouchi, Fully homomorphicencryption over the integers with shorter public keys, in: CRYPTO, 2011, pp.487–504.
[40] J.-S. Coron, D. Naccache, M. Tibouchi, Public key compression and modulusswitching for fully homomorphic encryption over the integers, in:EUROCRYPT, 2012, pp. 446–464.
[41] Y. Doröz, E. Öztürk, B. Sunar, Evaluating the hardware performance of amillion-bit multiplier, in: Digital System Design (DSD), 2013 16th EuromicroConference on, 2013.
[42] W. Wang, Y. Hu, L. Chen, X. Huang, B. Sunar, Exploring the feasibility of fullyhomomorphic encryption, IEEE Trans. Comput. 99 (PrePrints) (2013) 1.
[43] P.G. Comba, Exponentiation cryptosystems on the IBM PC, IBM Syst. J. 29 (4)(1990) 526–538.
[44] T. Güneysu, Utilizing hard cores of modern FPGA devices for high-performancecryptography, J. Cryptogr. Eng. 1 (1) (2011) 37–55.
Yarkin Doroz received a BSc. degree in ElectronicsEngineering at 2009 and a MSc. degree in ComputerScience at 2011 from Sabanci University. Currently he isworking towards a Ph.D. degree in Electrical and Com-puter Engineering at Worcester Polytechnic Institute.His research is focused on developing hardware/soft-ware designs for Fully Homomorphic EncryptionSchemes.
Erdinç Öztürk received his BS degree in Microelec-tronics from Sabanci University at 2003. He received hisMS degree in Electrical Engineering at 2005 and PhDdegree in Electrical and Computer Engineering at 2009from Worcester Polytechnic Institute. His research fieldwas Cryptographic Hardware Design and he focused onefficient implementations for Identity Based EncryptionSchemes. After receiving his PhD degree, he worked atIntel in Massachusetts for 4 years as a hardware engi-neer, than he joined Istanbul Commerce University asan assistant professor. He is currently the departmenthead of the Electrical-Electronic Engineering program at
Istanbul Commerce University.
Berk Sunar received his BSc degree in Electrical andElectronics Engineering from Middle East TechnicalUniversity in 1995 and his Ph.D. degree in Electrical andComputer Engineering (ECE) from Oregon State Uni-versity in December 1998. After briefly working as amember of the research faculty at Oregon State Uni-versity’s Information Security Laboratory, Sunar hasjoined Worcester Polytechnic Institute as an AssistantProfessor. He is currently heading the Cryptography andInformation Security Laboratory (CRIS). Sunar receivedthe prestigious National Science Foundation YoungFaculty Early CAREER award in 2002.
ecture for fully homomorphic encryption, Microprocess. Microsyst. (2014),