A million-bit multiplier architecture for fully homomorphic encryption

1

3

4

5

6 Q1

78

9

1 1

1213141516

1718192021

2 2

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

Q1

Microprocessors and Microsystems xxx (2014) xxx–xxx

MICPRO 2151 No. of Pages 10, Model 5G

24 June 2014

Q1

Contents lists available at ScienceDirect

Microprocessors and Microsystems

journal homepage: www.elsevier .com/locate /micpro

A million-bit multiplier architecture for fully homomorphic encryption

http://dx.doi.org/10.1016/j.micpro.2014.06.0030141-9331/� 2014 Published by Elsevier B.V.

⇑ Corresponding author.E-mail addresses: [email protected] (Y. Doröz), [email protected] (E. Öztürk),

[email protected] (B. Sunar).1 An earlier short conference version of this paper appeared in [41].

Please cite this article in press as: Y. Doröz et al., A million-bit multiplier architecture for fully homomorphic encryption, Microprocess. Microsyst.http://dx.doi.org/10.1016/j.micpro.2014.06.003

Yarkın Doröz a, Erdinç Öztürk b,⇑, Berk Sunar a

a Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609, USAb Istanbul Commerce University, Kucukyali, Istanbul 34840, Turkey

a r t i c l e i n f o a b s t r a c t

23242526272829303132

Article history:Received 6 December 2013Revised 16 May 2014Accepted 11 June 2014Available online xxxx

Keywords:Fully homomorphic encryptionVery-large number multiplicationNumber theoretic transform

3334

In this work we present a full and complete evaluation of a very large multiplication scheme in customhardware. We designed a novel architecture to realize a million-bit multiplication scheme based on theSchönhage–Strassen Algorithm. We constructed our scheme using Number Theoretical Transform (NTT).The construction makes use of an innovative cache architecture along with processing elements custom-ized to match the computation and access patterns of the NTT-based recursive multiplication algorithm.We realized our architecture with Verilog and using a 90 nm TSMC library, we could get a maximumclock frequency of 666 MHz. With this frequency, our architecture is able to compute the product oftwo million-bit integers in 7.74 ms. Our data shows that the performance of our design matches thatof previously reported software implementations on a high-end 3 GHz Intel Xeon processor, while requir-ing only a tiny fraction of the area.1

� 2014 Published by Elsevier B.V.

35

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

1. Introduction

Arbitrary precision arithmetic has been studied for a long time,and it is a well-established research field. Since multiplication isthe most compute-intensive basic operation, numerous algorithmsfor fast large-number multiplication has been introduced. Until1960, school-book multiplication method was thought to be thefastest algorithm for multiplying two large numbers. In 1960,Karatsuba Algorithm was discovered and it was published in1962 [16]. Although Karatsuba Algorithm is much more efficientthan school-book multiplication for large numbers, it is possibleto optimize multiplication algorithm for very-large numbers evenfurther. In 1971, Schönhage–Strassen Algorithm [10], which is anFFT-based large integer multiplication algorithm, was introduced.FFT-based algorithm is using floating-point arithmetic and thisleads to rounding errors. For precise multiplication, we useNumber Theoretic Transforms (NTT) instead of FFT.

Very-large integer arithmetic is used by a few number of appli-cations. However, this does not change the fact that efficientimplementations of very-large multiplication algorithms is animportant open problem. For some important and popular researchfields, such as primality testing and homomorphic encryption, the

80

81

82

83

84

85

runtime and throughput of multiplication of two very large num-bers is crucial for the practicality and the efficiency of the scheme.

As stated above, very-large integer multiplication is importantfor the field of Fully Homomorphic Encryption (FHE). The first pro-posed FHE scheme, which permits an untrusted party to evaluatearbitrary logic operations directly on the ciphertext, was intro-duced by Gentry in [4]. This was an exciting development for thefield of cryptographic research and led to tremendous amount ofstudies searching for further efficient FHE schemes. FHE holdsgreat promise for numerous applications including private infor-mation retrieval and private search and data aggregation, elec-tronic voting and biometrics. FHE’s role is further amplified inthe cloud where a server is shared by multiple users.

Although FHE schemes hold great promise, widespread applica-tions of FHE will only be possible when efficient FHE implementa-tions are available. Unfortunately, current FHE schemes, especiallythe one introduced by Gentry in [4], have significant performanceissues. None of the existing FHE proposals are considered to bepractical yet, due to the extreme computational resource require-ments. For instance, the Gentry and Halevi FHE [5], is reporting a�30 s latency for the re-encryption primitive on an Intel Xeonenabled server. Their scheme requires multiplication of operandsthat are a few million bits and performances of encryption, decryp-tion and re-encryption evaluations are primarily determined bythis operation. Existing software packages can handle integer oper-ations at this magnitude (e.g. GNU Multiple Precision ArithmeticLibrary [7]). However, the prohibitive high cost of the very-largemultiplier motivates the study of custom designed architectures.

(2014),

http://dx.doi.org/10.1016/j.micpro.2014.06.003

mailto:[email protected]




http://www.sciencedirect.com/science/journal/01419331

http://www.elsevier.com/locate/micpro


86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

2 Y. DorözQ1 et al. / Microprocessors and Microsystems xxx (2014) xxx–xxx


24 June 2014

Q1

This is the goal of this work and we are not aware of a customdesigned architecture implementation for very-large integermultiplication.

While more efficient schemes are appearing in the literature,we would like to take a snapshot of the attainable hardwareperformance of the Gentry–Halevi FHE variant, by implementinga cost-efficient very-large integer multiplier using the FFT-basedmultiplication algorithm. We are motivated by the fact that thefirst commercial implementations of most public-key crypto-graphic schemes, e.g. RSA, Elliptic Curve DSA schemes have beenvia limited functionality hardware products due to the efficiencyshortcomings on general purpose computers. We speculate that asimilar growth pattern will emerge in the maturation process ofFHEs. While the obvious efficiency shortcomings of FHE’s are beingworked out, we would like to pose the following question: How farare FHE schemes from being offered as a hardware product? Clearlyanswering this question would be a major overtaking deservingthe evaluation of all FHE variants with numerous implementationtechniques. Here, by providing strong area and performance num-bers for the multiplier that we designed, we provide initial resultsfor the FHE scheme of Gentry and Halevi.

In this work we present an implementation of our architecturerealizing multiplication of two million-bit integers. We used avariation of the FTT-based Schönhage–Strassen Algorithm usingNTT approach. Given the massive amount of data that needs tobe efficiently processed, we identify data input/output and associ-ate routing as one of the challenges in our effort. A secondarychallenge is the recursive nature of NTT computations, whichmakes it hard to parallelize the architecture without impeding dataflow. We achieve a high-speed custom hardware architecture witha reasonable footprint that may be easily extended to realize theprimitives of Gentry–Halevi FHE Scheme.

The rest of the paper is organized as follows. We present a briefreview of the previous work in Section 2 and introduce the basicbackground in Section 3. We describe our approach in Section 4and present the details of our design in Section 5. After the perfor-mance analysis in Section 6, we share the implementation resultsin Section 7.

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

2. Previous work

Until recently, large integer multiplication implementationswere mostly limited to few thousand-bits for classical encryptionschemes. The introduction of Fully Homomorphic Encryption(FHE) schemes, especially Gentry and Halevi’s FHE scheme, chan-ged the focus to very-large integer multiplication algorithms. Moregroups started working on multiplication designs for different plat-forms with different optimization interests. In this section, we willgive background information on previous implementations of tra-ditional large integer multiplication algorithms in Section 2.1. Also,in Section 2.2, we are giving background information on previousimplementations of FHE schemes.

200

201

202

203

204

205

206

207

208

209

210

211

212

2.1. Traditional large integer multipliers

For cryptographic applications, key sizes differ for differentalgorithms and schemes, to provide the same amount of security.For mainstream cryptographic applications, largest key sizes areused for public-key cryptography, such as RSA algorithm. Thesekeys are currently in the range of 1024-bit to 2048-bits, for reason-able security. However, FHE schemes are quite unusual requiringsupport for extremely long integer operands, from hundreds ofthousands to millions of bits.

As we are discussing very-large integer arithmetic, which is nota well-studied research field, the natural place to turn for high per-

Please cite this article in press as: Y. Doröz et al., A million-bit multiplier archithttp://dx.doi.org/10.1016/j.micpro.2014.06.003

formance large integer multiplication algorithms is the field ofpublic-key cryptography. If we examine the efficiency and themethods of the algorithms being used for large-integer multiplica-tion, we could extrapolate our findings to very-large integermultiplication schemes. Public key schemes such as RSA andDiffie–Hellman commonly use efficient multiplication algorithmsthat can handle operands of thousands of bits. Numerous efficientimplementations supporting RSA is available in the literature[1–3]. The thousand-bit multiplication algorithms are generallyrealized using two methods: classical multiplication and Karatsubamultiplication. For software implementations, due to constant-time requirements of the cryptographic schemes and the longand inefficient carry chains on the CPU pipeline, classical multipli-cation algorithm gives faster results. However, for hardwareimplementations, Karatsuba Algorithm is widely used. Theseimplementations typically break down the long operands using afew iterations of the Karatsuba–Ofman algorithm [16]. Themultiplication complexity of the Karatsuba–Ofman algorithm issub-quadratic, i.e. Oðnlog23Þ where n denotes the length of the oper-ands. While presenting great savings, even the Karatsuba–Ofmanalgorithm becomes insufficient when operand sizes grow beyondtens of thousands of bits.

Although there is a significant amount of literature on FFT-based multiplication algorithms, we encountered very few thatgave results for a detailed and full implementation in hardware.One of the very few implementation examples is the architecturepresented in [9]. They implement an FFT-based multiplicationscheme with scalable inputs. This architecture consists only ofone stage of the butterfly unit, which is not a complete design,and one FFT unit that achieves a 1024-bit FFT computation in448 clock cycles. In the follow-up paper [12] the authors presentan area-cost comparison of several multiplier architectures. Theirimplementation realizes a modular multiplication, using an archi-tecture built around 3

2 log S butterfly operators and one accumula-tor in S clock cycles, where S is the degree of the polynomial that isused for FFT multiplication. This polynomial is the result of FFT-conversion of the integer to be multiplied.

In [13], the authors implement large-squaring operation utiliz-ing a more complicated Number Theoretical Transform (NTT)based architecture. The authors claim that their squaring architec-ture can also be used for multiplication with minor modifications.They take the square of a 12-million digit number in 34 ms using aXC2VP100 with a 80 MHz clock frequency. Their result is 1.76times faster than a software implementation on Intel Pentium 4.However, since their design costs ten times more than a Pentiumprocessor, the authors note that the cost of their design outweighsthe performance gain.

Another design focusing on hardware implementation of very-large multiplication was proposed in [14]. This algorithm is moresimilar to the algorithms that we used in our implementation.For number of digits ranging from 25 to 213, their implementationis 25:7 times faster than software-based implementations, with anarea cost of 9:05 mm2. When the number of digits is increased to221 (with a digit size of 4 bits) their implementation is 35.7 timesfaster than software-based implementations, with an area of16:1 mm2. Although this architecture achieves fast multiplication,the computations are performed over floating-point numbers,which introduces the possibility of an error. Indeed, the authorsreport error rates steadily increasing with increasing hamming-weight. They report error rates of about 0:025 and 0:06 for theoperand lengths 4000 and 8000 bits, respectively. Clearly, the errorrates will reach unacceptable levels for million-bit operands.Furthermore, for cryptographic applications, even a single bit errorwill have disastrous effects on the overall scheme.

Another very-large number multiplication implementation onFPGA was presented by Wang and Huang [28]. The authors propose

ecture for fully homomorphic encryption, Microprocess. Microsyst. (2014),


213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

Y. DorözQ1 et al. / Microprocessors and Microsystems xxx (2014) xxx–xxx 3


24 June 2014

Q1

an architecture for a 768 K-bit FFT multiplier using a 64 K-pointfinite field FFT as the key component. This component was imple-mented on both a Stratix V FPGA and an NVIDIA Tesla C2050 GPUand the FPGA implementation achieved a speed that is twice thespeed of the implementation on the GPU [23].

279

280

281

282

283

284

285

286

287

288

289

290

296

297

298

299

300

301

302

304304

305

306

307

308

309

310

311312

314314

315316

318318

319

320

321

322

2.2. GPU and FPGA implementations of homomorphic encryptionschemes

There are currently several research groups working towardsdevelopment of optimized FHE architectures targeting platformsalternative to software implementations, to improve the efficiencyof Somewhat Homomorphic Encryption (SHE) and FHE schemes.Recent developments in this area show that efficient hardwareimplementations will drastically increase the efficiency of fullyhomomorphic encryption schemes. The performance of very-largemultiplication, which is an important primitive of some FHEschemes, could be significantly improved through the use of FPGAtechnology. Compared to ASIC design, FPGA technology offers flex-ibility at low cost. In addition, modern FPGAs include embeddedhardware blocks, which are optimized for multiply-accumulate(MAC) operations, and thus can be exploited when implementingvery-large multiplications.

Another alternative to software implementation that canimprove the performance of SHE schemes is GPU implementations.An efficient GPU implementation of an FHE scheme was presentedby Wang et al. [23]. The authors implemented the small parametersize version of Gentry and Halevi’s lattice-based FHE scheme [18]on an NVIDIA C2050 GPU using the FFT algorithm, achievingspeed-up factors of 7.68, 7.4 and 6.59 for encryption, decryptionand the recryption operations, respectively. The FFT algorithmwas used to target the bottleneck of this lattice-based scheme,namely the modular multiplication of very-large numbers, andthe authors took advantage of the highly parallelized GPU architec-ture to accelerate the performance of the FHE scheme. However,the authors state that even with this speed-up, the implementedFHE scheme remains unpractical. In [42], the authors present anoptimized version of their previous implementation. This modifiedmethod, implemented on an NVIDIA GTX 690, achieves speed-upfactors of 174, 7.6 and 13.5 for encryption, decryption and therecryption operations, respectively, when compared to results ofthe implementation of Gentry and Halevi’s FHE scheme [18] thatruns on an Intel Core i7 3770K machine.

Other work has also focused on the use of efficient large integermultiplication implementations for many FHE schemes [18,39,40],in order to achieve faster hardware implementations. The use of aComba multiplier [43], which can be implemented efficiently usingthe DSP slices of an FPGA [44], was proposed by Moore et al. [26] toimprove the performance of integer based FHE schemes [38,40]. AnFPGA implementation of a RLWE SHE scheme was also targeted byCousins et al. [24,25], in which Matlab Simulink is used to designthe FHE primitives.

Lastly, Cao et al. [27] proposed a large-integer multiplier usinginteger-FFT multipliers combined with Barrett reduction to targetthe multiplication and modular reduction bottlenecks featured inmany FHE schemes. The encryption step in the proposed integerbased FHE schemes by Coron et al. [39,40] were designed andimplemented on a Xilinx Virtex-7 FPGA. The synthesis results showspeed up factors of over 40 are achieved compared to existing soft-ware implementations of this encryption step [27]. This speed upillustrates that further research into hardware implementationscould greatly improve the performance of these FHE schemes. Anoverview of the above mentioned multipliers for different plat-forms is given in Table 4.


3. Background

Common multiplication schemes (Karatsuba Algorithm,classical school-book multiplication method) become infeasiblefor very-large number multiplications. For very-large operandsizes, efficient NTT-based multiplication schemes have been devel-oped. Schönhage–Strassen Algorithm is currently asymptoticallythe fastest algorithm for very-large numbers. It has been shownthat it outperforms classic schemes for operand sizes larger than217 bits [11].

3.1. Schönhage Strassen Algorithm

The Schönhage–Strassen Algorithm is an NTT-based large inte-ger multiplication algorithm, with a runtime of OðN log N log log NÞ[10]. For an N-digit number, NTT is computed using the ringRN ¼ Z=ð2N þ 1ÞZ, where N is a power of 2.

In [15], the algorithm is explained as follows:

Algorithm 1. Schönhage–Strassen Algorithm

In the algorithm, for the NTT evaluation we sample the num-
bers A and B into N-digits with a digit size of �. By this sampling,we represent numbers A and B with tuples ða0; a1; . . . ; aN�1Þ andðb0; b1; . . . ; bN�1Þ, respectively, where each ai is � bits. NTT evalua-tion on these numbers is done as follows:
ANTTk ¼

XN�1

i¼0

wikai and BNTTk ¼

XN�1

i¼0

wikbi;

where ANTT and BNTT are NTT-mapped tuples of A and B, respectively.The multiplications are modular multiplications with a selectedprime p and w is a primitive root of p, i.e. wp ¼ 1ðmod pÞ. Here,the prime p has to be selected large enough to prevent overflowerrors during the next phase of the NTT-based multiplication algo-rithm. For integer multiplication, the tuples in the NTT-form aremultiplied to form another tuple CNTT with elements CNTT

k :

CNTTk ¼ ANTT

k � BNTTk ðmod pÞ:

Using the inverse-NTT we compute

Ck ¼XN�1

k¼0

w�kCNTTk :

As the last step, we can accumulate the carry additions to final-ize the evaluation of C.

To realize the Schönhage–Strassen Algorithm efficiently, it iscrucial to employ fast NTT and inverse-NTT computation



323

324

325

326327

329329

330

331332

334334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376



24 June 2014

Q1

techniques. We adopted the most common method for computingFast Fourier Transforms (FFTs), i.e. the Cooley–Tukey FFTAlgorithm [8]. The algorithm computes the Fourier Transform ofa sequence X:

Xk ¼XN�1

j¼0

xje�i2pk jN;

by turning the length N transform computation into two N2 size Fou-

rier Transform computations as follows

Xk ¼XN=2�1

m¼0m even

x2me�i2pk mN=2 þ e

�2pikN

XN=2�1

m¼0m odd

x2mþ1e�i2pk mN=2:

We change e�2pik

N with powers of w and perform the divisionsinto two halves with odd and even indices, recursively. With theuse of fast transform technique, we can evaluate the Schönhage–Strassen Multiplication Algorithm in OðN log N log log NÞ time.

4. Our approach

Our literature review has revealed that existing multiplierschemes are lacking some of the essential features we need forFHE schemes. A few references propose FFT based implementa-tions of large integer multiplication techniques. These techniquesexploit the convolution theorem to gain significant speedup overother multiplication techniques. To overcome the inefficienciesexperienced in [13] and the noise problem experienced in [14]we chose to develop an application specific custom architecturebased on the NTT transform.

Fig. 1. A part of fully parallel circuit realizi


The Schönhage–Strassen Algorithm is the fastest and most effi-cient option for large integer multiplications. The algorithm has anasymptotic complexity of OðN log N log log NÞ. Given the size of ouroperands the Schönhage–Strassen Algorithm is perfectly suited tofulfill our performance needs. Another advantage of the algorithmis that it lends itself to a high degree of parallelization. A part ofhigh level illustration of the parallel realization of the algorithmfor short operands is shown in Fig. 1. In the figure, shown registersare the same registers which are shown separately for ease of view.We can add more reconstruction units in parallel with factor of thedigit size.

In our implementation of the Schönhage–Strassen Algorithm,the number of digits is 3� 215 = 98,304. With the Cooley–Tukeyapproach, we form our butterfly circuit down to the size of3� 22 ¼ 12 digits. This approach provides ample opportunitiesfor parallelization. However, there are limits on how far we goexploiting the parallelism in a hardware realization. There aretwo reasons of this:

� The first reason for this is, the stage reconstruction computa-tions are simple operations that take a few clock cycles,whereas we have huge data size to process. To overcome thisobstacle we need an architecture that can handle a higher databandwidth with rapid data access to perform calculations inparallel. Otherwise just by increasing the computational blockswhile keeping restricted (low) bandwidth we will not be able toimprove the performance.� As for the second reason, in the algorithm the index range of the

dependent digits double in each stage which requires

ng the Schönhage–Strassen Algorithm.



377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454



24 June 2014

Q1

significantly more routing on the digits. As the number of com-putational blocks and the stage number increases, too manycollisions and overlaps occur in the routing for the VLSI designtools to handle.

Having a well designed high capacity cache is the key to achiev-ing high performance large integer multiplier. Furthermore, thecache must be tailored to match of the computational elements.In our design we chose to incorporate m ¼ 4 computation units,so that we can have some performance boost and also avoid therouting problems that might occur. Even with m ¼ 4 the bus widthreaches 512-bits. In order to supply the computation blocks withsufficient data, we have chosen the cache size of the architectureas N ¼ 3� 215 = 98,304. Moreover, to enable parallel reads withoutimpeding the bandwidth, we divided the cache into 2�msub-caches. It is possible to incorporate more computation units,but it will also make the control logic harder to design and createadditional routing complications.

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

486486

487

488

489

490

491493493

494

495

5. Design details

5.1. Parameter selection

The parameters for the implementation are based on theparameters in [15]. We choose 64-bit word size for fast andefficient computations, sampling size as b ¼ 24 and the modulusas p ¼ 264 � 232 þ 1. The reason for the particular choice of p is thatit allows us to realize a modular reduction using only a few prim-itive arithmetic operations. A 128-bit number is denoted asz ¼ 296aþ 264bþ 232c þ d. Using the selected p, we performzðmod pÞ operation as 232ðbþ cÞ � a� bþ d.

For the parameter N, we need to satisfy N2 ð2

b � 1Þ2< p to

prevent the digit-wise overflows in the intermediary computationsin NTT. Also, N should be big enough to cover million-bit multipli-cation with smallest possible value, so that the computation timeis small. The best candidate for N is determined asN ¼ 3 � 215 = 98,304. As we use 24 bits for digit length, 3 � 215 digitshold a 1,179,648-bit number for NTT-based multiplication(Although 3 � 215 � 24 is double this size, the number to be multi-plied has to be truncated with 0s to double its size, therefor wecan only multiply 2 1,179,648-bit numbers by using 3 � 215 digitsof 24 bits). This is the closest we could get to 1 million bits withoutlosing efficiency for the NTT transform. Therefore, this parameterachieves one million-bit multiplication with minimum memoryoverhead.

Finally, we determine the Nth primitive root of unity, i.e. w.From the equation wN � 1 mod p, we determine w ¼ 3511764839390700819.

In Cooley–Tukey FFT Algorithm, each recursive halving opera-tion is referred as a stage and it is denoted as Si, where i is the stageindex. The size of the smallest NTT block is selected to be 12entries and it is referred as the 0th stage, i.e. S0. The remainingstages are reconstruction stages and require different arithmeticoperations from the ones in S0. As the nature of the NTT algorithm,in each reconstruction stage processing block size is doubled.Therefore, it takes 13 reconstruction stages to complete the NTT

operation.In terms of INTT operations, every stage and operation is iden-

tical to NTT. Only difference is selection of w0. It is computed as:w0 ¼ w�1 mod p.

496498498

499

500

501

5.2. Architecture overview

Our architecture is composed of a data cache, a central controlunit, two routing units and an function unit. The multiplication


architecture is illustrated in Fig. 2. The architecture is designedto perform a restricted set of special functions. There are fourfunctions for handling the input/output transactions and threefunctions for arithmetic operations:

Sequential load. It is used to store a million-bit number to thecache. The number is given in digits starting from the least signif-icant digit and stored in sequence starting from the beginning ofthe cache.

Sequential unload. The cache releases its contents starting fromthe least significant to most significant digit.

Butterfly load. In NTT an important step is the distribution ofthe digits into the right indices using the butterfly operation. Start-ing from the least significant digit, the digits are inserted to thecache locations identified by the butterfly indices.

Scale & unload. In the last step of the NTT-based multiplicationcomputation, as a part of the INTT operation, the digits need to bescaled by N�1ðmod pÞ, as a part of the multiplication algorithm.After this operation, the carries are accumulated and the digitsare scaled back to 24 bits. The SCALE UNIT overlaps scale and carryaccumulation of the digits with output, since it is the final stepin million-bit multiplication.

12 � 12 NTT/INTT. The smallest NTT/INTT computation is for 12digits. 12 � 12 NTT/INTT UNIT takes the digits sequentially andcomputes the 12 digit NTT/INTT by using simple shifts and addi-tion operations.

Stage-reconstruction. This function is used for reconstruction ofthe stages. According to the given input, it reconstructs an NTT

or INTT stages. Reconstructions only complete a stage at a time,with the given stage index input. In order to complete a full recon-struction, i.e. completing evaluation of NTT, it is recalled for all 13stages.

Inner-multiplication. This operation is used to compute the digit-wise modular multiplications. For this we utilize the multipliersused in STAGE-RECONSTRUCTION UNIT.

Using the functions outlined above, we can compute the prod-uct of million-bit numbers A and B using the following sequenceof operations:

1. A is loaded into cache by using BUTTERFLY LOAD.2. The NTT of number A, i.e. NTTðAÞ, is computed by calling;

first 12 � 12 NTT function, and afterwards STAGE-RECONSTRUCTION function for all stages.

3. NTTðAÞ is stored back to the RAM using SEQUENTIAL UNLOAD.4. Using the same sequence of functions as above, we also

compute NTTðBÞ.5. The cache can only hold the half of the digits of NTTðAÞ and

NTTðBÞ. Therefore, the numbers are divided into lower andupper halves:

ecture

NTTðAÞ ¼ fNTTðAÞh;NTTðAÞlg;NTTðBÞ ¼ fNTTðBÞh;NTTðBÞlg:

6. We use SEQUENTIAL LOAD function to store NTTðAÞh andNTTðBÞh.

7. Later on, an INNER MULTIPLICATION function is used to calculatethe digit-wise modular multiplications:

Ch½i� ¼ NTTðAÞh½i� NTTðBÞh½i�:

8. The result is stored back to the RAM by SEQUENTIAL UNLOAD.9. We repeat above three steps to compute the lower part:

Cl½i� ¼ NTTðAÞl½i� NTTðBÞl½i�:

10. The result digits, i.e. C½i�, are loaded into the cache bySEQUENTIAL LOAD. At this point the cache will contain the mul-tiplication result, but still in the NTT form.

for fully homomorphic encryption, Microprocess. Microsyst. (2014),


502

503

504505507507

508

509

510511513513514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

UnitFunction

RoutingRoutingReconstructionStage

Units

12x12NTT/INTT

Unit

Scale Unit

MultiplierControl

Unit

Cache

Fig. 2. Overview of the large integer multiplier.



24 June 2014

Q1

11. The result is converted into integer form by using, 12 � 12INTT function which is followed by a complete STAGE-RECON-

STRUCTION functions:

554

2 In tthe area

Pleasehttp:/

C0 ¼ INTTðC½i�Þ:

555

556

12. In the last step, the result is scaled and the carries are accu-mulated by SCALE & UNLOAD function to finalize computationof C:

557

558
C½iþ 1� ¼ C 0½iþ 1� þ bC 0½i�=pc: 559
560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588589

Table 1Assignment table.

arith0 arith1 arith2 arith3

S1–S10 sc0–sc1 sc2–sc3 sc4–sc5 sc6–sc7

5.3. Cache system

The size of the cache is important for the timing of multiplica-tions. In each stage reconstruction process of the NTT algorithm,we need to match the indices of odd and even digits. The index dif-ference of the odd and even digits for a reconstruction stage is com-puted as: Si;diff ¼ 12 � 2i�1, where i is the index of reconstructionstages, i.e. 1 6 i 6 14. Since later stages require digits from distantindices, an adequate sized cache should be chosen to reduce thenumber of input/output transactions between the cache and RAM.

Lets call N0 as the chosen cache size. Then, we can divide the Ndigits into 2t ¼ N=N0 blocks, i.e. N ¼ fN2t�1;N2t�2; . . . ;N0g. Once ablock is given as input, we can compute the reconstruction stagesuntil N0 < Si;diff for the ith stage. Then, stating from the ith stage, Nj

requires digits from Njþ1 which j is block index. So, we need todivide Nj and Njþ1 into halves and match the upper halves of Nj

with Njþ1, and lower halves of Nj with Njþ1. This matching processadds 2N0 clock cycles for each block. Then, the total input/outputoverhead is evaluated as 2N � log2ðN=N0Þ, where log2ðN=N0Þ is thenumber of the stages that requires digit matching from differentblocks. In our implementation, we aim to optimize the speed byselecting N0 as N.

Although a huge sized cache is important for our design, astraight cache implementation is not sufficient to support parallel-ism. The main arithmetic functions utilized in the multiplicationprocess, such as 12� 12 NTT/INTT,2 STAGE-RECONSTRUCTION and INNER

MULTIPLICATION, are highly suitable for parallelization. In order toachieve parallelization, the cache should be able to sustain requiredbandwidth for multiple units. In order to sustain the bandwidth, webuild up the cache by combining small, equal size caches or as werefer them sub-caches. Combining these sub-caches on a top level,we can select the cache to be used as a single-cache or a multi-cachesystem. In case of linear functions, such as SEQUENTIAL LOAD and BUTTER-

FLY LOAD, the cache works as a single-cache with one input/outputports, where as for parallel functions, it works as a multi-cache sys-tem with multiple input/output ports. The number of sub-cachesshould be equal to 2�m (double the size of multipliers in STAGE-

he design we choose to implement one unit, because its benefit is not worthcost.

cite this article in press as: Y. Doröz et al., A million-bit multiplier archit/dx.doi.org/10.1016/j.micpro.2014.06.003

RECONSTRUCTION UNIT) to eliminate access read/write to the same sub-cache in the reconstruction processes. Each sub-cache has a size ofN=ð2�mÞ and we denote them as; fsc0; sc1; . . . ; sc2m�1g.

5.4. Routing unit

The ROUTING UNITS play an important role to match the odd andeven digits to the arithmetic units. As stated previously, the indexdifference of the digits is ð12 � 2i�1Þ. Therefore, in last log 2m recon-struction stage, odd and even digits fall into different sub-cache.The assignment of sub-caches to the arithmetic units for each stagereconstruction is shown in Table 1. In the Table, arithmetic unitsare referred as arithi, which i is the index number.

Clearly, each arithmetic unit is assigned to two sub-caches ineach stage. From S1 to S10, odd and even indices are matched fromthe same sub-cache. Once an evaluation in sc2i is finished it contin-ues to process sc2iþ1. In rest of the stages, odd and even values arepresent in different sub-caches, so they are accessed by selectingsci1 and sci2 interchangeably. These assignments betweensub-caches and STAGE RECONSTRUCTION UNITS are done by the CENTRAL

CONTROL UNIT.

5.5. Function unit

The function unit can be divided into three parts, i.e. the SCALE

UNIT, the 12 � 12 NTT/INTT UNIT and the STAGE RECONSTRUCTION UNIT.SCALER uNIT: As mentioned earlier, after the 12 � 12 INTT

operation, a final step needs to be performed to compute the prod-uct. This final operation can be shown as di � N�1 þ ciðmod pÞ ¼fciþ1; rig. For our parameter choice N�1ðmod pÞ ¼ 0� FFFF555455560001 – a constant number with a special form that can becomputed by simple shift and add operations with a low numberof arithmetic operations. The unit takes digits in each clock cyclestarting from the least significant digit d0. The carry c0 is set to zeroin startup. After each scaling operation, the digit is added with acarry from the previous operation and least b ¼ 24 bits of the64-bit result is taken as the result ri. The remaining bits are usedas carry ciþ1 in the next iteration.

12 � 12 NTT/INTT UNIT: This unit can be used for either comput-ing the first stage of NTT or for INTT according to the requiredoperation. This first NTT/INTT stage is basically computed usingthe NTT formula:

S11 sc0–sc1 sc2–sc3 sc4–sc5 sc6–sc7





591591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641 Q2

642

643

644645

647647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680681

683683

684

685

Route

Re

cti

ud

on

mul_only

Coeff

Odd

Output

Even

Fig. 4. ALU of the stage reconstruction unit.



24 June 2014

Q1

Xj ¼Xn

i¼0

ðw0Þji � di ðmod pÞ: ð1Þ

In the equation, j is the index of the digit to be computed, and di

refers to the digit value with index i. The parameterw0 ¼ w213 ðmod pÞ, since the coefficient is squared in each stage.The parameter n is the smallest NTT block value to be evaluatedand in our architecture it is chosen to be n ¼ 12. The computationcan be performed in two ways. First, all the powers of ðw0Þij can bepre-computed into a lookup table and reused during the multipli-cations. However, this will be a costly operation since we need 121multiplications (coefficients with ðw0Þ0 are ignored). When theentire million-bit operand is considered, it will take 121�N=n = 991,232 multiplication operations. Having m ¼ 4 multiplierswill decrease the computation time by a factor of four. However,each multiplication takes two clock cycles and the total computa-tion will take 495,616 clock cycles.

The overhead of the first stage can be decreased significantly byexploiting the special structure of the constant coefficients. Indeed,we can realize the modular multiplications with simple shift andadd operations followed by a modular reduction at the end. Fur-thermore, note that in NTT w0 = 0 � 10,000 and in INTT

w0 ¼ 0� FFFFEFFFF00010001 and therefore only a few additionsand shifts are needed. These simple operations can be squeezedinto few clocks and pipelined to optimize throughput. Also coeffi-cients repeat after a while, since ðw0Þ12 � w3 2ð15Þ � 1ðmod pÞ. Thiseases the construction of the NTT/INTT units, since we can reusethe same circuit in the process. Due to pipelining all operationscan be performed in N clock cycles if we neglect the initial cyclesneeded to fill the pipeline to get the first output.

STAGE RECONSTRUCTION UNIT: This unit holds 64 bit multipliers and isresponsible for the operation of two functions: STAGE-RECONSTRUCTION

and INNER MULTIPLICATION. The architecture is more complex than theother units in the FUNCTION UNIT. It is composed of a control unit, anda coefficient block and an arithmetic unit. The architecture is illus-trated in Fig. 3.

In an INNER MULTIPLICATION function, two digits are read from thecache and fed to the Coeff and odd input lines. Since we are readingdigits from each sub-cache one at a time, we have two clock cyclesto perform a 64-bit multiplication. In order to reach higher fre-quencies, we used two 32-bit multipliers and extended them toconstruct the 64-bit multipliers. Using a classical high-schoolapproach, we are able to perform a 64-bit multiplication operationin 2 cycles. Since cache I/O speed is the bottleneck here and wecannot read faster than 2 cycles, utilizing faster multiplicationalgorithms, such as Karatsuba Multiplication algorithm, wouldnot have improved the performance. The INNER MULTIPLICATION func-tion computes a 64-bit multiplication, additions and a modularreduction. We pipelined the architecture to increase the clockfrequency and the throughput. The unit can output a modular mul-tiplication product in every two clock cycles after the initial startupcost of the pipeline. The whole function takes N

2 multiplications andwith m multipliers it will cost N

m clock cycles (see Fig. 4).In the STAGE-RECONSTRUCTION function, a more complex control

mechanism is required compared to other functions. In each stagewe need to compute

686

687

688

689

690

691

692

693

TableCoefficient

Stage Recon.

Control Unit

Control

Signals

SelectCoeff.

ALUInputSeq.

Data

Output

Fig. 3. Stage reconstruction unit.


Oi;j ¼ Ei�1;j � Oi�1;j �wjðmod ni�1Þi�1 ðmod pÞ;

Ei;j ¼ Ei�1;j þ Oi�1;j �wjðmod ni�1Þi�1 ðmod pÞ:

ð2Þ

In the equation, i denotes the stage index from 1 to 13; j denotesthe index of the digits, wi is the coefficient of stage i� 1 and finallyni�1 is the modular reduction to select the appropriate power of thewi. We may compute the terms ni iteratively as niþ1 ¼ 2� ni wheren0 ¼ 12. Ideally we need to store all the coefficients along with theodd digits. However this will require another large cache of sizeð12þ 24þ 48þ � � � þ 49152Þ � N digits. The cache can be reducedto almost half in size, since during the stage computations eachstage uses all of the coefficients from the previous stage. Therefore,the size can be reduced to 12 + (24 + 48 + � � � + 49,152)/2 = 49,152 � N. In order to reduce the storage requirement, wecan compute the coefficients using multiplications efficiently byusing pre-stored coefficients as follows:

1. The coefficients required in two consecutive stages are asfollows: Siþ1 : w0

i ;w1i ; . . . ;wni

i and Siþ2 : w0iþ1;w

1iþ1; . . . ;wniþ1

iþ1 .

2. Then Siþ2 : w0iþ1;w

1iþ1; . . . ;w2ni

iþ1 since it holds that niþ1 ¼ 2� ni.

3. Further, Siþ2 : w02i ;w

12i ; . . . ;w

2ni2

i , since wi ¼ w2iþ1.

4. This shows that half of the coefficients of Siþ2 are same as Siþ1

and the other half are the square roots of the coefficients of Siþ1.5. We can compute the square roots by multiplying each wj

i withw1

iþ1.

Using the properties described above, we construct the COEFFI-

CIENT TABLE by storing two columns of coefficients. In the first col-umn, since our smallest computation block is 12, we computeand store all wj

0 coefficients for 11 P j P 0. We denote these coef-ficients as wfirst;i, where i denotes the index of the coefficient. In thesecond column, for each of the remaining stages we compute andstore w1

i . The second column coefficients are denoted by wsecond;i.This makes a total of 24 coefficients which we can use to computeany of the wj

i values. When we include also the coefficients for theINTT operations, our table contains 48 coefficients. The computa-tion of an arbitrary coefficient using the table can be achieved as

wji ¼ wfirst;l �

Yi

t¼0

wesecond;t:

The values of l and e are functions of i and j. Also e is a valueequal to 1 or 0. Therefore we can omit the multiplications when-ever e ¼ 0. The total number of multiplications for computatingwj

i � Oi can be calculated as follows:

1. In every reconstruction stage we start by multiplying odd digitswith wfirst;l’s. This step makes a total of N

2 multiplications.2. Apart from the first reconstruction stage, in each stage we also

require coefficients from wsecond;0 to wsecond;i�1. Since we cannotstore the coefficients, in each stage we need to rebuild the pre-vious stage coefficients to build up the coefficients. We are



694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

747747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777778

780780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810



24 June 2014

Q1

using half of the previous stage values so in each stage we needN4 additional multiplications.

3. The total multiplications will then becomesPi¼12

i¼0N2 þ i� N

4

� �¼

26� N.

The STAGE RECONSTRUCTION architecture works with the followingsteps: The CENTRAL CONTROL UNIT sends an input bit sequence as{opcode, stage_level, digit_index} to the STAGE RECONSTRUCTION CONTROL

UNIT. The opcode defines whether the function is an NTT, INTT oran INNER MULTIPLICATION operation. If it is a STAGE-RECONSTRUCTION func-tion, it checks the stage_level bits to determine the computingstage and digit_index bits to determine the starting point of thecomputation from the N digit range. These bits are necessary espe-cially if there are multiple STAGE RECONSTRUCTION UNITS. Each unitneeds to determine its own startup coefficient, i.e. wj

i, so they ini-tialize different wfirst;l and wsecond;t . Also another important initiali-zation is the mul_only signal (for multiplication only), which iscrucial for computing wj

i � Oi. In each multiplication of an odd digitwith a coefficient means, either the computation of wj

i � Oi is com-pleted or it still needs more multiplications. According to therequirement, the mul_only signal is used for enabling Ei;jOi;j �wjðmod niÞ

i calculation. If the coefficient does not need any moreprocessing, the new odd and even values are stored in the cache.Otherwise, the even digit is not updated and the odd digit isupdated using the rule O0iþ1;j ¼ O0i;j �wsecond;l and stored into thecache. Here O0i;j denotes previously updated odd digit with a coeffi-cient. In order to complete the computation of partially completedcoefficients, the system is fed with even and O0 digits, and with anew coefficient. This process continues until all the values arecomputed for a stage.

5.6. Central control unit

The CENTRAL CONTROL UNIT contains a state machine to handle thefunctions that are given as inputs. It is mainly responsible to startthe requested function given as input to the system. Since 12� 12NTT/INTT Unit and SCALE UNIT consist only of datapath, they areonly controlled with start and ready signals. As mentioned earlier,the STAGE RECONSTRUCTION UNIT requires more complex operations, soeach one is managed by their own control logic and each onlyrequires a bit sequence {opcode, stage_level, digit_index} fromthe CENTRAL CONTROL UNIT. These bit sequences are pre-determinedsequences according to the number of STAGE RECONSTRUCTION UNITS

and the cache size.The CENTRAL CONTROL UNIT also handles input/output addressing of

the sub-caches. Sequential functions require incremental address-ing for each sub-cache. In STAGE-RECONSTRUCTION function, the address-ing is computed according to the stage level, which is basicallyupdated with the index range of dependent odd and even digits.

The addressing for the BUTTERFLY LOAD function requires a bitmore complex approach compared to incremental addressing. Foreach level the butterfly operation divides given input block intoeven and odd indices recursively. For instance:

The N digits are divided into smaller blocks until the digit sizereduces to 12. This operation creates 8192 blocks with 12 digitsthat forms a clear pattern that can be easily constructed. In ourconstruction N = 98,304 the pattern formed above can be con-structed using lookup tables consisting of 35 elements.


6. Performance analysis

The latency of each functional block in cycles is given in Table 2.Any LOAD or UNLOAD function takes N clock cycles, where N is thenumber of digits. Since the architecture consist of one 12� 12NTT/INTT function block the functions take N clock cycles perfunction. For the intermediate NTT/INTT operations, each stagehas varying number of clock cycles to complete its operation. Forthe given stage with index i, i.e. Si, the total number of clock cyclesfor that stage is N

m

� �þ N

2m

� �� ði� 1Þ, where m denotes the total

number of multipliers. The INNER MULTIPLICATION operation takes dNme

clock cycles.In order to perform a complete multiplication, two complete

NTT operations, two INNER MULTIPLICATION operations and one INTT

operation need to be performed. A complete NTT or INTT opera-tion can be performed as follows:

1. The number is loaded into the cache by using BUTTERFLY LOAD

operation.2. 12� 12 NTT/INTT operation is performed.

3. STAGE RECONSTRUCTION operations are performed for stages 1–13.4. A SCALE & UNLOAD is used to carry out the digit-wise additions,

carry propagations and to scale the product.

The total number of clock cycles for a NTT/INTT operation willtake N cycles per load and unload operations, N cycles for 12� 12NTT/INTT operation and each stage reconstruction will adddN

me þ N2m

� �� ði� 1Þ cycles simplified further as

Xi¼13

i¼1

Nm

� �þ N

2m

� �� ði� 1Þ

� ¼ 13

Nm

� �þ 78

N2m

� �:

We need an additional 3N cycles to load one operand, for the12� 12 NTT computation and to unload the final result. The totalnumber of cycles to compute the NTT of one operand becomes

52Nm

� �þ 3N. During the multiplication, we need to run NTT/INTT

three times. An INNER MULTIPLICATION operation will take N cycles perload and unload operation and N

m

� �clock cycles for the intermediate

multiplication operations. Note that since our cache can only holdone operand at a time we need to perform two INNER MULTIPLICATION

operations. Hence, one complete large integer multiplication willtake two NTT, two INNER MULTIPLICATION and one INTT operationsresulting in a total of 158 N

m

� �þ 13N. For instance, for a design with

m ¼ 4 multiplier units we need in total 2� 16N þ 2� 9N4 þ 16N ¼

52:5N clock cycles. The cycle counts for the functional block andfor large integer multiplication is given in Table 2.

6.1. Scalability of the proposed multiplier

Here we briefly study the scalability of the proposed multiplier.We would like to be able to reach an optimal design point withoutunnecessarily crippling the time performance. In Table 3 we listthe number of cycles and the cycle times number of multipliersunits, in other words the time area product, for various choicesof number of multiplier units incorporated in the large multiplierdesign. Clearly, increasing m decreases the time required to per-form the computation. However, we see diminishing returns dueto the constant term in the complexity equation, i.e. 158 N

m

� �þ 13N.

To obtain the small improvement, we pay significantly in area,reducing overall efficiency as evidenced in the time area productshown in the last column. This suggests that a better optimizationstrategy is to simply instantiate the large NTT multipliers with asmall number of multiplier units, e.g. m ¼ 4 or m ¼ 8, and thenspend more area by replicating the large multipliers to gain linear



811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

Table 2Clock cycle counts of functional blocks.

NTT(A), NTT(B) 2 BUTTERFLY LOAD 2N2 12� 12 NTT 2N2 STAGE-RECON 104 N

m

� �2 SEQUENTIAL UNLOAD 2N

A�B 2 SEQUENTIAL LOAD 2N2 INNER MULTIPLICATION 2 N

m

� �2 SEQUENTIAL UNLOAD 2N

INTT(A�B) BUTTERFLY LOAD N12� 12 INTT NSTAGE-RECON 52 N

m

� �SCALE UNLOAD N

Total 158 Nm

� �þ 13N

Table 3Clock cycle counts and time area product for various number of multiplier units.

m Total cycles Time area (m� cycles)

4 52:5N 210N8 32:7N 262N

16 22:8N 366N32 17:9N 574N64 15:4N 990N

Table 5FHE performance estimates for m ¼ 4 multiplier units (M = mult., MM = modularmult.).

Mul Barrett red Dec Enc Recrypt

# of Ops 1 M 2 M 1 MM 90 MM 1354 MMOurs 7.74 ms 15.4 ms 23.2 ms 2.09 s 31.4 sGH [5] 6.67 ms 13.3 ms 20 ms 1.8 s 32 s



24 June 2014

Q1

improvement in the overall FHE primitive execution times for alinear investment in area.

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873Q3

874

875

876

877

7. Implementation results

In our design we fixed the number of digits as N = 98,304. Theresulting architecture is capable of computing the product of two1,179,648-bit integers in 5:16 million clock cycles. We modeledthe proposed architecture into Verilog modules and synthesizedSynopsis Design Compiler using the TSMC 90 nm cell library. Thearchitecture reaches at a maximum clock frequency of 666 MHzand can compute the product of two 1,179,648-bit integers in7:74 ms.

While it is impossible to make a fair comparison between anapplication specific architecture and other types of platforms, i.e.a general purpose microprocessor, GPUs and FPGAs, we find ituseful to state the results side-by-side as shown in Table 4. TheNVIDIA C2050 and NVIDIA GTX 690 GPUs are high-end expensiveprocessors and consume significantly more area than our design;to be precise the GPUs contain � 18:7 and � 43:7 times moreequivalent gates respectively. In another perspective, we are ableto fit 18 or 43 of our multipliers into the same area that the GPUsoccupy. Similarly, the Stratix V FPGA implementation [28] is anexpensive high end DSP oriented FPGA with significant on-chipmemory. Compared to the proposed design a rough back of theenvelope calculation reveals a 23:7 times larger footprint. Theproposed design occupies only a tiny fraction of the footprint com-pared to the other proposals; in operations like recryption we cansurpass at timing by implementing more of our multipliers and

878

879

880881882883884885Q4886887

Table 4Comparison of timings for FFT multipliers.

Design Platform Mult.(in ms)

Area (in millionequiv. gates)

Proposed FFT multiplier 90 nm TSMC 7.74 26.7FFT multiplier/reduction [23] NVIDIA C2050 GPU 0.765 500FFT multiplier/reduction [42] NVIDIA GTX 690 0.583 1166Only FFT transform [28] Stratix V FPGA 0.125 633


running them in parallel or they can be used for running multiplerecryption processes. Although our multiplier is approximately 12times slower, additional multipliers amortize the timing up to 3.5times.

To gain more insight we focus more closely into the architec-ture. The main contributor to the footprint of the architecturewas the 768 Kbyte cache which requires 26:5 million gates. Incontrast, the footprint of the functional blocks is only 0:2 milliongates. The total area of the design comes 26:7 million equivalentgates. With the time performance summarized in Table 5 ourarchitecture matches the speed of Intel Xeon software [5] runningthe NTL software package [17] which features a very similar NTT-based multiplication algorithm.

8. Conclusion

We presented the first full evaluation of a million-bit multipli-cation scheme in custom hardware. For this, we introduced a novelarchitecture for efficiently realizing multi-million bit multiplica-tions. Our design implements the Schönhage–Strassen Algorithmoptimized using the Cooley–Tukey FTT technique to speed up theNTT conversions. When implemented in 90 nm technology, thearchitecture is capable of computing a million-bit multiplicationin 7.74 ms while consuming an area of only 26 million gates. Ourarchitecture presents a viable alternative to previous proposals,which either lack the required performance level or cannot provideerror-free multiplication for the operand lengths we are seeking tosupport. Finally, we presented performance estimates for the Gen-try–Halevi FHE primitives, assuming the large integer multiplier isused as a coprocessor. Our estimates show that the performance ofour design matches the performance of previously reported soft-ware implementations on a high end Intel Xeon processor, whilerequiring only a tiny fraction of the area. Therefore, our evaluationshows that much higher levels of performance may be reached byimplementing FHE in custom hardware thereby bringing FHEcloser to deployment.

9. Uncited references

[6,19–22,29–37].

Acknowledgments

Funding for this research was in part provided by the USNational Science Foundation CNS Award #1117590. Funding wasalso provided by The Scientific and Technological Research Councilof Turkey, project number 113C019.

References

[1] C.K. Koc., High-Speed RSA Implementation, TR 201, RSA Laboratories,November 1994, 73p.

[2] C.K. Koc, RSA Hardware Implementation, TR 801, RSA Laboratories, April 1996,30p.

[3] Shay Gueron, Efficient software implementations of modular exponentiation, J.Cryptogr. Eng. 2 (1) (2012) 31–43.

[4] C. Gentry, A Fully Homomorphic Encryption Scheme, Ph.D. Thesis, Departmentof Computer Science, Stanford University, 2009.



888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961

962963964965966967968969970971972973974975976977978979980981982

983

985985

986987988989990991992993994

996996

99799899910001001100210031004100510061007100810091010

10121012

1013101410151016101710181019102010211022102310241025



24 June 2014

Q1

[5] Craig Gentry, Shai Halevi, Implementing gentry’s fully-homomorphicencryption scheme, in: EUROCRYPT 2011, LNCS, vol. 6632, Springer, 2011,pp. 129–148.

[6] D. Cousins, K. Rohloff, C. Peikert, R. Schantz, SIPHER: Scalable Implementationof Primitives for Homomorphic Encryption-FPGA Implementation usingSimulink, HPEC 2011.

[7] The GNU Multiple Precision Arithmetic Library, GMP 5.0.5. <http://gmplib.org/>.

[8] J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complexFourier series, Math. Comput. 19 (90) (1965) 297–301.

[9] K. Kalach, J.P. David, Hardware implementation of large number multiplicationby FFT with modular arithmetic, in: IEEE-NEWCAS Conference, 2005. The 3rdInternational, 19–22 June, 2005, pp. 267–270.

[10] Arnold Schönhage, Volker Strassen, Schnelle multiplikation grosser zahlen,Computing 7(3), vol. 7, Springer, Wien, 1971.

[11] Luis Carlos Coronado Garcìa, Can Schönhage multiplication speed up the RSAdecryption or encryption?, MoraviaCrypt (2007)

[12] J.P. David, K. Kalach, N. Tittley, Hardware complexity of modularmultiplication and exponentiation, IEEE Trans. Comput. 56 (10) (2007)1308–1319.

[13] S. Craven, C. Patterson, P. Athanas, Super-sized multiplies: how do FPGAs farein extended digit multipliers? in: Proceedings of the 7th InternationalConference on Military and Aerospace Programmable Logic Devices(MAPLD’04), Washington, DC, USA, September 2004.

[14] S. Yazaki, K. Abe, An optimum design of FFT multi-digit multiplier and its VLSIimplementation, Bull. Univ. Electro-Commun. 18 (1 and 2) (2006) 39–46.

[15] N. Emmart, C.C. Weems, High precision integer multiplication with a GPUusing Strassen’s algorithm with multiple FFT sizes, Parallel Process. Lett.(2011) 359–375.

[16] A. Karatsuba, Yu. Ofman, Multiplication of many-digital numbers by automaticcomputers, Proc. USSR Acad. Sci. 145 (1962) 293–294.

[17] V. Shoup, NTL: A Library for doing Number Theory, Library Version 5.5.2.<http://www.shoup.net/ntl/>.

[18] C. Gentry, S. Halevi, Implementing Gentry’s fully-homomorphic encryptionscheme, in: EUROCRYPT, 2011, pp. 129–148.

[19] C. Gentry, S. Halevi, N.P. Smart, Homomorphic evaluation of the AES circuit,IACR Cryptol. ePrint Arch. 2012 (2012) 99.

[20] Z. Brakerski, C. Gentry, V. Vaikuntanathan, Fully homomorphic encryptionwithout bootstrapping, Electr. Colloq. Comput. Complex. (ECCC) 18 (2011)111.

[21] N.P. Smart, F. Vercauteren, Fully homomorphic SIMD operations, IACR Cryptol.ePrint Arch. 2011 (2011) 33.

[22] J.-S. Coron, T. Lepoint, M. Tibouchi, Batch fully homomorphic encryption overthe integers, IACR Cryptol. ePrint Arch. 2013 (2013) 36.

[23] W. Wang, Y. Hu, L. Chen, X. Huang, B. Sunar, Accelerating fully homomorphicencryption using GPU, in: HPEC, 2012, pp. 1–5.

[24] D. Cousins, K. Rohloff, R. Schantz, C. Peikert, SIPHER: scalable implementationof primitives for homomorphic encryption, Internet Source, September 2011.

[25] D. Cousins, K. Rohloff, C. Peikert, R.E. Schantz, An update on SIPHER (scalableimplementation of primitives for homomorphic encRyption) -FPGAimplementation using simulink, in: HPEC, 2012, pp. 1–5.

[26] C. Moore, N. Hanley, J. McAllister, M. O’Neill, E. O’Sullivan, X. Cao, TargetingFPGA DSP slices for a large integer multiplier for integer based FHE, in:Workshop on Applied Homomorphic Cryptography, vol. 7862, 2013.

[27] Xiaolin Cao, Ciara Moore, Mire O’Neill, Elizabeth O’Sullivan, Neil Hanley,Accelerating fully homomorphic encryption over the integers with super-sizehardware multiplier and modular reduction, IACR Cryptol. ePrint Arch. (2013).

[28] W. Wang, X. Huang, FPGA implementation of a large-number multiplier forfully homomorphic encryption, in: ISCAS, 2013, pp. 2589–2592.

[29] C. Gentry, S. Halevi, Fully homomorphic encryption without squashing usingdepth-3 arithmetic circuits, IACR Cryptol. ePrint Arch. 2011 (2011) 279.

[30] N.P. Smart, F. Vercauteren, Fully homomorphic encryption with relativelysmall key and ciphertext sizes, in: Public Key Cryptography, 2010, pp. 420–443.

[31] Z. Brakerski, V. Vaikuntanathan, Efficient fully homomorphic encryption from(standard)LWE, in: FOCS, 2011, pp. 97–106.

[32] C. Gentry, S. Halevi, N.P. Smart, Better bootstrapping in fully homomorphicencryption, IACR Cryptol. ePrint Arch. 2011 (2011) 680.

[33] Craig Gentry, Shai Halevi, Nigel P. Smart, Fully homomorphic encryption withpolylog overhead, IACR Cryptol. ePrint Arch. 2011 (2011) 566.

[34] O. Regev, On lattices, learning with errors, random linear codes, andcryptography, in: STOC, 2005, pp. 84–93.

[35] V. Lyubashevsky, C. Peikert, O. Regev, On ideal lattices and learning with errorsover rings, IACR Cryptol. ePrint Arch. 2012 (2012) 230.


[36] Z. Brakerski, V. Vaikuntanathan, Fully homomorphic encryption from ring-LWE and security for key dependent messages, in: CRYPTO, 2011, pp. 505–524.

[37] K. Lauter, M. Naehrig, V. Vaikuntanathan, Can homomorphic encryption bepractical? in: Cloud Computing Security Workshop, 2011, pp. 113–124.

[38] M. van Dijk, C. Gentry, S. Halevi, V. Vaikuntanathan, Fully homomorphicencryption over the integers, in: EUROCRYPT, 2010, pp. 24–43.

[39] J.-S. Coron, A. Mandal, D. Naccache, M. Tibouchi, Fully homomorphicencryption over the integers with shorter public keys, in: CRYPTO, 2011, pp.487–504.

[40] J.-S. Coron, D. Naccache, M. Tibouchi, Public key compression and modulusswitching for fully homomorphic encryption over the integers, in:EUROCRYPT, 2012, pp. 446–464.

[41] Y. Doröz, E. Öztürk, B. Sunar, Evaluating the hardware performance of amillion-bit multiplier, in: Digital System Design (DSD), 2013 16th EuromicroConference on, 2013.

[42] W. Wang, Y. Hu, L. Chen, X. Huang, B. Sunar, Exploring the feasibility of fullyhomomorphic encryption, IEEE Trans. Comput. 99 (PrePrints) (2013) 1.

[43] P.G. Comba, Exponentiation cryptosystems on the IBM PC, IBM Syst. J. 29 (4)(1990) 526–538.

[44] T. Güneysu, Utilizing hard cores of modern FPGA devices for high-performancecryptography, J. Cryptogr. Eng. 1 (1) (2011) 37–55.

Yarkin Doroz received a BSc. degree in ElectronicsEngineering at 2009 and a MSc. degree in ComputerScience at 2011 from Sabanci University. Currently he isworking towards a Ph.D. degree in Electrical and Com-puter Engineering at Worcester Polytechnic Institute.His research is focused on developing hardware/soft-ware designs for Fully Homomorphic EncryptionSchemes.

Erdinç Öztürk received his BS degree in Microelec-tronics from Sabanci University at 2003. He received hisMS degree in Electrical Engineering at 2005 and PhDdegree in Electrical and Computer Engineering at 2009from Worcester Polytechnic Institute. His research fieldwas Cryptographic Hardware Design and he focused onefficient implementations for Identity Based EncryptionSchemes. After receiving his PhD degree, he worked atIntel in Massachusetts for 4 years as a hardware engi-neer, than he joined Istanbul Commerce University asan assistant professor. He is currently the departmenthead of the Electrical-Electronic Engineering program at

Istanbul Commerce University.

Berk Sunar received his BSc degree in Electrical andElectronics Engineering from Middle East TechnicalUniversity in 1995 and his Ph.D. degree in Electrical andComputer Engineering (ECE) from Oregon State Uni-versity in December 1998. After briefly working as amember of the research faculty at Oregon State Uni-versity’s Information Security Laboratory, Sunar hasjoined Worcester Polytechnic Institute as an AssistantProfessor. He is currently heading the Cryptography andInformation Security Laboratory (CRIS). Sunar receivedthe prestigious National Science Foundation YoungFaculty Early CAREER award in 2002.


http://gmplib.org/

http://www.shoup.net/ntl/


Documents

A million-bit multiplier architecture for fully homomorphic encryption