DESIGN AND ANALYSIS OF PAIRING BASED ...drc/thesis/santosh_phd...APPROVAL OF THE VIVA-VOCE BOARD June 23, 2011 Certiﬁed that the thesis entitled DESIGN AND ANALYSIS OF PAIRING BASED

DESIGN AND ANALYSIS OF PAIRING BASEDCRYPTOGRAPHIC HARDWARE FOR PRIME FIELDS

Santosh Ghosh

DESIGN AND ANALYSIS OF PAIRING BASED

CRYPTOGRAPHIC HARDWARE FOR PRIME FIELDS

Thesis submitted to the

Indian Institute of Technology Kharagpur

For award of the degree

of

Doctor of Philosophy

by

Santosh Ghosh

Under the guidance of

Professor Debdeep Mukhopadhyayand

Professor Dipanwita Roy Chowdhury

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

JUNE 2011

c⃝ 2011 Santosh Ghosh. All Rights Reserved

APPROVAL OF THE VIVA-VOCE BOARD

June 23, 2011

Certified that the thesis entitled DESIGN AND ANALYSIS OF PAIRING BASEDCRYPTOGRAPHIC HARDWARE FOR PRIME FIELDS submitted by SANTOSHGHOSH to the Indian Institute of Technology, Kharagpur, for the award of the de-gree Doctor of Philosophy has been accepted by the external examiners and thatthe student has successfully defended the thesis in the viva-voce examination heldtoday.

Member of DSC Member of DSC Member of DSCDr. Arobinda Gupta Dr. Shamik Sural Dr. Abhijit DasProfessor Asociate Professor Assistant ProfessorCSE Department School of Information Technology CSE DepartmentIIT Kharagpur, India IIT Kharagpur, India IIT Kharagpur, India

Supervisor SupervisorDr. Debdeep Mukhopadhyay Dr. Dipanwita Roy ChowdhuryAssistant Professor ProfessorCSE Department, IIT Kharagpur CSE Department, IIT Kharagpur

External Examiner ChairmanDr. Bimal K. Roy Dr. Jayanta MukhopadhyayDirector Professor and HeadIndian Statistical Institute, Kolkata CSE Department, IIT Kharagpur

CERTIFICATE

This is to certify that the thesis entitled Design and Analysis of Pairing Based

Cryptographic Hardware for Prime Fields, submitted by Santosh Ghosh to In-

dian Institute of Technology, Kharagpur, is a record of bona fide research work un-

der our joint supervision and we consider it worthy of consideration for the award

of the degree of Doctor of Philosophy of the Institute

Supervisor Supervisor

Dr. Debdeep Mukhopadhyay Dr. Dipanwita Roy Chowdhury

Assistant Professor Professor

CSE, IIT Kharagpur CSE, IIT Kharagpur

Date: Date:

to Sweta

Acknowledgements

THROUGH THIS LITTLE note and limited space, I try my best to express mysincere gratitude to some of those people without whose help this thesis simply

would not have come about. Foremost, I would like to thank the mysterious anddivine Nature that has coursed my life across so many people and so many oppor-tunities, and has given me the strength and health that have contributed in shapingup this thesis.

With all my heart I thank my supervisors Professor Debdeep Mukhopadhyayand Professor Dipanwita Roy Chowdhury for introducing me to this wonderfulworld of Cryptography and more so for making amply available their immensesupport, advise and encouragements in a number of ways. Along with them I thankProfessor Indranil Sen Gupta for his continuous support in my research carried outsince 2005 started from my MS degree. I also thank the members of my DoctoralScrutiny Committee, Professor Shamik Sural, Professor Avijit Das, and ProfessorArabinda Gupta, for giving me timely directions.

The role of supportive and welcoming friends and colleagues in the life of a re-searcher is undeniable. To this end, I would like to thank all the present and formermembers of the Embedded Systems Laboratory and the Department of ComputerScience and Engineering at IIT Kharagpur. I am grateful to the Central Library ofIIT Kharagpur for offering such a vast resource of research material and making itso easily accessible. I offer my special thanks to Ghatal, Rohan, Dhiman, Chester,Subidh, Bodhi, Debi da, Joydeb da, and Bivas da for their friendship and support.

My father has always been my greatest inspiration. I owe him my heart filledbenediction for guiding me through a path of knowledge and truth, for being thestrongest support in the journey towards my dreams and aspirations. I thank mymother for her unconditional consecration in making a happy and adorable home,and for keeping me in a healthy mental and physical state. I also thankful to all ofmy relatives for their love and support.

Most importantly, I thank my wife Sweta. I thank you dear for your immutablepatience and love, for carving a blissful end to each of my tiring days. Your emo-tional and moral support pulled me through this journey. I am grateful to God forhaving you by my side forever.

Santosh GhoshCSE, IIT-Kharagpur,

June 2011

DECLARATION

I certify that

a. The work contained in this thesis is original and has been done by myselfunder the general supervision of my supervisor.

b. The work has not been submitted to any other Institute for any degree ordiploma.

c. I have followed the guidelines provided by the Institute in writing the thesis.

d. I have conformed to the norms and guidelines given in the Ethical Code ofConduct of the Institute.

e. Whenever I have used materials (data, theoretical analysis, and text) fromother sources, I have given due credit to them by citing them in the text of thethesis and giving their details in the references.

f. Whenever I have quoted written materials from other sources, I have put themunder quotation marks and given due credit to the sources by citing them andgiving required details in the references.

Santosh GhoshDepartment of CSE,

IIT Kharagpur

Date:

Abstract

THE PRIMARY CHALLENGE in modern day cryptographic hardware development

lies in coping with progressively strong physical attacks commonly referred to as side-

channel analysis. This research deals with practical implementations and analysis of physi-

cal security of pairing based cryptographic operations on prime fields. Pairing computation

and elliptic curve scalar multiplication are two major operations in pairing based cryptogra-

phy. These operations in turn rely on arithmetic in finite fields − prime fields (Fp). Hence,

this work first designs a portable and compact architecture for Fp arithmetic. Subsequently,

the work proposes an efficient dual-core cryptoprocessor for elliptic curve scalar multipli-

cation based on the above compact Fp core. Field Programmable Gate Array (FPGA) is a

relevant platform which provides various in-built features for optimizing arithmetic opera-

tions. A configurable core on FPGA device has been developed for Fpk arithmetic based on

the above optimized Fp primitive. Two such configurable cores are utilized for developing

a pairing cryptoprocessor which computes pairing over Barreto-Naehrig curve. Security

of pairing computations against fault and power attacks are subsequently addressed in this

work. The work further studies existing as well as new vulnerabilities of pairing computa-

tions against fault and power attacks. Suitable countermeasures are also proposed to resist

those attacks.

Keywords: Pairing based cryptography, Elliptic curve cryptography, Field programmable

gate array, Prime field, Side-channel attacks.

xiii

Contents

Title Page i

Certificate of Approval iii

Certificate v

Acknowledgements ix

Declaration xi

Abstract xiii

Table of Contents xv

Symbols and Abbreviations xvii

1 Introduction 1

1.1 Motivation and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Mathematical Background and Preliminaries 9

2.1 Finite Field Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

xv

CONTENTS

2.1.1 Addition and Subtraction in Fp . . . . . . . . . . . . . . . . . . . . 10

2.1.2 Multiplication in Fp . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.3 Inversion and Division in Fp . . . . . . . . . . . . . . . . . . . . . 11

2.1.4 Montgomery Ladder for Exponentiation in Fp . . . . . . . . . . . . 12

2.2 Elliptic Curve Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Operations on Elliptic Curve . . . . . . . . . . . . . . . . . . . . . 15

2.3 Cryptographic Pairings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Tate Pairing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Side-channel Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.1 Timing Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5.2 Power Consumption Attacks . . . . . . . . . . . . . . . . . . . . . 24

2.5.2.1 Simple Power Analysis (SPA) Attacks . . . . . . . . . . 24

2.5.2.2 Differential Power Analysis (DPA) Attacks . . . . . . . . 25

2.6 Fault Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6.1 Fault Induction Technique . . . . . . . . . . . . . . . . . . . . . . 27

2.7 Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Survey of Related Work 31

3.1 Hardware Implementation of ECSM on Prime Fields . . . . . . . . . . . . 31

3.2 The ECSM Against Side-channel Attacks . . . . . . . . . . . . . . . . . . 34

3.2.1 Indistinguishable Point Add and Point Double . . . . . . . . . . . . 35

3.2.2 Regular Point Multiplication Algorithms . . . . . . . . . . . . . . 36

3.2.3 Base Point Randomization Techniques . . . . . . . . . . . . . . . . 37

3.2.3.1 Point Blinding . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.3.2 Randomized Projective Representation . . . . . . . . . . 37

3.2.3.3 Randomized Elliptic Curve Isomorphisms . . . . . . . . 38

xvi

CONTENTS

3.2.3.4 Randomized Field Isomorphisms . . . . . . . . . . . . . 38

3.2.4 Scalar Multiplier Randomization Techniques . . . . . . . . . . . . 38

3.3 Implementation of Cryptographic Pairings . . . . . . . . . . . . . . . . . . 40

3.3.1 Software Library for 128-bit-secret Pairings . . . . . . . . . . . . . 41

3.3.2 Hardware Design for 128-bit-secret Pairings . . . . . . . . . . . . . 42

3.4 Fault and Side-channel Attacks on Pairings . . . . . . . . . . . . . . . . . 42

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 Design and Analysis of Elliptic Curve Cryptoprocessor 45

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Motivation and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Programmable GF(p) Arithmetic Unit (PGAU) . . . . . . . . . . . . . . . 49

4.3.1 Motivation of PGAU . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.2 Proposed Programmable GF(p) Arithmetic Unit . . . . . . . . . . . 51

4.3.3 Programable Data Path Block . . . . . . . . . . . . . . . . . . . . 52

4.3.4 Hardware Cost and Performance . . . . . . . . . . . . . . . . . . . 58

4.4 Elliptic Curve Cryptoprocessor Resistant to Timing and Power Attacks . . . 61

4.4.1 Modified Montgomery Ladder Against DPA and DA . . . . . . . . 63

4.4.2 The ECSM on Single PGAU-core . . . . . . . . . . . . . . . . . . 64

4.4.3 The ECSM on Dual PGAU-core . . . . . . . . . . . . . . . . . . . 66

4.5 Security Analysis of the Proposed Cryptoprocessor . . . . . . . . . . . . . 69

4.5.1 Timing Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.5.2 Simple Power Analysis (SPA) . . . . . . . . . . . . . . . . . . . . 70

4.5.3 Differential Power Analysis (DPA) . . . . . . . . . . . . . . . . . . 71

4.5.4 Doubling Attack (DA) . . . . . . . . . . . . . . . . . . . . . . . . 75

4.5.5 Security of the Random Generator . . . . . . . . . . . . . . . . . . 77

4.6 ECSM Implementation Result and Comparison . . . . . . . . . . . . . . . 78

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

xvii

CONTENTS

5 Fast Prime Field Adders and Multipliers on FPGA Platform 85

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2 Fast Additions on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.2.1 Proposed Addition Technique . . . . . . . . . . . . . . . . . . . . 91

5.2.2 Cost and performance . . . . . . . . . . . . . . . . . . . . . . . . 94

5.3 Fast Fp Multipliers on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3.1 Proposed Multiplication Technique . . . . . . . . . . . . . . . . . 97

5.3.2 Cost and Performance of Multiplier . . . . . . . . . . . . . . . . . 99

5.3.3 Security Against Timing and Power Attacks . . . . . . . . . . . . . 103

5.4 The PGAU and ECSM Hardware Based on Fast Adder . . . . . . . . . . . 105

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6 High Speed Flexible Pairing Cryptoprocessor 107

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.2.1 Choice of Elliptic Curve . . . . . . . . . . . . . . . . . . . . . . . 110

6.2.2 Pairing Computation . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.3 Programmable Fp-Primitive . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.3.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . 112

6.3.1.1 Computation of Fp-multiplication . . . . . . . . . . . . . 114

6.3.1.2 Computation of Fp-addition . . . . . . . . . . . . . . . . 115

6.3.1.3 Computation of Fp-subtraction . . . . . . . . . . . . . . 116

6.4 A Configurable Fpk Arithmetic Unit (CAU) . . . . . . . . . . . . . . . . . 116

6.5 The Pairing Cryptoprocessor (PCP) . . . . . . . . . . . . . . . . . . . . . 119

6.5.1 The Datapath Design . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.6 Computation of Tate Pairing on PCP . . . . . . . . . . . . . . . . . . . . . 121

6.6.1 Computation of Doubling Step . . . . . . . . . . . . . . . . . . . . 121

6.6.2 Computation of Addition Step . . . . . . . . . . . . . . . . . . . . 124

xviii

CONTENTS

6.6.3 Computation of Final Exponentiation . . . . . . . . . . . . . . . . 125

6.6.4 Cost for Computing Tate Pairing . . . . . . . . . . . . . . . . . . . 128

6.7 Computation of ate Pairing on PCP . . . . . . . . . . . . . . . . . . . . . . 128

6.8 Computation of R-ate Pairing on PCP . . . . . . . . . . . . . . . . . . . . 128

6.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.9.1 Comparison with Pairing Implementations . . . . . . . . . . . . . 129

6.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7 Pairing Computations Against Fault and Power Attacks 135

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.2 Fault Attack on Tate Pairing [82] . . . . . . . . . . . . . . . . . . . . . . . 136

7.2.1 Fault Induction Through Clock Signal . . . . . . . . . . . . . . . . 138

7.2.2 Analysis of Existing Countermeasures . . . . . . . . . . . . . . . . 139

7.2.2.1 New Point Blinding Technique [82] . . . . . . . . . . . . 140

7.2.2.2 Altering Traditional Point Blinding [82] . . . . . . . . . 141

7.2.3 Proposed Countermeasure . . . . . . . . . . . . . . . . . . . . . . 141

7.2.3.1 Correctness Analysis . . . . . . . . . . . . . . . . . . . 142

7.2.3.2 Security Against Fault Attack . . . . . . . . . . . . . . . 143

7.3 Fault Attack on Pairing in Edwards Coordinates . . . . . . . . . . . . . . . 144

7.3.1 Pairing in Edwards Coordinates . . . . . . . . . . . . . . . . . . . 144

7.3.2 Attack Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.3.2.1 Practical Implication of Above Fault Attack . . . . . . . 146

7.3.3 Countermeasure . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.3.3.1 Correctness Analysis . . . . . . . . . . . . . . . . . . . 147

7.3.3.2 Security Against Fault Attack . . . . . . . . . . . . . . . 147

7.4 Power Attacks on Pairing Computations . . . . . . . . . . . . . . . . . . . 148

7.4.1 Weakness of Pairing Computations over Fp . . . . . . . . . . . . . 148

7.4.2 Proposed DPA Attack . . . . . . . . . . . . . . . . . . . . . . . . 149

xix

CONTENTS

7.4.3 Mounting the DPA on FPGA Platform . . . . . . . . . . . . . . . . 150

7.4.4 Proposed DPA Resistance Pairing Computation . . . . . . . . . . . 152

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

8 Conclusions and Future Directions 155

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Bibliography 159

Dissemination 175

Bio-data 179

xx

Symbols and Abbreviations

Symbols:

O Point at Infinity

#E(Fq) Number of Points on an Elliptic Curve E(Fq)

E(Fq) Elliptic Curve Defined over a Finite Field Fq

Fq Finite Field with Order q

Fp Prime Field with a Large Prime Characteristic p

F2m Binary Field with Extension Degree m

F3n Characteristic-3 Field with Extension Degree n

Abbreviations:AES Advanced Encryption Standard

ASIC Application Specific Integrated Circuit

ASIP Application Specific Instruction-set Processor

ASM Addition, Subtraction and Multiplication in Fp

BN Barreto Naehrig

BRAM Block RAM

CAU Configurable Fpk Arithmetic Unit

CFP Configurable Fpk Primitive

CLB Configurable Logic Block

xxi

CONTENTS

CMOS Complementary Metal Oxide Semiconductor

DAU Data Access Unit

DES Data Encryption Standard

DLP Discrete Logarithm Problem

ECA Elliptic Curve Point Addition

ECC Elliptic Curve Cryptography

ECD Elliptic Curve Point Doubling

ECDLP Elliptic Curve Discrete Logarithm Problem

ECDSA Elliptic Curve Digital Signature Algorithm

ECMQV Elliptic Curve Menezes-Qu-Vanstone

ECSM Elliptic Curve Scalar Multiplication

EEA Extended Euclidean Algorithm

FF Flip Flop

FPGA Field Programmable Gate Array

GF Galois Field or Finite Field

IFD Instruction Fetch and Decode

LSB Least Significant Bit

LUT Look Up Table

MSB Most Significant Bit

NIST National Institute of Standards and Technology

PA Point Addition

PCP Pairing Cryptoprocessor

PD Point Doubling

PGAU Programmable GF(p) Arithmetic Unit

RSA Rivest, Shamir and Adleman

xxii

Chapter 1

Introduction

BILINEAR PAIRING is a candidate for one-way functions defined on elliptic or hy-

perelliptic curve group. Pairing based cryptography is suitable for securing identity

aware and ubiquitous computing devices. Major operations in pairing based cryptogra-

phy are pairing computation and elliptic curve scalar multiplication. This research focuses

on designing efficient hardware architectures for above mentioned operations on FPGA

platform. Implementations of the respective algorithms may leak secret information dur-

ing their execution through concealed channels, such as: power consumption, timing, and

faults. The attacks based on the exploitation of such concealed channels are known as side-

channel attacks. This research also focuses on the analysis and counteracts of elliptic curve

and pairing implementations against side-channel attacks.

1.1 Motivation and Objective

In recent times, Pairing-based cryptography has attained lot of importance. As a natural

consequence, its hardware implementation is extremely important. The implementations

must be cost-effective, both in terms of time and space requirement. This thesis focuses on

exploring several hardware design techniques which are employed in pairing based cryp-

tography. Two complex operations are elliptic curve scalar multiplication (or ECSM) and

pairing computation which are often used in pairing based cryptographic schemes. Field

programmable gate array (or FPGA) is one of the suitable platforms to develop hardware

1

Chapter 1 Introduction

for accelerating cryptographic operations. Thus, it may be prudent at this point to look into

the architectural design techniques on FPGA platform to improve the efficiency of ECSM

and pairing computation.

Finite field arithmetic is the most important primitive of ECSM and pairing compu-

tation. Pairing based cryptography requires all the underlying finite field operations like

addition, subtraction, multiplication, inversion, and division. In order to obtain an efficient

design, the present work first focusses on introducing hardware sharing among the finite

field operations. Modern FPGAs provide in-built features which may help in realizing op-

timized circuits. Thus, the proposed work also investigates FPGA features to accelerate the

finite field primitives. Subsequently, the work focuses on exploiting scopes of parallelism

in the finite field algorithms. It further explores the scope of parallelism in the computation

of ECSM and pairing using multiple cores of underlying primitives.

On the other hand, side-channel and fault attacks are the major threats on the imple-

mentation of any cryptographic algorithms. The present thesis explores not only these

vulnerabilities but also counteracting techniques of pairing schemes. Finally, the effect of

these techniques on the entire design and the final robustness of the design is evaluated.

In this thesis two broad aspects of the hardware for pairing based cryptography, namely,

efficient implementation and security against side-channel attacks, have been separately

studied. One of the main objectives of this thesis is to reduce the computation time of

major operations of a pairing based scheme. This reduction in computation time is brought

about by targeting the following aspects of FPGA implementation:

• Hardware sharing technique has been explored to develop an optimized programmable

architecture for prime field arithmetic.

• The in-built carry chains of an FPGA device have been exploited to develop a high-

speed adder circuit.

• A modified interleaved multiplication technique has been proposed to reduce the

critical path of a prime field multiplier.

• Multiple functional cores have been incorporated into the proposed cryptoprocessors

2

1.2 Contributions

for exploiting the parallelism of ECSM and pairing computations.

One more objective of this thesis is to provide the security of the proposed designs

against side-channel attacks. In that respect the following techniques have been proposed:

• The proposed finite field primitives help to make the cryptoprocessors resistant against

simple side-channel attacks.

• A new point blinding technique has been proposed which protects the secret pa-

rameter of ECSM operation against simple power analysis (SPA), differential power

analysis (DPA), and doubling attack (DA).

• A new counteracting technique has been proposed to defend the fault attacks against

pairing computations.

• Line function of pairings has been modified to defend differential power attacks.

1.2 Contributions

The contributions of the thesis are summarized below:

• Design and Analysis of Elliptic Curve Cryptoprocessor. We present an elliptic

curve cryptoprocessor by exploiting the concept of shared arithmetic hardware and

explore its security against timing and power attacks. The contribution of this work

is in three folds.

1. PGAU core: We propose a Programmable GF(p) Arithmetic Unit (PGAU) that

performs GF(p) addition, subtraction, multiplication, inversion, and division.

The modular operations are performed directly in 2’s complement number sys-

tem. The PGAU reduces 18% area compared to that required of an integrated

design where each arithmetic unit is a state-of-the-art stand alone implemen-

tation. The PGAU takes only 0.96 times slice area but achieves 2.67 times

speedup compared to the existing design [106].

3


2. Elliptic curve cryptoprocessor: We observe that the saving in area can be ex-

ploited by using multiple copies of PGAU for accelerating elliptic curve scalar

multiplication. Thus, we attempt to speed up the ECSM operation by using two

PGAU cores. The implementation of the proposed design is done on Xilinx

Virtex-II Pro FPGA device. The experimental result shows that the proposed

elliptic curve cryptoprocessor computes a 192-bit ECSM operation in 4.47ms.

The whole design demands 8972 CLB slices and runs at 43MHz clock on a

Virtex-II Pro FPGA. The same can run at 61MHz clock on a Virtex-IV FPGA

platform.

3. Side-channel attacks: The PGAU is designed in such a way that it does not

provide any timing and power attack vulnerabilities during the execution of

finite field operations. A new point blinding technique is proposed which is

applied on the SPA resistant Montgomery ladder for ECSM computation. The

analysis shows that the proposed cryptoprocessor is indeed secure against dif-

ferential and non-differential timing and power attacks. In order to show its

security against differential power analysis (or DPA) we first show an actual

DPA result on an FPGA implementation without any DPA resistance scheme.

This result ensures that the DPA is really capable to obtain the secret scalar

multiplier. The same analysis have been performed on our proposed implemen-

tation. It is shown that with even ten times more power traces we could not

find any significant DPA peak to guess the secret bits. The result ensures that

the proposed design is capable to protect the secret against DPA attack. The

proposed design is also capable to provide security against doubling attack.

• Fast Prime Field Adders and Multipliers on FPGA Platform. Finite field addi-

tion and multiplication are the most important operations in cryptography. Efficient

techniques of these operations greatly affect the overall performance of a cryptopro-

cessor. We explore the in-built features of an FPGA device to develop high-speed

prime field (Fp) primitives. The contributions of this work are briefly described here.

1. Fast carry chain (FCC): Modern FPGAs provide special carry logic for ad-

dition. The carry chains formed by the in-built carry logic are 32 bits long.

4

1.2 Contributions

Through experimental results this chapter shows that the carry propagation

adder (CPA) based on in-built carry logic for a 32-bit addition provides min-

imum latency compared to all other known addition techniques. Experimental

results show that the latency of above CPA is only 6.6ns whereas the same of

the carry lookahead adder is 9.2ns on a Virtex-II pro FPGA.

2. High-speed adder: Subsequently, we propose a hierarchical adder structure

for large operands using above fast carry chains (FCCs). The large operands are

decomposed hierarchically upto 32 bit-lengths based on Karatsuba technique.

The experimental result shows that the proposed technique significantly reduces

the routing delay as well as logic delay compared to the existing techniques. For

a comparison we implement some existing addition techniques for 256 and 512

-bit operands using 32-bit FCC. Thus they are designed as their respective 8

and 16-bit structures where a single bit full-adder is now replaced by a 32-bit

FCC. The proposed 256-bit adder provides 35% speedup from the best known

carry lookahead technique on an FPGA platform.

3. Fp-multiplier: A modification on interleaved multiplication algorithm is pro-

posed for improving the scope of parallelism. The modified algorithm exploits

the Montgomery ladder where doubling and addition within an iteration are

independent to each other. On the other hand, both of the operations are com-

puted at every iteration which provide a balanced execution and security against

non-differential side-channel attacks.

It further proposes a parallel iterative architecture based on the modified mul-

tiplication algorithm and high speed adders. It exploits the parallelism in two

levels. One is in the addition level and other is in the algorithmic level. The

extensive experimental results have been furnished to show its performance im-

provement of 70% over existing design and security against non-differential

timing and power attacks.

4. Speedup of ECC cryptoprocessor: It is now essential to validate the proposed

technique on elliptic curve and pairing computations. In case of elliptic curve

computation, we redesign the PGAU and ECSM cryptoprocessor. The old adder

circuits are now replaced by the proposed high speed adders in the new designs.5


The experimental result shows that the modified designs achieve 30% speedup

over the old designs. The same Fp-primitives are used to develop the pairing

cryptoprocessor which is described later.

• High Speed Flexible Pairing Cryptoprocessor. In this work we propose a crypto-

processor for the computation of pairings over Barreto-Naehrig curves (BN curves).

The proposed pairing cryptoprocessor (PCP) supports random curve parameters in-

cluding prime p. It supports all primes less than the given length (256 bits). We

develop a parallel configurable hardware for computing addition, subtraction, and

multiplication on Fp and Fp2 using high-speed Fp-primitives described previously.

Existing techniques to speed up arithmetic in extension fields [61] for fast computa-

tion in Fp6 and Fp12 are used on top of it. The major contributions of this work are

highlighted here.

1. CFP design: The chapter introduces a configurable Fpk-primitive (CFP) based

on the high-speed Fp-primitives described previously. The CFP has inherent

configurability to perform arithmetic in Fp and Fp2 for any p less than the given

length. Existing techniques to speed up arithmetic in extension fields [61] for

fast computation in Fp6 and Fp12 are implemented on top of it.

2. Pairing cryptoprocessor: A pairing cryptoprocessor is designed with two

CFP-cores. The advantages of dual core have been utilized by developing a

parallel scheduling of the underlying Fp-operations for pairing computation.

The proposed cryptoprocessor also provides flexibility for curve parameters.

Experimental results show a significant improvement in clock cycle counts for

pairing computations compared to the similar design reported in [17]. Due to

the above factor the speed of the proposed cryptoprocessor on a FPGA platform

is comparable with the existing CMOS design.

The proposed configurable Fpk arithmetic cores and parallel computation result in a

significant improvement on the performance of Tate, ate, and R-ate pairing over BN

curves. The result is demonstrated for a 256-bit BN curve which provides 128-bit

security.

6

1.3 Organization of the Thesis

• Pairing Computations Against Fault and Power Attacks. This work deals with

the fault and side-channel attacks on pairing computations which is another objective

of this thesis. The contributions of this chapter are summarized here.

1. Fault attack on pairing: It analyzes existing fault attacks and countermeasures

on pairing computations that are described in [82]. The attack assumes that the

respective fault is injected into a specific register inside the pairing cryptopro-

cessor. With experimental result this chapter depicts a fault injection technique

into a register by tuning the clock frequency. The chapter finds out the limita-

tions of the existing countermeasures. To overcome such limitations we propose

a new countermeasure to defend fault attacks on pairing computations.

2. Fault attack on Miller’s algorithm: A new representation of the addition law

on elliptic curves has been introduced by Edwards [65] in 2007, which provides

efficient elliptic curve group operations [64]. Pairing computation in Edwards

coordinates are proposed in [47]. This chapter analyzes the security of the

pairing computation proposed in [47] against a new fault attack. This chapter

shows a vulnerability against new fault attack on pairing computations over BN

curves and Edwards coordinates [47]. A suitable technique is also proposed to

counteract against such attack.

3. DPA on pairing: The side-channel attack based on power analysis on pairing

computation is another objective of the present work. We propose an attacking

technique based on differential power analysis on pairing computations over Fp.

Through experimental results we show how the proposed attack actually works

on an FPGA platform. A suitable technique is also proposed to counteract

against such power attack.

1.3 Organization of the Thesis

The rest of the thesis is structured as follows:

Chapter 2 gives a brief overview with related techniques and algorithms of finite field

operations. It also includes basic ideas of elliptic curve and pairing based cryptography.

Backgrounds on side-channel and fault attacks are also provided in this chapter.7


Chapter 3 reports some related works to present the state-of-art in connection to the thesis.

Chapter 4 presents an elliptic curve cryptoprocessor exploiting the concept of shared arith-

metic hardware and explore its security against timing and power attacks.

Chapter 5 explores the in-built features of an FPGA device to develop high-speed prime

field primitives. The multiplication algorithm has been modified to improve the scope of

parallelism and proposed a high-speed Fp multiplier for 2’s complement numbers.

Chapter 6 at first designs a configurable architecture for computing arithmetic in Fpk . Then

it proposes a cryptoprocessor for computing asymmetric pairings over BN curves that pro-

vide 128-bit security.

Chapter 7 deals with the security of pairing computations against fault and power attacks.

Through experimental results the actual technique of fault induction has been shown. Ac-

tual DPA attacks on a pairing computation has been described. Suitable countermeasures

have been proposed in this chapter.

Chapter 8 concludes the thesis and discusses some possible directions of future work.

1.4 Conclusion

This chapter has given an overview of the whole work. The motivation behind this

research, objectives and scopes are described. In the next chapter we provide a background

of the works described in this thesis.

8

Chapter 2

Mathematical Background and

Preliminaries

THIS CHAPTER PRESENTS the background related to this thesis. It starts with a brief

description of finite field arithmetic which are the underlying operations in elliptic

curve and pairing computations. Then, the chapter discusses the basic concepts on elliptic

curve and pairing to outline the foundation of the content of this thesis.

The steep growth in the processing speed causes the key sizes of cryptographic schemes

to increase almost 25% in every decades [7]. As per the recommendation by National

Institute of Standards and Technology (NIST) the security requirements upto 2010 is 80

bits, i.e., the computation complexity is 280, it is 112 bits upto 2030, and 128 bits beyond

2030. Along with this security requirements there evolved separate and independent public

key techniques: RSA, elliptic curve, pairings, etc. Among these techniques the key sizes of

the latter two are relatively lesser than the first one. In this chapter, we briefly describe some

of the methodologies, challenges and solutions related to later two public key techniques.

The practical implementations and security analysis against side-channel attacks of

above public key techniques are the major objectives of the thesis. Field programmable

gate array or FPGA is a relevant platform which is being used to develop application spe-

cific hardware. The FPGA devices provide different in-built features for basic arithmetic

modules which could be utilized for developing high-speed architectures for an application.

9

Chapter 2 Mathematical Background and Preliminaries

Side-channel attacks are the major threats on implementations of cryptographic algorithms.

It can break a secure algorithm with very less effort by exploiting some unwanted leakage

during execution. The basic concepts of FPGA platform and side-channel attacks are also

described in this chapter.

2.1 Finite Field Arithmetic

Finite Field or Galios Field (GF) is defined on a finite set of elements with a prime

characteristic. The smallest two finite fields are developed with characteristic 2 and 3,

which are known as F2 (or GF(2)) and F3 (or GF(3)), respectively. We represent a finite

field with a large prime characteristic p by Fp or GF(p). Most of the works described in this

thesis are based on Fp. Therefore, this section describes the arithmetic operations on Fp.

The multiplication, inversion, division, and exponentiation in above field can be computed

by different techniques [117]. However, this chapter discusses the techniques which are

further used to describe the proposed works in the thesis.

2.1.1 Addition and Subtraction in Fp

The operation (a+ b) in Fp adds two operands a, b, and it subtracts the modulus p

from the sum if (a+ b) ≥ p. However, the comparison can be avoided by the following

way. First, we perform c = a+b and then perform d = c− p. The final result is either c or

d which is decided by the carry out values of above two operations.

The doubling operation (2a) mod p is a special case of (a+b) mod p. This operation

is computed by same way of Fp addition by replacing the first addition for (a+ b) with a

left shift operation for computing 2a.

To perform Fp subtraction, input b is bitwise inverted and added to input a with carry

in 1. If the result is negative (i.e. the carry-out is low) then the modulus is added to

produce an output in Fp. The correct result is selected by the carry-out bit of the first

adder. The respective architectures for Fp addition, doubling, and subtraction are described

in [53, 106].

10

2.1 Finite Field Arithmetic

2.1.2 Multiplication in Fp

One of the interesting procedures to perform Fp multiplication is interleaved multi-

plication algorithm [179, 180], which is shown in Algo. 2.1. The main advantage of the

algorithm is that it does not require any final division. At every iteration the intermediate

result is reduced and it remains below the modular value.

Algorithm 2.1: Interleaved multiplication in Fp, IntMult (b,b, p).

Input: a,b, p. b = ∑k−1i=0 2ibi.

Output: (a ·b) mod p.s← 0.for i from k−1 downto 0 do

s← (2s) mod p.if bi = 1 then

s← (s+a) mod p.end

endreturn s.

Main difficulty of the algorithm is the computation of addition on large operands. The

carry chain linearly increases the latency of the addition operations. Thus, carry propa-

gation adder circuit is inefficient for developing a multiplier that is based on interleaved

multiplication procedure. However, some modifications for using carry save adder (CSA)

in interleaved multiplication algorithm are reported in [94, 95, 121]. The pre-computations

that are required in the existing modifications depend on the multiplicand a. The advantage

can be taken efficiently in an application where the repeated multiplications are performed

on a fixed multiplicand and varying multiplier. But in our applications like elliptic curve or

pairing computations, the finite field multiplications are normally performed on different

operands. Thus, the pre-computation cost is directly added with the multiplication pro-

cedure in elliptic curve and pairing based cryptographic schemes. A modification on the

above algorithm is proposed in chapter 5 for improving the efficiency of Fp multiplication.

2.1.3 Inversion and Division in Fp

The modular multiplicative inverse (a−1) mod p of an integer a exists if and only if a

and p are relatively prime, that is, gcd(a, p) = 1. Two methods for inversion are often used:11


the Fermat’s Little Theorem and a variant of the Extended Euclidean Algorithm (EEA). First

one computes the inversion by exponentiation. There are lot of variants of these algorithms

reported in literature; most of them are listed and discussed in [117]. One of the efficient

variant of EEA for Fp inversion is based on binary method, which is known as Binary In-

version Algorithm shown in Algo. 2.2. The algorithm runs iteratively, and proceeds towards

the goal. At every iteration either u or v is reduced by at least one bit length, which ensures

that the total number of iteration is at most 2k, where k is the maximum bit length of p and

a. In [106], the authors proposed the outline of modular division operation using a modular

inversion followed by a modular multiplication operation. The binary modular inversion

algorithm (Algo. 2.2) can easily be modified to perform modular division b/a = ba−1. To

obtain (b/a) mod p using this algorithm it is necessary to initialize the x1 variable in step

1 by b instead of 1. We follow this algorithm for performing GF(p) inversion and division

operations in the elliptic curve hardware which is described in chapter 4.

2.1.4 Montgomery Ladder for Exponentiation in Fp

The operation (ab) mod p in Fp can be performed by binary square-and-multiply al-

gorithm [117] with complexity O(k3), where k represents the bit length of p. It has two

variations: right-to-left and left-to-right. In the above iterative procedure the squaring is

performed at every iterations whereas the multiplication is performed if the respective bit

bi = 1. The main drawback of this procedure is its unbalanced computation depending on

the bit values of the exponent, for which it is vulnerable against simple-SCA attacks. To

overcome the above vulnerability sometimes Montgomery powering ladder is used [125].

The respective algorithm is shown in Algo. 2.3, where internal squarings and multiplica-

tions are performed in Fp.

2.2 Elliptic Curve Cryptography

Use of elliptic curves in cryptography has been independently introduced by V.S. Miller

[178] and N. Koblitz [174] in late eighties of the last century. It has largely reduced the key

sizes of public key schemes from traditional RSA [182] based techniques. For example, a

160-bit elliptic curve based scheme is equivalently secure with a 1024-bit well known RSA

12


Algorithm 2.2: Binary Inversion in GF(p).Input: a ∈ Fp.Output: (a−1) mod p.u← a, v← p, x1← 1, x2← 0.while u = 1 and v = 1 do

while u is even dou← u/2.if x1 is even then

x1← x1/2.endelse

x1← (x1 + p)/2.end

endwhile v is even do

v← v/2.if x2 is even then

x2← x2/2.endelse

x2← (x2 + p)/2.end

endif u≥ v then

u← u− v.x1← x1− x2.

endelse

v← v−u.x2← x2− x1.

endendif u = 1 then

return (x1) mod p.endelse

return (x2) mod p.end

scheme. An elliptic curve on a finite field consists of finite number of points which form an

abelian group. The exponentiation in case of elliptic curve group is known as elliptic curve

scalar multiplication (ECSM) which is represented as dP for any integer d and any point

13


Algorithm 2.3: The Montgomery ladder for exponentiation.

Input: a,b, p. b = ∑k−1i=0 2ibi.

Output: (ab) mod p.q1← a and q2← a2.for i from k−2 downto 0 do

if bi = 1 thenq1← q1 ·q2 and q2← (q2)

2.endelse

q2← q1 ·q2 and q1← (q1)2.

endendreturn q1.

P. The operation dP represents the addition (P+P+ · · ·(d− 1) times). It is an one way

function, where forward computation, i.e., given d and P, the computation of Q = dP is

easy. But, the reverse problem is computationally hard and it follows following definition.

Definition 1. Elliptic curve discrete logarithm problem (ECDLP): Given an elliptic curve

E defined over a finite field Fq, a point P ∈ E(Fq) of order n, and a second point Q ∈ ⟨P⟩,determine the integer d ∈ [0,r−1] such that Q = dP.

The ECDLP is the heart of elliptic curve cryptography. The security of an elliptic curve

scheme is based on the difficulty to solve this problem. The best algorithm to solve ECDLP

is known as Pollard-rho method [183]. The algorithm has a fully-exponential expected

running time of√

πn2 point additions. The hardness of the ECDLP depends on the choice

of an elliptic curve. For a given underlying field Fq, maximum resistance to Pollard’s rho

method can be attained by selecting an elliptic curve E for which n is prime and is as large

as possible. The most favourable situation arises when #E(Fq) is prime or almost prime,

i.e., #E(Fq) = kn, where n is prime and the co-factor k is small (e.g., k ∈ {1,2,3,4}). In

this case, since #E(Fq) lies in the Hasse interval [(√

q−1)2,(√

q+1)2], we have n≈ q and

we say that the elliptic curve has a security level of 12 log2 q bits [102].

Figure 2.1 represents a typical ECC implementation hierarchy. The top level comprises

of elliptic curve cryptographic schemes like ECDSA, ECMQV [155]. The second level

is the elliptic curve scalar multiplication (ECSM), which consists of a sequence of point

14


doubling (ECD) and point addition (ECA). The operations ECD and ECA, considered as

elliptic curve group operations, are in the third level. These two group operations consist

of a sequence of finite field division, multiplication, addition, and subtraction that belong

to the fourth level in the hierarchy.

Pairing based cryptography could be viewed as an extension of elliptic curve cryptog-

raphy. Along with the ECSM operation there is another one way function known as pairing

computations, which we discuss later.

Finite field addition

Finite field subtraction

Finite field multiplication

Finite field inversion

Elliptic curve point addition

Elliptic curve point doubling

Elliptic curve scalar multiplication

Elliptic curve cryptographic schemes

Figure 2.1: ECC implementation hierarchy.

2.2.1 Operations on Elliptic Curve

An elliptic curve E over GF(p) is often defined as the set of solutions (points) of the

following equation [117, 155],

y2 = x3 +ax+b, (2.1)

where x,y,a,b ∈ GF(p) and 4a3+27b2 = 0. The rational affine points on the curve and the

point at infinity O form an abelian group. The point O is used as an identity element of the

group. Thus, for every point P ∈ E, P+O = O +P = P. The group operations are known

as point addition (or ECA) and point doubling (or ECD).15


(a) (b)

Figure 2.2: Geometric interpretation of elliptic curve operations. (a) addition of two points

and (b) doubling a point.

The geometric interpretation of addition of two points on an elliptic curve and doubling

a point are depicted in Fig. 2.2. Suppose that P and Q are two distinct points on an elliptic

curve, and the P is not −Q. To add the points P and Q, a line is drawn through the two

points as shown in Fig. 2.2(a). This line will intersect the cubic curve in exactly one more

point, call −R. The point −R is reflected in the x-axis to the point R which represents the

resultant point of P+Q.

Similarly, to add a point P to itself or doubling a point P, a tangent line to the curve is

drawn at the point P as shown in Fig. 2.2(b). If y-coordinate of P is not 0, then the tangent

line intersects the elliptic curve at exactly one other point, −R. The −R is reflected in the

x-axis to R, which represents the resultant point of P+P or 2P.

The formulæ for ECA and ECD in affine coordinates are as follows: Let P = (x1,y1)

and Q = (x2,y2) are two points on E. If P =−Q then P+Q = O. Otherwise R = (x3,y3) =

P+Q is given by:

x3 = λ2− x1− x2 (2.2)

y3 = λ(x1− x3)− y1 (2.3)

16


where

λ =

{(3x2

1 +a)/2y1 if P = Q

(y2− y1)/(x2− x1) if P = Q.

In addition to the ECA and ECD, the inverse of a point P(x,y)∈E is computed as P(x,−y).

The ECSM computation technique is shown in Algo. 2.4, where k represents the bit size

of p, i.e., k = ⌈log2 p⌉. Algo. 2.4 is the Montgomery ladder [176] based ECSM algorithm.

It works as follows:

Let binary representation of d be (dk−1, · · · ,d0), and we assume that dk−1 = 1. The

algorithm starts with pair (P,2P). At the beginning of each step i, we have the pair

(Q1,Q2) = (mP,(m+ 1)P), where m = dk−1 · · ·dk−1−i . At the end of last step (i = 0),

we eventually have (Q1,Q2) = (dP,(d +1)P).

Algorithm 2.4: The Montgomery ladder for elliptic curve scalar multiplication.

Input: An integer d ≥ 1 and a point P on elliptic curve. d = ∑k−1i=0 2idi.

Output: dP.Q1← P and Q2← 2P.for i from k−2 downto 0 do

if di = 1 thenQ1← Q1 +Q2 and Q2← 2Q2.

endelse

Q2← Q1 +Q2 and Q1← 2Q1.end

endreturn Q1.

The Montgomery ladder has two significant advantages. First, both branches for di = 1

and di = 0 can be parallelized in an obvious way. Point addition and point doubling can be

run in parallel on two different processor cores. Second, the algorithm is resistant to non-

differential (simple) side-channel attacks [156]. The cryptoprocessor proposed in chapter 4

computes Montgomery ladder based ECSM operation by following above point inversion,

point addition, and point doubling rules in affine coordinates.

17


2.3 Cryptographic Pairings

The pioneering work in the field of pairing based encryption was proposed by Boneh

and Franklin [143]. The identity based encryption (IBE) scheme proposed in [143] uses

pairing computation as one of the major operations in encryption as well as decryption

procedures. The security of the scheme is based on the difficulty to solve well known

Bilinear Diffie-Hellman problem. A survey on pairing based cryptographic schemes is

given in [112]. This section gives a brief overview of Tate pairing computation and some

of the security issues against fault attack on pairing algorithms.

2.3.1 Tate Pairing

The name bilinear pairing indicates that it takes a pair of vectors as input and returns

a number. It performs a linear transformation on each of its input variable. For example,

the dot product of vectors is a bilinear pairing. Similarly, for cryptographic application

the bilinear pairing (or pairing) operations are defined on elliptic or hyperelliptic curves.

Pairing is a mapping e : G1×G2→ G3, where G1 is a curve group defined over a finite

field Fq, G2 is another curve group on the extension field Fqk , and G3 is an subgroup of the

multiplicative group of Fqk . Groups G1 and G2 could also be the same group. If G1 =G2

then the mapping e is called symmetric pairing. On the other hand if G1 = G2 then e is

called asymmetric pairing.

Every point on an elliptic curve is one of two kinds: a point of finite order or a point

of infinite order. For P to be a point of finite order means there exist a smallest integer l

such that lP = O. If no such l exists then P is of infinite order. In other words, P being of

infinite order means you can never get the point at infinity by adding P to itself, no matter

how many times you do it. This distinction between finite and infinite points leads to the

following definition:

Definition 2. l-torsion point: A point P ∈ E(Fq) is called a torsion point of order l or

l-torsion point if P has order l.

Gathering all of the torsion points of an elliptic curve E will form a finite subgroup of

E(Fq), called E(Fq)tor : E(Fq)tor = P ∈ E(Fq)tor|P has finite order⊆ E(Fq).

18

2.3 Cryptographic Pairings

Let, a large odd prime l divide the order of the curve group (#E(Fq)), and let, the point

P be a l-torsion point. Here, k is the corresponding embedding degree, often referred to

as security multiplier in pairing computation. It is the smallest positive integer such that l

divides qk−1. Then the Tate pairing of order l is a map

el : E(Fq)[l]×E(Fqk)[l]→ F∗qk/(F∗qk)l, (2.4)

where E(Fq)[l] denote the subgroup of E(Fq) of all points of order dividing l, and similarly

for Fqk . The l-Tate pairing on points P ∈ E(Fq)[l],Q ∈ E(Fqk)[l] is given by el(P,Q) =

fl,P(D). Here fl,P is a function on E whose divisor is equivalent to l(P)− l(O). D is a

divisor equivalent to (Q)− (O), whose support is disjoint from the support of fl,P. The

point O represents the point at infinity. For more information regarding divisor, we refer

the reader to [37, 138]. The formulas for D and fl,P(D) is given in following equations:

D = ∑i

aiPi (2.5)

fl,P(D) = ∏i

fl,P(Paii ). (2.6)

Cryptographic pairings satisfy following properties:

• Non-degeneracy : For each P = O there exist Q ∈ E(Fqk)[l] such that el(P,Q) = 1.

• Bilinearity : For any integer n, el([n]P,Q)= el(P, [n]Q)= el(P,Q)n for all P∈E(Fq)[l]

and Q ∈ E(Fqk)[l].

• It is efficiently computable.

The value el is a representative of an element of the quotient group F∗qk/(F∗qk)l . However

for cryptographic protocols, it is essential to have a unique representative. So el is raised

to the ((qk− 1)/l)-th power for obtaining an l-root of unity. The resulting value is called

reduced Tate pairing:

El(P,Q) = el(P,Q)(qk−1)/l. (2.7)

Computation of Tate pairing is performed by an ECSM based technique proposed by V.

Miller [175], which is shown in Algo. 2.5. The algorithm performs doubling for every bit19


value of l, and it performs addition only if the corresponding bit value of l is 1. Finally, it

returns the l-Tate pairing. In the algorithm, l′(Q) indicates the divisor of the straight line l′

connecting two points P1 and P2 with respect to point Q. Let the line l′ intersects the curve

at a third point X . Now v′(Q) is the divisor of the vertical line v′ through X with respect to

Q [37, 47].

Algorithm 2.5: Miller’s algorithm.Input: P an l torsion point ∈ E(Fq), Q ∈ E(Fqk).Output: the Tate pairing El(P,Q).i = ⌈log2(l)⌉,K← P, f ← 1.while i≥ 1 do

Compute equations of l′ and v′ arising in the doubling of K.K← 2K and f ← f 2l′(Q)/v′(Q).if the i-th bit of l is 1 then

Compute equations of l′ and v′ arising in the addition of K and P.K← P+K and f ← f l′(Q)/v′(Q).

endi← i−1.

endreturn f (q

k−1)/l .

The Tate pairing can only be computed efficiently if the security parameter k is small.

Before the work of Miyaji, Nakabayashi and Takano [145], it was assumed that k is in size

of l for any general curve. Thus, the early curves to be used in pairing based cryptography

were supersingular curves, since their security multiplier satisfies k ≤ 6. Barreto et al.

in [138] generalized the Miller’s technique for computing pairings which is also known

as BKLS algorithm. A pairing computation technique on hyperelliptic curves including

supersingular curves over F3m is proposed by Duursma and Lee in [126]. The algorithm

is further improved by Kown [111]. Recently, the pairing computation in highly efficient

Edwards coordinates [64, 65] and on Twisted Edwards coordinates [50] are defined by

Ionica and Joux [47], and by Das and Sarkar [49], respectively. A popular and widely used

elliptic curve which provides 128-bit security is proposed by Barreto and Naehrig [76].

This pairing friendly elliptic curve is defined over a 256-bit prime field with embedding

degree k = 12. The present thesis discusses about FPGA implementation of pairing based

cryptography, resistant against side-channel attacks. Hence, we present an overview of

20

2.4 FPGA Architecture

FPGAs and side-channel attacks, which is essential to the understanding of the work.

2.4 FPGA Architecture

The name field programmable gate array comes from its internal structure which con-

sists of an array of programmable logic clusters. We use the Xilinx Virtex device family

for our applications. Xilinx Virtex-II Pro is such an FPGA device which is implemented

in nine layers using 130nm CMOS technology [137]. It has an island style architecture

which consists of a two-dimensional array of Configurable Logic Blocks (CLBs) and pro-

grammable interconnect resources. An architectural overview of such an FPGA device is

shown in Fig. 2.3. Each CLB is a cluster of four identical sub-blocks called slice and two

3-state buffers. Each slices are equivalent and they contains following circuit elements.

I / O blocks

logic blocks Vertical

interconnects

Horizontal interconnects

Figure 2.3: Structure of a Virtex FPGA.

• Two function generators (F & G)21


• Two storage elements

• Arithmetic logic gates

• Large multiplexers

• Wide function capability

• Fast carry look-ahead chain

• Horizontal cascade chain (OR gate)

The function generators F & G are configurable as typically 4-input look-up tables (LUTs),

as 16-bit shift registers, or as 16-bit distributed RAM. In addition, the two storage elements

are either edge-triggered D-flip-flops or level-sensitive latches. Each CLB has internal fast

interconnect and be connected to a switch matrix to access general routing resources.

The LUTs are used to map fixed-input Boolean logic. The gates are used to implement

special functions such as fast carry chains. Exploring the benefit gained from such an in-

built carry chain for implementing adder and multiplier circuits with large operands is one

of the major objectives of this thesis.

2.5 Side-channel Attacks

Side-channels are defined to be unintended output channels of a system. A side-channel

attack (or SCA) exploits the unwanted leakage information through the side-channels of a

cryptographic device during the execution of specific operations. It is typically a passive

and non-invasive physical attack [101]. It observes some physical characteristic of the

device without interfering the execution. That is, the target device behaves exactly as if

no attack occurs. On the other hand, the side-channel attack is non-invasive which exploits

only the externally available and unintentionally leaked information. It does not de-package

or tamper the target device.

Various side-channels are known to mount an attack on a cryptosystems which are

shown in Fig. 2.4. The information through side-channels can be gathered easily in practice

and therefore it is essential that the threat of SCA be quantified when assessing the overall

security of a cryptosystem.22


D Cipher text

Key

Plain text

Power consumption

Heat

Computation time

Sound

Visible light

Frequency Faulty output

Electromagnetic radiation

Error message

I / O channels side channels

Figure 2.4: Traditional side-channels.

2.5.1 Timing Attacks

Usually the running time of a program is merely considered as a constraint, some pa-

rameter that must be reduced as much as possible by the programmer. More surprising

is the fact that the running time of a cryptographic device can also constitute an informa-

tion channel, providing the attacker with invaluable information on the secret parameters

involved. This is the idea of timing attack.

Timing information to compute a cryptographic operation was the first side-channel uti-

lized for attacks. It was brought to the attention of the cryptographic community when Paul

Kocher [171] in 1996 introduced a technique for exploiting timing variation in an attack.

Timing variation often occurs because of data-dependent operations or imbalance execu-

tions. For example, in the binary double-and-add (left-to-right) multiplication algorithm,

Algo. 2.1, the addition is only taken place if the corresponding bit value of the multiplier is

one. In order to exploit timing information, precise timing measurements need to be made.

The first experimental result on timing attack against an actual smartcard implementation

of the RSA was shown in [168].

23


2.5.2 Power Consumption Attacks

As described in [163], the integrated circuits are built out of individual transistors,

which act as voltage-controlled switches. In a transistor, current flow is directed across the

transistor substrate when charge is applied to (or remove from) the gate. This current then

delivers charge to the gates of other transistors, interconnect wires, and other circuit loads.

The motion of electric charge consumes power and produces electromagnetic radiation,

both of which are extremely detectable.

Nowadays, almost all smartcards and other mobile processors are implemented as inte-

grated circuits (IC) in CMOS technology. From these devices, normally two types of power

consumption leakage can be observed. The transition count leakage, which is related to the

number of bits that change their state at a time; and the Hamming weight leakage, which is

related to the number of 1’s being processed at a time. The internal current flow of a cryp-

toprocessor can be observed from outside by measuring the current drawn from the power

supply. Certainly, power analysis attack is applicable only to hardware implementation

of the crypto systems. Power analysis attack is particularly effective and proven success-

ful in attacking smartcards or other dedicated embedded systems storing the secret key.

Among all types of SCA attacks known today, the number of literatures on power analysis

attacks and the relevant countermeasures is the largest. The power analysis attacks have

been demonstrated to be very powerful attacks for most straightforward implementations

of symmetric and public key ciphers. Basically, power analysis attacks are divided into

Simple and Differential Power Analysis (referred to as SPA and DPA, respectively).

2.5.2.1 Simple Power Analysis (SPA) Attacks

Simple power analysis or SPA is generally based on looking at the visual representation

of the power consumption of a device when an encryption operation is being performed.

It is a technique that involves direct interpretation of power consumption measurements

collected during cryptographic operations. The SPA can yield information about a device’s

operation as well as key material.

In a SPA attack, the attacker directly observes the power consumption of a device. The

amount of power consumption varies depending on the instructions being executed and it is

24


necessarily distinguishable by their power trace. In addition, the attacked instruction need

to have a relatively simple or direct relationship with the secret key. For example, SPA can

be used to break RSA implementations by reveling differences between multiplication and

squaring operations. Similarly, many DES implementations have visible differences within

permutation and shifts, and can be broken using SPA [163]. However, it is not difficult to

design a system that will not be vulnerable to SPA attacks.

2.5.2.2 Differential Power Analysis (DPA) Attacks

When simple power analysis is not feasible differential power analysis (or DPA) can

be tried. DPA uses many measurements. It tries to exploit the relationship between the

processed data and the power consumption, whereas, SPA exploits the relationship between

the power consumption and the executed operations. While the later may not be successful,

the former has more chances of a success.

In DPA attack, the attacker records the power consumption of several runs of a crypto-

graphic algorithm implemented on electronic devices. In general, every runs are performed

on some random plain texts with a fixed secret key. The DPA attack relies on the fact that

the power consumption of a device varies to perform same operation on different data. This

power consumption difference is very small and they are not visible from their direct plots.

However, it could be measurable and exploited by sophisticated offline analysis.

To get a clear idea about the DPA attack let us demonstrate it on binary field addition

which is described in [40]. Let, f (x) be an irreducible polynomial of degree m over F2. We

assume that an element a = am−1xm−1 +am−2xm−2 + · · ·+a1x+a0 of F2m ∼= F2[x]/( f (x))

is represented by the polynomial basis with ai ∈ F2.

The addition of a(x)+b(x) is performed as: (am−1⊕bm−1)xm−1+(am−2⊕bm−2)xm−2+

· · ·+(a1⊕b1)x+(a0⊕b0). Let us consider that a(x) be a secret key which is added with

b(x), the publicly known, and even be chosen by the attacker. The attacker chooses some

random b(x) and collects the power consumption for each executions.

Let, W be the power consumption associated with the addition operation a(x)+ b(x).

Let, the adversary chooses thousands of random b(x) and collects corresponding W . To

recover the i-th bit of a(x), we guess that ai = 0 and divide power consumptions into two

25


sets by bi. The formation of two power consumption sets S0 and S1 are done by following

way :

Sk = {W | bi = k} with k ∈ {0,1}

Thus, the differential power consumption is

∆ = < S1−S0 > .

Now if the guess is correct, then ∆ will be positive, as the Hamming weight of output

in S1 is atleast one more than the S0. Otherwise, ∆ will be negative. Thus, by the repetition

of above DPA technique the attacker can obtain all bit values of a(x).

2.6 Fault Attacks

Fault attack is another powerful technique to break a cryptosystem [154]. These theo-

retical findings were applied on both symmetric ciphers [13, 48, 63, 139] and asymmetric

ciphers [161] by several researchers. In this attack, a fault is injected during the compu-

tation of a cryptographic algorithm on a cryptoprocessor. It exploits the faulty output to

deduce the secret key. The faults of a device can be characterized from several aspects

which are as follows.

• Permanent Fault: It damages a cryptographic device in a permanent way. The

device will behave incorrectly in all future computations. Such damage includes

freezing a memory cell to a constant value, cutting a data bus, stuck a logic output at

VCC or GND line, etc.

• Transient Fault: As opposed to the permanent fault, with a transient fault, the de-

vice is disturbed during its processing, so that it will only perform fault(s) during

that specific computation. Examples of such disturbances are radioactive bombing,

abnormally high or low clock frequency, abnormal voltage in power supply, etc.

• Error Location: Some attacks require to induce the fault in a specific location such

as a specific memory cell, a specific bit of a register, etc.26

2.6 Fault Attacks

• Time of Occurrence: Some attacks require to be able to induce the fault at a specific

time during the computation. For example, induce a fault at a particular round output

of DES or AES algorithm.

• Error Type: Many types of error may be considered. For example, flip the value

of some bits, freeze a memory cell to 0 or 1, prevent a jump from being executed,

disable instruction decoder, flips in memory only in one direction (e.g. a bit can be

flipped from 1 to 0, but not the opposite), etc.

The fault model has much importance regarding the feasibility of an attack. The works

on fault attacks can be categorized into two groups. First group deals with the way to

induce a given type of fault in a cryptoprocessor. The second assumes a fault model and

deals with the way this model can be exploited to break a cryptosystem. Later one does not

bother about the way such faults be induced in practice. These two groups are of course

complementary to determine the potential weaknesses induced by a fault induction method.

2.6.1 Fault Induction Technique

Fault induction is taken place by tuning the channels which affects the device’s envi-

ronment and putting it in abnormal conditions [101]. Many channels are available to the

attacker. Some of them are as follows:

• Power: Unappropriate power supply affects the behavior of a device. For example, a

smartcard, as per ISO standards, must be able to tolerate supply voltage between 4.5V

and 5.5V . Within this range the smartcard must be able to work properly. However, a

deviation of the power supply of much more than the specified tolerance might affect

its functionality. It will indeed lead to a wrong computation result, provided that it is

able to complete the current computation.

• Clock Frequency: An abnormally high or low frequency may induce errors in pro-

cessing. A fine tuning of clock frequency or a clock glitch at proper time can com-

pletely change the execution of a processor. It may even omit an instruction from the

execution sequence.

27


• Temperature: The device can process in extreme temperature conditions to induce

faults. Although, this is not a good choice for mounting fault attacks in practice.

• Radiations: Correctly focused radiations can harm the behavior of a cryptoproces-

sor. In practice, the attacker may put the devices like smartcard into a microwave

oven to have it perform erroneous computations.

• Light: The illumination of a transistor causes it to conduct. Thereby, it may induce a

transient fault. By applying an intense light source, it is possible to change individual

bit values in an SRAM [139]. The same technique could also interfere with jump

instructions, causing conditional branches to be taken wrongly.

• Eddy current: Eddy currents induced by the magnetic field produced by an alter-

nating current in a coil could generate various errors inside a chip. It could induce a

fault in RAM, EPROM, EEPROM, and Flash memory cells. For example, it could

change the value of a pin code in a mobile phone card [130].

Cryptanalysis based on fault is an interesting area of research. We refer the reader

to [13] and [101] for fault attacking techniques on AES and RSA, respectively.

2.7 Terminologies

In the subsequent chapters, the following terms will be encountered, and therefore are

described below.

1. Finite field: This is a field that contains only finitely many elements. Finite field is

also known as Galois field (in honor of Evariste Galois). Finite field is an abstract

algebra construct, which, with pn many elements, are represented by the notation of

Fpn or GF(pn) where p is a prime number called the characteristic of the field, and n

is a positive integer.

2. Elliptic curve: This is a smooth, projective algebraic curve of genus one, on which

there is a specified point O. An elliptic curve is in fact an abelian variety, that is, it

28

2.7 Terminologies

has a multiplication defined algebraically with respect to which it is a (necessarily

commutative) group and O serves as the identity element.

3. Elliptic curve group operations: These are the operations on which an elliptic curve

group is formed. An elliptic curve group is an additive group. The addition opera-

tions are defined on such a group based on the underlying elliptic curve equations. In

general, there are two operations namely: point addition (adds two different points,

e.g., P+Q) and point doubling (adds two similar points, e.g., P+P).

4. Elliptic curve scalar multiplication (ECSM): This operation multiplies a point on

an elliptic curve with an integer (scalar). This operation is also known as elliptic

curve point multiplication or some times called elliptic curve exponentiation. This

operation is an one way function, i.e., the forward computation − given a point P

and an integer d the computation of Q = dP is easy, but, the reverse computation −given Q and P finding out the integer d such that Q = dP is hard.

5. Fq-primitives: These are the units on which the respective Fq arithmetic operations

can be performed.

6. Programmable unit: This is a hardware unit which provides inherent programma-

bility. For example, a programmable Fp multiplier is a multiplier unit which supports

all primes less than a given length. A programmable unit does not require to recon-

figure the FPGA for changing the parameters.

7. Dual core: Two identical arithmetic units (cores) which can compute in parallel.

The cores can use a same or sometimes different memory blocks for itput and output.

Main objective of utilizing dual core in a processor is the improvement of parallelism

as well as the reduction of computation time.

8. Pairing: This is a mathematical construct on which the elements are processed pair-

wise and generates a single element. In case of cryptography, the pairing is per-

formed on pair of points on elliptic or hyperelliptic Jacobian curves, and it generates

an element on an integer field.

29


2.8 Conclusion

This chapter has described mathmetical background and preliminaries that are essen-

tial for understanding the works described in the following chapters. It has given a brief

overview on techniques and algorithms for performing arithmetic operations on finite fields

with large prime characteristic. In the next chapter, we give a literature survey of the works

related to the contributions of this thesis.

30

Chapter 3

Survey of Related Work

THIS CHAPTER DISCUSSES some of the previously published works, which, either

directly or indirectly, relate to the contributions of this thesis. Besides, it tries to pro-

vide the reader with a basic understanding of the state-of-the-art in research in this domain.

We start with the investigation of the existing works in the area of prime field elliptic curve

scalar multiplication (ECSM) on hardware platforms. Next, we will delve into the analysis

of the various reported techniques on side-channel attacks and corresponding countermea-

sures on ECSM. Finally, we will probe into the state of affairs of pairing computation

techniques, respective hardware and software implementations, and their security against

fault and side-channel attacks.

3.1 Hardware Implementation of ECSM on Prime Fields

The efficient implementation of ECSM on prime fields is achieved by applying opti-

mization at different hierarchical stages. Normally, the research in this direction follows

either of two level of optimizations:

1. Field-stage optimization. It chooses a prime field characteristic with lower ham-

ming weight which provides faster multiplication and inversion technique, mostly in

the reduction stage. Some of the specialized number systems are also used to perform

prime field operations more efficiently. Like, Montgomery number system [176] and

31

Chapter 3 Survey of Related Work

Residue number system (RNS) [117].

2. Coordinates and scalar multiplication-stage optimizations. Research tries to re-

duce the number of field inversions (projective coordinates), number of point addi-

tions (windowing), and replace point doubles (endomorphism methods).

However, the efficiency of an implementation also varies on underlying platform. For

example, the same architecture implemented on a customized CMOS library is much more

faster than the same on an FPGA platform. This is because of the in-built varying properties

of different platforms. Therefore, the choice of platform also plays an important role for

implementing ECSM. Though FPGA is slower but it is much cheeper than the fabrication

of a customized CMOS design. FPGA is reconfigurable, i.e., you can wipe out your design

and use the same FPGA for other design. By considering these facts FPGA is accepted as

a good choice for implementing different embedded systems including for cryptographic

applications. The very fact that the entire design takes place in-house also raises the level

of trust on FPGA based cryptographic designs.

Apart from the platform, different level of parallelism and pipelining could be adopted

to design an efficient architecture for ECSM operation on prime field. ECSM is computed

hierarchically as shown in previous chapter, Fig. 2.1. Active research is going on to opti-

mize an ECSM architecture at each stages of its computation hierarchy. These optimiza-

tions could be focused either on specific platform or on general architectures. The works

described in [56] and [58] proposed parallelism techniques for implementing a prime field

multiplier efficiently. The work of [56] proposed a pipelined GF(p) multiplier with mul-

tiple processing elements having data-width < ⌈log2 p⌉. Thus, it explores both pipelining

and parallelism for implementing an efficient GF(p) multiplier. Whereas, the work [58]

proposes a parallel Montgomery reduction multiplier.

Many hardware implementations have been documented for computing the elliptic

curve scalar multiplication. A good survey in this area is described in [52]. The ECC hard-

wares are broadly designed on GF(2m) and GF(p). Efficient implementations of GF(2m)

arithmetic units are reported in [32], which are effectively embedded into a ECSM hard-

ware. Some of the very good ECSM hardware for GF(2m) are [16, 26, 30, 31, 41, 60, 68, 69,

71–73, 89, 103, 119].32

3.1 Hardware Implementation of ECSM on Prime Fields

An efficient implementation of GF(p) ECC on general purpose processor was proposed

in [88]. Customized hardware implementation of ECSM for the curves defined over GF(p)

was introduced in [142]. Orlando et al. in [142] proposed a reduced instruction set GF(p)

ECC processor for a fixed prime p = (2192−264−1). The GF(p) ALU proposed in [106]

combines different arithmetic units to a common unit for ECC processor. Thereafter, a lot

of hardware implementations were proposed, and some of the good results were shown

in [4, 14–16, 31, 33, 45, 55, 58, 70, 74, 87, 107, 119, 120]. Most of them have used Mont-

gomery numbers to perform modular arithmetic. The conversion of a binary number A to

its Montgomery domain representation A and the reverse operation are expressed as:

A = MonPro(A, 22m (mod M), M)

A = MonPro(A, 1, M)

where MonPro(a,b,M) represents the Montgomery product algorithm [176] to compute

modular multiplication a.b (mod M). Some other implementations like [4, 15] use RNS

numbers for computing underlying arithmetic in ECSM operation. The conversions be-

tween binary and corresponding specific representations incur additional costs. For exam-

ple, MonPro() takes log2 M number of clock cycles in the VLSI implementation reported

in [106] for converting a log2 M-bit binary number to its equivalent Montgomery represen-

tation.

Among the existing works, the designs in [14] and [15] support only NIST primes [166].

Other designs support any general primes as a field characteristic. The work in [87] showed

a timing-and-area tradeoff for implementing GF(p) ECC processor. Sakiyama et al. in [70]

accelerated ECSM operation using parallel modular arithmetic logic units. The work de-

scribed in [58] proposes two different levels of parallelism for designing an efficient ECSM

architecture. It defines the parallelism applied for computing a single GF(p) operation as

horizontal parallelism and the finite field operations are computed in parallel if no data

dependency exists, this is defined as vertical parallelism. In the same paper, an embedded

multicore system for elliptic curve cryptography was proposed. Using sixteen 18-bit mul-

tiplier cores, the system achieves both horizontal and vertical parallelism to develop a very

long instruction word (VLIW) processor for computing scalar multiplication on GF(p) el-

liptic curve. A parallel and scalable processor for performing ECSM in both prime and33


binary fields was proposed in [16]. In [4], a prime field ECSM processor has been pro-

posed. In this design the underlying operations are performed in RNS.

Side-channel attack [163, 171] is one of the major threats in developing cryptographic

hardware. This cryptanalytic technique exploits the leakage information (side-channels)

of the device while it executes some cryptographic algorithms. The most popular side-

channels are power and time [57]. Among aforementioned related designs, [4] and [45]

attempted to implement the GF(p) ECC hardware that is secured against side-channel at-

tacks. The design proposed in [45] provides security against simple power analysis (SPA)

and timing attacks only. However, the doubling attack [81, 118] and differential power

analysis (DPA) attack [162] are more powerful attacks, which can also work on SPA resis-

tant designs. Hence, an efficient as well as SPA, DPA, doubling attack, and timing attack

resistant ECSM hardware is on demand, which is aimed at this work. Very recently, in

2010, the ECSM cryptoprocessor [4] addressed SPA and DPA attacks. However, it did not

consider the more powerful doubling attack. Let us now make a literature survey to know

the existing techniques for protecting ECSM operation against side-channel attacks.

3.2 The ECSM Against Side-channel Attacks

This section describes the vulnerability of ECSM operation against side-channel at-

tacks. The ECSM operation is defined as : Q = dP, where the scalar multiplier d is used

as a secret key. A number of works have been reported to protect ECSM operation against

side-channel attacks.

The computation formulæ for elliptic curve addition (ECA) and doubling (ECD) are

different (Ref. section 2.2.1). The distinction between these two operations may leak some

information for revealing the bit values of the secret d. Simple side-channel (i.e., simple

power analysis (or SPA)) attacks may be applied for exploiting this distinction. However,

there are well defined counteracting techniques for preventing this vulnerability. The SPA

resisting techniques are as follows :

1. Unifying the addition formulæ [115, 132] or considering alternative parameteriza-

tions [127, 151, 152].

34


2. Inserting dummy operations [45, 114, 162].

3. Using algorithms that are already regular and do not leak by definition [132,140,141,

153, 156, 167].

Though the above techniques are sufficient to resist SPA attacks, they are vulnerable

against more sophisticated differential side-channel (i.e., differential power analysis (or

DPA)) attacks [162, 163]. In order to thwart differential side-channel analysis, the inputs

of the ECSM, namely, base point P and the scalar multiplier d, should be randomized [83,

150, 162]. Some combined methods can also prevent differential side-channel attacks on

ECSM operations [59,149]. Some of the techniques to protect an ECSM operation against

differential and simple side-channel attacks are briefly described in the following sections.

3.2.1 Indistinguishable Point Add and Point Double

Different methods have been reported to make the point addition and point doubling for-

mulæ indistinguishable in different coordinates. The works reported in [115,132] proposed

the unifying addition formulæ for GF(2m) elliptic curve in affine and projective coordinate

systems. It is observed that every elliptic curve is isomorphic to a Weierstraß form [132]

and the parameterizations other than the Weierstraß form may lead to faster unified point

addition formulæ [151, 152]. These parameterizations are mainly based on either of Hes-

sian form [148] or Jacobi form [127]. It is observed that using these parameterizations the

computation of point addition become cheaper also. Table 3.1 gives a comparison based

on computation cost of these two different parameterizations with the original Weierstraß

form. In the table the symbols M and C stand for finite field multiplication and multiplica-

tion by a constant, respectively.

It may be noted that different standard elliptic curves like IEEE 1363, FIPS 186.2, and

SECG use a group order #E(F) = h · q, where q is a prime and the cofactor h ≤ 4. The

point addition and doubling formulæ need not be strictly equivalent to prevent side-channel

analysis. This can be achieved by inserting some dummy operations. This could be help-

ful when two (distinct) elliptic curve group operations are similar. This technique has

been adopted for the elliptic curves defined over GF(2m) in [114]. The technique proposed

35


Table 3.1: Point addition for elliptic curve over GF(p).

Parameterization Cost Cofactor

Weierstraß form [132] 17M + 1C (general case)–

(with unified formulæ) 16M + 1C (a4 = -1)

Hessian form [151] 12M h ∝ 3

Extended Jacobi form [127] 13M + 3C (general case) h ∝ 2

13M + 1C (ε = 1) h ∝ 4∩of 2 quadrics [152] 16M + 1C h ∝ 4

in [114] has assumed that the loading/storing of random values from different registers

is indistinguishable. But in practice it may not be true. A possible solution is presented

in [147] by random register renaming. The work reported in [45] proposed an indistin-

guishable computation technique of point addition and point doubling on GF(p) elliptic

curves in affine coordinates.

3.2.2 Regular Point Multiplication Algorithms

Elliptic curve group operation (ECA and ECD) formulæ may have different side-channel

traces, provided they do not leak any information about the secret scalar multiplier d for

evaluating Q = dP. For binary algorithms, this implies that the processing of bits 0 and bits

1 of multiplier d are indistinguishable. There are some algorithms which can process both

0 and 1 bit values of the multiplier d in atomic way. This can be achieved by following

ways.

• Classical Algorithms. This trick usually tries to remove the conditional branching

in the double-and-add based algorithms [155]. It consists of a dummy point addition

when multiplier bit di is zero. As a result, each iteration executes a point doubling

followed by a point addition [162]. One such algorithm for some special form of

elliptic curves was developed by Montgomery in [176]. This algorithm is particu-

larly suited for elliptic curves defined over GF(2m) in a normal basis representation

36


performed on Lopez and Dahab coordinates [167].

• Atomic Algorithms. This is the generalized idea of double-and-add always tech-

nique proposed by Chevallier et al. [114], resulting in the concept of side-channel

atomicity.

There are numerous variants of above regular point multiplication techniques based on

efficient software implementations on different processor architectures [155].

3.2.3 Base Point Randomization Techniques

In general, base point randomization techniques try to develop some strategies on the

mathematical structure of the curves, which lead to efficient and simple ways for protecting

the secret multiplier d at the time of Q = dP computation against differential side-channel

analysis. The reported techniques are:

3.2.3.1 Point Blinding

The method is analogous to Chaum’s blind signature scheme for RSA [177]. The base

point P is blinded by adding a secret random point R for which the value of S = dR is

known. The ECSM operation, Q = dP, is performed by computing d(P+R) and subtract-

ing S to get Q. The concept proposed in [83, 162, 171] uses the idea that the point R and

S = dR are stored inside the device and refreshed at each new execution of ECSM by com-

puting R← rR and S← rS, where r is a (small) random generated at each new execution.

As R is secret, the representation of point P∗ = P+R is unknown in the computation of

Q∗ = dP∗, which ensures the security against differential side-channel attacks. However, a

difficulty of this technique is the overhead due to the computations and storage of R and S

in the cryptoprocessor. On the other hand, their initial value must be secret, which increases

the key size.

3.2.3.2 Randomized Projective Representation

In projective coordinates the randomization of a point can be done in a very simple

manner. The points are not uniquely represented in projective coordinates. For example,

in Jacobian coordinates, the triplets (θ2XP : θ3YP : θZp)J with any θ = 0 represent same37


point; and in homogeneous coordinates, the triplets (θXP : θYP : θZP) with any θ =0 represent same point. There are some other projective representations also which are

described in [170].

In the projective coordinates representation, for each new execution of point multipli-

cation Q = dP; the projective input point P is randomized with a random non-zero value

θ [162]. Therefore, an attacker is no longer be able to predict any specific bit in the binary

representation of P, which is used to mount the DPA attack [162].

3.2.3.3 Randomized Elliptic Curve Isomorphisms

Point P = (x,y) on elliptic curve E can be randomized as P∗ = ϕ(P) on E∗ = ϕ(E),

for a random curve isomorphism ϕ. Then the computation of Q = dP can be computed by

Q = ϕ−1dP∗ [150].

3.2.3.4 Randomized Field Isomorphisms

Let a point P be on an elliptic curve E defined over a finite field F. A random field

isomorphism J : F→ F∗ is applied to P and E to get point P∗ = J (P) on E∗ = J (E). Then

the operation Q = dP is evaluated as J−1dP∗ [150].

3.2.4 Scalar Multiplier Randomization Techniques

In this section we review another way of randomizing the computation Q = dP with a

randomized representation of the scalar multiplier d. There are several strategies to ran-

domize the multiplier d, which includes:

1. Multiplier Blinding: The secret scalar multiplier d is blinded here by d∗ = d + r×ord(P), where ord(P) denotes the order of the point P ∈ E(F), and r is randomly

chosen. The operation Q = dP is computed by Q = d∗P. The randomization of the

value d can also be done by d∗ = d+ r#E, where #E denotes the order of the elliptic

curve group. This relation holds because by Lagrange’s Theorem the order of an

element always divides the order of its group [162, 171].

2. Multiplier Splitting: The multiplier d can also be decomposed into two or more

38


parts. This idea was introduced in [165] as a generalized side-channel countermea-

sure. The splitting can be done additively, as d = d∗1 +d∗2 , where d∗1 = r and d∗2 = d−r

for a random r [146]. The multiplicative splitting of scalar multiplier d is introduced

in [131]. Using this technique the Q = dP evaluates as Q = dr−1(rP) for a random r

invertible modulo ord(P).

3. Forbenius Endomorphism: Endomorphism can be applied to represent the multi-

plier d for some special type of curves. For example, the Forbenius endomorphism

applied on Koblitz elliptic curves [172] is reported in [150]. One thing should be

mentioned here is that, the Forbenius expansion is roughly twice the length of a bal-

anced binary representation of a number.

Regarding the scalar multiplier randomization technique following thing can be observed.

1. Normally the value of scalar multiplier d < ord(P). Now in multiplier blinding

scheme d∗ = d + r ord(P) > ord(P) for any non zero r. So, during the compu-

tation of Q = d∗P some intermediate resultant point may lie on the neutral element

of the group or point at infinity (O). This can be attempted for side-channel analysis.

In [85,116] a special kind of differential power analysis attack, known as Zero-value

Register Attack (ZRA) is described, which tries to exploit the active registers of a

cryptoprocessor with zero intermediate result.

2. Regarding the multiplier splitting schemes, it is not clear whether if an attacker can

find the bit values of d during Q = dP using some side-channel analysis, then the

same thing can also be applicable to find the bit values of d∗1 and d∗2 , and it can

compute the secret d = d∗1 +d∗2 .

In 2010, a modified technique is proposed in [11] to defend a specific DPA attack,

known as address bit DPA or ADPA [164]. The general countermeasure against ADPA on

RSA and ECC was proposed in [122]. It randomizes the address of data being accessed

during the exponentiation of RSA and ECC. We refer the recent survey of [12] to the

reader for further information on side-channel attacks and countermeasures against ECSM

operation.

39


3.3 Implementation of Cryptographic Pairings

Pairing based cryptography started in the beginning of this century when Joux [160] in-

troduced the application of Weil pairing to construct a three-party one-round key agreement

protocol. Subsequently, Boneh and Franklin presented the first fully functional, efficient,

and provably secure identity-based encryption scheme [143] using the properties of bilinear

pairing on elliptic curves. Cryptographic pairings require tedious computations on elliptic

curves or hyperelliptic Jacobian curves defined over large finite fields. Weil and Tate are

the two oldest pairing techniques, which came at the same time in literature [155]. The

first efficient computation of pairing for cryptographic applications was introduced in 1986

when V. Miller described the Tate pairing computation technique over finite fields [175].

The generalized and efficient algorithm based on Miller’s technique was proposed after al-

most two decades by Barreto, Kim, Lynn, and Scott in [138], which is also known as BKLS

algorithm. Several authors have found further algorithmic improvements to decrease the

complexity of Miller’s algorithm by reducing its loop length [3, 29, 77, 126].

Both Weil and Tate pairings are based on the Miller’s loop. Most works focused on

speeding up the computation of the Tate pairing because the Weil pairing is more time-

consuming [92]. The computation of Weil pairing (W (P,Q) = M(P,Q)/M(Q,P) where

M stands for Miller’s loop), needs two Miller steps. One Miller step is called the Miller

lite part and the other Miller step is called the full Miller part [128]. On the other hand,

computation of Tate pairing (T (P,Q) = M(P,Q)c) requires one Miller lite part and one final

exponentiation. In lower security level, the full Miller part is much more time consuming

than a final exponentiation. Thus, it appears that the Weil pairing is more time-consuming

than in the case of the Tate pairing. By comparing the exponentiation of the Tate pairing

with the computation of the full Miller part, one can see a proper power of the Weil pairing

can be computed faster than the Tate pairing at high security levels [91]. It is observed

in [91] that at 256-bit security level and above the computation of Weil pairing will be

faster than the Tate pairing.

The underlying algebraic curve also plays an important role in pairing computation.

Active research is also going on for obtaining better pairing friendly curve which provides

more efficient pairing computation technique and more security with smaller field size. In

40

3.3 Implementation of Cryptographic Pairings

recent days some of the renowned curves are Miyaji, Nakabayashi, and Takano (MNT)

curve [145], Freeman curve [2, 67], and Barreto-Naehrig (BN) curve [76]. Among them,

the most popular BN curve is defined over a 256-bit prime field with embedding degree

k = 12 which provides 128-bit symmetric security. Efficient computation of arithmetic

operations over pairing-friendly tower extensions of finite fields are proposed in [18, 91].

Different varieties of pairing computations have appeared in the literature based on the

underlying algebraic curves and finite fields. Some of the most popular techniques are:

Duursma-Lee Tate pairing [126] over characteristic three fields F3m;1 ηT pairing over both

binary and characteristic three fields [54]; and ate [77], R-ate [28], and optimal-ate [3]

pairings over large prime fields. The ηT pairing [54] is the most efficient algorithm for

symmetric pairings (G1 = G2) that are always defined over supersingular curves. Other

pairings like ate, R-ate, and optimal-ate are known as asymmetric pairings (G1 =G2).

As per National Institute of Standards and Technology (NIST) recommendation it is

essential to chose a pairing that can achieve 128-bit security for its application beyond 2030.

Therefore, in this thesis we only study the existing software and hardware implementation

of pairings which can achieve at least 128-bit security.

3.3.1 Software Library for 128-bit-secret Pairings

The software implementation results for symmetric pairings over supersingular curves

which achieves 128-bit security are shown in [5, 24]. The software library presented in [5]

takes 3.02 millions of cycles to compute ηT pairing on a supersingular curve defined over

F21223 . It is implemented on a dual quad-core Intel Xeon 45nm systems for taking facility

of parallelism on eight cores. Authors in [24] report 5.42 millions of cycles to compute

ηT pairing on a supersingular curve defined over F3509 on an Intel Core i7 45nm processor

using eight cores.

Asymmetric pairings for achieving 128-bit security are mostly defined on BN curves.

The software library presented in [8] takes 4.47 millions of cycles to compute the optimal-

ate pairing on a 257-bit BN curve using only one core of an Intel Core 2 Quad Q6600

1Throughout the report we use Fqm and GF(qm) for the same meaning which represent finite field or Galois

field with characteristic q.

41


processor. A more efficient software implementation of asymmetric bilinear pairings for

128-bit security levels is described in [10]. In this software library, the optimal-ate pairing

over a 254-bit BN curve is computed in just 2.33 million of clock cycles on a single core of

an Intel i7 2.8GHz processor. Some other software implementation results for asymmetric

pairings over BN curves are reported in [27, 39, 42, 61].

3.3.2 Hardware Design for 128-bit-secret Pairings

The hardware implementation result of pairings over BN curves has been provided in-

dividually by Kammler et al. [17] and Fan et al. [19] in 2009. Both the designs are based

on 130nm CMOS technology. The first one is designed as a complete application specific

instruction-set processor (ASIP) augmented with some special instructions for comput-

ing pairings over BN curve. It results in the computation time of an optimal-ate pairing

over general primes in 15.8ms. The faster Fp-arithmetic for BN curve proposed in [19]

exploits the specific features of BN parameters and respective primes. This specialized de-

sign computes one R-ate pairing over BN curve in 2.9ms only. In [6], a compact hardware

is proposed for computing the Tate pairing over 128-bit-security supersingular curves. It

uses a characteristic-3 field with moderate composite-degree of field extensions for achiev-

ing 128-bit security as-well-as efficient tower field arithmetic. On a Virtex-4 FPGA, this

accelerator computes the pairing in 2.2ms while requiring no more than 4755 slices.

However, to the best of our knowledge, there does not exist an FPGA design of a pairing

cryptoprocessor for BN curves. Considering the popularity and the reconfigurability of

FPGA devices there is a strong impact of FPGA designs of crypto algorithms. The resource

constraint and the lesser clock frequency of FPGA devices pose further design challenges

to the designer.

3.4 Fault and Side-channel Attacks on Pairings

Boneh et al. [154] in 1997 introduced fault attacks and show how to recover secret keys

of RSA and discrete logarithm based cryptosystems. Thereafter, a lot of research based

on fault and side-channels have been undertaken on different cryptosystems to recover the

42

3.4 Fault and Side-channel Attacks on Pairings

secret keys. The first mention of side channel analysis of pairings was in 2004 when Page

and Vercauteren [82, 105] described a fault attack of Duursma-Lee algorithm [126] for

characteristic three. It shows the multiplication operation in a general pairing could be

attacked using Simple Power Analysis (or SPA) and a Messerges style Differential Power

Analysis (or DPA) [159]. The attack is based on a transient fault at the loop boundary of

the Miller’s loop of such pairing computations. Suitable countermeasures based on point

blinding technique [162] are also proposed in [82] for protecting the secret against above

side-channel attacks. However, these counteracting techniques require additional private

parameters which increases the complexity of whole system − from key establishment to

pairing computations.

Thereafter, in depth approach for performing side channel analysis on pairing imple-

mentations was described in [84]. This work targets Tate, ate, and ηT pairings and it

deterministically calculates partial output of a pairing computation based on the structural

expansion of basic finite field operations. Differential power analysis (DPA) on ηT pairing

over F2m is described in [40]. It targets addition and multiplication operations where the

secret and public parameters are directly involved. The attack works as follows :

1. Identify the addition and multiplication operations inside the ηT pairing computation

where one operand is public and other one is secret.

2. If such addition operation is found then apply the DPA on it. Remember that the

operation a+ b in F2m is nothing but the bitwise XOR of a and b. Perform DPA on

addition involving one secret and one chosen parameters.

3. If such multiplication is found then mount a DPA attack on it. The multiplication

in F2m is performed by shift and add procedure. The DPA attack on binary field

multiplication involving one secret and one chosen parameters could be applied to

find out the secret.

Suitable countermeasure also proposed in [40], which is based on projective coordinates

randomization [155] technique. However, no related works on pairing computations over

prime fields have been reported in the literature.

43


3.5 Conclusion

This chapter has presented a survey on diverse research activities related to the design

of cryptoprocessors for elliptic curve and pairing computations. Various methods of fault

and side-channel attacks and existing countermeasures on respective operations have been

also mentioned in this chapter. With this background of related work next chapter presents

our work on elliptic curve cryptoprocessor.

44

Chapter 4

Design and Analysis of Elliptic Curve

Cryptoprocessor

THIS CHAPTER PROPOSES an elliptic curve cryptoprocessor and analyzes its secu-

rity against physical attacks. Elliptic curve operations are based on the underlying

finite field arithmetic. Therefore, this work first designs a programmable arithmetic unit

for performing addition, subtraction, multiplication, inversion, and division in prime fields.

An elliptic curve cryptoprocessor for computing scalar multiplication is subsequently de-

signed for the curves defined over prime fields. The proposed cryptoprocessor comprises

two identical cores of programmable arithmetic unit. We explore a parallel scheduling for

computing elliptic curve scalar multiplication on proposed dual core cryptoprocessor. A

suitable technique is proposed and applied on the elliptic curve cryptoprocessor for resist-

ing it against differential power analysis and doubling attacks. In summery, the proposed

cryptoprocessor is inherently programmable, memoryless, and resistant against timing and

power attacks. It efficiently optimizes area × time per bit value for elliptic curve scalar

multiplication.

4.1 Introduction

Due to the increased demand of secured communication it is important to speed up

public key cryptography (PKC). Application areas like mobile communication emphasize

45

Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor

that high performance security architectures for large volume of data is of utmost impor-

tance. In order to provide security in these resource constrained devices, PKC algorithms

should be implemented in small area with high throughput. Elliptic curve cryptography

(ECC) [174, 178] is one of the best PKC algorithms as it provides high security at lesser

bit sizes than RSA [182]. In mobile applications, ECC is regarded more suitable than

RSA based public key schemes because it operates with higher throughput, lower power

consumption, and lesser area requirements. In current days, pairing based cryptography,

which is an extension of ECC, are mostly used in identity aware hand held devices.

Elliptic curve scalar multiplication (ECSM) is one of the most important operations

in elliptic curve as well as pairing base cryptography. It relies on underlying finite field

primitives like multiplication and inversion. A lot of work has been reported on the tech-

niques of speeding up performance of ECSM [66, 117, 155, 176]. The speed-up techniques

are essentially three-folds. First, high speed algorithms and architectures for finite field

primitives like multiplication and inversion were invented. Secondly, the methods of scalar

multiplications were improved to reduce the number of underlying costly operations, like

multiplicative inverse. Finally, the architectures were improved to incorporate more paral-

lelism in the computation of the scalar product.

In this chapter, we present an elliptic curve scalar multiplier (or elliptic curve crypto-

processor) exploiting the concept of shared arithmetic hardware and explore its security

against timing and power attacks. The contribution of the chapter is in three folds.

• PGAU core. We propose a Programmable GF(p) Arithmetic Unit (PGAU) that

performs GF(p) addition, subtraction, multiplication, inversion, and division. The

modular operations are performed directly in 2’s complement number system. The

PGAU reduces 18% area compared to that required in an integrated design where

each arithmetic unit is a state-of-the-art stand alone implementation. The PGAU

takes only 0.96 times slice area but achieves 2.67 times speedup compared to the

GF(p) ALU [106] with respect to a target operation AB/C (mod p).

• Elliptic curve cryptoprocessor. We observe that the saving in area of PGAU design

can be exploited by using its multiple copies in elliptic curve cryptoprocessor. We

attempt to speed up the elliptic curve scalar multiplication by using two cores of the46


proposed programmable GF(p) arithmetic unit. The implementation of the proposed

cryptoprocessor is done on Xilinx Virtex-II Pro FPGA platform.

• Side-channel attacks. The programmable GF(p) arithmetic unit is designed in such

a way that it does not provide any timing and power attack vulnerabilities during

the execution of finite field operation. A new point blinding technique is proposed

and applied on the proposed elliptic curve cryptoprocessor. The experimental results

are furnished for ensuring its security against timing, simple power analysis (SPA),

differential power analysis (DPA), and doubling attacks.

The outline of the present chapter is as follows: the chapter starts with a brief descrip-

tion of motivation and objective of the work followed by the description of the proposed

programmable GF(p) arithmetic unit. Then it elaborates the timing and power attack resis-

tant elliptic curve cryptoprocessor along with experimental results.


The hardware implementation of elliptic curve based cryptographic primitives are on

demand. More specifically, it is required to implement elliptic curve scalar multiplication

on hardware to achieve higher throughput of those primitives. In order to achieve better

performance on dedicated hardware, in general, multiple copies of similar hardware units

are integrated on a die and run in parallel. The parallelism of elliptic curve scalar multi-

plication algorithm is achieved by integrating multiple copies of GF(p) adder, subtractor,

multiplier, and inverter/divider units [58]. However, multiple units demand more hardware

area as well as they consume more power. It is observed that the utilization of individual

GF(p) arithmetic units are also low in existing designs. For example, the utilization of par-

allel ECC hardware (PPU2) in [23] is 42.7% during ECSM computation. The current work

attempts to improve the utilization of hardware.

In order to improve the utilization of hardware area, we first study the architectures of

individual GF(p) arithmetic units, which are shown in [106] and [53]. We observe that there

are huge scopes for hardware optimization. The GF(p) arithmetic units are mainly based on

two’s complement adder. As per the architectures shown in [53], GF(p) adder, subtractor,47


multiplier, and divider consist of 2, 2, 3, and 8 two’s complement adders, respectively.

Instead of separate adder circuits in each of the arithmetic operations, we can keep only an

optimum number of such circuits and reuse them for performing all four GF(p) arithmetic

operations. It is also observed that some other parts like internal registers and counters can

also be reused for computing GF(p) multiplication and division. Thus, we can share and

optimize the hardware for computing all four GF(p) operations. The objective is that the

sharing technique helps to improve hardware utilization.

We introduce a programmable architecture for computing all the four aforementioned

finite field operations on a single unit, which we call programmable GF(p) arithmetic unit

(PGAU). The unit can be reprogrammed to perform any of the four operations. However,

the proposed PGAU is not only applicable to ECC but it can also be used for developing

any finite field cryptographic primitives.

Another objective to develop such a programmable unit is for achieving lower value

of area × time per bit1 while computing elliptic curve scalar multiplication. It is a fact

that there is a minimal scope of parallelism within an elliptic curve point addition (ECA)

or elliptic curve point doubling (ECD) formulæ. The sequence of operations in an ECA

(or ECD) provides only a limited scope for simultaneously computing two different GF(p)

operations. Here we mainly refer to GF(p) multiplication and division, which are the most

time consuming operations compared to others. Thus, the performance of ECSM compu-

tation can be potentially improved by the parallel computation of independent ECA and

ECD within an iteration of ECSM algorithm. In [23], this parallelism is achieved by keep-

ing two copies of GF(p) multiplier and divider units, which demands twenty-two two’s

complement adders, two mod-k counters, and two mod-2k counters along with complex

control circuits for scheduling modular operations in ECA and ECD [23].

We alleviate the complexity of the architecture by keeping two dedicated PGAUs for

ECA and ECD. The two PGAU cores run as parallel threads. Thus, it achieves higher

throughput because of the increased parallelism. However, the increase in area is smaller

compared to a mere duplication of the GF(p) arithmetic operations because of the com-

1The term time per bit indicates the bitrate, which infers that the term area × time per bit is nothing but

the area × bitrate value. Lower value of this parameter indicates the better performance of a design.

48

4.3 Programmable GF(p) Arithmetic Unit (PGAU)

pactness of the PGAU, which helps to reduce the area × time per bit value.

The security of the proposed elliptic curve cryptoprocessor against timing and power

attacks is another major consideration in this work. We propose a technique for resisting

a special type of power attack known as doubling attack (DA). The adopted algorithms

and design techniques make the proposed design secured against differential and non-

differential timing and power attacks. Exhaustive experimental results have been shown

against those attacks to demonstrate the strength of the proposed design.


In a typical elliptic curve processor, as depicted in Fig. 4.1, there are dedicated arith-

metic units to perform the finite field operations, namely addition, subtraction, multiplica-

tion, and division. In the figure, the arithmetic units are represented as: Op1,Op2,Op3,

and Op4, respectively. The controller logic schedules the operations performed by the

arithmetic units Op1−4. The present work observes that the arithmetic units have a sig-

nificant commonality and is hence amenable to hardware sharing. Thus, the elliptic curve

processor can alternatively be transformed into a combination of a common logic, individ-

ual unshared logic of each of the Op1−4, a controller logic, and a logic for configuration,

which is shown in Fig. 4.2.

Op 2

Controller logic

Op 4 Op 1 Op 3

in 1 in 2 in 3 in 4 out 4 out 3 out 2 out 1

Figure 4.1: General structure of ECC processor with dedicated hardware units.

It may be observed that the common hardware logic from Op1−4 have been extracted and

integrated into one common logic module. The unshared portions are designed as Op1−4.49


unshared logic – Op 1




common logic

logic for configuration controller logic op

in out

Figure 4.2: Structure of ECC processor with shared arithmetic unit.

Depending on the operation required (denoted by Op) the controller logic uses the logic

for configuration module to program the unit to perform as one of the four operations.

This programmable elliptic curve processor consumes much less area than that of dedicated

arithmetic units existing in conventional implementation. However, the programmable pro-

cessor as shown in Fig. 4.2 may have following downsides.

• It essentially has resources to compute only one finite field operation at a time. There-

fore, it prevents the computation of several finite field operations in parallel.

• Due to the extra controlling logic, the place and route become complex which may

require additional design time. It may also demand larger chip area and provide

longer critical path of the design.

But in this work, we use elliptic curves defined over a prime field and the points on the

curve are represented in affine coordinates. In this point representation, there is a very little

scope of parallel computation of prime field operations. Thus a common arithmetic core is

useful. The elliptic curve processor with resource sharing technique as shown in Fig. 4.2

can also be implemented hierarchically by smaller modules for reducing the place and route

complexity. Therefore, the degradation of results for additional logic is negligible. The

main objective of this work is to optimize area× time per bit value in ECSM computation.50


4.3.1 Motivation of PGAU

Main motivation behind the programmable GF(p) arithmetic unit is to optimize the area

by exploiting the resource sharing technique. Although resource sharing prevents the com-

putation of several GF(p) operations in parallel (because they share the same resources).

This does not cause problems with the point representation we use in the thesis, because

those point formulæ do not have much parallelism. Motivated by this fact we aim to design

an optimized programmable unit for computing all underlying finite field operations.

The objective of the work is to develop a programmable unit for GF(p) arithmetics using

lesser resources. We choose underlying algorithms for different operations which require

lesser resources and have maximum common resources. The common resources are then

extracted from each of the architectures, and develop a programmable GF(p) arithmetic unit

(PGAU). We found the bit serial interleaved multiplication and binary inversion algorithms

in GF(p) are more suitable in this respect in cost of higher multiplication time.

4.3.2 Proposed Programmable GF(p) Arithmetic Unit

Figure 4.3 presents a top level block diagram of the proposed programmable arithmetic

unit for performing GF(p) addition, subtraction, multiplication, inversion, and division.

It consists of four ⌈log2 p⌉-bit operand registers, namely u, v, x1, and x2, for holding the

intermediate results. The operation a [op] b (mod p) is done inside the data path module,

where op indicates the operation being performed by the unit, and the result appears at port

R. The control path decodes the instructions and generates appropriate signals to configure

the data path; so that, the operations are performed correctly. In our proposed architecture,

essentially, we have five different instructions that are encoded in four bit opcode. To

make the controller unit as well as configuration less complex, one extra bit is used. The

operations and corresponding opcodes (op) are described in Table 4.1.

Inside the architecture, the four opcode bits are referred as D,M,A/S, and I/D. The opcode

bits D,M,and A/S, are used to control data flow inside the data path block; whereas, I/D

is only used to differentiate inversion from division by initializing the x1 register by 1 and

b, respectively (see Fig. 4.3).

51


common cores unshared

logic

data path

control path operand registers

M U X

op clock reset

p

R

a b

configu ration

Figure 4.3: Programmable GF(p) adder, subtractor, multiplier, and divider unit.

Table 4.1: Different opcodes for PGAU.Operation Opcode

D M A/S I/DGF(p) Addition 0 0 0 XGF(p) Subtraction 0 0 1 XGF(p) Multiplication 0 1 0 XGF(p) Inversion 1 0 1 0GF(p) Division 1 0 1 1

4.3.3 Programable Data Path Block

Figure 4.4 depicts a broad view of the data path block of the proposed programmable

GF(p) arithmetic unit. Considering the hardware cost, parallelism, and efficiency of com-

puting GF(p) multiplication and division, we classify the data path block into three major

sub-blocks; namely DP1, DP2, and DP3. We use the sub-blocks for computing GF(p)

operations in the following way.

• GF(p) addition/subtraction. GF(p) addition and subtraction are performed by the

DP3 module which takes inputs from a,b, and p ports. It computes a± b (mod p)

and produces the output at a′′, which then comes out through port R.

52


common cores unshared logic

common cores

mux

mux

unshared logic

mux

mux

mux

u x 1 v x 2

2 x

p a

a . b i

b

mux

D M

A / S

b i u '

v ' x 1 '

x 2 '

x 1 ''

x 1

x 2

E R

R un vn x 1 n x 2 n

u 0 v 0

x 2 '' u '' v '' a ''

D P

1 D P

2

D P

3

mux mux mux mux 0 1 0 1 0 1 0 1 0 1 2

Figure 4.4: Data path block.

• GF(p) multiplication. The proposed design computes GF(p) multiplication using

bit serial interleaved multiplication algorithm, which is described in chapter 2. Each

iteration of this multiplication algorithm (Algo. 2.1) consists of two steps, GF(p)

doubling and GF(p) addition. DP1 and DP3 are used to compute those two steps

in only one clock cycle. Operand register u (Fig. 4.3) is used to accumulate the

intermediate result. At the k–th iteration, where k = ⌈log2 p⌉, the final result comes

at a′′ port of DP3, and result goes out through port R. Hence the multiplication

latency of the proposed design is k clock cycles.

• GF(p) inversion/division. The proposed design computes GF(p) inversion as well

as division using binary inversion/division algorithm which is described in Algo. 2.2,

section 2.1.3 of chapter 2. One iteration of this algorithm consists of three steps. We

say, step–1 comprise of the operations in steps 2.2.1 and 2.2.2, step–2 the operations

53


in steps 2.5.1 and 2.5.2, and step–3 the operations in steps 2.7 and 2.8 of the algo-

rithm. For the clear understanding of our design, aforementioned steps of Algo. 2.2

are revisited below.

Step–1:

2.2.1. u← u/2

2.2.2. if x1 is even then x1← x1/2 else x1← (x1 + p)/2

Step–2:

2.5.1. v← v/2

2.5.2. if x2 is even then x2← x2/2 else x2← (x2 + p)/2

Step–3:

2.7. if u≥ v then u← u− v, x1← x1− x2

2.8. else v← v−u, x2← x2− x1

The first two steps perform similar operations on different inputs, {u,x1} and {v,x2};whereas, step–3 operates on input {u,v,x1,x2}. Three data path units namely DP1,

DP2, and DP3 are used to perform aforementioned three steps, respectively. The

updated values of u,v,x1, and x2 after an iteration come out at un,vn,x1n, and x2n,

respectively. These intermediate results are then accumulated into the respective

registers, u,v,x1,x2. The multiplexing of intermediate results as per Algo. 2.2 is done

based on the bit values of u0 and v0, which indicates whether current value of u and v

are odd. If either of u and v is even then the intermediate results will come from DP1

(Step–1) and DP2 (Step–2); otherwise, they will come from DP3 (Step–3). All the

three data path sub-blocks run in parallel for computing inversion as well as division,

and they compute one iteration of the algorithm in only one clock cycle. At every

clock cycle either u or v is reduced by one bit size. Hence, the inversion/division

latency of PGAU is atmost 2k clock cycles.

The common cores of DP3 module is identified as the common operator logic for all

five GF(p) operations. The DP2 module and unshared logic of DP3 module are used only

for inversion and division operations. The common cores of DP1 module are used for mul-

tiplication, inversion, and division. The muxes are considered as logic for configurations.

The architectures and functionalities of DP1, DP2, and DP3 are described in following54


paragraphs. The control signal ER (Fig. 4.4) is a 2-bit signal, say ER1 and ER2 . The logic

for ER1 = D∨M, and ER2 = D∧T , where T is a temporary one bit signal generated by the

following logic : i f v = 1 then T = 1 else T = 0, also ∨ and ∧ stands for Boolean OR and

AND operations. Let us now describe the functionality of each of the modules in the data

path of our proposed programmable GF(p) arithmetic unit.

Module DP1. Figure 4.5 renders the DP1 module. It consists of one ⌈log2 p⌉-bit binary

adder/subtractor, and a set of multiplexors. For GF(p) multiplication (D = 0), DP1 per-

forms 2u (mod p), and passes the result to the port 2x; whereas, for GF(p) inversion and

division (D = 1), it performs u/2, and if x1 is even then it performs x1/2 else (x1 + p)/2.

Finally, if u is even then DP1 passes the computation results else it passes the values of u

and x1 to u′ and x′1, respectively.

shifter 2 u u / 2 x 1 / 2 ( p + x 1 )/ 2

adder

mux mux

x 1 u p

a b c in c out

s D

mux mux mux

mux

u k - 1

x 1 0

u 0

2 x u’ x 1 '

1 0

1 0 1 0

1 0

1 0

1 0

p + x 1

Figure 4.5: DP1 block in the data path.

Module DP2. Figure 4.6 portrays the DP2 module, which is used only for inversion and

division. This module performs v/2, and if x2 is even then x2/2 else (x2 + p)/2. Finally, if

v is even then DP2 passes the updates else it passes the old values of v and x2 to v′ and x′2,

respectively.

55


shifter v / 2 x 2 / 2 ( p + x 2 )/ 2

adder

x 2 v p

a b c in c out

s D

mux mux

mux x 2 0

v 0

v’ x 2 '

1 0 1 0

1 0

p + x 2


adder a b

c in

s GF ( p ) A / S unit a b

p s

A / S 1

mux 1 0

mux 1 0

mux 3 2 1 0

mux 3 2 1 0

u v x 1 x 2 p 2 x a a . b i b

mux 1 0

mux 1 0

mux 1 0

mux 1 0

u " v " x 1 " x 2 " a "

A / S

S Y

uov


Module DP3. Figure 4.7 shows the detailed architecture of DP3 module. It comprises

of one GF(p) addition/subtraction (A/S) unit, input and output multiplexors, and some

56


additional circuitry used in inversion and division only. The GF(p) A/S unit is used for

computing all five GF(p) arithmetic operations. DP3 computes a[+,−]b (mod p), if D = 0

and M = 0. The result of GF(p) addition or subtraction goes to the output port R through

a′′. In case of GF(p) multiplication (D = 0, M = 1), DP3 computes 2x + a.bi (mod p).

In each iteration, intermediate result of multiplication is restored in register u; while, at

final iteration, the value of a′′ passes through output port R. The DP3 is fully utilized for

computing GF(p) inversion and division (D = 1). For D = 1, if u ≥ v then DP3 performs

u′′ = u− v and x′′1 = x1− x2 (mod p) else it performs v′′ = v− u and x′′2 = x2− x1 (mod

p). Thus the result of subtractor unit (uov) is multiplexed to either of u′′ and v′′. And the

result of GF(p) A/S unit (a′′) is multiplexed to either of x′′1 and x′′2 . The inputs x and y of

GF(p) A/S unit are properly assigned by a couple of 4×1 multiplexers that are controlled

by select line S. The variables are in accordance with the description of Algo. 2.2. The

control signals Y and S (S1S0) are generated in the following way:

Y = 1, i f u≥ v

= 0, otherwise;

and

S0 = MD+DY

S1 = D,

where D and M indicate the opcode bits.

Fig. 4.8 depicts the programmable GF(p) adder/subtractor (A/S) unit. In order to

achieve a programmable modular adder/subtractor unit, we need to control the input data

of two binary adder circuits. The control signal A/S configures the circuit for GF(p) ad-

dition and subtraction, accordingly. If A/S = 0 then the unit performs x+ y (mod p) else

it performs x− y (mod p). Therefore, in the inversion and division opcodes A/S is one,

whereas in the multiplication opcode A/S is zero (see Table 4.1), as addition is needed for

multiplication whereas subtraction is needed for inversion/division. The responsibility of

three data path sub-blocks are summarized in Table 4.2.

The programmable GF(p) arithmetic unit is programmable also in the sense that it sup-

ports all primes smaller than the given lengths (192, 224, 256 bits).57


adder

mux mux

p

a b c in c out

s

1 0 1 0

y x

A / S

adder a b

c in c out

s

mux 1 0

mux 1 0

x + y , x - y ( mod p )

Figure 4.8: GF(p) adder and subtractor (A/S) unit.

Table 4.2: Major operations in sub-blocks of PGAU.

GF(p) Operation DP1 DP2 DP3Addition/Subtraction − − a±bMultiplication 2u − 2x+a.biInversion/Division u/2, x1/2, v/2, x2/2, ±(a−b),

(x1 + p)/2 (x2 + p)/2 ±(x1− x2)

4.3.4 Hardware Cost and Performance

In order to compute GF(p) addition and subtraction, the proposed programmable arith-

metic unit takes only one clock cycle. In case of GF(p) multiplication, the unit takes only

⌈log2 p⌉ clock cycles; and for GF(p) inversion as well as division, it takes only 2⌈log2 p⌉clock cycles. Table 4.3 shows the amount of resource required for implementing the pro-

posed programmable GF(p) arithmetic unit on Virtex-II FPGA. The number of clock cycles

and time required to compute GF(p) addition/subtraction (A/S), multiplication (M), and in-

version/division (I/D) are listed in Table 4.4.

Table 4.5 gives a comparative study of our proposed PGAU and stand alone implemen-

58


Table 4.3: Implementation result of PGAU on Virtex-II FPGA.

Prime p Area Frequency(bits) Slice LUT Equivalent gate (MHz)

192 3 985 7 328 60.9k 40224 4 657 8 547 71.0k 37256 5 379 9 821 81.5k 34

Table 4.4: Performance of PGAU on Virtex-II FPGA.

GF(p) operation #Clock Time (µs)k = 192 k = 224 k = 256

Addition/Subtraction 1 0.025 0.027 0.029Multiplication k 4.800 6.000 7.300Inversion/Division 2k 9.600 12.100 14.600k = ⌈log2 p⌉

tation of different GF(p) arithmetic units. In the table, we refer conventional processor with

dedicated GF(p) adder, subtractor, multiplier, and divider units as the integrated processing

unit (IPU). The design of IPU, we consider, is similar to the parallel GF(p) arithmetic unit

shown in [53]. The work described in [53] provides the design architectures for each of

the individual arithmetic operations in prime fields. However, it does not provide the cost

for implementing individual units. Thus, for proper comparison we implement each of the

GF(p) arithmetic units based on the architectures provided in [53]. Table 4.5 shows that

the area requirement of our PGAU is only 82% of the IPU.

Table 4.5: Comparative hardware costs in LUT on Virtex-II FPGA.

Prime p Hardware Cost (LUT) PGAUIPU

(bits) Add Sub Mult Inv/Div IPU PGAU192 577 577 1 145 6 616 8 915 7 328 0.82224 673 673 1 329 7 731 10 406 8 547 0.82256 769 769 1 508 9 146 12 192 9 821 0.81− Add, Sub, Mult, and Inv/Div indicate stand alone adder, subtractor, multiplier, and

inverter/divider units, respectively.

Comparison with GF(p) ALU [106]. A GF(p) ALU for encryption processor is presented

by Daly et al. [106]. It performs modular multiplication and inversion in Montgomery

domain. The above ALU is a carry propagate adder based architecture consisting of 3 k+259


bit adder units. It is specified that the ALU computes the multi-cycle modular operations

like multiplication and inversion by executing repeated addition/subtraction operations on

respective data sets. The paper does not mention anything about the control units. We

may assume that those signals are generated from another module outside the ALU unit.

The control unit obviously adds on extra overhead with the proposed ALU for performing

respective modular operations. However, [106] also shows the implementation results of

the architecture that is implemented for computing AB/C (mod p) operation.

The major differences between proposed PGAU and the ALU proposed in [106] with

respect to the adopted design strategies are shown in Table 4.6. Proposed PGAU performs

all GF(p) operations on two’s complement binary number domain. Whereas, the GF(p)

ALU of [106] computes the multiplication and inversion on Montgomery domain num-

bers. Thus they need an overhead for input and output conversion in elliptic curve based

applications. Another major advantage of PGAU design is that it performs GF(p) division

directly in 2k clock cycles; while, the ALU of [106] performs the same in 3k clock cycles

by executing an inversion followed by a multiplication.

Table 4.6: Difference between PGAU and GF(p) ALU [106].

GF(p) ALU [106] PGAUNumber Domain Montgomery two’s ComplementMult. Algo. Montgomery InterleavedInv. Algo. Montgomery Binary inversionDiv. Algo. Inversion followed Binary inversion/

by multiplication division

The implementation result of our 192-bit PGAU is compared with the same length

GF(p) ALU [106] in Table 4.7. Both of the designs are implemented on Virtex II FPGA

platform. In the table CC indicates the number of clock cycles that are required to compute

the respective operations on the respective designs, and T indicates the respective time in

µs. The 192-bit implementation consumes 4135 slice area, and it operates at the maximum

of 19 MHz clock frequency. Whereas, proposed PGAU in same bit length consumes 3985

slices and runs at the maximum of 43 MHz clock frequency. Due to the direct division,

the PGAU saves k clock cycles for each division operation, which is a major operation in

elliptic curve point operations in affine coordinates. The small area gains in our design is60

4.4 Elliptic Curve Cryptoprocessor Resistant to Timing and Power Attacks

due to the step-by-step optimized architecture for all operations.

Table 4.7: Performance comparison between PGAU and GF(p) ALU [106].

DesignSlice

Multiplication (A ·B) Inversion (A−1) Division (A/B) AB/C(192-bit) CC, T CC, T CC, T CC, TProposed 3985 k, 4.8 2k, 9.6 2k, 9.6 3k, 15.4Daly [106] 4135 k, 10.1 2k, 20.2 3k, 30.3 4k, 40.4−CC : number of clock cycles. T : time in µs.

4.4 Elliptic Curve Cryptoprocessor Resistant to Timing

and Power Attacks

The proposed elliptic curve cryptoprocessor employs above programmable GF(p) arith-

metic unit. The whole architecture has inherent programmability features. That is, the

prime p can be changed without reconfiguring the FPGA. The architecture is designed for

computing the elliptic curve operations in affine coordinates.

In affine coordinates, point addition (ECA) consists of two multiplications and one divi-

sion, and point doubling (ECD) consists of three multiplications and one division in GF(p).

The elliptic curve scalar multiplication (ECSM) is performed by executing a number of

ECA and ECD operations. In general, we may consider that the integer d in dP operation

consists of 0.5⌈log2 d⌉ number of 1’s. Thereby, using binary algorithm one ECSM can be

computed by ⌈log2 d⌉ number of ECD and 0.5⌈log2 d⌉ number of ECA operations [155].

But, Okeya et al. [156] pointed out that aforementioned imbalanced ECSM computa-

tion procedure is vulnerable to non-differential side-channel attacks. However, the Mont-

gomery ladder, which is described in Algo. 2.4, is balanced and it computes both ECD and

ECA at every iteration irrespective of the bit value di. Therefore, it executes 30% additional

operations over binary algorithms for defending above attacks.

It is shown in [162] that the balanced ECSM algorithms are secured against non-

differential side-channel attacks, but they are vulnerable against differential side-channel

attacks. Differential Power Analysis (DPA) is one of the most popular differential side-61


channel attacks. Coron in [162] described the DPA on an balanced ECSM algorithm other

than the Montgomery ladder. However, the operands (curve points) occurred at every it-

erations of Montgomery ladder are deterministic. For example, if dk−2 = 1 then operands

in ECD operation is 2P else it is P. It is a known fact that there is a difference between

the power consumptions for performing ECD operations on 2P and P. Thus, there is a

correlation between the secret bit dk−2 and the power consumption at first iteration. In gen-

eral, there are correlations between di and the power consumption at the respective iteration

number k− i−2 for k−2≥ i≥ 0. The DPA is based on such correlations. Hence, Algo. 2.4

is vulnerable against DPA attack.

There are several ways to protect the secret in ECSM operation against DPA attack.

Point blinding technique proposed by Coron [162] over the Montgomery ladder can de-

fend the DPA attack. It works by following way. The point P to be multiplied is blinded

by adding a secret random point R for which S = dR. Scalar multiplication is done by

computing the point d(R+P) and subtracting S to get Q = dP. Let us consider an user

executes a set of dPi operations in a session. The input points R and S are private to the

user, which are provided by the same way as d for every new session. The random point R

and S are refreshed at each new execution by computing R← (−1)b2R and S← (−1)b2S

with a random bit b. This makes the DPA attack infeasible since the point P′ = P+R to be

multiplied by d is not known to the attacker.

A new kind of power attack known as doubling attack (DA) [118] breaks the double-

and-add always algorithms by only two query. It exploits the power consumption profiles

of the elliptic curve device to compute dP and d(2P). This attack also breaks the above

DPA resistant point blinding technique. The only difference is that during the second query

with 2P the device executes 2P+2R with probability 0.5. The attack is extended to break

Montgomery ladder in [81]. Therefore, the combination of above two ideas can break the

DPA resistant Montgomery ladder. The computation of Montgomery ladder (Algo. 2.4)

with above point blinding technique for two input points P and 2P is shown in Table 4.8.

Let us assume that d = 11001011 and M = P+R.

In the table, it is observed that if di−1 = di then same doubling operation is executed

on the (i−1)th iteration of d(2(P+R)) computation and the (i)th iteration of d(2(P+R))

62


Table 4.8: The dM and d(2M) in the Montgomery ladder.

i di Process of dM Process of d(2M)7 1 Q1 = 1+M Q1 = 1+2M

Q2 = 2(M) Q2 = 2(2M)6 1 Q1 = M+2M Q1 = 2M+4M

Q2 = 2(2M) Q2 = 2(4M)5 0 Q2 = 3M+4M Q2 = 6M+8M

Q1 = 2(3M) Q1 = 2(6M)4 0 Q2 = 6M+7M Q2 = 12M+14M

Q1 = 2(6M) Q1 = 2(12M)3 1 Q1 = 12M+13M Q1 = 24M+26M

Q2 = 2(13M) Q2 = 2(26M)2 0 Q2 = 25M+26M Q2 = 50M+52M

Q1 = 2(25M) Q1 = 2(50M)1 1 Q1 = 50M+51M Q1 = 100M+102M

Q2 = 2(51M) Q2 = 2(102M)0 1 Q1 = 101M+102M Q1 = 202M+204M

Q2 = 2(102M) Q2 = 2(204M)Return Q1 = 203M Q1 = 406M

computation. Thus the power consumption profiles for any such i can find out whether di−1

is same as di, i.e., both of them are either 0 or 1. Now starting from the first iteration (MSB

of d) we can easily find out all bits of the secret exponent d using above doubling attack.

4.4.1 Modified Montgomery Ladder Against DPA and DA

The DPA and DA resistant Montgomery ladder is sketched in Algo. 4.4.1. We propose

a modification of the Coron’s point blinding technique to defend doubling attack (DA). The

refreshment of point R and S are done by following way :

R = (−1)b3R,

S = (−1)b3S.

The modified technique is indeed secure against DPA attack, as the rest of the conditions

remain unchanged. It is also secure against doubling attack (or DA) because in the second

query it adds ±3R with P. Therefore, as per DA it effectively computes d(P+R) and

d(2P±3R), which will not execute any similar operation during ith and (i−1)th iterations,

63


respectively. The algorithm performs dP correctly because it is executed by computing

d(P±3R)− (±3S) where R = dS.

Algorithm 4.4.1. DPA and DA resistant Montgomery ladder.Input: An integer d ≥ 1 and points P,R,S such that S = dR.Output: dP.1. Q1← P and Q2← R2. Q1← Q1 +Q23. Q2← 2Q14. for i from k−2 down to 0 do

if di = 1 thenQ1← Q1 +Q2 and Q2← 2Q2

elseQ2← Q1 +Q2 and Q1← 2Q1

5. b← ∑k−1j=0 Q1X j(mod 2), //Q1X is the x-coordinate of Q1

6. Q1← Q1−S7. R← (−1)b3R, S← (−1)b3S8. return Q1

It is described in [108] that the ECSM operation generates a good random pattern. The

security of this random generator depends on the input point P, which is private. However

in our case the input P is public. In the proposed technique, the input point P is added

with a private point R before starting the ECSM operation. The private R infers a private

input (P+R) to the ECSM operation. Thus, our modified Montgomery ladder works same

as [108]. We take the bitwise XOR of all bits of x-coordinate of the point Q1 = d(P+R)

as a random bit b. The bit b is indeed random because the operation b =⊕k

i=1 xi where

x1, · · · ,xk denote the bits of a random number x must be hard to compute if f (ECSM in our

case) cannot be inverted (see § 6.1.3, [51] and § 5.9.1, [129]). The bit b is used to refresh

the random point R and corresponding S for next execution which are performed at steps 5

and 7 of Algo. 4.4.1.

4.4.2 The ECSM on Single PGAU-core

The DPA resistant ECSM computation described in Algo. 4.4.1 performs both ECD

and ECA at every iterations. Table 4.9 shows the finite field operations required to perform

ECD and ECA. On a single PGAU core implementation, we perform ECA and ECD se-

quentially; an ECA followed by an ECD at every iteration. The computations of ECA and64


Tabl

e4.

9:Pa

ralle

lism

ofE

CD

and

EC

Aof

step

-4of

Alg

o.4.

4.1

intw

oG

F(p)

arith

met

icco

res.

EC

D(2

Qd i+

1)on

PGA

U1

EC

A(Q

1+

Q2)

onPG

AU

2C

lock

sG

F(p)

Ope

ratio

nR

TL

Clo

cks

GF(

p)O

pera

tion

RT

L1

MR

A←

RQ(d

i+1)

x×

RQ(d

i+1)

x1

SR

D←

RQ

2y-R

Q1y

k+2

AR

B←

RA

+R

A2

SR

E←

RQ

2x-R

Q1x

k+3

AR

A←

RB

+R

A3

DR

D←

RD

/RE

k+4

AR

A←

RA

+R

CP§

2k+4

MR

E←

RD×

RD

k+5

AR

B←

RQ(d

i+1)

y+

RQ(d

i+1)

y3k

+5A

RF←

RQ

1x+

RQ

2xk+

6D

RA←

RA

/RB

3k+6

SR

E←

RE

-RF

3k+7

MR

B←

RA×

RA

3k+7

SR

F←

RQ

1x-R

E4k

+8A

RC←

RQ(d

i+1)

x+

RQ(d

i+1)

x3k

+8M

RF←

RD×

RF

4k+9

SR

B←

RB

-RC

4k+9

SR

F←

RF

-RQ

1y4k

+10

SR

C←

RQ(d

i+1)

x-R

B4k

+11

MR

C←

RA×

RC

Res

ult:

(RB

,RC

)=2Q

(di+

1),(

RE

,RF)

=Q

1+

Q2

5k+1

2S

RC←

RC

-RQ(d

i+1)

y−

RT

Lst

ands

forR

egis

terT

rans

ferL

ogic

−§

Reg

iste

rRC

Pco

ntai

nsth

ecu

rve

para

met

era.

−R

Q1x

,RQ

1y,R

Q2x

,and

RQ

2yco

ntai

nth

eva

lue

ofx

and

yco

ordi

nate

sof

Q1

and

Q2

resp

ectiv

ely.

65


ECD take 4k+10 and 5k+13 clock cycles. Thus, one iteration of step-4 of the Algo. 4.4.1

is computed in (9k+ 23) clock cycles by the single PGAU core based ECSM cryptopro-

cessor, where k = ⌈log2 p⌉= ⌈log2 d⌉ (refer Table 4.9). Hence, the number of clock cycles

(TS) required to perform ECSM operation on the single PGAU core is calculated as:

TS = (k−1)(9k+23)+3(5k+13)+4(4k+10)

= 9k2 +45k+56, (4.1)

where 3(5k+13) and 4(4k+10) number of clock cycles are required to execute the steps

2, 6, and 7 of the respective algorithm.

4.4.3 The ECSM on Dual PGAU-core

Parallelism can be considered to achieve a faster computation of elliptic curve scalar

multiplication. More than one PGAUs can be incorporated in this regard.

PGAU 1

PGAU 2

x

p y R o 1

x y p

R o 2

m u x

m u x

RP

RA RB RC

RCP

RQ 1 x

RQ 1 y Controller Logic

RD RE RF

d clock reset

A

B

RQ 2 x

RQ 2 y

RR x

RR y

RS x

RS y

Figure 4.9: Programmable dual-PGAU-core ECSM unit.

The proposed design is based on k-bit two’s complement parallel adder circuit, and it

does not use any combinational multiplier; hence do not have any scope of horizontal par-

allelism [58]. We try to maximize the parallelism of GF(p) operations; i.e., use vertical66


parallelism. Fig. 4.9 depicts our proposed dual core GF(p) elliptic curve scalar multipli-

cation (ECSM) unit. It comprises of two programmable GF(p) arithmetic units (PGAUs),

on which we compute finite field operations of ECD and ECA concurrently. We assign

PGAU1 for ECD and PGAU2 for ECA. The state machine based controller unit is imple-

mented for sequencing the operations mentioned in Table 4.9. It takes the responsibility

of loading large operands into the respective registers through 32-bit data port, generating

opcodes for both PGAUs, selecting operand values accumulating the ECD and ECA results

at every iterations of ECSM algorithm, updating the random pair (R,S) for next execution,

and passing the resultant point Q = dP through the 32-bit output port. There are two over-

lapped register blocks A and B. The registers inside the block A are in accordance with

PGAU1. The intermediate results of ECD, which appears at Ro1 of PGAU1, are stored in

one of the registers RA, RB, and RC. Similarly, the intermediate results of ECA, which

appears at Ro2 of PGAU2, are stored in one of the registers RD, RE, and RF . The registers

RCP and RP contain the curve parameter a and prime modulus p, respectively. Registers

RQ1x, RQ1y, RQ2x, and RQ2y contain the value of x and y coordinates of Q1 and Q2, the

two points on which the ECSM operates, respectively. The overlapping registers of both

register blocks A and B are RQ1x, RQ1y, RQ2x, RQ2y, and RP, which are in accordance with

both PGAUs. The final result of an iteration is restored into the RQ1x, RQ1y, RQ2x, and

RQ2y registers, which are used as the input points at the next iteration. Table 4.10 shows

the data transfer among registers for final result of an iteration.

Table 4.10: Restoring the intermediate results of dP operation.

di = 0 di = 1RQ1x← RB RQ1x← RERQ1y← RC RQ1y← RFRQ2x← RE RQ2x← RBRQ2y← RF RQ2y← RC

The operation in step-6, Q1← Q1−S, is executed in the proposed cryptoprocessor as :

Q2← S, Q1← Q1−Q2.

In step-7, the operation R← (−1)b3R is performed as:

Q2← R, Q2← 2Q2, Q2← Q2 +R, R← (−1)bQ2.

67


The similar process is followed for performing S← (−1)b3S. The above procedures are

adopted for reducing the size of multiplexer in the input ports of PGAUs. The random bit

b is generated by a XOR tree included into the controller logic in only one clock cycle.

The operation scheduling on PGAU1 and PGAU2 at different clock cycles is shown

in Table 4.9 for performing ECD and ECA in parallel. The computation of ECD takes

only 5k+13 clock cycles; whereas, ECA takes only 4k+10 clock cycles. Operations are

performed in parallel, and the next iteration depends on the results of both ECD and ECA

of current iteration. Hence, one iteration of the Montgomery ECSM ladder is computed

in 5k + 13 clock cycles by the proposed dual-core elliptic curve cryptoprocessor, where

k = ⌈log2 p⌉. The latency of Montgomery ECSM algorithm in our proposed design is

derived here.

Latency of dP computation. The overhead for the DPA and DA resistance involves the ini-

tialization, final result computation, and random points refreshment as stated in Algo. 4.4.1.

It consists of three ECD, and four ECA operations, which takes 3(5k+ 13)+ 4(4k+ 10)

clock cycles on proposed dual-PGAU-core cryptoprocessor. Each iteration of the algo-

rithm takes 5k+ 13 clock cycles. The algorithm goes through all bits of scalar multiplier

d starting from the second most significant bit. Let us consider k = ⌈log2 p⌉ = ⌈log2 d⌉.The number of clock cycles (TD) required to perform one dP operation on the proposed

cryptoprocessor is as follows.

TD = (k−1)(5k+13)+3(5k+13)+4(4k+10)

= (k+2)(5k+13)+4(4k+10)

= 5k2 +39k+66. (4.2)

Therefore, the latency of the proposed dual-PGAU-core ECSM cryptoprocessor is 5k2 +

39k+66 clock cycles. Hence, according to equations 4 and 5 the clock cycle latencies of

single and dual core implementation are as follows.

• Latency in single-PGAU-core : 9k2 +45k+56.

• Latency in dual-PGAU-core : 5k2 +39k+66.

68

4.5 Security Analysis of the Proposed Cryptoprocessor

Thus, for a 192-bit ECSM (k = 192) the dual-PGAU-core implementation performs

1.8 times faster compared to corresponding single-PGAU-core implementation. Through

out this thesis, our proposed elliptic curve cryptoprocessor indicates this dual-PGAU-core

architecture, sometimes it is also called dual-core elliptic curve cryptoprocessor.


This section shows that the proposed implementation is indeed secure against timing

and power attacks. In case of ECC applications, d is used as a private key of the user. The

(S,R) pair is also private. The private parameters are applied to the proposed cryptoproces-

sor once in a session. Within a new session the user can decrypt several message (P). For

every decryption within a session the (S,R) pair is refreshed by a random bit b.

4.5.1 Timing Attacks

Timing attack was introduced by Paul Kocher in 1996 [171], which was the first re-

ported side-channel attack on cryptographic implementations. There are two types of tim-

ing attack that are applied on ECSM implementations [45].

• The Hamming weight model. This attack is only applicable to the unbalanced al-

gorithms, where the ECA is performed only in the iterations where di = 1. This

attacking model exploits the timing measurement of dP computation for finding out

Hamming weight of the secret parameter d. This attack does not exactly find out

bit values of d, but it reduces the search space. The attack works as follows: let us

consider that the adversary knows the time required to perform ECD and ECA by

the target device. The adversary measures the time required to perform one dP op-

eration by the device. From these timing information the adversary tries to guess the

Hamming weight of d.

• Statistical Timing Attack. This is more sophisticated and more powerful timing at-

tack. Let, the target device take different amount of time to perform ECD operations

on different points. The processed point in an iteration is correlated to the respective

bit value of d. Thus, a statistical analysis of timing variations to perform ECD of a69


particular iteration finds out the secret bits.

The proposed elliptic curve cryptoprocessor is secure against above mentioned timing

attacks. It computes symmetric operation at every iteration. It performs both ECD and

ECA in parallel, which takes exactly 5k + 13 clock cycles at every iteration. According

to Equation 4.2 it takes 5k2 + 39k + 66 clock cycles to perform one dP operation. Let

us consider that the time period of one clock cycle is t. Thus the measured value of dP

computation time tdP = t(5k2 + 39k+ 66). It is considered that the computation time for

ECD and ECA, which are denoted by tecd and teca, are known to the adversary. In fact the

value of teca is same as the time required for computing each and every iteration on the

proposed cryptoprocessor, which is 5k+13 clock cycles. Thus, tdP/teca is fixed for a given

k, which does not help to find out the Hamming weight of secret d.

On the other hand, statistical timing analysis believes that the value of tecd varies with

input points. However, in the proposed elliptic curve cryptoprocessor tecd value for ev-

ery point is unique. It is achieved through the design technique adopted for implementing

PGAU. The PGAU takes a fixed amount of time for computing a GF(p) arithmetic opera-

tion on different inputs. For example, it takes exactly k and 2k clock cycles for computing

multiplication and division on every input for a given finite field. Thus, the proposed cryp-

toprocessor is secure against statistical timing attack.

4.5.2 Simple Power Analysis (SPA)

Kocher in [163] first described SPA and DPA attacks. SPA observes the power con-

sumption of one single execution of a cryptographic algorithm. The SPA attack on ECSM

implementation is based on the observation that the power consumed at a given time is re-

lated to the point operations being executed and the bit value of the secret scalar multiplier

being manipulated. The SPA on naive implementation, which are based on imbalanced

computations, finds out the bit values of secret multiplier [45,155]. However, the proposed

implementation consists of the following SPA resisting properties.

• It is based on the balanced computation of Montgomery ladder.

• It does not execute any conditional branch statement.70


• It performs field multiplication and squaring by the same sequence of operations.

• It performs fixed set of operations for every iterations of dP execution.

Thus, the power consumption profile exhibits an uniform pattern throughout the dP com-

putation from which by simple observation it is impossible to identify the respective bit

value of the secret multiplier. Therefore, the proposed cryptoprocessor is secure against

SPA attack.

On the other hand, an n bit scalar multiplier d (on average) consists of n/2 number of

non-zero bits in its binary representation. In case of binary double-and-add algorithm, the

addition of two points on the elliptic curve is computed only while processing the non-zero

bits of d. However, our implementation is based on Montgomery ladder which performs

both point addition and point doubling at every iterations, irrespective of the bit values,

make the design resistant against SPA. As described in Section 2.2.1, the costs of one point

addition and one point doubling are 2M +D and 3M +D, respectively, where M and D

stands for multiplication and division in GF(p). Thus, the total cost of one elliptic curve

scalar multiplication using binary double-and-add procedure is n(4M+1.5D), whereas the

same using proposed procedure is n(5M+2D). It incurs n(M+0.5D) additional operations

due to the side-channel attack resistance property. In our proposed cryptoprocessor each

M and D demand n and 2n clock cycles, respectively. Therefore, the overhead cost of our

proposed side-channel attack resistant scheme is (n(2n)/n(7n))100%≃ 30%.

4.5.3 Differential Power Analysis (DPA)

In DPA attack the adversary exploits deterministic variations in the power consumption

that are caused by processing varying data. A DPA on ECC is described in [162]. Here we

describe a similar type of DPA on SPA resistant Montgomery ladder. A DPA on Algo. 2.4

in section § 2.2.1 can be performed by noticing that at step j the processed points Q1 and Q2

depend only on the bits (dk−1, · · · ,d j) of d. When point Qi, i ∈ {1,2} is processed, power

consumption will be correlated to the bit patterns of Qi and thus to a specific bit si (say

LSB) of Qi. No correlation will be observed with a point not computed. Thus it is possible

to successively recover the bits of the secret d by guessing which points are presently being

71


computed by the device. The computed points for first three bits are shown in Table 4.11.

Table 4.11: Computed points in Algo. 2.4 for first three bits of d.

dk−2 = 0Q2 = 3P, Q1 = 2P

dk−3 = 0 dk−3 = 1Q2 = 5P, Q1 = 4P Q2 = 6P, Q1 = 5P

dk−4 = 0 dk−4 = 1 dk−4 = 0 dk−4 = 1Q2 = 9P, Q2 = 10P, Q2 = 11P, Q2 = 12P,Q1 = 8P Q1 = 9P Q1 = 10P Q1 = 11P

dk−2 = 1Q2 = 4P, Q1 = 3P

dk−3 = 0 dk−3 = 1Q2 = 7P, Q1 = 6P Q2 = 8P, Qc = 7P

dk−4 = 0 dk−4 = 1 dk−4 = 0 dk−4 = 1Q2 = 13P, Q2 = 14P, Q2 = 15P, Q2 = 16P,Q1 = 12P Q1 = 13P Q1 = 14P Q1 = 15P

The main objective of this DPA attack is to recover the secret d. It recovers d iteratively

starting from second MSB (dk−2). Let us examine the ECD operation in Montgomery lad-

der. At the first iteration, if dk−2 = 0 then it performs 2P else it performs 2(2P). The point

2P in both cases appeared either in output or in input. That means the power consumption

in the first iteration is always correlated with 2P. Whereas, the point 4P appeared only if

dk−2 = 1. Thus the power consumption in first iteration is correlated with 4P only if the

secret bit dk−2 = 1.

In order to mount the DPA attack, we implement the Montgomery ladder on an FPGA

platform which is specially designed for power analysis attack. The board provides an 1-

ohm resistor between the power supply and the VCCINT pin of FPGA device. We measure

the current drawn through that resistor during ECSM computation by a current probe. The

specification of the probe is Tektronix current probe (serial number B014316). We use

the probe with a TCPA300 power amplifier in standby mode. The measured power is

displayed and stored in a Tektronix TDS5032B Digital Phosphor Oscilloscope. We develop

software tools to automate the whole process for varying inputs. The power consumption is

proportional to the voltage drop across the register and is measured in terms of mV which

72


is varied around 10mV . The power signal is sampled at 12.5MS/s.

The computation of ECSM with same exponent d is performed repeatedly with varying

P. The attack first targets the secret bit dk−2 which is processed at first iteration. Thus,

we store the power consumptions during the first iteration of dP computations. The power

consumptions are then divided into two sets based on the specific bit (LSB in our case)

of 2P. We calculate the mean power consumptions of each of the sets. Then the absolute

value of the difference-of-mean power consumption is calculated on above two means. The

similar processing has been done for 4P also.

500 1000 1500 2000 25000

1

2

3

4

5

x 10−4

samples

diffe

renc

e−of

−m

ean

(V)

for 2Pfor 4P

Figure 4.10: Difference-of-mean power for 2000 traces.

Figure 4.10 shows the corresponding results. The attack is based on the power con-

sumptions of 2000 different inputs. In the figure it is shown that the difference-of-mean for

2P gives significant peaks whereas there are no significant peaks for 4P, which identifies

dk−2 = 0. The difference-of-mean for 2P gives peaks as 2P actually occurred in the first

iteration (refer 4.11). But the same does not give any peak (it is almost nullified) for 4P as

2(2P) never was computed during first iteration. After identifying the bit dk−2 the process

is repeated for further bits. Thus the SPA resistant Montgomery ladder is vulnerable against

DPA attack.

However, our modified point blinding technique has been incorporated in Montgomery

ladder. The proposed cryptoprocessor computes the ECSM operation by executing above

73


DPA resistant Montgomery ladder (Algo. 4.4.1). The Montgomery ladder in this algorithm

starts its execution with an unknown point P+R. The point R is updated at every new exe-

cution by (−1)b3R with a random bit b. The randomness of our adopted random generator

is described in [108], and we assume there is no weakness in the random generator. Now,

whenever a new execution is encountered the computation starts with a new unknown point.

Thus the processed points in the Algo. 4.4.1 are not deterministic. Also there is no relation

among the point occurrences in adjacent executions [157]. Every new point P changes to

some new unknown point P′ = P+R for which no known specific bit value could be cho-

sen on target point. Thus, no correlation could be found between a specific bit of the target

point and the respective power consumption.

500 1000 1500 2000 25000

1

2

3

4

5

x 10−4

samples

Diff

eren

ce−

of−

mea

n (V

)

2000 traces5000 traces10000 traces20000 traces

Figure 4.11: Difference-of-mean power for 2P on proposed cryptoprocessor.

In order to prove the DPA resistance property, we perform the above mentioned attack

on our proposed elliptic curve cryptoprocessor. During the first iteration the attack is per-

formed with a maximum of 20,000 traces. Figure 4.11 shows the difference-of-mean during

the first iteration with respect to the LSB of 2P. The same for 4P is shown in Fig. 4.12.

It is observed that none of the difference-of-mean powers gives significant peaks, all are

nullified. Therefore, neither 2P nor 4P has occurred in the first iteration. Some unknown

points have been processed there. Thus the DPA could not identify the secret bit dk−2,

which ensures that the proposed cryptoprocessor is secure against DPA attack.

74


500 1000 1500 2000 25000

1

2

3

4

5

x 10−4

samples

Diff

eren

ce−

of−

mea

n (V

)

2000 traces5000 traces10000 traces20000 traces

Figure 4.12: Difference-of-mean power for 4P on proposed cryptoprocessor.

4.5.4 Doubling Attack (DA)

It is already shown that the Coron’s point blinding technique on Montgomery ladder is

vulnerable against doubling attack (see § 4.4). The DA is based on the two queries. One is

on some input P and the other one is on 2P. The DA exploits the similar (point doubling)

operations for computing dP and d(2P). Therefore, it is very essential to process above

two queries by the target device for doubling attack. In case of Coron’s point blinding

technique the first query is processed on some P′ = P+R. The second query is processed

on 2P′ with 0.5 probability as random point R is refreshed by ±2R.

However, in our modified point blinding technique the random point R is refreshed by

±3R. Thus, though the first query with P is processed on P′ = P+R the second query

with 2P is processed on P′′ = 2P±3R, which is not in the form of 2P′. Thus, the essential

requirement of DA is not satisfying on our elliptic curve cryptoprocessor. The computa-

tions of dP′ and dP′′ for d = 11001011 are shown in Table 4.12. No similar operation

is performed during (i− 1)th and ith iterations for processing second and first queries, re-

spectively. Hence, the computation of our modified Montgomery ladder (Algo. 4.4.1) with

proposed point blinding technique, as well as the proposed elliptic curve cryptoprocessor

is secure against doubling attack.

A common argument one may arrive is whether the proposed scheme can defend the

75


Table 4.12: The d(P+R) and d(2P+3R)) in the Montgomery ladder.

i di Process of d(P+R) Process of d(2P+3R))7 1 Q1 = 1+(P+R) Q1 = 1+(2P+3R)

Q2 = 2(P+R) Q2 = 2(2P+3R)6 1 Q1 = (P+R)+2(P+R) Q1 = (2P+3R)+(4P+6R)

Q2 = 2(2P+2R) Q2 = 2(4P+6R)5 0 Q2 = 3(P+R)+4(P+R) Q2 = (6P+9R)+(8P+12R)






Q2 = 2(102P+102R) Q2 = 2(204P+306R)Return Q1 = 203(P+R) Q1 = 406P+309R

DA for a second query on 3P instead of 2P. The answer is yes. In that case the second query

is processed on 3P′ with a 0.5 probability. Now, the DA basically exploits the consecutive

point doubling (ECD) operations (where an ECD computes 2P from an input P). The

original DA is based on the fundamental observation that the output of an ECD on 2P′ is

same as the output of two consecutive ECD on P′. This similarity of the two computations,

one on P′ and the other on 2P′ continues throughout the ECSM computations, and result

in the DA. However, with our modification, the same observation does not hold for P′

and 3P′. Fig. 4.13 shows an execution tree for first few iterations with all possible input

combinations of the Montgomery ladder. The execution of two queries are performed with

two different inputs P and 3P, respectively.

The execution tree shows few similarities during first three iterations of dP′ and d(3P′).

For example, if first iteration of d(3P′) and second iteration of dP′ both executes 6P′ then

it could identify {dk−2,dk−3} = {0,1}. Similarly, if first iteration of d(3P′) and third iter-

ation of dP′ both executes 12P′ then it could identify {dk−2,dk−3,dk−4} = {1,0,0}. But,

no similarities can occur at further iterations. Due to the above few similarities it is recom-

76


16 , 17 48 , 51

17 , 18 51 , 54

18 , 19 54 , 57

19 , 20 57 , 60

20 , 21 60 , 63

21 , 22 63 , 66

22 , 23 66 , 69

23 , 24 69 , 72

24 , 25 72 , 75

25 , 26 75 , 78

26 , 27 78 , 81

27 , 28 81 , 84

28 , 29 84 , 87

29 , 30 87 , 90

30 , 31 90 , 93

31 , 32 93 , 96

8 , 9 24 , 27

9 , 10 27 , 30

10 , 11 30 , 33

11 , 12 33 , 36

12 , 13 36 , 39

13 , 14 39 , 42

14 , 15 42 , 45

15 , 16 45 , 48

4 , 5 12 , 15

5 , 6 15 , 18

6 , 7 18 , 21

7 , 8 21 , 24

2 , 3 6 , 9

3 , 4 9 , 12

1 , 2 3 , 6

0 1

0 1 0 1

0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

- d k - 1 = 1 - 1 , 2 indicates Q 1 = P’ , Q 2 = 2 P’ ( 1 - st iteration of dP’ ) - 3 , 6 indicates Q 1 = 3 P’ , Q 2 = 6 P’ ( 1 - st iteration of d ( 3 P’ ) ) Match found for

{ d k - 2 , d k - 3 } = {{ 0 , 1 }, { 1 , 0 }} Match found for { d k - 2 , d k - 3 , d k - 4 } = { 1 , 0 , 0 }

- First query P’ = P + R , computes dP’ . - Second query : P” = 3 P ± 3 R , computes d ( 3 P’ ) with probability 0 . 5 .

Figure 4.13: Execution tree for doubling attack with P and 3P.

mended to avoid those few values of d as the secret key. Hence, the DA on the proposed

technique can not be performed even by two queries on P and 3P.

4.5.5 Security of the Random Generator

Let us consider that an operation P+R is performed on a public parameter P and a

private parameter R. Security of the randomness of ECSM operation as per [108] depends

on the privacy of the point P+R. Thus, the system will be vulnerable if an attacker can

find out R. The side-channel attack (SCA) in this regard can be considered to find out an

unknown R by exploiting P+R operations.

A DPA attack on (A+B) in F2m is shown in [40]. That field addition is performed

by XOR (linear) operation. The power consumption of such operation is directly related

to a specific bit of the output and inputs. Thus, a correlation is found. But, in case of

point addition (P+R) the output is not linearly related to the inputs. It is performed by

executing a set of finite field operations. The above correlation could not be found in this

case between the power consumption and a specific bit of inputs and output, where from

that specific unknown input bit is guessed. Hence, it is secure against above attack.

77


Let us assume that a new side-channel attack can manage to find the vulnerability of

(P+R) operation. A very common phenomenon for any powerful side-channel attack is

that a sufficient number of times (say 4000 times) you need to compute the same operation

on the same private parameter (R) with varying public parameter (P). But, it could not be

possible in the proposed technique. The private parameter R is changed randomly at every

new execution within a session by the proposed device.

However, if (S,R) pair is fixed for a user then the operation (P+R) on the same R is

performed at the beginning of each of the sessions. Thus, 4000 sessions can give sufficient

side-channel information to mount that attack. To protect this vulnerability it is necessary

to design the protocol in such a way that at the end of a session the updated (S,R) pair

is sent back to the user through some secure channel. The latest (S,R) pair is used for

the next session of that user. Therefore, using our proposed technique with the above high

level protocol completely avoids the execution of repeated (P+R) operation on the same R,

which ensures the security of the proposed random generator against side-channel attacks.

4.6 ECSM Implementation Result and Comparison

We have implemented the ECSM units on FPGA platforms. The design has been done

in Verilog (HDL). The synthesis, mapping, placement, and routing have been done on Xil-

inx ISE 7.1i. Simulation at different levels have been performed on ModelSim XE III 6.0a

simulator. Table 4.13 shows the post place and route results of ISE for three different bit

sizes. The target device is a Xilinx Virtex-II Pro FPGA. The 192-bit dual core implementa-

tion computes an ECSM in 4.47 ms running at 43 MHz. The 192-bit cryptoprocessor uses

8 972 slices and 3127 flip-flops. The estimated equivalent gate count of 192-bit implemen-

tation is 133 685.

The dual PGAU-core elliptic curve cryptoprocessor has been implemented for different

FPGA platforms. The total slice area consumption with different FPGA devices are almost

same. But due to the speed grade factor, frequency of the design changes and thus the

performance varies with devices. Table 4.14 shows the performance of proposed crypto-

processor on different FPGA platforms.

78


Table 4.13: Hardware cost and time of ECSM operation on Virtex-II Pro FPGA.

Prime p Frequency Area ECSM Time(bits) (MHz) Slice LUT FF Equivalent gate (ms)

Single PGAU-core implementation192 43 4 463 8 791 1 147 65 874 7.92224 40 5 226 9 344 1 253 79 051 11.55256 36 6 102 10 561 1 512 87 024 16.71

Dual PGAU-core implementation192 43 8 972 15 394 3127 133 685 4.47224 40 10 386 17 914 3513 154 861 6.50256 36 11 953 20 779 3873 177 534 9.38

Table 4.14: Performance of dual PGAU-core implementation on different FPGA.

bit Spartan-III Virtex-II Pro Virtex-IVFrequency Time Frequency Time Frequency Time

(MHz) (ms) (MHz) (ms) (MHz) (ms)192 32 6.00 43 4.47 61 3.15224 28 9.27 40 6.50 58 4.49256 24 14.05 36 9.38 54 6.26

The utilization of area for performing ECSM operation is shown in Table 4.15. It is

considered that the multiplication unit consumes only 1/6 slices of inversion/division units

(refer Table 4.5). The PPU1 proposed in [23] consists of one multiplier and one divider that

utilize 82.3% and 65.9% times of the total time required for an ECSM computation. The

total area utilization of PPU1 for computing ECSM operation is only 68.2%. Similarly,

PPU2 in [23] consists of two multipliers and two dividers, which utilize 39.4%, 59.1%,

39.4%, and 39.4% during the ECSM operation. Its total area utilization is 40.9% on an

average during ECSM operation. Whereas, in the proposed design 2/3 area (DP1 and DP3

except subtractor) utilize 100% and another 1/3 area (DP2 and subtractor in DP3) utilize

only 40%. Thus, the proposed design utilizes 80.0% total area on an average during the

execution of above operation.

Table 4.16 shows the performance comparison of elliptic curve scalar multiplication.

The design, which is reported in [31], supports dual-field operations with relatively higher

hardware area (40 219 Slices) and thus we have not explicitly compared it with proposed

79


Table 4.15: Area utilization of designs in ECSM operation.

PPU1 [23] PPU2 [23] Proposed68.2% 40.9% 80.0%

design. It may be observed that the designs are implemented on different platforms, and

are using different resources. Thus, a straight forward comparison is not fair. However,

we analyze them individually with our proposed elliptic curve cryptoprocessor. Embedded

multi-core design proposed by Fan et al. [58] consumes 0.35 times slice area compared to

proposed design, but along with slices, the design [58] uses 6 Block RAM (or BRAM),

which is equivalent to 6× 18 Kb RAM, and sixteen 18-bit dedicated multipliers. On the

same Virtex-II Pro platform, compared to [58], the proposed design gives 2.21 times better

throughput with respect to ECSM operation. The design described in [74] consumes 15 755

slices and computes one point multiplication in 3.86 ms. Design proposed by Sakiyama et

al. [70] uses 1.2 times slice area along with 9× 18 Kb RAM and two 256-bit dedicated

modular arithmetic logic units. The 256-bit implementation of [70] takes 2.70 ms, which

has a 2.20 times more throughput compared to proposed 192-bit design. The ECSM im-

plementation reported in [87] gives only 0.74 times throughput in smaller area compared

to proposed elliptic curve cryptoprocessor.

The design proposed in [31] provides maximum throughput with respect to CMOS as

well as FPGA implementations. The design proposed in [58] consumes minimum amount

of slice area in FPGA compared to the designs that are able to perform division operation

also. Our design in this respect makes a trade off, also it has additional features that it

could resist non-differential and differential timing and power attacks. Figure 4.14 gives a

pictorial view of the performance of current design and the designs that were proposed by

Sakiyama et al. [70] and Fan et al. [58]. Different resources inside the Virtex-II Pro FPGA

device that are used by the related designs are shown in four groups. The major features

that make the current design superior than existing designs are as follows :

• The whole architecture have inherent programmability. That is, it supports all primes

less than the given lengths (192, 224, 256 bits). Therefore, it can be used to process

a larger number of curves defined on GF(p).

80


Tabl

e4.

16:P

erfo

rman

ceco

mpa

riso

nof

ellip

ticcu

rve

scal

arm

ultip

licat

ion

over

arbi

trar

ypr

ime

field

s.

Ref

eren

cePr

ime

pD

evic

eFr

eque

ncy

Are

aTi

me

Thr

ough

put

(bits

)(M

Hz)

(ms)

(Kbp

s)Pr

opos

ed*

192

Vir

tex-

IIPr

o43

897

2Sl

ice

+3

127

FF4.

4743

.0A

nany

i[14

],20

09‡

192

Vir

tex-

IIPr

o49

2079

3Sl

ice,

32M

ult.

(18-

bit)

7.24

26.5

Schi

nian

akis

[15]

,200

9‡19

2V

irte

x-E

−25

012

LU

T3.

5454

.2L

ai[3

1],2

008

192

Vir

tex-

IIPr

o94

4021

9Sl

ice

1.25

153.

6Fa

n[5

8],2

007§

192

Vir

tex-

IIPr

o93

317

3Sl

ice

+16

Mul

t.+

6B

RA

M9.

9019

.4Sa

kiya

ma

[70]

,200

625

6V

irte

x-II

Pro

100

1084

7Sl

ice

+9

BR

AM

2.70

94.8

McI

vor[

74],

2006

256

Vir

tex-

IIPr

o39

1575

5Sl

ice,

256

Mul

t.(1

8-bi

t)5.

9942

.7Sh

uhua

[87]

,200

5†19

2V

irte

x-II

502

365

Slic

e+

2B

RA

M+

114

7FF

6.00

32.0

Lai

[16]

,200

916

0C

MO

S0.

13µm

121

170

Kga

tes

0.61

262.

3L

ai[3

1],2

008

160

CM

OS

0.13

µm21

715

1K

gate

s0.

3447

0.6

Che

n[5

5],2

007

256

CM

OS

0.13

µm55

612

2K

gate

s1.

0125

3.5

Sato

h[1

19],

2003

256

CM

OS

0.13

µm13

812

0K

gate

s2.

6895

.5−

*Im

plem

enta

tion

itsel

fis

secu

reag

ains

ttim

ing

and

pow

erat

tack

s.−

‡Im

plem

enta

tion

supp

orts

fixed

NIS

Tpr

imes

only

.−

§It

supp

orts

mod

ular

mul

tiplic

atio

nof

arbi

trar

yle

ngth

.−

†D

oes

noti

nclu

dedi

vide

rand

does

notc

ompu

teth

ere

sult

inaf

fine

co-o

rdin

ates

.

81


• The current design is a memoryless design. It does not use any block RAM of the

FPGA. The RAM cells consume more power compared to logic cells [35]. Thus in

low power applications our design is more useful than the designs with memory.

• It efficiently optimizes the area × time per bit value for ECSM operation.

• The proposed design is secure against known timing and power attacks.

Flip Flops

Our Fan Sakiyama

8972

3127

3173

6

16

10 847

4 . 47 ms 9 . 90 ms 2 . 70 ms

Our Fan

Sakiyama

Our Fan Sakiyama

0 0

0 Fan

Sakiyama 9

Our

0 Our Fan

Sakiyama 0

R e s

o u r c

e s

ECSM ComputationTime

Slice

BRAM

Multiplier

Figure 4.14: Performance of related designs with respect to area and time.

For detailed comparison, let us consider one LUT is equivalent to 16×1 RAM [34] and

one BRAM consists of 1024×18-bit RAM; hence, one BRAM is equivalent to 576 slices.

We also consider, for simplicity, one 18-bit multiplier is equivalent to 197 slices [36]. The

aforementioned equivalence relations are used to calculate the equivalent slice area that

are required to implement related designs. Table 4.17 shows the equivalent slice area and

corresponding comparative parameter area × time per bit values. Compared to other de-

sign, the proposed cryptoprocessor holds the second best position with respect to the above

parameter, just after the design in [70]. However, the design of [70] contains memory

elements and also it does not provide security against any side-channel attack. Except

the current design, all other designs implement normal Double-and-Add scalar multiplica-

tion algorithm. This algorithm is susceptible to simple side-channel analysis, like simple82

4.7 Conclusion

power attacks (SPA) [155]. However, the proposed architecture protects the secret multi-

plier against above vulnerabilities. This incurs an additional cost of 30% operations on an

average and requires an extra 6k bit register.

Table 4.17: Performance comparison of different designs.

Reference EqS ECSM Time EqS×Timebit

TPAR BRAM(ms)

Proposed 9 072 4.47* 211 Yes NoAnanyi [14] 27 097 7.24 1022 No NoLai [31] 40 219 1.25 262 No NoSchinianakis [15] 13 500 3.54 249 No NoFan [58] 9 781 9.90 504 No YesMaIvor [74] 15 755 3.86 238 No NoSakiyama [70] 16 031 2.70 169 No Yes− EqS: equivalent slice. TPAR: timing and power attack resistant.− * TPAR property needs 30% additional operations and 6k bit extra register.

Figure 4.15 presents the ECSM computation time versus equivalent slice area of related

designs. The area × time per bit value of our design is only 0.20 times compared to the

design of Ananyi et al. [14], 0.77 times compared to the design of Lai et al. [31], 0.82

times compared to the design of Schinianakis et al. [15], 0.42 times compared to the design

of Fan et al. [58], 0.64 times compared to the design of McIvor et al. [74], and 1.2 times

compared to the design of Sakiyama et al. [70]. Therefore, the above comparison show

that our design optimizes the area × time per bit value along with the inherent timing and

power attack resistant property.

4.7 Conclusion

The chapter has investigated the scope of hardware sharing among the finite field prim-

itives. The reduced area of the programmable GF(p) arithmetic unit has allowed the use of

dual cores to accelerate the ECSM operation. It has been shown that the proposed crypto-

processor performs one 192-bit ECSM operation in 4.47ms. The proposed architecture has

the additional advantage of being memoryless. The most important thing is that the pro-

posed elliptic curve cryptoprocessor provides security against timing and power attacks.

Experimental results have been furnished to show that the proposed design is the best area

83


Fan Our

Sakiyama

1 2 3 4 5 6 7 8 9 10

5000 10 000 15 000 20 000 25 000 30 000

ECSM Execution time ( ms )

E q u

i v a l

e n t S

l i c e

35 000 40 000

McIvor

45 000 Lai

Ananyi

S c h

i n i a

n a k i

s

Figure 4.15: Performance of related designs.

× time per bit value optimized timing and power attack resistant elliptic curve scalar mul-

tiplier on GF(p).

In the next chapter we will focus on further performance improvement of our current

PGAU and ECSM cryptoprocessor on FPGA platform. The work explores the in-built

features of an FPGA device for designing optimized architectures for arithmetic operations,

based on which we measure the performance gain of our current design.

84

Chapter 5

Fast Prime Field Adders and Multipliers

on FPGA Platform

FINITE FIELD ADDITION and multiplication are the most important operations in

cryptography. Efficient techniques of these operations greatly affect the overall per-

formance of a cryptoprocessor. The development of such a cryptoprocessor for bilinear

pairings is one of the major objectives of this thesis. The bilinear pairings in cryptogra-

phy are computed on elliptic or hyperelliptic curves that are defined over finite fields. Its

security and computation efficiency depends on underlying curve and finite field. We call

them the pairing-friendly curve and the pairing-friendly field, respectively. The selection

of curves and design of efficient cryptoprocessor for pairing computation will be addressed

in the next chapter. However, this chapter deals with the efficient design techniques for

prime field (Fp) arithmetic on FPGA platform.

The efficient computations of addition/subtraction and multiplication in Fp are the main

objective of this chapter. We propose high-speed prime field adder and multiplier circuits

on FPGA platform. The work shows the efficient utilization of in-built fast carry chains

of an FPGA device for developing a high speed adder circuit. It follows the Karatsuba

decomposition technique for computing above operations. Through experimental results it

shows that due to the optimized addition chains, Karatsuba decomposition upto a particular

level improves the performance. But, further decomposition degrades the same.

85

Chapter 5 Fast Prime Field Adders and Multipliers on FPGA Platform

Subsequently, the chapter modifies the existing interleaved multiplication algorithm us-

ing Montgomery ladder. The modified algorithm indeed improves the scope of parallelism.

Also, it provides the security against non-differential side-channel attacks. The experi-

mental result shows that the proposed design provides 70% speedup from the best known

existing design. The actual power analysis has been performed to show its security against

non-differential power attack which is also known as simple power analysis (SPA).

Finally, the chapter redesigns the programmable GF(p) arithmetic unit (PGAU) and

the elliptic curve scalar multiplication (or ECSM) cryptoprocessor based on the proposed

fast adder circuits. The experimental result shows that the modified designs achieve 30%

speedup over the designs reported in previous chapter.

5.1 Introduction

The requirement of high security in current days electronic applications are mostly

provided by the protocols that are developed with RSA, elliptic curve, and pairing based

cryptography. Modular addition and multiplication are the fundamental operations in all

such public key schemes. In case of elliptic curve and pairing, the key sizes are relatively

lesser than RSA, and it is around 256 bits long for achieving 128-bit security level. Thus,

efficient design of modular (or finite field) addition and multiplication on large operands are

one of the objectives of this thesis. The performance gain from this optimized finite field

primitives further help to design highly optimized processor for pairing based cryptography.

Various techniques exist to improve the efficiency of addition. Some of the known

techniques are carry select, carry lookahead, and Brent-Kong carry. All of these techniques

have optimized the length of the carry chain in order to improve the efficiency of an ad-

dition. But, none of them has considered the routing complexity of the respective adder

circuits on hardware. This chapter experimentally shows that the routing delay of these

addition techniques on an FPGA platform is almost same as the logic delay.

On the other hand modern FPGAs provide special carry logic for addition. The carry

chains formed by the in-built carry logic are 32 bits long. This chapter explores that the

carry propagation adder based on in-built carry logic for a 32-bit addition provides the

86

5.1 Introduction

minimum latency compared to all other addition techniques. Subsequently, it develops a

hierarchical adder structure for large operands using above fast carry chain (FCC). The

experimental result shows that the proposed technique significantly reduces the routing

delay as well as logic delay compared to the existing techniques. For a 256-bit adder, the

proposed technique provides 35% speedup compared to the best known technique on an

FPGA platform.

Another major operation for any public key scheme is the finite field multiplication.

The Fp-multiplication algorithms are based on either of iterative interleaved additions or

multiplication followed by reduction. The reduction in Fp requires division by a large

prime. However, the Montgomery partial reduction algorithm [176] avoids division in

Fp. But, only problem in such algorithm is that the operands as well as the results are

in a specific format called Montgomery numbers. The operands first need to convert in

such form and then the multiplication is performed. Thus, the Montgomery reduction is

useful for exponentiation like algorithms where multiplications are repeatedly computed

before producing the final result. Some of the existing architectures for implementing

multiplication by Montgomery reduction are shown in [44,79,80,95–99,106,110,124,158].

The best known design in so far is proposed in [44], which computes multiplication in only

0.38µs in Montgomery domain. However, it uses thirty-three 18-bit in-built multipliers and

1704 slices, i.e., in total 8000 slices of an FPGA device make it costly.

The Fp multiplication based on interleaved addition chains was introduced at the same

time with Montgomery reduction by Blakley [179, 180]. The algorithm is described in

§ 2.1.2. It is an iterative algorithm computed by following standard double-and-add pro-

cedure on 2’s complement numbers. At every iteration it reduces the intermediate result

which always remains below the modular value. However, the main difficulty of this al-

gorithm is the computation of addition on large operands. The carry chain linearly in-

creases the latency of the addition operations. Thus, carry propagation adder circuit is

inefficient for developing an interleaved multiplication algorithm based Fp-multiplier for

large operands.

Some modifications on the above algorithm for reducing the multiplication latency due

to the carry chains have been made in [121] and [25]. The modifications are based on carry

87


save adder (CSA). The modifications are either based on sign estimation technique [25]

or it uses some pre-computed value [121]. Both of the techniques require some additional

computations, which require additional circuitry and memory elements in hardware. The

pre-computations that are required to perform (A ·B) mod p by the modified algorithms

depend on the multiplicand A. The advantage of pre-computation can be taken efficiently

in an application where the repeated multiplications are performed on a fixed multiplicand

and varying multiplier. But, in our applications like elliptic curve or pairing the above is no

longer valid. Thus, the pre-computation cost is directly added with the multiplication pro-

cedure in elliptic curve and pairing based cryptographic applications. Another disadvantage

of above techniques is that the carry save adder inherently requires one carry propagation

adder for computing the final output. Therefore, better techniques could be explored, which

is more suitable for applications like elliptic curve and pairings.

This chapter proposes an efficient architecture for interleaved multiplication algorithm

on FPGA platform using in-built fast carry logic. It shows different level of decompositions

of operands and respective parallelism for executing interleaved multiplication algorithm.

For further speedup we modify the above algorithm by exploring the scope of parallelism

within an iteration. The Montgomery laddering [125] technique (see § 2.1.4) is exploited

for such modification. The doubling and addition operations within an iteration of our

modified algorithm are independent to each other. Thus, they can be computed in parallel.

On the other hand, both of the operations are computed at every iterations which provides a

balanced execution and security against non-differential side-channel attacks [156]. Subse-

quently, a two-level parallel architecture based on the modified algorithm is proposed. The

experimental result shows that the proposed design gives the best performance considering

the final reduction step for multiplication in Fp for a large prime p.

In order to demonstrate the efficiency gained by the proposed techniques on an FPGA

platform, we redesign the PGAU and ECSM hardware which are described in previous

chapter. The old adder circuits are now replaced by the proposed high speed adders in the

new designs. The experimental result shows a significant performance improvement has

been achieved due to the high speed adder. The major contributions of the current chapter

could be identified as:

88

5.2 Fast Additions on FPGA

1. High speed adder. It explores the utilization of in-built carry logic of an FPGA

device and proposes a hierarchical structure for developing a high speed adder circuit

for large operands.

2. Parallel Fp–multiplier. It proposes a modification on interleaved multiplication al-

gorithm for improving the scope of parallelism. Subsequently, we propose a parallel

iterative architecture based on the proposed modification and high speed adders. The

proposed architecture exploits the parallelism in two levels. One is in the addition

level and other is in the algorithmic level. The extensive experimental results have

been furnished to show its performance improvement over existing designs and se-

curity against non-differential timing and power attacks.

3. Speedup of elliptic curve cryptoprocessor. The PGAU and ECSM hardware de-

scribed in chapter 4 are redesigned based on the proposed high-speed adder. Sub-

sequently, the chapter shows the respective performance gains achieved through our

proposed technique.

The outline of the chapter is as follows: the chapter starts with the description of an

efficient adder circuit on FPGA platform and comparison of its performance with the best

known technique. The modification of interleaved multiplication algorithm, the respective

architecture, its implementation costs, respective performance, and security against power

attacks are described thereafter. The chapter then shows the performance gains of the

new PGAU and elliptic curve cryptoprocessor compared to their old designs presented in

Chapter 4.


Addition is one of the fundamental operations for computing any cryptographic algo-

rithms. It is also the major operation to perform interleaved multiplication. This section

develops an efficient adder unit specifically for FPGA platforms in general, although specif-

ically we discuss about Xilinx Virtex-II pro FPGA devices. The modern FPGAs support

a maximum of 32-bit ripple carry chain [169]. A carry chain is placed in one row of the

FPGA, and it interfaces with all the FPGA cells in that row. Each of the 32-bit carry chains89


support k-bit carry computations placed at any point within a carry chain resource, where

k ≤ 32. Thus the addition upto 32 bits in an FPGA device using ripple-carry (or carry-

propagation) adder (CPA) requires the lowest routing complexity compared to other adders

including Brent-Kung [181], and carry lookahead adders.

Table 5.1 shows the performance of 32-bit adders on Virtex-2 Pro FPGA. In the table

the total delay (TD) comprises of input buffer delay, computation delay, and output buffer

delay. Typically in a Virtex-2 Pro FPGA device the input buffer delay is 1.452ns and the

output buffer delay is 2.851ns. These two buffers infer that the actual addition time in the

respective adders are 4.303ns less than the time shown in the TD column. It is observed that

the 32-bit carry propagation adder gives the best performance as well as lowest area among

all 32-bit adders on FPGA platform. There are 16 slices (or 32 LUT) in a single column

of a Virtex-2 Pro FPGA device. Each LUT in a column consists of a special carry logic

which is directly connected to its next adjacent LUT. Therefore, the carry propagation up

to 32-bit takes minimum routing delay which indeed forms the best adder circuit for 32-bit

additions on an FPGA platform.

Table 5.1: Performance of different 32-bit adders on Virtex-2 Pro FPGA.

Adder Type LD (ns) RD (ns) TD (ns) Slice LUTCarry Propagation 5.882 0.799 6.621 16 32Brent-Kong Carry 5.395 4.174 9.569 84 151Lookahead Carry 5.271 4.017 9.288 69 127Carry Select 5.563 4.321 9.884 82 149− LD, RD, and TD stand for Logic Delay, Route Delay, and Total Delay.

In the applications like elliptic curve and pairing computations it is essential to perform

additions on large operands which are much larger than 32 bits. In case of such applications

256-bit operands are very common to use. Let us now observe the performance of different

adders for such a large bit length. It is now known to us that for a 32-bit addition on an

FPGA platform, carry propagation adder provides the best performance. From this point

we call it as 32-bit fast carry chain (or FCC). We design various adder circuits for 256-bit

addends based on FCC. The 256-bit addends are broken into eight 32-bit smaller parts and

we use them as basic units.

90


Table 5.2: Performance of 256-bit adders based on 32-bit FCC on Virtex-2 Pro.

Adder Type LD (ns) RD (ns) TD (ns) Slice LUTCarry Propagation 14.058 0.799 14.857 128 256Brent-Kong Carry 7.514 6.120 13.634 373 712Lookahead Carry 8.574 5.358 13.932 486 910Carry Select 7.514 6.460 13.974 373 711− LD, RD, and TD stand for Logic Delay, Route Delay, and Total Delay.

Let us assume, A = ((a7232+a6)264+(a5232+a4))2128+((a3232+a2)264+(a1232+

a0)) and B=((b7232+b6)264+(b5232+b4))2128+((b3232+b2)264+(b1232+b0)), where

each of the ai, bi are 32-bit words. An addition ai +bi is performed by a 32-bit FCC. The

output S = ((s7232 + s6)264 +(s5232 + s4))2128 +((s3232 + s2)264 +(s1232 + s0)), and the

final carry output is cout . Different adder circuits (carry propagation, Brent-Kong carry,

lookahead carry, and carry select) are designed as their respective 8-bit structures [169],

where a bit addition is now replaced by a 32-bit FCC. The comparative performance study

of different such adders are provided in Table 5.2. Among different known techniques

the Brent-Kong carry with 32-bit FCC provides the bast performance which takes only

13.634ns to perform a 256-bit addition on a Virtex-2 Pro FPGA. It is observed that the

routing delay of each of the structures except carry propagation adder is significantly high.

In the following subsection, we propose an adder structure on FPGA platform which re-

quires lesser routing delay and provides maximum speed for large addends.

5.2.1 Proposed Addition Technique

Our addition technique follows the Karatsuba decomposition [117]. Normally, the

Karatsuba technique is used to compute a multiplication of two large operands. We ex-

ploit the same technique for addition. Let us consider the addition (A+B) for two 256-bit

addends A and B. The addends are broken as shown in Fig. 5.1. They are decomposed upto

level 3 (height of the decomposition tree, h = 3). However, further decompositions upto

level 8 (individual bit level) for a 256-bit addend is possible. The level of decomposition

increases the scope of parallelism for computing additions. It is already shown that for a

32-bit addition on FPGA platform, the best adder unit is formed by 32-bit FCC. Thus, we

stop the decomposition at level 3 for designing the proposed 256-bit adder.

91


LL 64 : 64

LHL 32 : 32

LH 64 : 64

L 128 : 128

LHH 32 : 32

LLL 32 : 32

LLH 32 : 32

HL 64 : 64

HHL 32 : 32

HH 64 : 64

H 128 : 128

HHH 32 : 32

HLL 32 : 32

HLH 32 : 32

A + B 256 : 256

Figure 5.1: The Karatsuba decomposition of 256-bit addends.

We use Add256bit routine that is shown in Algo. 5.3 for computing the addition A+

B. The algorithm hierarchically computes the addition of 256-bit addends. It calls three

Add128bit routines which follow the definition as given in Algo. 5.4. The first call of

Add128bit routine at step-1 executes the L part of Fig. 5.1. Whereas the next two calls at

steps 2 and 3 execute the H part with carry-in zero and carry-in one, respectively. Step-4

chooses the correct result of H part based on its actual carry-in which is the carry-out of L

part (c0). It is observed that all three calls of Add128bit routines are independent and can

be computed in parallel.

Algorithm 5.3: Add256bit (A,B,cin).Input: A = 2128A1 +A0,B = 2128B1 +B0,cinOutput: A+B+ cin

/* S0,S′0,S1,S

′1 are 128-bit variables. */

/* c0,c1,c′0,c

′1 are single bit variables for carry. */

1. {c0,S0}← Add128bit(A0,B0,cin)2. {c′0,S

′0}← Add128bit(A1,B1,0)

3. {c′1,S′1}← Add128bit(A1,B1,1)

/* {x,Y} indicates concatenation of x and Y. */4. S1← S

′c0

, c1← c′c0

5. return {c1,S1,S0}

An Add128bit routine is executed by calling three independent Add64bit routines. It

computes the 128-bit addition by same way of the 256-bit addition procedure. Only differ-

ence is that here L and H parts contain 64-bit operands instead of 128-bit. The Add64bit

routine is defined in Algo. 5.5. It computes the addition of two 64-bit operands by three

92



/* S0,S′0,S1,S


/* c0,c1,c′0,c


1. {c0,S0}← Add64bit(A0,B0,cin)2. {c′0,S

′0}← Add64bit(A1,B1,0)

3. {c′1,S′1}← Add64bit(A1,B1,1)


′c0

, c1← c′c0



/* S0,S′0,S1,S


/* c0,c1,c′0,c


1. {c0,S0}← A0 +B0 + cin2. {c′0,S

′0}← A1 +B1 +0

3. {c′1,S′1}← A1 +B1 +1


′c0

, c1← c′c0


32-bit additions. Therefore, Add256bit routine hierarchically computes the addition of two

256-bit addends. The whole computation consists of twenty-seven 32-bit additions, which

are independent and can be computed in parallel.

The architecture of the proposed adder based on Add256bit is depicted in Fig. 5.2.

It performs one 32-bit addition (ai + bi) using a 32-bit fast carry chain (FCC). A 64-bit

addition is performed by three 32-bit adders and a 32-bit 2:1 MUX. A 128-bit addition is

performed by three 64-bit adders and a 64-bit 2:1 MUX. So, the proposed 256-bit adder is

formed by three above mentioned 128-bit adders and a 128-bit 2:1 MUX. In case of a 256-

bit adder as shown in Fig. 5.2, twenty-seven 32-bit additions are performed in parallel. The

correct result is finally selected by multiplexors. The critical path of the proposed 256-bit

adder circuit consists of one 32-bit fast carry chain and three 2:1 MUXs.

93


Add 64 bit

Add 64 bit

Add 64 bit

Add 64 bit

Add 64 bit

Add 64 bit

Add 64 bit

Add 64 bit

Add 64 bit

+

a i + 1 b i + 1

0 co

s

+

a i + 1 b i + 1

1 co s

+

a i b i

cin co

s 0 1 0 1

32

32

s i

s i + 1

c 1

32

32

a 0 b 0 a 1 b 1

cin

0

a 2 b 2 a 3

0

b 3

a 4 b 4 a 5

0

b 5

a 6 b 6 a 7 b 7

a 6 b 6 a 7 b 7

1

a 6 b 6 a 7 b 7

1

a 2 b 2 a 3 b 3

1 0

a 6 b 6 a 7 b 7

a 4 b 4 a 5

1

b 5

0 1 0 1

0 1 0 1

0 1 0 1

0 1 0 1

s 1 - 0

s 3 - 2

s 7 - 4

cout

k

j

j + k

s i + 1 - i Add 64 bit

Figure 5.2: The proposed 256-bit adder based on 32-bit fast carry chain.

5.2.2 Cost and performance

The algorithms 5.3, 5.4, and 5.5 show that the computation of a 256-bit addition

decomposes the operands upto 256/23 = 32 bits. Let h denotes the decomposition level,

which is incremented with each decomposition. Thus, at the beginning h = 0, at 128-bit

level it is incremented to 1, and at 32-bit level h = 3.

However, the decomposition can be continued further upto single bit level. Instead of

performing a 32-bit addition on a 32-bit FCC it could be further decomposed. Table 5.3

shows the performances of such adder circuits for different decompositions on a Virtex-2

Pro FPGA. The table shows the performances of such adder circuits for 256 and 512 -bit

operands. It is observed that the decomposition upto 32-bit improves the performance (i.e.,

94


reduces the latency) but after that it degrades. This is due to the fast carry chain (FCC)

which are inherently available on a Virtex-2 pro FPGA. At h = 3 (for 256-bit), the critical

path of the proposed adder contains one 32-bit FCC and three 2:1 MUXs. The further

decomposition adds more MUXs in the critical path. For example, at h = 4 the critical path

contains 16-bit fast carry chain and four 2:1 MUXs. The routing delay is also increased.

Table 5.3: Performance of different decompositions for adders in FPGA.

h § 256-bit adder 512-bit adderSlice Latency (ns) Slice Latency (ns)

0 128 14.9 256 24.21 266 11.2 532 16.02 427 10.3 853 12.73 695 † 10.1 1384 11.84 1088 10.7 2160 † 11.65 1706 11.4 3359 12.26 2674 12.2 5195 12.9

− § h indicates the height of decomposition tree. † minimum latency at 32-bit FCC.

The latencies of fast carry chains in its different lengths and one 2:1 MUX on a Virtex-2

Pro FPGA is shown in Table 5.4. The latencies of a 32-bit FCC and a 16-bit FCC are 6.7ns

and 6.1ns, whereas the latency of a 2:1 MUX is 5.1ns. So, the latency of a 32-bit addition

operation using 32-bit fast carry chain is 6.7ns, whereas the same using three parallel 16-bit

fast carry chains and one 2:1 MUX is 11.2ns. Hence, further decompositions below 32-

bit degrades the performance of the adder circuits in FPGA. It is also observed previously

that the 32-bit carry propagation adder provides the best performance among all existing

addition techniques on an FPGA platform. Above experimental result ensures that the use

of 32-bit FCC is the best choice for adding two 32-bit addends on an FPGA platform.

Hence, we stop decomposition at level-3 for developing a 256-bit high-speed adder circuit.

The same is stopped at level-4 in case of 512-bit adder.

Table 5.5 shows the performances of the 256 and 512 -bit adders on the FPGA platform.

The Brent-Kong carry adder with 32-bit FCC provides the best performance among all ex-

isting techniques (see Table 5.2). It is observed that our proposed adder gives 35% speedup

from Brent-Kong carry adder on a Virtex-2 pro FPGA. Due to its hierarchical structure the

routing delay becomes half compared to the Brent-Kong structure.

95


Table 5.4: Latencies of circuit elements on a Virtex-2 pro FPGA.

Fast carry chain (FCC) 2:1 MUX32-bit 16-bit 8-bit 4-bit6.7ns 6.1ns 5.8ns 5.6ns 5.1ns

Table 5.5: Performance comparison of proposed adder on Virtex-2 Pro FPGA.

Adder Type 256-bit adders 512-bit addersSlice LD RD TD Slice LD RD TD

(ns) (ns) (ns) (ns) (ns) (ns)Proposed 695 6.94 3.13 10.07 2160 7.65 3.96 11.61Brent-Kong Carry† 373 7.51 6.12 13.63 1220 9.12 7.23 16.35−† Brent-Kong carry with 32-bit FCC is the best among existing adders (see Table 5.2).− LD, RD, and TD stand for Logic Delay, Route Delay, and Total Delay.

5.3 Fast Fp Multipliers on FPGA

Multiplication is another important underlying operation in cryptography. It is per-

formed as (A ·B) mod p in case of a cryptographic scheme defined over Fp, where A,B∈Fp.

The operation (A · B) mod p is also known as modular multiplication. In school book

method modular multiplication is performed by a multiplication followed by a division op-

eration. Due to the division this procedure is very costly. Different techniques have been

developed for avoiding division operation in modular multiplication procedure [117].

Interleaved multiplication [180] (see § 2.1.2) is one of the procedures that can compute

(A ·B) mod p without final division. This is a bit serial addition based algorithm which

takes k iterations where k indicates the bit length of the operands. Therefore, in practice the

performance of this algorithm depends on the efficiency of the underlying adder circuit.

Table 5.6 shows the costs and performances of interleaved multipliers based on different

adder circuits. It is observed that our proposed high-speed adder based implementation

achieves 63% speedup compared to the same based on carry propagation adder. It achieves

36% speedup compared to the same based on Brent-Kong carry with 32-bit FCC adder. It

can be considered that the above speedup is achieved through the efficient parallelism in

addition operation (our high-speed adder) on FPGA platform. In subsequent section we

96


Table 5.6: Performance of the interleaved multipliers on Virtex-2 Pro FPGA.

Adder Type 256-bit Fp multiplier 512-bit Fp multiplierSlice Frequency Time Slice Frequency Time

(MHz) (µs) (MHz) (µs)Proposed high-speed 2701 62 4.1 7890 53 9.7Carry propagation 808 38 6.7 1560 22 23.0Brent-Kong carry † 1853 46 5.6 6385 37 13.8Carry select † 1838 44 5.8 6324 35 14.6Carry lookahead † 2114 45 5.7 6826 36 14.2− †adders are based on 32-bit fast carry chain (or FCC).

propose a modification on the interleaved multiplication algorithm for achieving the scope

parallelism within an iteration which helps to further speedup.

5.3.1 Proposed Multiplication Technique

This section proposes a modified interleaved multiplication algorithm based on Mont-

gomery ladder [125] (described in § 2.1.4). The modified algorithm for 256-bit operands

(IMML256bit) is shown in Algo. 5.6. It can be scaled accordingly for other bit lengths. The

main objective of such modification is to perform modular addition and doubling in paral-

lel. In order to perform (A ·B) mod p, it uses two temporary variables S0 and S1 which are

initialized by A and 0, respectively. At every iteration, it performs Sbi = (Sbi +Sbi) mod p

and Sbi= (2Sbi

) mod p, which are independent to each other.

The respective architecture is shown in Fig. 5.3. The proposed multiplier works as

follows. It consists of three adders, two registers, five 2:1 MUX, and some additional

circuit elements for controlling the data flow among data path. The registers S0 and S1 are

initialized by A and 0, respectively. At every iteration registers are updated with the results

produced by Fp addition and doubling operations, accordingly, as defined in Algo. 5.6.

After 256 such iterations the register S1 holds the Fp multiplication result. The architecture

is scalable for shorter and longer operands. It is inherently programmable which supports

all primes less than the given lengths (256, 512 -bit). We show the results of the proposed

multiplier of 256 and 512 -bit implementations on Virtex-2 Pro FPGA.

The operands are too long compared to 32-bit length. We break the addition operation

97


Algorithm 5.6: Interleaved Montgomery ladder, IMML256bit(A,B, p).Input: A,B, p. B = ∑255

i=0 2ibiOutput: AB mod p1. S1← 0 and S0← A2. for i from 255 down to 0 do

/* T0,T ′0,T1,T ′1 are 256-bit variables. *//* c0,c1,c′0,c


{c0,Tbi}← Add256bit(Sbi,Sbi,0)

{c′0,T ′bi}← Add256bit(Tbi, p,1)

{c1,Tbi}← Sbi

<< 1{c′1,T ′bi

}← Add256bit(Tbi, p,1)

/* {x,Y} indicates concatenation of x and Y. p indicates1’s complement of p*/if (c0 or c′0) then Sbi ← T ′bi

else Sbi ← Tbi

if (c1 or c′1) then Sbi← T ′bi

else Sbi← Tbi

3. return S1

2 X X

S

mux b i 0 1

S 1 S 0

mux b i 0 1 mux 0 1

clk

Add 256 bit a b

s

0

256

cin

cout

256

Add 256 bit a b

s

1

256

cin

cout

256 M

mux 0 1

Add 256 bit a b

s

1

256

cin

cout

256 M

mux 0 1

x 255

256

reset

Figure 5.3: The proposed Montgomery ladder based interleaved multiplier.

into smaller operations by following Karatsuba decomposition procedure. The smaller

operations are performed in parallel as described in previous section. The results of the

smaller operations are then combined in the final result. The combination in this case is

performed by multiplexer (MUX). Each of the levels in Karatsuba binary decomposition

procedure adds one additional 2:1 MUX in the critical path.

The circuit elements that form the critical path in above 256-bit multiplier with different

98


decompositions are listed in Table 5.7. The critical path of the multiplier is formed by two

256-bit adders and two MUXs. Among two additional MUXs, one is used to select the

correct reduced results of respective modular addition and doubling operations, whereas

the other one is used to restore the intermediate results after an iteration into the respective

registers based on bi. The critical path of the multiplier varies for different adder circuits. It

is observed that the minimum critical path delay is obtained if we develop Add256bit units

by our proposed 256-bit high-speed adder. The respective minimum critical path is found

at the decomposition level where h = 3.

Table 5.7: Circuit elements in the critical path of 256-bit Fp multipliers.

h § Critical path Latency (ns)0 2 * 256-bit carry chain + 2 MUX 18.361 2(128-bit carry chain + 1 MUX) + 2 MUX 16.022 2(64-bit carry chain + 2 MUX) + 2 MUX 12.893 2(32-bit carry chain + 3 MUX) + 2 MUX 12.504 2(16-bit carry chain + 4 MUX) + 2 MUX 13.285 2(8-bit carry chain + 5 MUX) + 2 MUX 17.97− § h indicates the height of decomposition tree, i.e. decomposition stops at 256/2h bits.−MUX indicates 2:1 multiplexer.

5.3.2 Cost and Performance of Multiplier

The designs have been done in Verilog (HDL). The synthesis, mapping, placement,

and routing have been done on Xilinx ISE 7.1i. The results are based on the post routing

simulation using ModelSim XE III 6.0a simulator. The target device is a Xilinx Virtex-II

Pro FPGA. Table 5.8 shows the cost and performances of proposed 256 and 512 -bit Fp

multipliers. For both the cases we demonstrate the cost and computation time of an Fp

multiplication, where operands are decomposed at different levels.

In the table 5.8, the column h indicates the decomposition level, i.e., the decomposition

stops in respective designs when the operands are in 256/2h-bit. As expected, it is observed

in both 256 and 512 -bit multipliers that the maximum speed (minimum time) is achieved

when the operand decomposition stops at 32-bit. It is due to the use of in-built 32-bit fast

carry chain (or FCC) of the FPGA device.

99


Table 5.8: Performance of the proposed multiplier with different decompositions on aVirtex-II Pro FPGA platform.

h § 256-bit implementation 512-bit implementationSlice Area Time (µs) Area×Time Slice Area Time (µs) Area×Time

0 2271 4.7 10 674 5119 14.3 73 2011 2123 4.1 8 704 4362 12.9 56 2702 2808 3.3 9 266 5637 9.0 50 7333 3475 †3.2 11 120 7775 7.7 59 8684 4808 3.4 16 347 9792 †7.3 71 4825 6888 4.6 31 685 13630 7.4 100 862− § h indicates the height of decomposition tree, i.e. decomposition stops at 256/2h bits.− †minimum multiplication latency at 32-bit FCC.

Table 5.9 shows the costs and multiplication times of the proposed multiplier imple-

mented with different adders. For example, the carry propagation adder (CPA) based

multiplier is developed by following proposed interleaved multiplication on Montgomery

ladder (Algo. 5.6) where the additions are performed by a given length (256, 512 -bit)

CPA. Similarly, the Brent-Kong carry based multiplier is designed by following Algo. 5.6

where additions are performed by Brent-Kong carry with 32-bit FCC adder. The high-

speed adder based implementation of the proposed multiplication technique achieves 36%

speedup compared to the same based on Brent-Kong carry based 256-bit adder with 32-bit

FCC. It is also observed that the proposed multiplication technique with high-speed adder

achieves 28% speedup compared to the interleaved multiplication with high-speed adder

(shown in Table 5.6).

Table 5.9: Performance of the proposed Fp multiplier with different adders on a Virtex-2Pro FPGA platform.

Adder Type 256-bit Fp multiplier 512-bit Fp multiplierSlice Frequency Time Slice Frequency Time

(MHz) (µs) (MHz) (µs)Proposed high-speed 3475 80 3.2 9792 70 7.3Carry propagation 2271 54 4.7 5119 35 14.3Brent-Kong carry † 2547 60 4.3 8722 48 10.7Carry select † 2515 58 4.4 8705 46 11.2Carry lookahead † 2874 59 4.4 9117 47 10.9− †adders are based on 32-bit fast carry chain.

100


Table 5.10 shows the comparison of the proposed Fp multiplier with the existing con-

temporary designs. In some implementations the pre and post -computation costs are

not considered which apparently show lower multiplication costs. The designs reported

in [134,135] are for some fixed primes p < 25. Thus they are not included into the compar-

ison table. We can perform a fair comparison of performance of the proposed interleaved

multiplier with existing architectures using the same algorithm and on the same platform.

The performance of the interleaved multiplier has been attempted to be improved by

utilizing the carry save adder (CSA) units by different researchers. We have implemented

the redundant interleaved architecture of [121] and compared with our proposed multipli-

ers. A disadvantage of CSA based algorithm is that it requires at least one final addition

for the correct result. In case of redundant interleaved multiplier, it requires some pre-

computations which depends on the multiplicand. The pre-computation and final addition

costs are also added with the multiplication cost. The pre and post -computations of a CSA

based multiplier require absolute addition of two large operands. In such a multiplier we

perform the absolute additions by our proposed fast adder circuit. The latencies of CSA

and our proposed adder are 4.60ns and 10.07ns, respectively. Thus in the CSA based mul-

tiplier, the pre and post -computations are performed by a divide-by-four clock. Whereas,

the iterative computations are performed on CSA by the original clock.

Figure 5.4 depicts a graphical view of the performances of contemporary designs. All

such existing designs in our knowledge are based on CSA and discards the additional costs

due to pre and post -computations. According to the results produced by different authors,

the best known design takes 5.5µs for computing one 256-bit Fp multiplication based on the

interleaved algorithm. It may be noted that our proposed 256-bit adder reduces the delay

of the above multiplier from 5.5µs to 3.7µs. Further, our proposed IMML32FCC multiplier

takes only 3.2µs for the same operation. Hence, it gives 70% speedup from the best known

existing designs described in [25] and [95]. However, one drawback of the proposed design

is that it requires higher slice area which also increases the overall area × time per bit or

AT/B value.

For the sake of completeness, further comparisons have been furnished with Mont-

gomery reduction based multipliers implemented both on FPGA and CMOS libraries. The

101

Chapter 5 Fast Prime Field Adders and Multipliers on FPGA PlatformTa

ble

5.10

:Per

form

ance

com

pari

son

ofdi

ffer

entF

pm

ultip

liers

.

Ref

eren

ceM

ultip

licat

ion

Type

Plat

form

Bit

leng

thA

rea

Tim

eA

T/B

§

Prop

osed

Inte

rlea

ved

with

Mon

t-V

irte

xII

pro

256

3475

slic

es3.

2µs

43.4

gom

ery

ladd

erB

unim

ovet

al.[

121]

,R

edun

dant

Inte

rlea

ved

Vir

tex

IIpr

o25

637

11sl

ices

3.7µ

s53

.6(im

plem

ente

dby

us)

Bun

imov

etal

.[12

1]R

edun

dant

Inte

rlea

ved

Vir

tex

II25

618

34sl

ices

5.5µ

s39

.4(im

plem

ente

dby

[25]

,200

9)†

Am

anor

etal

.[95

],20

05In

terl

eave

dV

irte

x20

00E

256

1030

slic

es5.

5µs

22.1

‡A

bdel

Fatta

het

al.[

25],

2009

Mod

ified

Inte

rlea

ved

Vir

tex

II25

623

61sl

ices

6.9µ

s63

.6O

rset

al.[

120]

,200

3M

ontg

omer

yre

duct

ion

V81

2E-B

G-5

6025

615

48sl

ices

7.7µ

s46

.651

229

72sl

ices

16.2

µs94

.0D

aly

etal

.[10

6],2

004

Mon

tgom

ery

redu

ctio

nV

irte

xII

2000

256

3109

slic

es5.

8µs

70.4

McI

vore

tal.

[74]

,200

4M

ontg

omer

yre

duct

ion

Vir

tex

IIPr

o25

646

63sl

ices

+1.

3µs

88.7

64M

ultip

liers

Am

anor

etal

.[95

],20

05M

ontg

omer

yre

duct

ion

Vir

tex

2000

E25

618

00sl

ices

5.6µ

s39

.4H

arri

set

al.[

99],

2005

Mon

tgom

ery

redu

ctio

nV

irte

xII

2000

256

5598

LU

T+

3.9µ

s46

.310

245n

bitR

AM

16.0

µs48

.2C

row

eet

al.[

98],

2005

Mon

tgom

ery

redu

ctio

nV

irte

xII

2000

256

5267

slic

es5.

8µs

119.

3Sa

kiya

ma

etal

.[79

],20

06,

Mon

tgom

ery

redu

ctio

nV

irte

x-II

Pro

256

4836

slic

es4.

0µs

75.6

Kha

leel

etal

.[80

],20

06M

ontg

omer

yre

duct

ion

Vir

tex

IV25

634

345

slic

es0.

4µs

53.7

Kaw

akam

ieta

l.[4

4],2

008

Mon

tgom

ery

redu

ctio

nV

irte

xII

Pro

256

1704

slic

es+

0.38

µs12

.433

Mul

tiplie

rsSa

vas

etal

.[15

8],2

000

Mon

tgom

ery

redu

ctio

n1.

2µm

CM

OS

256

−6.

6µs

−L

iuet

al.[

96],

2005

Mon

tgom

ery

redu

ctio

n0.

13µm

CM

OS

1024

221

kga

tes

1.5µ

s−

Kai

hara

etal

.[97

],20

05M

ontg

omer

yre

duct

ion

0.35

µmC

MO

S25

618

kce

lls1.

5ms

−−

†Doe

sno

tcon

side

rthe

final

CPA

sum

and

pre

com

puta

tions

whi

char

ere

quir

edfo

rdis

tinct

mul

tiplic

ands

.−

‡Red

unda

ntIn

terl

eave

d[1

21]r

equi

res

som

epr

eco

mpu

tatio

nsw

hich

are

cons

ider

edto

best

ored

insi

deth

eFP

GA

in[2

5].

−§

A:S

lice

Are

a,T

:Tim

ein

µs,B

:Bit

Len

gth.

102


0

500

1000

1500

2000

2500

3000

3500

4000 S

l i c e

A r e

a

F p Multiplication Time ( us )

0 . 5 1 . 5 1 . 0 2 . 0 3 . 0 2 . 5 3 . 5 4 . 5 4 . 0 5 . 0 6 . 0 5 . 5 6 . 5 7 . 0

Our 2010

Amanor et al . 2005

AbdelFattah et al . 2009

7 . 5 8 . 0

Bunimov et al . 2003 designed by AbdelFattah et al . 2009

Bunimov et al . 2003 designed by us

Figure 5.4: Different implementations of interleaved multiplier on FPGA.

Montgomery reduction involves conversions, which are not incorporated in the values men-

tioned in Table 5.10. Although this conversion needs to perform only once at the beginning

and once at the end for computing exponentiation like algorithms where multiplication is

performed repeatedly.

5.3.3 Security Against Timing and Power Attacks

Our proposed multiplier has an additional property of security against timing attack and

simple power analysis (SPA) attack. This is achieved due to the balanced computation of

our proposed architecture. The proposed design performs every multiplications in a con-

stant time which ensures its security against timing attack. At every iterations it performs

a fixed amount of computation which leads to follow a fixed power consumption profile.

Therefore, by observing a single trace of the power profile it is not possible to distinguish

one iteration from another, which ensures its security against simple power analysis (or

SPA) attack.

In order to prove its security against SPA attack, we first show how this attack finds

the secret by exploiting a multiplier. Let us assume that a multiplication (A ·B) mod p is

103


500 1000 1500 2000 2500−4

−2

0

2

4x 10

−3

samples

pow

er c

onsu

mpt

ion

(V)

1 0 1 1 0 1

Figure 5.5: Simple power analysis on a naive multiplier.

performed on a secret B, which is represented by ∑k−1i=0 2ibi. Let us further assume that the

multiplication is performed by interleaved multiplication algorithm. The doubling opera-

tion (2A) mod p is performed at every iterations but the addition is performed at an iteration

if and only if the respective bit bi = 1. The power consumption profile of this naive im-

plementation by following such an algorithm is shown in Fig. 5.5. It is exploited easily to

distinguish the iterations where addition is performed from the iterations where the same is

not performed. Hence, it finds out all secret bit values bi, where 0≤ i≤ k−1 by the simple

power analysis.

500 1000 1500 2000 2500−4

−2

0

2

4x 10

−3

samples

pow

er c

onsu

mpt

ion

(V)

Figure 5.6: Simple power analysis on our proposed multiplier.

The same analysis is performed on our proposed Fp multiplier. The respective power

consumption profile is shown in Fig. 5.6. In this waveform it is not possible to distinguish104

5.4 The PGAU and ECSM Hardware Based on Fast Adder

an iteration from another. This is because all iterations consist same computation in our

proposed multiplication technique. Hence, the proposed multiplier is indeed secure against

simple power analysis (or SPA) attack.

5.4 The PGAU and ECSM Hardware Based on Fast Adder

The programmable GF(p) arithmetic unit (PGAU) and elliptic curve scalar multiplica-

tion (ECSM) hardware have been described in previous chapter. In this section we show the

performance gain of the above hardware using proposed fast adders on an FPGA platform.

We replace all the adder units of PGAU architecture by fast adders. The modified unit is

called PGAU-FA. The performance gain and cost overhead of the PGAU-FA is compared

with our previous PGAU in Table 5.11. It is observed that due to the proposed addition

technique the 256-bit PGAU-FA gives 41% speedup compared to the same length PGAU.

Table 5.11: Performances of PGAU-FA and PGAU on Virtex-II Pro FPGA.bit PGAU-FA PGAU [Chapter 4] Speedup

slice Frequency TM TID slice Frequency TM TID(MHz) (µs) (µs) (MHz) (µs) (µs)

192 6 743 56 3.43 6.86 3 895 44 4.36 8.72 27%224 7 814 53 4.23 8.46 4 675 41 5.46 10.92 29%256 9 408 52 4.92 9.85 5 384 37 6.92 13.84 41%− TM and TID indicate Fp (or GF(p)) multiplication and inversion/division times, respectively.

Next we redesign the whole dual core ECSM hardware based on PGAU-FA instead of

PGAU. We call this new ECSM hardware by ECSM-FA. The performance of the ECSM-

FA is compared with the previous ECSM in Table 5.12. The speedup factor shows that due

to the fast adder circuits a 256-bit elliptic curve scalar multiplication achieves 39% speedup

from its previous implementation.

5.5 Conclusion

This chapter has presented techniques to improve the performance of different adder

and Fp multipliers on FPGA platforms. The designs are based on the proper usage of the

105


Table 5.12: Comparison between with and without fast adder based ECSM hardwares.bit ECSM with PGAU-FA (ECSM-FA) ECSM [Chapter 4] Speedup

Slice Frequency Time Slice Frequency Time(MHz) (ms) (MHz) (ms)

192 10 350 55 3.50 8 972 43 4.47 27%224 11 936 52 5.00 10 386 40 6.50 30%256 13 350 50 6.75 11 953 36 9.38 39%− Comparison is made on Virtex-II Pro platform.− Time in ms indicates one ECSM computation time.

in-built carry chains in the FPGA device. This work shows that the carry chains have a

direct impact on the level of decomposition of the adder to obtain the fastest adder. The

interleaved multiplication algorithm has been modified based on the Montgomery powering

ladder. The parallel architecture based on the proposed fast adder circuits has been shown

to give 70% speedup from the best known existing designs.

We redesigned the PGAU and ECSM hardware that were described in the last chap-

ter using our proposed fast adder circuit. It is observed that the proposed adder technique

indeed improves the overall performance of the ECSM hardware. The 256-bit ECSM cryp-

toprocessor achieves 39% speedup in cost of only 12% additional slices over its old design.

In summery, this chapter has presented efficient design techniques for prime field arith-

metic on FPGA platform. Towards the goal of designing an efficient and secure pairing

cryptoprocessor, the next chapter deals with the selection of a pairing-friendly curve and

design of respective cryptoprocessor. The pairing cryptoprocessor is based on our currently

developed fast addition and multiplication techniques on FPGA platform.

106

Chapter 6

High Speed Flexible Pairing

Cryptoprocessor

APART FROM ELLIPTIC CURVE scalar multiplication, pairing computation is an-

other tedious operation in pairing based cryptography. The security and computa-

tion efficiency of a cryptographic bilinear pairing mostly depend on underlying algebraic

curves. As per NIST recommendation, 128-bit security is essential beyond 2030 [62].

Some of the existing curves which provide 128-bit security are : Barreto-Naehrig curves

(BN curves) defined over a 256-bit prime field with embedding degree 12, Supersingular

curves defined over a 1223-bit binary field with embedding degree 4, and supersingular

curves defined over a 509-bit characteristic–3 field with embedding degree 6 [42]. Among

them the BN curves are most popular in current days.

This chapter presents a Pairing Cryptoprocessor (PCP) over Barreto-Naehrig curves.

The proposed architecture is specifically designed for field programmable gate array (FPGA)

platform. The objective of this chapter is to utilize the efficient implementation of the un-

derlying finite field primitives namely adder, subtractor, and multiplier that have been de-

scribed in previous chapter for developing pairing cryptoprocessor. This is brought about

the two stages of the cryptoprocessor design:

1. A configurable Fpk arithmetic unit (CAU) has been developed which has inherent

configurability to perform arithmetics in Fp and Fp2 for any p less than the given

107

Chapter 6 High Speed Flexible Pairing Cryptoprocessor

length.

2. The PCP has been developed using two CAU cores as arithmetic operators along

with additional control units and memory elements.

Extensive parallelism techniques have been proposed to realize a PCP which requires

lesser clock cycles than the existing designs. The proposed design is the first reported

result on an FPGA platform for 128-bit security. The proposed cryptoprocessor provides

flexibility to choose the curve parameters for pairing computations.

The cryptoprocessor needs 1764 k, 1242 k, and 856 k cycles for the computation of

Tate, ate, and R-ate pairings, respectively. On a Virtex-4 FPGA device it consumes 52

kSlices at 50MHz and computes the Tate, ate, and R-ate pairings in 35.3 ms, 24.9 ms, and

17.0 ms, respectively, which are comparable to known CMOS implementations.

6.1 Introduction

Cryptographic pairing [93] is a bilinear map G1×G2 → G3 where G1 and G2 are

additive groups and G3 is a multiplicative group. Let E be an elliptic curve defined over Fq

having even embedding degree k with respect to the prime divisor r of order of the elliptic

curve (#E(Fq)). Suppose further that r3 does not divide #E(Fq) and r2 does not divide

qk− 1. Many cryptographic pairings such as the Tate pairing [138], ate pairing [77], and

R-ate pairing [28] choose G1 to be an order-r cyclic subgroup of E(Fq), G2 to be an order-r

cyclic subgroup of E(Fqk), and G3 to be a subgroup of F∗pk with order r. The above pairings

are called asymmetric pairings as G1 and G2 are different. The pairings considered in this

chapter are the (reduced) Tate pairing tr :G1×G2→G3, the ate pairing ar :G2×G1→G3,

and the R-ate pairing Rr : G2×G1→G3.

Selection of such groups as well as field types have a strong impact on the security and

computation cost of pairing. We choose Barreto-Naehrig curves (or BN curves) [76] for

computing the above pairings. The BN curves are a type of elliptic curves E defined over

prime fields Fp having prime order #E(Fp) and an embedding degree k = 12. These curves

are especially well suited for the computation of above pairings with 128-bit security level

by choosing a 256-bit prime p.108

6.2 Prior Work

This chapter proposes a cryptoprocessor for the computation of pairings over BN curves.

The proposed pairing cryptoprocessor (PCP) is flexible to choose curve parameters includ-

ing prime p. It supports all primes less than the given length (256 bits). Field programmable

gate array (FPGA) is one of the suitable platforms for implementing cryptographic algo-

rithms. In this chapter, we develop a parallel configurable hardware for computing addition,

subtraction, and multiplication on Fp and Fp2 . Existing techniques to speed up arithmetics

in extension fields (see [61, 78]) for fast computation in Fp6 and Fp12 are used on top of it.

The major contributions of the chapter are highlighted here.

• The chapter introduces an underlying configurable primitive for Fpk arithmetics on

FPGA platform.

• It proposes a pairing hardware that is flexible for curve parameters.

• Parallelism techniques are adopted in different levels including underlying finite field

operations which drastically reduces the overall cycle count of pairing computation.

• The proposed FPGA design achieves a comparable speed with the existing CMOS

design.

The proposed configurable Fpk arithmetic cores and parallel computation result in a signif-

icant improvement on the performance of Tate, ate, and R-ate pairing over BN curves. The

result is demonstrated for a 256-bit BN curve that provides 128-bit security.

The chapter starts with a brief description of cryptographic pairings and BN curves.

Then it describes the pairing cryptoprocessor followed by the description of experimental

results based on BN curves and provides comparative study with existing contemporary

designs.

6.2 Prior Work

The software implementation results of pairings over BN curve have been shown in [8],

[42], [39], and [61]. The highly optimized software codes run on a 64-bit core2 processor

which computes a R-ate pairing in only 10,000,000 cycles. The software implementation

109


of [8] gives the speed record for the computation of Optimal-ate pairing on BN curves,

which is computed by 4,470,408 cycles on a Intel Core 2 Quad Q6600 processor.

An application specific instruction-set processor (ASIP) has been proposed in [17]. It is

designed by extending a RISC core with additional scalable functional units. It requires a

special programming environment in order to execute pairings. Therefore, the authors have

developed a special C compiler. Implementation result shows that the ASIP can compute

an Optimal-ate pairing in 15.8 ms over a 256-bit BN curve at 338 MHz with a 130 nm

CMOS library.

A pairing processor specially for BN curves has been proposed in [19]. It exploits the

characteristic of the field defined by BN curves and choose curve parameters such that the

underlying Fp multiplication becomes more efficient. It shows a 5.4 times speedup of a

pairing computation compared to the ASIP proposed in [17]. However, the main limitation

of the pairing processor [19] is that it is useful only for computing pairings over a fixed BN

curve.

6.2.1 Choice of Elliptic Curve

The most important parameters for cryptographic pairings are the underlying finite

field, the order of the curve, the embedding degree, and the order of G1,G2 and G3. These

parameters should be chosen such that the best exponential time algorithms to solve the

discrete logarithm problem (DLP) in G1 and G2 and the sub-exponential time algorithms

to solve the DLP in G3 take longer than a chosen security level. We choose 128-bit sym-

metric key security for current cryptoprocessor. For the 128-bit security level, the National

Institute of Standards and Technology (NIST) recommends a prime group order of 256 bits

for E(Fp) and of 3072 bits for the finite field Fpk [62].

Barreto-Naehrig curves, introduced in [76], are elliptic curves over fields of prime order

p with embedding degree k = 12. The BN curve is represented as :

EFp : Y 2 = X3 +3

110

6.2 Prior Work

with BN parameter z = 6000000000001F2D (in hexadecimal) [61]. It forms the group

E(Fp) with order #E(Fp) = r = 36z4 + 36z3 + 18z2 + 6z+ 1, which is a 256-bit prime of

Hamming weight 91. The field characteristic p = 36z4 +36z3 +24z2 +6z+1 is a 256-bit

prime of Hamming weight 87, and t−1 = p− r = 6z2+1 is a 128-bit integer of Hamming

weight 28. Here t = p+ 1− r is the trace of EFp . The prime p ≡ 7 (mod 8) (so -2 is a

quadratic non-residue, we represent it by β) and p≡ 1 (mod 6).

6.2.2 Pairing Computation

Different varieties of Tate pairing could be computed over BN curves. Among them

ate [77], R-ate [28], and Optimal-ate [3] are most popularly used. Algo. 6.1 shows compu-

tation of Tate pairing. It consists of two major steps : the computation of Miller function

and the final exponentiation. The first part is computed by one of the optimized version

of Miller algorithm [175]. Several optimizations of this algorithm have been presented

in [138]. The resulting algorithm is called BKLS algorithm. The pairings other than Tate

are computed by similar way using different parameter other than r and by interchanging

the input points [42].

Algorithm 6.1: Computing the Tate pairing.Input: P ∈ G1 and Q ∈ G2.Output: er(P,Q).

Write r in binary : r = ∑L−1i=0 ri2i.

T ← P, f ← 1.for i from L−2 downto 0 do

T ← 2T .f ← f 2 · lT,T (Q).if ri = 1 and i = 0 then

T ← T +P.f ← f · lT,P(Q).

endendreturn f (q

k−1)/r.

The BN curves also admits a sextic twist [42], which means that the point Q is mapped

on a point Q′ defined over Fp2 . Thus the line functions lT,T (Q) and lT,P(Q) is computed111


over Fp2 instead of Fp12 . Value of the line functions are represented as : l0 + l1W 2 + l2W 3,

with l0 ∈ Fp, l1, l2 ∈ Fp2 , and a quadratic non-residue W over Fp2 . The Miller function

f is computed over Fp12 , which is represented as : f0 + f1W + f2W 2 + f3W 3 + f4W 4 +

f5W 5, with fi ∈ Fp2 . So in the Tate pairing computation f 2, f · lT,T (Q), and f · lT,P(Q) are

performed on Fp12 . Whereas all other computations are performed on Fp and Fp2 .

The detailed procedure of pairing computation including the final exponentiation on BN

curve is described in [42] and [61]. Another efficient way of computing final exponentiation

is described in [20]. This paper follows the descriptions that are given in [42] for comput-

ing the Tate, ate, and R-ate pairings. We use Jacobian coordinate systems for performing

elliptic curve operations, where a point (X ,Y,Z) corresponds to the point (x,y) in affine

coordinates with x = X/Z2 and y = Y/Z3. Let (m,s, i) denote the cost of multiplication,

squaring, inversion in Fp. Using Jacobian coordinate system the Miller function of Tate

pairing on BN curve requires 27934m and the final exponentiation requires 7246m+ i [42].

Thus the total cost for Tate pairing on BN curve is 35180m+ i. Similarly, the cost of ate

pairing is 23047m+ i and the cost of R-ate pairing is 15093m+2i.

6.3 Programmable Fp-Primitive

In this section we develop a programmable Fp-primitive based on above 256-bit high-

speed adder circuits. Essential operations for pairing computation are addition, subtraction,

and multiplication in finite fields. Figure 6.1 depicts the overall resulting architecture of

the proposed Fp-adder/subtractor/multiplier unit.

6.3.1 Architecture Description

Our first objective for designing such an integrated architecture is to reduce the overall

hardware costs for computing three essential prime field operations in pairing computa-

tion. The architecture consists of several independent blocks which operate in parallel for

accelerating the execution of respective operations. The whole architecture is subdivided

into four macro-blocks (A1,A2,A3,A4) and seven micro-blocks (B1,B2,B3,B4,B5,B6,B7).

The macro-blocks are used to compute the arithmetic operations, whereas, micro-blocks

112


A 64

A 64

A 64

A 64

A 64

A 64

A 64

A 64

A 64

1 0

1 0

A 64

A 64

A 64

A 64

A 64

A 64

A 64

A 64

A 64

l e f t

s h i

f t e r

1 0

s 1 s 2

A 64

A 64

A 64

A 64

A 64

A 64

A 64

A 64

A 64

c 0 1

1 0 0 1 0 1

b i ~ c 0

p

1 0 1 0 b i b i

0 1 c 0

data MSB or carry out

1 0

~ c 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

B 1 B 2

B 3

B 4

B 5

B 6

v 2 s 1 + s 2 v 2 s 1 + ( ~s 2 ) +1

A 1 : w 2 v 2 + (~p) + 1 w 2 v 2 + p

A 2 : v 1 2 u A 3 : w 1 v 1 + (~p) + 1 A 4 :

v 1

u

v 1 w 1 v 2 w 2

t 1

t 2

B 7

c 1 c 2

Figure 6.1: The architecture of Fp adder/subtractor/multiplier unit.

are primarily responsible for dataflow among the macro-blocks, the registers, and the i/o

ports. The functionality of the individual blocks are described here.

• Macro-blocks A1,A2, and A4 are 256-bit adders based on our proposed technique as

described in section 5.2.

• Block A3 performs 2u for an integer u ∈ Fp. This is done by simply one bit left shift

having only rewiring and no additional logic cells.

• Micro-block B1 consists of one 2:1 multiplexer that selects either v1 or w1 based on

the most significant bits (or carry-outs) of 2u and 2u− p operations. Therefore, this

block completes the 2u mod p operation.

• Block B2 selects either s1 or s2 as the input to the A3.

• Blocks B3,B4, and B5 help to compute Fp- addition and subtraction in A1 and A2.

113


The control signal c0 holds zero for addition and one for subtraction. Thus, if c0 = 1

then block B5 selects ¬s2 else it selects s2. Similarly, if c0 = 1 then block B4 selects

p else it selects ¬p. Block B3 completes the operation by selecting the correct result.

In case of Fp-subtraction (i.e., c0 = 1), it selects either v2 or w2 based on the most

significant bit (MSB) of v2 only, whereas, for Fp-addition it does the same based on

the MSB of both v2 and w2.

• Blocks B6 and B7 multiplex t1 (the output of 2u mod p) and t2 (the output of s1± s2

mod p) as the new value of s1 and s2 registers, respectively.

6.3.1.1 Computation of Fp-multiplication

Proposed Fp-primitive follows the parallelism technique of Montgomery ladder [125]

for computing Blakley multiplication algorithm in Fp [180]. The choice of this algorithm

is due to its lower hardware cost and intrinsic adaptability to Montgomery ladder for paral-

lelism. We rewrite it, in Algorithm 7 with parenthesized indices in superscript in order to

emphasize the intrinsic dependency as well as parallelism of the multiplication procedure.

The algorithm computes two intermediate results (s(i)1 and s(i)2 ) in each iteration. The data

transfer inside the architecture (Fig. 6.1) for computing (a ·b) mod p is as follows :

• The register s1 and s2 hold the iterative results s(i)1 and s(i)2 of Algorithm 7, which are

initialized by zero and a, respectively, as specified in step 1.

• Iterative execution starts from i = n− 1 and goes down to zero as shown in step 2.

This step is executed by a 8-bit counter, which belongs to the control part of the

proposed design and it is not shown in Fig. 6.1.

• Block B2 of Fig. 6.1 executes step 3. The modular doubling (as computed by exe-

cuting the steps 4, 6, 8, and 10) and the modular addition (as computed by executing

the steps 5, 7, 9, and 11) are performed in parallel. In Fig. 6.1, steps 4 and 6 are

performed in blocks A3 and A4, respectively, whereas, both the steps 8 and 10 are

performed in block B1. Similarly, steps 5 and 7 are performed in blocks A1 and A2,

respectively, whereas, both the steps 9 and 11 are performed in block B3. During the

execution of Fp-multiplication control signal c0 remains zero.114


Algorithm 7 : The interleaved multiplication based on Montgomery ladder†.Input: p, a = ∑n−1

i=0 2iai and b = ∑n−1i=0 2ibi.

Output: a ·b mod p.1. s(n)1 ← 0; s(n)2 ← a ;2. for i = n−1 down to 0 do3. if bi = 1 then u(i)← s(i+1)

2 ; else u(i)← s(i+1)1 ;

4. v(i)1 ← 2u(i);5. v(i)2 ← s(i+1)

1 + s(i+1)2 ;

6. w(i)1 ← v(i)1 +(¬p)+1;

7. w(i)2 ← v(i)2 +(¬p)+1 ;

8. c(i)1 ← (v(i)1 )n | (w(i)1 )n;

9. c(i)2 ← (v(i)2 )n | (w(i)2 )n ;

10. if c(i)1 = 1 then t(i)1 ← w(i)1 ; else t(i)1 ← v(i)1 ;

11. if c(i)2 = 1 then t(i)2 ← w(i)2 ; else t(i)2 ← v(i)2 ;

12. if bi = 1 then s(i)1 ← t(i)1 ; else s(i)1 ← t(i)2 ;13. if bi = 1 then s(i)2 ← t(i)2 ; else s(i)2 ← t(i)1 ;14. end for15. return s(0)1 ;† In the algorithm, x(i) represents the value of x at i-th iteration, (x)n indicates then-th bit of x, and | indicates logical OR.

• Finally, results of the current iteration are restored as specified in step 12 and step 13

in parallel by B6 and B7 blocks.

All steps from step 3 to step 13 of Algorithm 7 are performed within one clock by the

proposed architecture. Therefore, to compute a multiplication in Fp256 the proposed design

takes only 256 clock cycles.

6.3.1.2 Computation of Fp-addition

The proposed design executes Algorithm 8 for computing Fp-addition. As described in

step 1, the architecture initializes registers s1 and s2 by operands a and b, respectively. It

executes steps 2 and 3 in blocks A1 and A2. Based on the most significant bits of v2 and w2

it produces the correct result of s1+ s2 mod p in block B3 as described in step 3 and step 4.

During the execution of Fp-addition the control signal c0 holds logic zero. The proposed

115


architecture computes a Fp-addition in one clock cycle.

Algorithm 8 : The addition in prime field.Input: p, a = ∑n−1


Output: a+b mod p.1. s1← a; s2← b ;2. v2← s1 + s2 ;3. w2← v2 +(¬p)+1 ;4. c2← (v2)n | (w2)n;5. if c2 = 1 then t2← w2 ; else t2← v2;6. return t2 ;

6.3.1.3 Computation of Fp-subtraction

Subtraction a−b mod p on the proposed design is performed by executing Algorithm 9.

As described in step 1, the architecture initializes registers s1 and s2 by operands a and b,

respectively. The subtraction s1− s2 is performed as s1 + (¬s2) + 1 by the A1 block of

Fig. 6.1. Block B5 computes ¬s2 and sends it to the A1 block. During subtraction the

control signal c0 holds one. The architecture further executes step 3 in block A2 and based

on the most significant bit of v2 it produces the correct result of s1− s2 mod p in block B3.

The whole computation takes only one clock cycle of the proposed architecture.

Algorithm 9 : The subtraction in prime field.Input: p, a = ∑n−1


Output: a−b mod p.1. s1← a; s2← b ;2. v2← s1 +(¬s2)+1 ;3. w2← v2 + p ;4. c2← (v2)n;5. if c2 = 1 then t2← w2 ; else t2← v2;6. return t2 ;

6.4 A Configurable Fpk Arithmetic Unit (CAU)

It is observed that the major operations for pairing computations over BN curves are

performed either on Fp or on Fp2 . Thus, we design a configurable architecture for per-

forming arithmetic in Fp and Fp2 . Figure 6.2 shows the architecture of the proposed CAU.116

6.4 A Configurable Fpk Arithmetic Unit (CAU)

It consists of three Fp-adder/subtractor/multiplier units described before in Section 6.3.

Each of these units along with their input multiplexers are identified as separate blocks

(A1,A2,A3), which can operate in parallel. The CAU operates on two modes; namely,

Fp-mode and Fp2-mode. In Fp-mode, it computes three independent Fp-operations on

A1, A2, and A3 blocks. The respective operations that are computed in this mode are

t1← a0[+/−/·]b0, t2← a1[+/−/·]b1, and t3← a2[+/−/·]b2 as shown in the figure.

F p - Add / Sub / Mult - 1

a 0

c 10 t 3

0 1 c 9

0 1 c 1 0 1 2 c 2


0 1 2 3 c 3 0 1 2 c 4


0 1 c 5 0 1 c 6

c 11 t 4 c 8 t 2 c 7 t 1

a 1

b 0

b 1

a 2 b 2

p

t 1 a 0 +/ - b 0 t 3 a 0 + a 1 t 1 a 0 . b 0 t 1 t 1 – t 2

A 1 :

t 2 a 1 + / - b 1 t 4 b 0 + b 1 t 2 a 1 . b 1

t 4 t 1 + t 2

t 2 t 3 – t 4

A 2 :

t 3 a 2 +/ - b 2 t 3 t 3 . t 4 t 3 a 2 .b 2

A 3 :

Figure 6.2: The architecture of Configurable Fpk Arithmetic Unit (CAU).

In Fp2-mode, the CAU computes Fp2-multiplication. Let an element α ∈ Fp2 be repre-

sented as α0+α1X , where α0,α1 ∈Fp and X is an indeterminate. The formula of Karatsuba

multiplication c = ab in Fp2 is :

v0 = a0b0, v1 = a1b1,

c0 = v0 +βv1,

c1 = (a0 +a1)(b0 +b1)− v0− v1,

where v0,v1,c0,c1,a0,a1,b0,b1 ∈ Fp. Here β is a quadratic non-residue in Fp which is −2

in case of BN curve. We compute a ·b in the proposed CAU as described in Algorithm 10.

All operations within a step of the Algorithm 10 are computed in parallel, whereas,

individual steps are executed one-by-one. Step 1 of the algorithm is computed by block

A1 and block A2. Then the CAU executes three independent Fp-multiplications as defined

117


Algorithm 10 : The multiplication in Fp2 .Input: p, a = a0 +a1X and b = b0 +b1X .Output: a ·b.1. t3← a0 +a1 ; t4← b0 +b1 ;2. t1← a0 ·b0 ; t2← a1 ·b1 ; t3← t3 · t4 ;3. t1← t1− t2 ; t4← t1 + t2 ;4. t1← t1− t2 ; t2← t3− t45. return t1 + t2X ;

in step 2 by A1, A2, and A3, respectively. After executing steps 3 and 4 by A1 and A2

blocks the final result is stored into the registers t1 and t2 as defined in step 5. The cost of

multiplication in Fp2 is 3m where m represents the cost of one Fp-multiplication. However,

due to three parallel independent Fp-multiplication units this cost on the proposed CAU is

only m. The Fp2-squaring is performed as a ·a for reducing the multiplexer complexity of

the CAU for which too we pay the same cost.

The micro instruction sequence generator finds the current operation type and generates

the respective micro instructions which are nothing but the control signals ci, 1 ≤ i ≤ 11.

The respective values of control signals, which on the other hand, represents the scheduling

of different operations on CAU are depicted in Table 6.1.

Table 6.1: Micro-instructions for performing arithmetic in Fp and Fp2 .

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11

Fp-mode, execution of three Fp-operations in parallel

0 1 1 0 1 1 1 1 1 1 0

Fp2-mode, execution of Algorithm 10†

0 0 0 0 0 0 0 0 0 1 1

0 1 1 0 0 0 1 1 1 1 0

1 2 2 1 0 0 1 0 0 0 1

1 2 3 2 0 0 1 1 0 0 0† : Each row of micro-instructions represents one step in Algo. 10.

This sequence generator is constructed as a typical state machine which generates micro

instructions at each state. Its deterministic state transition takes place at every clock cycle

based on the current state and overall status of the CAU. In case of a multiplication in Fp256,118

6.5 The Pairing Cryptoprocessor (PCP)

it remains in a same state for 256 cycles, whereas it remains for one cycle only in a state

for computing Fp256- addition and subtraction. Thus, the cost m means 256 clock cycles

in the proposed pairing cryptoprocessor. Similarly, the computation of c = ab in F(p256)2

takes only 259 clock cycles which is approximately equal to m.

6.5 The Pairing Cryptoprocessor (PCP)

This section describes the proposed architecture of the cryptoprocessor. The main nov-

elty of the architecture lies in its efficient utilization of FPGA features. Independent oper-

ations are exploited at each level of pairing computations to evolve an optimized parallel

design. We explain here the top level of the design followed by its internal parts.

6.5.1 The Datapath Design

The major operations for pairing computations are point doubling (PD), point addition

(PA), line computation (l(Q)), f 2, and f · l(Q). In case of Tate pairing on BN curve, the

PA and PD are performed on E(Fp). Hence, the underlying operations are performed in

Fp. Similarly, the operation l(Q) is performed in Fp2 , while the other two operations are

performed in Fp12 . In case of ate and optimal-ate pairings, the PA, PD, l(Q) are performed

in Fp2 , and f 2, f · l(Q) are performed in Fp12 . However, each of the above computations are

well defined and constitute a number of independent Fp-operations. The proposed datapath

executes those independent operations in parallel to speed up pairing computations.

Figure 6.3 shows the overall resulting structure of the datapath. Two configurable Fpk

arithmetic units (CAU) are included which perform arithmetic in Fp and Fp2 depending

on their mode of configurations. The instructions to configure the CAUs are stored into

a small memory segment called instruction memory. There is a special instruction fetch

and decode (IFD) unit which reads the respective instructions and converts them to proper

configuration signals for both the CAUs. The input data to the CAUs come in parallel from

respective registers. The mechanism and regularity of data access for computing above

operations are fairly simple. The distribution of access to the registers and resolution of

access conflicts are handled efficiently at the runtime by a dedicated hardware block called

119


Register 1

Micro - instruction sequence

generator - 1

Sequence control

Instruction memory

IFD

control lines data lines

access control

enable lines

Register 2 Register d

Micro - instruction sequence

generator - 2

Configurable F p k - arithmetic unit

( CAU – 1 )

Configurable F p k - arithmetic unit

( CAU – 2 )

1 2 d

select lines

p

1 12 2 3 4 5 6 7 8 9 10 11

1 2 d

1 2 3 4 5 6

1 2

results

d operands

1 2 6 7 8 18

Execution

unit

Active registers

Data access unit ( DAU )

Figure 6.3: The datapath of the pairing cryptoprocessor.

data access unit (DAU) which distributes the data to the CAUs from the registers and vice

versa.

Each CAU performs atmost three Fp-operations in parallel. Thus, overall twelve inde-

pendent operands along with modulus p and six outputs are accessed in either directions

between memory elements and the CAUs. This on-demand concurrent data requests re-

sult in multiple independent read or write connections between CAUs and DAU. The DAU

takes care of granting accesses. Therefore, a simple multiplexing protocol is used between

CAUs and registers, which is able to confirm a request within the same cycle in order not

to cause any delay cycles when trying to access data in parallel. The data accesses and

instruction sequences are hard coded into the sequence control of the architecture which

avoids the additional software development costs.

The data access conflicts have been resolved prior to design of the DAU. The proposed

one is a custom hardware for pairing computations which executes a fixed set of operations.

The dependency of the instructions are predefined and thus the access conflicts are known.

The priority of the data processing and the respective execution is rearranged accordingly

which achieves maximum utilization of CAUs.

The data access unit or DAU acts as a mediator while transferring data between CAUs

120

6.6 Computation of Tate Pairing on PCP

and memory elements. Due to the demand of parallel access, the proposed cryptoprocessor

stores all intermediate results in its active registers. To fulfil our aimed parallelism of

pairing computations on BN curves the proposed design consists of fifty 256-bit registers

(i.e., d = 50 in Fig. 6.3). Each of the register consists of data-in, data-out, and enable lines.

It gets updated by data-in lines when the respective enable signal is invoked. The crossbar

switch (results) redirects the outputs of each operation to registers. Similarly, the operands

are redirected from registers to the input ports of the CAUs. The respective select signals

are generated prior to the above two redirection procedures by the sequence control unit.

The access control block synchronizes the select lines of the multiplexers for operands and

results. It also synchronizes the enable signals of registers for restoring the intermediate

results.


We follow the formula and algorithms for the computation of asymmetric pairings

(Tate, ate, and R-ate) that are given in [42]. The major computations in pairing algorithm

are the Miller function and the final exponentiation. The Miller function consists of two

major steps, namely : doubling step and addition step. Here, we discuss the computation

of above steps for Tate pairing over BN curve on our proposed PCP.

The Tate pairing (tr) over BN curve takes input points P and Q over Fp and Fp2 , re-

spectively. The parameter r is a 256-bit prime of Hamming weight 91. Thus, the Miller

algorithm runs for 255 iterations having 255 doubling steps and 90 addition steps. There

are sufficient independent operations within the doubling and addition steps which can be

performed in parallel. Our proposed PCP consists of a fixed number of functional units.

Therefore, an optimization can be done based on the available functional units and the op-

erations. In the following subsections, we describe the optimized scheduling of above steps

on proposed PCP.

6.6.1 Computation of Doubling Step

The doubling step consists of the following computations.

121


• The point doubling (2T ) operation.

• The computation of tangent line at point T (lT,T (Q)).

• The squaring of Miller function ( f 2).

• The multiplication of Miller function with line function ( f · lT,T (Q)).

The computation of 2T , lT,T (Q), and f 2 are performed in parallel on our PCP. In Jacobian

coordinates the formulae for doubling a point T = (X ,Y,Z) are 2T = (X3,Y3,Z3) where

X3 = 9X4− 8XY 2, Y3 = (3X2)(4XY 2−X3)− 8Y 4 and Z3 = 2Y Z. The tangent line at T ,

after clearing denominators, is l(x,y) = 3X3−2Y 2−3X2Z2x+Z3Z2y [90].

In case of Tate pairing computation on BN curves X ,Y,Z,X3,Y3,Z3 ∈ Fp and x,y ∈ Fp2 .

Let us assume that x and y are represented as x0+x1U and y0+y1U. The above operations

are performed by one of the CAUs by following way.

Instructions ASM1,1 ASM1,2 ASM1,31. t0← X2 t1← Y 2 t2← Y ·Z2. t3← (t0)2 t4← X · t1 t5← (t1)2

3. t4← 2t4 t6← 2t3 Z3← 2t24. t4← 2t4 t6← 2t6 t5← 2t55. t3← t3 + t6 − t5← 2t56. X3← t3− t2 − t5← 2t57. t3← t4−X3 − t7← t7 + t08. t7← t7 · t3 t4← Z2 t2← X · t09. Y3← t7− t5 t1← 2t1 t5← 2t210. t4← t4 · t0 t0← t4 ·Z3 −11. t2← 2t4 − t5← t2 + t512. t4← t4 + t2 l0← t5− t1 −13. l10← t4 · x0 l11← t4 · x1 −14. l20← t0 · y0 l21← t0 · y1 −

In the above table ASMi, j stands for the j-th Fp Add/Sub/Mult unit of i-th CAU. Nonlin-

ear Fp operations are performed in the instructions 1, 2, 8, 10, 13, and 14. If we assume that

s≈ m then the cost of above operations is 6m. At the same time CAU2 starts the computa-

tion of f 2. We represent the Miller function f as : ( f0,0+ f0,1V + f0,2V 2)+( f0,0+ f0,1V +

f0,2V 2)W , where fi, j ∈ Fp2 . The equivalent representations of f are [42] as follows:122


f = f0 + f1W , where f0, f1 ∈ Fp6

= ( f0,0 + f0,1V + f0,2V 2)+( f1,0 + f1,1V + f1,2V 2)W , where fi, j ∈ Fp2

= f0,0 + f1,0W + f0,1W 2 + f1,1W 3 + f0,2W 4 + f1,2W 5

The computation of c = f 2 is performed in Fp12 using complex method by following

way.

v = f0 · f1,

c0 = ( f0 + f1)( f0 +β f1)− v−βv,

c1 = 2v,

where v,c0,c1 are in Fp6 . It requires two Fp6 multiplications. The Fp6 multiplication is

performed in the tower field F(p2)3 using Karatsuba technique, which requires six multi-

plications in Fp2 . Let us consider that an element ai ∈ Fp2 is represented as : ai0 + ai1U,

ai j ∈ Fp. The computation of v = f0 · f1 on CAU2 is as follows:

1. v0← f00 · f10, where f00, f10 ∈ Fp2 13. t3← t1 · t2, where t1, t2 ∈ Fp2

2. v1← f01 · f11, where f01, f11 ∈ Fp2 14. t10← v00 + v10, t11← v01− v113. v2← f02 · f12, where f02, f12 ∈ Fp2 15. t20← v21 + v214. t10← f010 + f020, t11← f011 + f021 16. t10← t10 + t20, t11← v20− t115. t20← f110 + f120, t21← f111 + f121 17. v10← t30− t10, v11← t31 + t116. t3← t1 · t2, where t1, t2 ∈ Fp2 18. t10← f000 + f020, t11← f001 + f0217. t10← v10 + v20, t11← v11− v21 19. t20← f100 + f120, t21← f101 + f1218. t30← t30− t10, t31← t31− t11 20. t3← t1 · t2, where t1, t2 ∈ Fp2

9. t31← t31 + t31 21. t10← v00 + v20, t11← v01 + v2110. v00← v00− t31, v01← v01 + t30 22. t10← v10− t10, t11← v11− t1111. t10← f000 + f010, t11← f001 + f011 23. v20← t30 + t10, v21← t31 + t1112. t20← f100 + f110, t21← f101 + f111.

The result v∈ Fp6 is represented as : (v00+v01U)+(v10+v11U)V +(v20+v21U)V 2,

where vi j ∈ Fp. In the above computation, steps 1, 2, 3, 6, 13, 20 perform multiplications

in Fp2 . Thus the cost of v = f0 · f1 is 6m, which is computed in parallel with 2T , lT,T (Q)

on CAU2 by the proposed PCP.

123


The second Fp6 multiplication, i.e., the computation of ( f0+ f1)( f0+β f1) is performed

by both the CAUs, which costs only 3m in the PCP. Therefore, the total cost of computing

2T , lT,T (Q), and f 2 by the PCP is 9m.

The l(Q) is represented as : (l0 + l1V )+ (l2V )W , where l0 ∈ Fp, l1, l2 ∈ Fp2 , which

is equivalent to l0 + l1W 2 + l2W 3. The computation of f · l(Q) is performed in the tower

field F((p2)3)2 by following way.

f ′ = f · l(Q)

= (( f0,0 + f0,1V + f0,2V 2)+( f1,0 + f1,1V + f1,2V 2)W ) · ((l0 + l1V )+(l2V )W )

The top most extension is quadratic. Thus the computation of f · l(Q) is done by three Fp6

multiplications, which are identified as :

t11 = (l0 + l1V ) · ( f0,0 + f0,1V + f0,2V 2)

t12 = (l2V ) · ( f1,0 + f1,1V + f1,2V 2)

t13 = (l0 +(l1 + l2)V ) · ((( f0,0 + f1,0)+( f0,1 + f1,1)V +( f0,2 + f1,2)V

2)

One multiplication in Fp6 using Karatsuba method requires 18 Fp multiplications. How-

ever, due to the sparse representation of l(Q) the cost of computing t1i , 1 ≤ i ≤ 3 is lesser

than the actual costs of three Fp6 multiplications. Both the equations for t11 and t1

3 re-

quire only 14 Fp multiplications. In our parallel cryptoprocessor the above two equations

are computed in parallel on two CAUs, which costs 5m. The computation of t12 requires

only nine Fp multiplications, which is performed on both the CAUs and it costs only 2m.

Therefore, the computation of f · l(Q) requires 37 Fp multiplications, which costs only

7m in our PCP. Therefore, the total cost for computing the doubling step (the computation

of 2T, lT,T (Q), f 2, and f · l(Q)) of the Miller algorithm for Tate pairing on BN curve is

9m+7m = 16m.

6.6.2 Computation of Addition Step

The addition step consists of the computations of T +P, lT,P(Q), and f · lT,P(Q). The

formulae for mixed Jacobian-affine addition are the following: if T = (X1,Y1,Z1) is in Ja-

cobian coordinates and P= (X2,Y2) is in affine coordinates, then T +P= (X3,Y3,Z3) where124


X3 = (Y2Z31 −Y1)

2− (X2Z21 −X1)

2(X1 +X2Z21), Y3 = (Y2Z3

1 −Y1)(X1(X2Z21 −X1)

2−X3)−Y1(X2Z2

1 −X1)3, Z3 = Z1(X2Z2

1 −X1). The line through T and P is l(x,y) = (X2(Y2Z31 −

Y1)−Y2Z3)−(Y2Z31−Y1)x+Z3 ·y. During the addition step of Miller algorithm we compute

the above operations in parallel on both CAUs. There are limited independent operations

in this step. Therefore, there are scopes for optimizing the scheduling of operations on Fp

arithmetic units. The optimization is done based on the requirements of additional registers

and related wiring. The respective scheduling is shown here.

CAU1 CAU21. t0← Y2 ·Z1, t0← (Z1)

2 −2. t0← t1 · t0, t1← t1 ·X2 −3. t4← t1 +X1, t0← t0−Y1, t5← t1−X1 −4. t3← (t0)2, Z3← t5 ·Z1, t7← (t5)2 l10← t0 · x0, l11← t0 · x15. t2← t7 ·X1, t4← t4 · t7, t5← t5 · t7 t10← t0 ·X26. X3← t3− t4 −7. t2← t2−X3 −8. t2← t2 · t0, t4← Y2 ·Z3, t5← t5 ·Y1 l20← Z3 · y0, l21← Z3 · y19. Y3← t2− t5 l0← t10− t4

In the above scheduling, the nonlinear operations (multiplication and squaring) in Fp

are performed in steps 1, 2, 4, 5, and 8. Thus, the cost of computing T +P, lT,P(Q) is 5m

in the PCP. This computation is followed by f · l(Q), which costs 7m. Therefore, the cost

for evaluating the addition step is 5m+7m = 12m in the PCP.

6.6.3 Computation of Final Exponentiation

The final exponentiation is computed by following way. It follows the optimization to

factor (p12−1)/r into three parts [61] and compute f (p12−1)/r as :

fp12−1

r = f(p6−1)× p6+1

p4−p2+1× p4−p2+1

r

= (( f p6−1)p2+1)p4−p2+1

r .

125


The computation is done by following way:

1. f ← f p6−1.

2. f ← f p2+1.

3. a← f−(6z+5), b← ap, b← a ·b.

4. Compute f p, f p2, f p3

.

5. f ← f p3·[b · ( f p)2 · f p2

]6z2+1·b · ( f p · f )9 ·a · f 4.

Table 6.2 lists the operation costs of final exponentiation on the PCP. The power of

(p6 − 1) in F(p6)2 is an easy exponentiation, which is performed by a conjugation [8]

(Frobenius) and a division. The operation f p6= f0− f1W . Thus, f p6−1 is performed by

one inversion and one multiplication in Fp12 . As shown in [61], the inversion (a0+a1W )−1

in Fp12 using quadratic over sextic extensions is computed as :

(a0 +a1W )−1 = (a0−a1W )/((a0)2 +2(a1)

2).

It is computed by one inversion, two multiplications, and two squarings in Fp6 . The inver-

sion a = (a0 + a1V + a2V 2)−1 in Fp6 using cubic over quadratic extension is performed

by following way :

A = (a0)2 +2a1a2, B =−2(a2)

2−a0a1, C = (a1)2−a0a2.

F =−2a1C+a0A−2a2B.

a = (A+BV +CV 2)/F.

It requires one inversion, nine multiplications, and three squarings in Fp2 . On the proposed

PCP, we compute a = (a0 +a1U)−1 in Fp2 by :

t1← (a0)2, t2← (a1)

2

t3← t1 +2t2

a0← a0/t3, a1←−a1/t3.

The division in Fp is performed by binary inversion/division algorithm as described in

Chapter 2. On our parallel PCP, the above operations cost is 3m. The cost for computing

126


Table 6.2: Operation costs for the final exponentiation on our PCP.

Operation cost on PCPf p6−1 29mf p2+1 12m

f−(6z+5) 480map, a ·b, f p, f p2

, f p321m

T ← b · ( f p)2 · f p224m

T ← T 6z2+1 951mf p3 ·T ·b · ( f p · f )9 · f 4 93m

inversion in Fp6 is 7m+3m = 10m. The cost for computing the inversion in Fp12 is 20m in

our PCP. Hence, the cost for computing f p6−1 is 29m.

The first part of the exponentiation is not only cheap (although it does require an ex-

tension field division), it also simplifies the rest of the final exponentiation [20]. After

raising to the power of (pd−1) the field element becomes unitary [104]. This has impor-

tant implications, as squaring of unitary elements is significantly cheaper than squaring of

non-unitary elements, and any future inversions can be implemented by simple conjuga-

tion [27, 133].

Let we have an element a = (a0 +a1U) ∈ FP2 Then the powering of a to the power of

the modulus p is computed by following relationship.

(a0 +a1U)p = (a0−a1U)mod p.

This relation could be applied for higher order tower extensions. The exponentiation

by p2 + 1 in FP12 is done by applying the p2-power Frobenius automorphism and one

multiplication in Fp12 . Following the procedure described in [42] the p2-power Frobenius

automorphism is computed by five multiplications in Fp2 . Thus the cost of f p2+1 on our

PCP is 12m.

The above procedures are followed for computing f p, f p3also, and each is computed

by five Fp2 multiplications, which costs only 3m on our PCP. The exponentiations f 6z+5,

T z and (T z)6z are performed by repeated square-and-multiply. Note that 6z+5 and 6z have

127


bit length 66 and Hamming weight 11, while z has bit length 63 and Hamming weight 11.

The respective costs for computing them on our PCP is listed in Table 6.2.

6.6.4 Cost for Computing Tate Pairing

In case of BN curve, r has bitlength 256 and Hamming weight 91. Thus the total cost

for evaluating iterative Miller function of the Tate pairing computation is 5176m on our

PCP. The cost for computing the final exponentiation is 1610m. Hence, the total cost for

computing a Tate pairing over BN curves by our cryptoprocessor is 6786m, which takes

1,764,360 cycles.

6.7 Computation of ate Pairing on PCP

The ate pairing interchanges the input points of Tate pairing and it runs a smaller num-

ber of iterations. It uses t − 1 (instead of r) to determine the number of iterations in the

Miller algorithm [42]. In case of BN curve t ≈√

r, and t−1 is a 128-bit prime with Ham-

ming weight 28, which makes the number of iterations halved as well as it reduces the

number of addition step drastically. The computation costs 3165m, 1610m, and 4775m for

the Miller function, the final exponentiation, and the ate pairing, respectively on our pro-

posed hardware. Hence, the number of cycles required to compute an ate pairing by the

PCP is 1 241 500.

6.8 Computation of R-ate Pairing on PCP

The R-ate [28] pairing follows the same procedures of ate pairing but it uses a = 6z+2

(instead of t − 1) to determine the number of iterations in the Miller algorithm. Since,

a ≈√

t on BN curves, and a has bitlength 66 and Hamming weight 9 [42], the Miller al-

gorithm in R-ate pairing has 65 doubling steps and 8 addition steps. Thus, in our parallel

cryptoprocessor, 1681m, 1610m, and 3291m are the costs for Miller function, final expo-

nentiation, and R-ate pairing, respectively. The total number of cycles required to compute

an R-ate pairing is 855 660 by the proposed design.

128

6.9 Results

6.9 Results

The whole design has been done in Verilog (HDL). All synthesis results have been

obtained from Xilinx ISE Design Suit [22] using a Virtex-4 xc4vlx200 FPGA device with

a supply voltage of 1.2V . The design can run at a maximum frequency of 50MHz. The

pairing hardware uses around 52k logic slices including controllers and data access unit. It

uses 27k flip flops for registers. It finishes one Tate, ate, and R-ate pairing computations in

35.3ms, 24.9ms, and 17.0ms, respectively. Table 6.3 shows the implementation results.

Table 6.3: Implementation results of the pairing cryptoprocessor on xc4vlx200 device.

Operation Slice LUT FF Frequency Cycles Security Times(MHz) (bit) (ms)

Tate52 k 101 k 27 k 50

1764 k128

35.3ate 1242 k 24.9R-ate 856 k 17.0

The critical path of the design goes through the data access mechanism, then through

two 256-bit adders, the multiplexer mx1, and back through data access mechanism. In § 5 it

is shown that the latency of a 256-bit adder circuit is 9.9ns. However, this addition latency

consists of input buffer delay of 1.3ns, addition logic delay of 6.2ns, and output buffer

delay of 2.4ns. The individual delays of the addition logic includes input and output buffer

delays. In our architecture the critical path is within two internal registers which includes

neither the input buffer nor the output buffer. Therefore, the total latency of the critical path

of the design is calculated as 3.8ns+2×6.2ns+1.6ns+2.2ns = 20ns.

6.9.1 Comparison with Pairing Implementations

This section provides the performance comparison with related pairing implementa-

tions over BN curves. Performances are compared with actual implementations of cryp-

tographic pairings on software and dedicated hardware achieving a 128-bit security level.

Hardware implementations of ηT pairing over binary and cubic curves are shown in [21,

75]. These designs are for lower security level (72-bit) and hence it shall be unfair to com-

pare with the present design. Table 6.4 gives a performance comparison of related hardware

and software implementations.129


Tabl

e6.

4:H

ardw

are

and

soft

war

eim

plem

enta

tion

ofpa

irin

gsov

erB

N-c

urve

s.

Ref

eren

cePl

atfo

rmFr

eque

ncy

Are

aC

ycle

sTi

mes

(MH

z)(m

s)Ta

tepa

irin

gPr

opos

edV

irte

x-4

5052

kSlic

es17

64k

35.3

[17]

130

nmC

MO

S33

897

kGat

es11

627

k34

.4[6

1]Pe

ntiu

m-4

3400

-15

674

0k

-at

epa

irin

gPr

opos

edV

irte

x-4

5052

kSlic

es12

42k

24.9

[17]

130

nmC

MO

S33

897

kGat

es77

06k

22.8

[19]

§13

0nm

CM

OS

204

183

kGat

es86

2k

4.2

[1]§

Vir

tex-

621

04,

014

Slic

es,4

2D

SP48

E1s

336

k1.

6[4

2]Pe

ntiu

m-4

2400

-81

000

k-

[39]

64-b

itco

re2

2400

-14

640

k-

[61]

Pent

ium

-434

00-

133

620

k-

optim

al-a

tepa

irin

gPr

opos

edV

irte

x-4

5052

kSlic

es85

6k

17.0

[17]

130

nmC

MO

S33

897

kGat

es53

40k

15.8

[19]

§13

0nm

CM

OS

204

183

kGat

es59

3k

2.9

[1]§

Vir

tex-

621

04,

014

Slic

es,4

2D

SP48

E1s

245

k1.

17[4

2]64

-bit

core

224

00-

1000

0k

-−

§im

plem

enta

tion

spec

ifica

llyfo

rBN

-cur

ves

with

fixed

para

met

ers.

130

6.9 Results

Due to the parallel structure our PCP computes six Fp multiplications in parallel which

are completed in 256 cycles. The main features that strengthen the proposed PCP for

pairing computations are as follows :

• The proposed cryptoprocessor is the first FPGA results for pairing computation with

128-bit security.

• Our adopted parallelism and efficient use of two Fpk arithmetic cores reduce the total

number of cycles drastically.

• Due to the inherent properties the frequency of a design in FPGA is much lower

than that in ASIC (CMOS standard cell). However, the speed achieved of the PCP is

comparable to the CMOS standard cell design.

• The PCP is flexible to configure for different curve parameters.

The underlying platform plays a crucial role in determining the performance of a design.

Thus, existing designs on different platforms does not lead to a fair comparison. We try

to find out the platform independent features of existing designs and compare them with

our proposed one. The cycles required to compute pairings on different designs may be

considered such a parameter.

Kammler et al. [17] reported the first hardware implementation of cryptographic pair-

ings achieving a 128-bit security. In [17] the proposed hardware is not only a cryptoproces-

sor, but an actual ASIP : it is in fact a general purpose processor, augmented with finite field

arithmetic units in order to compute pairings. It uses the same z that we have considered

to generate a 256-bit BN curve. The Montgomery algorithm is used for Fp multiplication.

The platform of the design is 130 nm CMOS standard cell library, whereas our design is on

Virtex-4 FPGA. The main feature of the design [17] is the fast modular multiplication in Fp

which takes only 68 cycles. The average cycle count of our PCP for one Fp multiplication

is only 43 which is 1.6 times faster than [17]. With respect to the Tate pairing computation,

the design of [17] takes 11 627 k cycles, whereas our design takes only 1730 k cycles,

which is much less (0.15 times only) compared to [17].

131


Fan et al. [19] proposed a processor for cryptographic pairing over BN curves. They

designed a fast modular multiplier in Fp only for BN parameters which takes only 23

cycles. The 130 nm ASIC design of [19] provides the best known performance which takes

only 2.9ms for computing a R-ate pairing over BN-curve. This design also attains smaller

area-latency product than that in [17]. But the main drawback of the design proposed

in [19] is that it does not provide the flexibility to compute pairings on chosen parameters.

Whereas, our design provides the above flexibility in all aspects which indeed requires

more cycles.

The results of software implementations [39, 42] are quite impressive. On an Intel 64-

bit core2 processor, R-ate pairing requires only 10,000,000 cycles. The advantages of Intel

core2 is that it has a fast multiplier (two full 64-bit multiplications in 8 cycles) and relatively

high clock frequency. It takes 13 times more clock cycles than our cryptoprocessor. In a

very recent work by Naehrig et al. [8] shows that the Optimal-ate pairing on BN curves can

be computed by 4,470,408 cycles on an Intel Core 2 Quad Q6600 processor. The software

implementation of same pairing on a different curve is described in [10]. It takes only

2.63 million clock cycles on a Intel Core i7 : 2.8 GHz processor. However, the exact time

required to compute pairings by executing softwares on a Desktop or Server systems are

not predictable. It depends on so many other factors like available cache memory, context

switching, bus speed of the system, etc.

6.10 Conclusion

In this chapter we presented an FPGA based architecture for computing cryptographic

pairings over 256-bit BN curves. The design is flexible to choose curve parameters. Ex-

tensive parallelism techniques have been incorporated to speed up overall cryptographic

pairing computations. It provides a comparable speed with the existing ASIC designs. The

overall clock cycles required to compute pairings over BN curves are less than existing

designs. To the best of our knowledge it is the first FPGA result for high security (128-bit)

cryptographic pairings.

The next chapter focusses on the security analysis of pairing computations against two

physical attacks. Fault attack on a pairing computation which tries to exploit the faulty out-

132

6.10 Conclusion

put of a transient fault. The fault is injected into a specific register of the pairing cryptopro-

cessor. On the other hand, the power attack exploits the variations of power consumption

during pairing computations.

133

Chapter 7

Pairing Computations Against Fault and

Power Attacks

BILINEAR PAIRING is a new and increasingly popular way of constructing crypto-

graphic protocols. This has resulted in the development of pairing based schemes

such as identity based encryption (IBE) which are ideally used in identity aware devices.

The security of such devices leads to the security of pairing computations. This thesis

considers the security of the pairing computations against physical attacks based on covert

power channel and faulty outputs. The introductory works on cryptanalysis of pairing

computations by exploiting power consumption and faulty outputs are described in [82]

and [40]. However, the existing works have addressed only a small set of pairing computa-

tions, even they have not performed actual attacks.

7.1 Introduction

In general, implementation technique for computing the Tate pairing such as Barreto,

Kim, Lynn, and Scott (BKLS) algorithm [138] are effectively realized as point multipli-

cation with a fixed multiplier and some auxiliary operations. However, the algorithms for

Tate pairing by Duursma and Lee [126] and their modification by Kwon [111] are not based

on point multiplication algorithm. These two algorithms compute Tate pairing on super-

singular curves over F3m field. Fault injection attacks on these two pairing algorithms have

135

Chapter 7 Pairing Computations Against Fault and Power Attacks

been explicitly studied by Page and Vercauteren in [82]. The attack exploits the effect of

fault at a specific register which stores the number of iterations of the pairing computa-

tions. The countermeasures for resisting fault attacks on respective pairing algorithms are

also described in the same paper.

This chapter describes how the above fault attack is mounted on a cryptoprocessor. The

attack assumes that the respective fault is injected into a specific register inside the pairing

cryptoprocessor. With experimental result this chapter shows the fault injection technique

into a register by tuning the clock frequency. The chapter also finds out the limitations of

the existing countermeasures and proposes a new countermeasure to defend fault attacks.

It further finds out a weakness of pairing computations based on Miller’s algorithm. It

demonstrates the said vulnerability on the computations of asymmetric pairings over BN

curves [76] and over Edwards coordinates [47]. A suitable counter measuring technique

against such attack is also proposed in this chapter.

The side-channel attack based on power consumption analysis on pairing computation

is another objective of this chapter. Differential power analysis on ηT pairing over F2m

is described in [40], which targets addition and multiplication operations performed on

one secret and one public parameter. In this chapter we propose a DPA attack on pairing

computations over prime fields. Through experimental results we demonstrate the proposed

attack on FPGA platform. The chapter further proposes a suitable computation procedure

of pairings over prime field which is secure against DPA attack.

The outline of the chapter is as follows: the chapter starts with the demonstration of

fault injection technique followed by proposing a countermeasure against fault attack. Then

it proposes a new fault attacking technique which is described on pairing computations over

BN curves and over Edwards coordinates. The chapter then demonstrates the power attacks

on pairing computations over BN curves which is followed by a counteracting technique.

7.2 Fault Attack on Tate Pairing [82]

Fault attack on pairing computation tries to exploit erroneous results that are produced

by the device in presence of some transient fault at loop bound m. Algo. 7.1 presents a

136


specialised algorithm proposed by Duursma and Lee in [126] to compute pairings on a

family of hyperelliptic curves, including the supersingular curves in characteristic 3. In the

algorithm, ρ, σ, and b are known system parameters. The algorithm was further improved

by Kown [111] and by Barreto et al. [138].

Algorithm 7.1: Duursma-Lee algorithm.Input: P = (x1,y1), Q = (x2,y2)Output: fP(ϕ(Q)) ∈ µl ⊂ F∗q6

f ← 1for i = 1 to m do

x1← x31, y1← y3

1µ← x1 + x2 +bλ←−y1y2σ−µ2

g← λ−µρ−ρ2

f ← f .gx2← x1/3

2 , y2← y1/32

endreturn f q3−1

Page and Vercauteren [82] studied the security of pairing algorithms against fault attack.

They have shown that if an adversary can induce proper transient fault at loop bound m

of Duursma-Lee algorithm then the secret point P(x1,y1) could be revealed easily. The

transient fault on m can be induced through glitch attack, or provoking error in memory or

register in where m is stored [113].

Let an adversary induce transient faults into the register that holds the value of loop

boundary m. It measures the modified loop boundary and corresponding pairing result. Let

us consider it replaces the loop boundary m with m± r and m± r + 1 in two instances.

The corresponding pairing results are R1 = em±r and R2 = em±r+1. The ratio of these two

pairings gives

R =R2

R1=

em±r+1

em±r= gq3−1

m±r+1,

where

gi =−y3i

1 .y2σ−µ2i −µiρ−ρ2.

137


The value of gi from gq3−1i can be extracted through root finding algorithm and by solving

some linear system of equations [82]. Here σ and ρ are field extension parameters known

to the attacker. The attacker can extract the value of x1 and y1 from above equation. We

refer [82] for further analysis and information regarding above attack.

7.2.1 Fault Induction Through Clock Signal

The above fault attack assumes that a known fault is already injected into the register

holding the loop boundary. However, it does not use an actual fault injection technique.

This section proposes such a technique for injecting fault into a specific register of a cryp-

toprocessor. The proposed technique manipulates the clock frequency for injecting fault.

In our target application it is necessary to induced fault at a register which holds the

loop boundary. In case of Tate pairing computation over F3m field, the value of m is more

than 512, which can be stored in a 10-bit (or longer) register. The hardware is assumed to be

designed in such a way that it receives the value of m through a serial port. The respective

register is designed as a shift register. The serial data is normally generates synchronously

with receiver’s clock. The shift operation of the register is performed by the same clock.

Therefore, if the attacker can have control over the clock signal of the register then he (or

she) can store some faulty data in the register instead of the actual incoming data.

For example, let us consider we have a 12-bit register to store the value of m. Let us

further assume that the register loads serially through Bus 0 by synchronous left shift with

clock. Therefore, 12 clocks are required to store a new 12-bit value into the correspond-

ing register. The correct data value which is aimed to store into the register is 1365 (or

010101010101 in binary).

Figure 7.1 shows the experimental results of three instances of above load operation.

An instance of load operation takes 12 clocks. During the period of load operation the

signal BUS 1 remains high. The first instance (left most “/out” column) shows the correct

load. It is performed without any inconvenience. The next two columns show the result

of two faulty load operations. During 7-th and 5-th cycles of these two load operations,

respectively, we tune the clock to its 4 times higher frequency. In these cycles due to the

high-speed clock the incoming data from BUS 0 could not be stored into the LSB of the

138


correct value

faulty values at two instances

Figure 7.1: Expected and faulty values of register containing the value of m.

register and no shift operation has been performed. But, the counter has been incremented

and it goes for next input. As a result, the final value after 12 clocks into the corresponding

register becomes faulty. In the above experiment the faulty values are 2742 and 2773, re-

spectively in two instances. Therefore, the fault injection into the loop boundary of pairing

computation can be done easily through clock signal. The faulty value of the loop boundary

is then exploited to find out the secret point P = (x1,y1) of Duursma-Lee algorithm.

7.2.2 Analysis of Existing Countermeasures

Page and Vercauteren [82] have given two countermeasures against fault attacks on

pairing based cryptography. Both of the countermeasures are based on point blinding tech-

nique. In the fault attack as described in section 7.2 the fault is injected randomly into the

loop boundary m. The attacker can easily measure the faulty value of m through timing

or power analyses. The attacker collects two pairing results R1 and R2 for two faulty loop

boundary m± r and m± r+1, respectively, and computes the ratio

R =R2

R1=

em±r+1(P,Q)

em±r(P,Q), (7.1)

which is exploited to compute the x and y coordinates of secret point P. The countermea-

sures proposed in [82] protect the fault attack by randomizing the input points P and Q so139


that the ratio R could not be exploited.

7.2.2.1 New Point Blinding Technique [82]

The aim of point blinding technique is randomization of input points so that the attacker

could not utilize knowledge of the public point in pairing computation. This countermea-

sure chooses two integers x,y randomly from Z∗l such that xy≡ 1 (mod l). The points P and

Q in e(P,Q) computation are blinded by computing xP and yQ. The pairing is computed

on xP and yQ as e(xP,yQ) since it is known that

em(P,Q) = em(xP,yQ)

= em(P,Q)xy. (7.2)

In both Duursma-Lee and Kwon-BGOS algorithms, the input points are processed and

it produces pairing result after m iterations. Now according to the relationship, which is

shown in Eq. 7.2, the pairing result on set of points (P,Q) and (xP,yQ) are equal. According

to the definition, the above equality holds for correct pairing results only, which is produced

after m iterations of above algorithm. Therefore, using the aforementioned fault attack

on new point blinding technique the attacker could not find out a ratio R for which R =

em±r/em±r+1. With this countermeasure two such outputs are :

R1 = em±r(x1P,y1Q) = em±r(P,Q),

R2 = em±r+1(x2P,y2Q) = em±r+1(P,Q).

Therefore, the ratio of R2 and R1 could not be exploited for finding out the secret point P.

Variable x and y are updated after every pairing computation. It is suggested in [82] that

the refreshment of these random variables are done by computing x = (x · c) mod l, and

y = (y ·d) mod l such that (c ·d)≡ 1 (mod l).

However, the main difficulty of the new point blinding technique is the generation of

random variables x and y for which x · y ≡ 1 (mod l). One possibility is that the random

variables are generated by the user during the key generation procedure. The user will apply

them to the cryptoprocessor during the pairing computation along with the private key. Also

if we follow the above procedure for the refreshment of (x,y) then another similar pair (c,d)

140


need to be available to the user and they are applied to the cryptoprocessor. Therefore, this

protocol demands 4m bits additional private parameter.

Alternatively, we may avoid the overhead private parameters by following way. It may

be considered that two such pairs of integers (x,y) and (c,d) are stored inside the pairing

cryptoprocessor. So that, not need to apply them by the individual user. The main problem

in this scheme is that we need to design an architecture for a fixed l and store all four

random variables inside the hardware. It is only possible for a fixed application with a

fixed value of l. It is necessary to replace the whole hardware if a new l needs to be chosen.

7.2.2.2 Altering Traditional Point Blinding [82]

The pairing computation on points P, Q is performed by

e(P,Q) = e(P,Q+X).e(P,X)−1,

where X is a random point. It is assumed that P is secret and Q is public. The fault attack

described in [82] exploits knowledge of the public point Q. This defence mechanism [82]

tries to randomize the public point Q using the random point X . Thus, it computes e(P,Q+

X) instead of e(P,Q), and eliminates the surplus by multiplying the inverse of e(P,X).

The value e(P,X)−1 are assumed to be supplied to the cryptoprocessor for avoiding ad-

ditional pairing computation and inversion in extension field. One difficulty of this coun-

termeasure is that the refreshment of the random point X . In [82], it is done by bX and

e(P,X)−b such that b ∈R {−2,+2}. The major difficulty of such countermeasure is that

the generation of X which is random. As mentioned in case of previous countermeasure,

it can be considered as private to the user and it is applied to the cryptoprocessor during

the execution of pairing algorithm. So that the key size is increased by 18m bits (2× 6m

for x and y coordinates of X , and 6m for e(P,X)−1), where it is considered that embedding

degree of the underlying elliptic curve is 6 and hence the point X and e(P,X)−1 are in Fq6 .

7.2.3 Proposed Countermeasure

This section proposes a suitable countermeasure against fault attack on pairing com-

putation. The underlying principle of fault attack on pairing computation is based on the

ability of the attacker to change the value of the loop boundary m. The attacker also has141


the ability to measure the change from timing or power analysis of the computation. The

attacker tries to obtain two pairing computations one for m+ r and the other for m+ r,

augmented by 1 through fault induction. Hence, our countermeasure ensures that even if

there is a fault the attacker cannot correlate the pairing output with number of iterations.

The objective is to disable the attacker from ascertaining the ratio R2/R1, as mentioned in

Section 7.2. At the same time our proposed countermeasure does not increase the size of

the user’s private key.

The proposed countermeasure blinds the loop boundary m as it is the main factor in

fault attacks. It protects the loop boundary so that the attacker cannot guess the number

of iterations for which the faulty output is produced. It modifies the Duursma-Lee algo-

rithm for protecting secret point in pairing computation against fault attack. The modified

algorithm is shown in Algo. 7.2. Other pairing computation procedures, like Kown-BGOS

algorithm can be modified by same procedure in order to defend it against fault attack.

Algorithm 7.2: Modified Duursma-Lee algorithm.Input: P = (x1,y1), Q = (x2,y2)Output: fP(ϕ(Q)) ∈ µl ⊂ F∗q6

Choose r1 ∈R F∗q6 , and r2 ∈R Z, 2≤ r2 ≤ mf0← r1, f1← 1m′← m+ r2for i = 1 to m′ do

x1← x31, y1← y3

1µ← x1 + x2 +bλ←−y1y2σ−µ2

g← λ−µρ−ρ2

f1← f1.gj← (i == m)f0← f j

x2← x1/32 , y2← y1/3

2endreturn f q3−1

0

7.2.3.1 Correctness Analysis

Theorem 7.1: The modified Duursma-Lee algorithm produce the correct result.

142


Proof.

The Algo. 7.2 is modified from original Duursma-Lee algorithm (Algo. 7.1) for resisting

it against side-channel and fault attacks. The original algorithm runs for m iterations and

produce the pairing result after m-th iteration. In the modified algorithm, the loop boundary

m′ is random as m′←m+r2, r2 ∈R Z and r2≤m. It runs for a random number of iterations.

However, the intermediate pairing result f1 is restored into f0 at the m-th iteration only. It

is not restored for other iterations. At the end of the execution, i.e. after m′ iterations f0

holds the pairing result of m iterations. Hence, the algorithm produces the correct reduced

Tate pairing result.

7.2.3.2 Security Against Fault Attack

Security Assumption. The adversary can only induce unknown random faults into the loop

boundary.

Theorem 7.2: The modified Duursma-Lee algorithm is secure against fault attack.

Proof.

In the fault attack, the adversary is interested in two pairing results, Rm′±r′ and Rm′±r′+1.

We may consider the following two scenarios.

• Inject fault at m′ : The adversary can change the value of m′ to m′±r′ (with random

r′) by injecting fault at m′. Thus, our modified Duursma-Lee algorithm runs for

m′± r′ iterations. If the resultant value m′± r′ ≥ m then the algorithm produces

result Rm for m iterations else it produces random value rq3−11 as a pairing result. So,

the adversary cannot collect two such target outputs by injecting random faults at m′

register.

• Inject fault at m : The adversary can inject random fault at m register, and alter

m to m± r′. Thus, the algorithm runs for m± r′+ r2 iterations. But, it produces

result Rm±r′ for m± r′ iterations only, where r2 and r′ both are random. This result

can be collected by the adversary. The adversary can also measure the total number

of iterations m± r′+ r2 by timing or power analysis. But, it could not correlate

the outputs and corresponding measured iteration numbers, which are actually not143


correlated. Thus, it could not find out two useful pairing results. Therefore, the fault

attack described in [82] could not be mounted on proposed countermeasure.

7.3 Fault Attack on Pairing in Edwards Coordinates

This section attempts to analyze the security of pairing computation in Edwards coor-

dinates that is defined by Ionica and Joux [47] against fault attack. It finds out a weakness

of such algorithm in presence of fault and give a suitable countermeasure.

7.3.1 Pairing in Edwards Coordinates

Edwards showed in [65] that every elliptic curve defined over an algebraic number field

F is birationally equivalent to a curve over some extension of F given by the equation:

x2 + y2 = c2(1+ x2y2) (7.3)

Thereafter Bernstein and Lange [64] showed that the group operations can be performed

most efficiently on the elliptic curves defined in the Edwards coordinates. The equation

x2 + y2 = 1+dx2y2 is called the Edward curve [50]. It was shown in [50] that an Edwards

curve E is birationally equivalent to the elliptic curve Ed : (1/(1− d))v2 = u3 + 2((1+

d)/(1−d))u2 +u via the rational map:

ψ : Ed → E (7.4)

(u,v)→(

2uv,u−1u+1

).

The addition formulas on Edwards curve is given by:

(x1,y1),(x2,y2)→(

x1y2 + y1x2

1+dx1x2y1y2,

y1y2− x1x2

1−dx1x2y1y2

).

It is shown in [64] that above addition law is complete when d is not a square. This means

that it is defined for all pairs of input points on the Edwards curve with no exceptions for

doubling operation, neutral element, etc.

The pairing computation in Edwards coordinates and on Twisted Edwards coordinates [50]

are defined by Ionica and Joux [47], and by Das and Sarkar [49], respectively. The dou-

bling and mixed addition steps of Miller’s algorithm for pairing computation are redefined144


in Edwards and Twisted Edwards coordinates in these two papers. It is shown that the

computation of pairing f in Edwards coordinates is the most efficient than that of Twisted

Edwards coordinates. This paper takes the pairing computation that is given in [47] for

analyzing security against fault attack.

7.3.2 Attack Procedure

The fault attack defined in [82] will not work on Miller’s algorithm, Algo. 2.5, in Ed-

wards coordinates due to the complex nature of the iterative operations. For example, the

doubling operation [64] on K = (X1,Y1,Z1) gives 2K = (X3,Y3,Z3), and the formulas are:

X3 = 2X1Y1(2Z21− (X2

1 +Y 21 )),

Y3 = (X21 +Y 2

1 )(Y21 −X2

1 ),

Z3 = (X21 +Y 2

1 )(2Z21− (X2

1 +Y 21 )).

Similarly, during addition K is updated by K +P, which is even more complex than dou-

bling [64]. The point K is initialized by the secret point P = (X0,Y0,1).

Algorithm 2.5 is realized as a point multiplication along with some additional field

multiplication for computing pairing value f . The value of f in doubling step of the Miller’s

algorithm in Edwards coordinates [47] can be computed by f ← f 2l′, where in case of even

embedding degree and k > 2, l′ can be computed by following equation:

l′ = 2X1Y1(x/y− y/x)(X21 −Y 2

1 )(X21 +Y 2

1 −Z21)

− 2(X21 −Y 2

1 )2(X2

1 +Y 21 −Z2

1)

− dx2y2Z21(X

21 +Y 2

1 )(2Z21−X2

1 −Y 21 )

+ (X21 +Y 2

1 )(2Z21−X2

1 −Y 21 )(X

21 +Y 2

1 −Z21).

The Tate pairing El(P,Q) is computed by Miller’s algorithm on points P,Q such that

P is an l-torsion point on the curve E(Fq) and Q ∈ E(Fqk). In order to mount fault attack

on Miller’s algorithm in Edwards coordinates, we assume that the adversary has ability to

inject fault at the register l. We further assume that the adversary can obtain the pairing

result El(P,Q) for l = 2. This may be possible by adopting some powerful fault injection

procedure or from a number of trial with the help of timing and simple power analysis [82,145


100, 113]. If l = 2 then the Miller’s algorithm runs for only one iteration and it executes

only doubling part of Algo. 2.5. In such a scenario the pairing output f = l′ and K = P.

So, f will be a function of X0,Y0,x,y, and d, which can be deduced from the equation

of l′ by replacing X1 by X0, Y1 by Y0, and Z1 by 1. We can assume that the value of d

(curve parameter) and Q = (x,y) are known to the attacker. Thus, f has been simplified

and represented by the following equation:

f = a1X60 +a2Y 6

0 +a3X50 Y0 +a4X0Y 5

0 +a5X20 Y 4

0

+a6X40 Y 2

0 +a7X0Y 30 +a8X3

0 Y0 +a9X20 Y 2

0

+a10X40 +a11Y 4

0 +a12X20 +a13Y 2

0 , (7.5)

for constants a1, · · · ,a13. Here a1, · · · ,a13 are constants as they can be expressed interms

of known values, x,y, and d. We can linearize the above equation by using a number of

variables. The public point Q could be changed for obtaining a number of such equations.

Hence, X0,Y0 could be solved by solving the set of linear equations.

7.3.2.1 Practical Implication of Above Fault Attack

Let us assume l is a large prime (say 256 bits long in practice). Then the probability of

setting l = 2 by random fault injection [113] is very less (≈ 2−256 for a 256-bit l). Hence

a random fault in register l has vary less probability of success. However, we propose a

different strategy.

The requirement of our fault attack is satisfied by inverting the least-significant-bit of l

(say l[1]) and setting i = 1. Note that since l is a odd prime, l[1] is 1. Now, if l is 256 bits

long then i is of ⌈log2(256)⌉ = 8 bits. Hence the probability of setting i = 1 by random fault

injection is at least 2−8. The algorithm runs for only one iteration as i = 1, and it executes

only the doubling part as l[i] = 0. Thus the probability of success of the attacker is 2−9.

Hence we expect that after 512 trials the attacker will be successful at least once.

7.3.3 Countermeasure

In order to resist the above fault attack it is ensured that the Miller’s algorithm does

not produce a valid pairing result for l = 2, and for the condition that i = 1 and l[1] = 0.

In general, l is a odd prime in l-Tate pairing computation, which means l[1] = 1. But for146


mounting the above fault attack it is essential to alter the value of l[1] from 1 to 0. Thus we

suggest modified Miller’s algorithm that is shown in Algo. 7.3 for defending against fault

attack. The proposed technique ensures that the pairing does not compute for l[1] = 0.

Algorithm 7.3: Fault attack resistant Miller’s algorithm.Input: P an l torsion point ∈ E(Fq), Q ∈ E(Fqk)Output: the Tate pairing El(P,Q)i = [log2(l)],K← P, f ← 1.if l[1] = 0 then

return 0.endwhile i≥ 1 do

Compute equations of l′ and v′ arising in the doubling of K.K← 2K and f ← f 2l′(Q)/v′(Q).if the i-th bit of l is 1 then

Compute equations of l′ and v′ arising in the addition of K and P.K← P+K and f ← f l′(Q)/v′(Q).

endi← i−1.

endreturn f (q

k−1)/l .

7.3.3.1 Correctness Analysis

Theorem 7.3: The fault-attack resistant Miller’s algorithm produce the correct result for

cryptographic pairing computation.

Proof.

The modified Miller’s algorithm performs correctly for cryptographic pairing computation.

It is automatically aborted if l is even. It returns zero if least significant bit (LSB) of l is

zero, i.e., l[1] = 0. But, pairing computation for cryptographic applications chooses l as a

large odd prime. Thus, the LSB of l is one, i.e., l[1] = 1. In this case our proposed modified

Miller’s algorithm executes exactly same operations with its original form (Algo. 2.5).

Thus it produces correct pairing value for cryptographic applications.

7.3.3.2 Security Against Fault Attack

Theorem 7.4: The Algorithm 7.3 is secure against above fault attack.147


Proof.

The fault attack described in section 7.3 believes that the attacker has ability to inject fault

at particular variables during execution. It injects fault at variables i and l. In order to

mount the fault attack in pairing computation in Edwards coordinate it is necessary to sets

i = 1 and l[1] = 0. Let us assume that the adversary has successfully injected the required

fault. Now for performing the attack on the pairing computation it is also necessary to get

the correct result for faulty values of i and l. But the proposed fault-attack resistant Miller’s

algorithm does not execute the pairing with above fault. It will simply return zero. Thus

the proposed countermeasure is secure against the fault attack described in section 7.3.

7.4 Power Attacks on Pairing Computations

Page and Vercauteren [105] presented SPA and DPA attacks on the pairing computa-

tions performed by the Duursma-Lee algorithm [126] and the BLKS algorithm [138] over

F3m . The power consumption attack on ηT pairing computation over F2m is described by

Kim et al. in [40]. However, the same in case of Fp has not been studied so far. This sec-

tion investigates the security of pairing computations over Fp against power consumption

attacks.

7.4.1 Weakness of Pairing Computations over Fp

In the decryption step of identity-based encryption schemes [112], a dominant operation

is e(SID,U), where SID is the fixed secret key, and U is a part of a ciphertext. In this case,

power analysis may try to extract the secret key from the pairing computation by repeatedly

manipulating U . The Tate pairing over Fp consists of elliptic curve group operations (ECD

and ECA), the line functions, and the Miller function [42]. The line functions as per the

definition provided by Chatterjee et al. [90] use both the public point U and private points

SID. The formula of line functions are based on the underlying Fp primitives.

During the addition step of Tate pairing computation the formula of the line function is

l(x,y) = (y−Y2)Z3−(x−X2)(Y2Z31−Y1) [90]. In pairing based cryptographic schemes, the

point T = (X1,Y1,Z1) is an intermediate resultant point of current point doubling operation,

148


the point U = (X2,Y2) is used as a public parameter (it could be the plain texts or messages),

and SID = (x,y) is used as the private key. The resultant point (T +U) is represented by

(X3,Y3,Z3). Therefore, in such a scheme the operations (x−X2) and (y−Y2) could be

exploited through power analysis attacks for finding out the x and y-coordinates of the

secret point.

7.4.2 Proposed DPA Attack

In this section, we investigate differential power analysis (or DPA) attack against the

subtraction (x−X2) used in the Tate pairing on elliptic curves in Fp, where x is secret

and X2 is public and known to, or even chosen by, the attacker. The subtraction (x−X2) in Fp is computed by first computing S = x−X2 and then the result is reduced (if

required) by adding p with S. Let us assume that all operations are performed on 2’s

complement numbers. Therefore, the subtraction S = x−X2 could be performed as: S =

∑ki=0 2isi = ∑k−1

i=0 2ixi+∑k−1i=0 2iX2i +1, where k represents the bit length of operands (x,X2)

and X2i corresponds to the 1’s complement of X2i . The subtraction is started from the least

significant bit (or LSB) by computing sum and carry bits iteratively. The formula for i-th

carry bit is: ci = xiX2i ⊕ xici−1 ⊕ X2ici−1. Similarly, the i-th sum bit is computed as:

si = xi ⊕ X2i ⊕ ci−1 for k−1≤ i≤ 0 with c−1 = 1.

The proposed DPA attack works by following way. The attacker first collects the power

consumption traces of n number of randomly chosen public point U . We consider the sim-

plified Hamming weight model for power leakage [159]. In this model, power consumption

depends on the Hamming weight of the data being processed. Thus, we can express the

power consumption W as:

W = εH +η

where H, ε, and η represent the Hamming weight of the intermediate data, the incremental

amount of power for each extra 1 in the Hamming weight, and the noise, respectively. We

assume that the average of noise η is zero.

Let W be the power consumption associated with the subtraction operation (x−X2).

We start from the LSB and iteratively find all bits of the x-coordinate of the secret point

SID =(x,y). To recover the i-th bit of x, we guess that xi = 0 and divide power consumptions

149


into two sets by X2i⊕ ci−1.

Pk = {W | X2i ⊕ ci−1 = k} with k = {0,1}

Thus, the differential power consumption is:

∆ = < P1−P0 > .

If the guess is correct, then the averages of P1 and P0 are, ε(M + 1)/2 and ε(M− 1)/2,

where M corresponds to the bit length of S. Thus, if ∆ > 0, we know that xi = 0; otherwise,

the averages of P1 and P0 is ε(M−1)/2 and ε(M+1)/2. Thus, if ∆ < 0 then xi = 1. There

should be a positive peak when xi = 0 and a negative peak when xi = 1.

In summary, since the subtraction operation (x−X2) of line function in pairing compu-

tation is vulnerable to the proposed attack, we can recover x. Next, we can obtain the value

of y-coordinate of the secret point SID by solving the curve equation.

7.4.3 Mounting the DPA on FPGA Platform

We perform the actual DPA attack on our proposed pairing cryptoprocessor (or PCP).

The PCP is implemented on a customized FPGA board for power analysis. We put an one

ohm resistor between the VCCint pin of the FPGA chip and the on board voltage regu-

lator. We measure the current drawn through that resistor during pairing computation by

a current probe. The specification of the probe is Tektronix current probe (serial number

B014316). We use the probe with a TCPA300 power amplifier in standby mode. The

measured power is displayed and stored in a Tektronix TDS5032B Digital Phosphor Oscil-

loscope. We develop software tools to automate the whole process for varying inputs. The

power consumptions are measured in terms of mV which is varying around ±5mV . The

power signal is sampled at 12.5MS/s.

We choose an x with x0 = 0 and perform (x−X2) for 2000 times with 2000 different

randomly chosen X2. The respective power consumptions are stored in 2000 one dimen-

sional vectors. Now we differentiate the the power vectors in two sets namely P1 and P0.

A vector will be in set P1 if X20 ⊕ c−1 = 1; i.e., X20 = 1. Otherwise, the vector will be

in set P0. For computing the differential power consumption we subtract the average of P0

150


vectors (means) from the average of P1 vectors. We say this differential power consumption

vector as difference-of-means which is represented by ∆. Then we accumulate the samples

of ∆ and plot it. The respective difference-of-means is depicted in Fig. 7.2(a), which shows

a positive peak as expected for x0 = 0.

50 100 150 2000

2

4

6

8

x 10−3

samples

diffe

renc

e−of

−m

eans

(a)

50 100 150 200

−8

−6

−4

−2

0x 10

−3

samplesdi

ffere

nce−

of−

mea

ns

(b)

Figure 7.2: The correlation between LSB and corresponding average power differences of

an addition in Fp. (a) for x0 = 0 and (b) for x0 = 1.

The same experiment has been repeated for another x with x0 = 1. The difference-of-

means in this case is plotted in Fig. 7.2(b). In this case the expectation of < P1−P0 > is

negative and we got the result as expected with 2000 random X2.

Above experimental result ensures that an attacker can easily mount the DPA attack

on pairing computation over Fp. After finding out the LSB, DPA can be performed for

second LSB, and so on. The same power traces could be utilized for finding out all secret

bits. The differentiation of power vectors into two sets depending on the current value of

(X2i ⊕ ci−1) upto the generation of the difference-of-means will be repeated for finding

out each of the secret bits. Thus, above DPA attack iteratively finds out all bits of the x-

coordinate of secret SID. After obtaining the x-coordinate, the value of y-coordinate could

be obtained easily by solving the underlying elliptic curve equation.

151


7.4.4 Proposed DPA Resistance Pairing Computation

In the pairing computation, the secret point is only used for computing the line func-

tions. The formula of the line function during doubling step of the Miller algorithm over

Fp is as follows:

lT,T (x,y) = Z3Z2y−2Y 2−3X2(Z2x−X),

where T = (X ,Y,Z) be the intermediate resultant point of Miller algorithm while 2T =

(X3,Y3,Z3) [90].

The formula of lT,T (x,y) is using the secret point SID = (x,y) of identity based encryp-

tion (IBE) [143]. But, it does not use the public point U = (X2,Y2). Therefore, this function

could not be exploited by any side-channel attacks.

The second line function lT,P(x,y) is computed during the addition step of the Miller

algorithm. In IBE scheme P is replaced by U . The formula of lT,P(x,y) is:

lT,U(x,y) = (y−Y2)Z3− (x−X2)(Y2Z31−Y1),

where T (X1,Y1,Z1) is the intermediate result of doubling step and (X3,Y3,Z3) represents the

addition result of T +U . In this line computation formula both public point U = (X2,Y2)

and private point SID = (x,y) are used. The computation of lT,U(x,y) is the main weakness

of pairing computation over Fp against side-channel attacks. The DPA attack described

above can easily find out the x and y-coordinates of private point SID by exploiting the

above formula.

The main drawback of the above formula is that the public and private parameters are

directly involved to perform an Fp operation. The side-channel attack thus exploit the

respective Fp operation for finding out the secret bits by manipulating public parameter U .

To counter act on such computation against side-channel attacks it could be computed by

following way.

lT,P(x,y) = (X2(Y2Z31−Y1)−Y2Z3)− (Y2Z3

1−Y1)x+Z3 · y.

The above computation technique does not have any Fp primitive which is performed

on one public parameter and one private parameter. The attacker may try to exploit the152

7.5 Conclusion

power consumption of the cryptoprocessor during the computation of lT,P(x,y). The private

parameter x in the above formula is multiplied with an unknown parameter (Y2Z31 −Y1).

Therefore, no difference-of-mean can be computed for identifying the secret bits of x.

The second secret parameter y is multiplied with Z3 in the modified computation of

lT,P(x,y). The parameter Z3 is computed by executing the formula Z3 = Z1(X2Z21 −X1)

which ensures Z3 is unknown due to the unknown temporary point T (X1,Y1,Z1). Therefore,

no difference-of-mean value can be computed based on the specific bits of Z3 for identifying

the secret bits of y. Thus, the proposed counteracting technique protects both x and y

coordinates of secret point SID, which ensures the security of pairing computation against

DPA attack.

7.5 Conclusion

This chapter has started with the description of the security issues of pairing algorithms

in presence of fault. It has briefly described the existing countermeasures which are based

on the point blinding technique. It has proposed a new countermeasure against fault attacks.

The proposed countermeasure is indeed secure and it requires less computation and mem-

ory overhead compared to existing measuring techniques. The chapter further has shown

a weakness of Miller’s algorithm in Edwards coordinates in presence of fault. It has also

proposed a suitable countermeasure against such weakness.

The current chapter also has described the security issues of the pairing computations

over Fp against power analysis attacks. It has proposed a differential power analysis against

pairing computations. A suitable counteract also has been proposed to protect the private

key of identity based encryption scheme.

153

Chapter 8

Conclusions and Future Directions

THIS CHAPTER CONCLUDES the thesis by underlining the main contributions. It

also discusses the possible directions of future work.

8.1 Conclusions

Right from the end of nineties of the last century, since the onset of the elliptic curve

and pairing eras, there has been an inconceivable growth in their efficient and secure im-

plementation process. Growth, primarily in the efficient implementation, made it feasible

to use more and more efficient algorithms and design techniques to compute underlying

finite field operations. Along with this, more and more robust implementations against

powerful physical attacks − like side-channel and fault attacks have been developed. This

thesis aims at offering some useful techniques towards reduction in time and area as well

as providing security against side-channel and fault attacks of hardware for pairing based

cryptography on FPGA platform. The contributions of the work are concluded as follows:

• In chapter 4, it has been identified that the design objective is reduction in area and

latency of ECSM operation against timing and power attacks by developing a secure

GF(p) elliptic curve cryptoprocessor. A shared programmable unit has been pro-

posed that can perform GF(p) addition, subtraction, multiplication, inversion, and di-

vision. Thereafter, an elliptic curve cryptoprocessor has been proposed based on two

155

Chapter 8 Conclusions and Future Directions

programmable functional cores. We have proposed a new point blinding technique

which can protect the secret in ECSM operation against DPA including doubling at-

tack. Through actual differential power analysis on FPGA platform we have shown

that the proposed ECSM cryptoprocessor is indeed secure against DPA attack.

• Chapter 5 has focussed on the utilization of in-built FPGA features for developing

optimized prime field arithmetic units. We have proposed a hierarchical adder struc-

ture based on the in-built carry chains of an FPGA device which drastically reduces

the routing delay as well as the overall addition delay of a large operand adder cir-

cuit. The chapter also has proposed a parallelism technique by which the critical

path delay of the interleaved multiplier has been reduced by 50% compared to the

existing design. Finally, it has demonstrated that the proposed high-speed addition

and multiplication techniques improves the performance of ECSM cryptoprocessor

by 30% compared to its previous version described in chapter 4.

• We have focused on the implementation of pairings over BN curves in chapter 6. This

chapter has introduced a parallel cryptoprocessor on FPGA platform for computing

asymmetric pairings over BN curves. The generic design provides the flexibility

to choose curve parameters. Extensive parallelism techniques have been exploited

to speed up pairing computations. The overall clock cycles required to compute

pairings over BN curves has been reduced drastically. The proposed FPGA design

provides a comparable speed with the existing ASIC designs.

• In chapter 7, the major objective was security analysis of pairing computations against

physical attacks. We have demonstrated a practical fault injection technique on a

pairing hardware on FPGA platform. Subsequently, this chapter has proposed a

new countermeasure which overcomes the drawback of the existing measuring tech-

niques. The chapter also has shown a vulnerability of Miller’s algorithm for pairing

computation over BN curves and over Edwards coordinates. It has also proposed a

suitable countermeasure against the attack. Finally, the chapter has demonstrated the

vulnerability of pairing computations over BN curves against DPA attack followed

by the proposal for defending such attacks.

156

8.2 Future Directions

8.2 Future Directions

The work proposed in each chapter of this thesis can be extended for further research.

In this section, some of the directions in which the problems can be further pursued have

been described.

• In chapter 4, the PGAU has been designed based on a bit serial interleaved multipli-

cation algorithm. In future the underlying bit serial multiplication algorithm can be

replaced by a digit serial multiplication algorithm and the performance can be com-

pared with the current design. A common programmable unit could be aimed for

arithmetic operations over binary, char-3, and prime fields.

• Chapter 5 is concerned only with the performance gained through the utilization of

in-built fast carry chains (or FCC) of an FPGA device. However, modern FPGA

devices provide several in-built components, such as: DSP blocks, Multiplier blocks,

Power PC, etc. A future work could be aimed at the development of highly optimized

architectures for finite field operations by utilizing the in-built components of an

FPGA device.

• In chapter 6, the pairing cryptoprocessor has been proposed for computing pairings

over BN curves. We have attempted the cryptoprocessor for general prime p. How-

ever, the BN curves are defined over specific primes. Its other global parameters also

have some specific forms. Therefore, in future these properties could be exploited for

developing a optimized FPGA architecture for asymmetric pairings over BN curves.

• The fault attack described in chapter 7 has considered some specific pairing com-

putations. However, there are a number of pairing-friendly elliptic curves on which

pairing computation formulæ are different. A future direction of the work could be

studying the designed attack techniques on other pairing algorithms.

• This thesis only has focussed on the pairing based cryptographic hardware over prime

fields. Moving slightly beyond the scope of this thesis, 128-bit-security pairing com-

putation hardware over other finite fields may be considered. Furthermore, work to-

wards a multi-field programmable cryptoprocessor may be taken up, like the design

157

Chapter 8 Conclusions and Future Directions

of a pairing cryptoprocessor for binary, char-3, and prime fields.

• The present design methodology also does not incorporate power as a design metric.

Hence, an important future extension would be implementing low power techniques

into the design.

158

Bibliography

[1] J. Fan, F. Vercauteren, and I. Verbauwhede. Efficient Hardware Implementation of

Fp-arithmetic for Pairing-Friendly Curves. IEEE Trasaction on Computers, 2011. To

appear.

[2] D. Freeman, M. Scott, and E. Teske. A taxonomy of pairing-friendly elliptic curves.

Journal of Cryptology, Vol. 23, No. 2, pp. 224–280, 2010.

[3] F. Vercauteren. Optimal pairings. IEEE Transactions on Information Theory, Vol. 56,

No. 1, pp. 455–461, 2010.

[4] N. Guillermin. A high speed coprocessor for elliptic curve scalar multiplications over

Fp. CHES 2010, LNCS 6225, pp. 48–64, 2010.

[5] D.F. Aranha, J. Lopez, and D. Hankerson. High-speed parallel software implementa-

tion of the ηT pairing. CT-RSA 2010, LNCS 5985, pp. 89–105. Springer, 2010.

[6] N. Estibals. Compact hardware for computing the Tate pairing over 128-bit-security

supersingular curves. Pairing 2010, LNCS 6487, pp. 397–416, 2010.

[7] BlueKrypt, Cryptographic key length recommendation. http://www.keylength.com

/en/4/. 2010.

[8] M. Naehrig, R. Niederhagen, and P. Schwabe. New software speed records for crypto-

graphic pairings. Cryptology ePrint Archive, Report 2010/186. http://eprint.iacr.org/.

[9] R. Granger and M. Scott. Faster squaring in the cyclotomic subgroup of sixth degree

extensions. PKC 2010, LNCS 6056, pp. 209–223, 2010.

159

BIBLIOGRAPHY

[10] J.L. Beuchat, J.E.G. Dıaz, S. Mitsunari, E. Okamoto, F.R. Henrıquez, and T. Teruya.

High-speed software implementation of the optimal ate pairing over Barreto-Naehrig

curves. Pairing 2010, LNCS 6487, pp. 21–39, 2010.

[11] M. Izumi, J. Ikegami, K. Sakiyama, and K. Ohta. Improved countermeasure against

address-bit DPA for ECC scalar multiplication. DATE 2010, pp. 981–984, 2010.

[12] J. Fan, X. Guo, E.D. Mulder, P. Schaumont, B. Preneel, and I. Verbauwhede. State-

of-the-art of secure ECC implementations: a survey on known side-channel attacks

and countermeasures. HOST 2010, pp. 76–87, 2010.

[13] D. Mukhopadhyay. An improved fault based attack of the advanced encryption stan-

dard. Africacrypt 2009, LNCS 5580, pp. 421-434, 2009.

[14] K. Ananyi, H. Alrimeih, and D. Rakhmatov. Flexible hardware processor for elliptic

curve cryptography over NIST prime fields. IEEE Trans. VLSI Systems, Vol. 17, No.

8, pp. 1099–1112, 2009.

[15] D.M. Schinianakis, A.P. Fournaris, H.E. Michail, A.P. Kakarountas, and T. Stouraitis.

An RNS implementation of an Fp elliptic curve point multiplier. IEEE Transactions

on Circuits and Systems-I, Vol. 56, No. 6, pp. 1202–1213, 2009.

[16] J.Y. Lai and C.T. Huang. A highly efficient cipher processor for dual-field elliptic

curve cryptography. IEEE Transactions on Circuits and Systems-II, Vol. 56, No. 5,

pp. 394–398, 2009.

[17] D. Kammler, D. Zhang, P. Schwabe, H. Scharwaechter, M. Langenberg, D. Auras, G.

Ascheid, and R. Mathar. Designing an ASIP for cryptographic pairings over Barreto-

Naehrig curves. CHES 2009, LNCS 5747, pp. 254–271, 2009.

[18] N. Benger and M. Scott. Constructing tower extensions for the implementation

of pairing-based cryptography. Cryptology ePrint Archive, Report 2009/556, 2009.

http://eprint.iacr.org/.

[19] J. Fan, F. Vercauteren, and I. Verbauwhede. Faster Fp-arithmetic for cryptographic

pairings on Barreto-Naehrig curves. CHES 2009, LNCS 5747, pp. 240-253, 2009.

160

BIBLIOGRAPHY

[20] M. Scott, N. Benger, M. Charlemagne, L.J. Dominguez Perez, and E.J. Kachisa. On

the final exponentiation for calculating pairings on ordinary elliptic curves. Pairing

2009, LNCS 5671, pp. 78-88, 2009.

[21] J. Beuchat, J. Detrey, N. Estibals, E. Okamoto, and F.R. Henrıguez. Hardware acceler-

ator for the Tate pairing in characteristic three based on Karatsuba-Ofman multipliers.

Cryptology ePrint Archive, Report 2009/122. http://eprint.iacr.org/.

[22] Xilinx ISE design suit, 2009. http://www.xilinx.com/tools/designtools.htm.

[23] S. Ghosh, M. Alam, D. Roychowdhury, and I. Sengupta. Parallel crypto-devices for

GF(p) elliptic curve multiplication resistant against side-channel attacks. Computers

and Electrical Engineering, Elsevier, Vol. 35, pp. 329–338, 2009.

[24] J.L. Beuchat, E.L. Trejo, L. M. Ramos, S. Mitsunari, and F.R. Henrıquez. Multi-core

implementation of the Tate pairing over supersingular elliptic curves. CANS 2009,

LNCS 5888, pp. 413–432, 2009.

[25] A.M. AbdelFattah, A.M.B. El-Din, and H.M.A. Fahmy. An efficient architecture

for interleaved modular multiplication. World Academy of Science, Engineering and

Technology, Vol. 56, 2009.

[26] J. Jiang, J. Chen, J. Wang, D.S. Wong, and X. Deng. High performance architec-

ture for elliptic curve scalar multiplication over GF(2m). Cryptology ePrint Archive,

Report 2008/066. http://eprint.iacr.org/.

[27] M. Naehrig, P.S.L.M. Barreto, and P. Schwabe. On compressible pairings and their

computation. Africacrypt 2008, LNCS 5023, pp. 371-388, 2008.

[28] E. Lee, H.S. Lee, and C. M. Park. Efficient and generalized pairing computation on

abelian varieties. Cryptology ePrint Archive, Report 2008/040. http://eprint.iacr.org/.

[29] F. Hess. Pairing lattices. Pairing 2008, LNCS 5209, pp. 18–38, 2008.

[30] W.N. Chelton and M. Benaissa. Fast elliptic curve cryptography on FPGA. IEEE

Trans. VLSI Systems, Vol. 16, No. 2, pp. 198–205, 2008.

161

BIBLIOGRAPHY

[31] J.Y. Lai and C.T. Huang. Elixir: High-throughput cost-effective dual-field processors

and the design framework for elliptic curve cryptography. IEEE Trans. VLSI Systems,

Vol. 16, No. 11, pp. 1567–1580, 2008.

[32] T.C. Chen, S.W. Wei, and H.J. Tsai. Arithmetic unit for finite field GF(2m). IEEE

Transactions On Circuits And Systems–I, Vol. 55, No. 3, 2008.

[33] R. Laue and S. A. Huss. Parallel memory architecture for elliptic curve cryptography

over GF(p) aimed at efficient FPGA implementation. J. Signal Process. Syst., Vol. 51,

pp. 39-55, 2008.

[34] C. Maxfield. FPGA architectures. http://www.pldesignline.com/192200165; jses-

sionid=2EZPXOVIXZFT GQSNDLQCKIKCJUNN2JVN?pgno=4, December 2008.

[35] H. Kaeslin. Digital integrated circuit design – from VLSI architectures to CMOS

fabrication. Cambridge University Press, 2008.

[36] K. Chapman. Expanding dedicated multipliers. White paper : Xilinx FPGAs.

http://www.xilinx.com/support/documentation/white papers /wp277.pdf, December

2008.

[37] J. Hoffstein, J. Pipher, and J.H. Silverman. An introduction to mathmatical cryptog-

raphy. Springer, 2008.

[38] A. Barenghi, G. Bertoni, L. Breveglieri, and G. Pelosi. A FPGA coprocessor for the

cryptographic Tate pairing over Fp. ITNG 2008, pp. 112-119, 2008.

[39] P. Grabher, J. Großschadl, and D. Page. On software parallel implementation of cryp-

tographic pairings. SAC 2008. LNCS 5381, pp. 35-50, 2008.

[40] T.H. Kim, T. Takagi, D.G. Han, H. Kim, and J. Lim. Power analysis attacks and

countermeasures on ηT pairing over binary fields. ETRI Journal, Vol. 30, No. 1, pp.

68–80, 2008.

[41] C. Rebeiro and D. Mukhopadhyay. High speed compact elliptic curve cryptoprocessor

for FPGA platforms. Indocrypt 2008, LNCS 5365, pp. 376–388, 2008.

162

BIBLIOGRAPHY

[42] D. Hankerson, A. Menezes, and M. Scott. Software implementation of pairings. In:

Joye, M., Neven, G. (eds.) Identity-Based Cryptography, 2008.

[43] J. Fan and I. Verbauwhede. Extended abstract : unified digit-serial multiplier/inverter

in finite field GF(2m). HOST 2008, 2008.

[44] K. Kawakami, K. Shigemoto, and K. Nakano. Redundant radix-2 number system for

accelerating arithmetic operations on the FPGAs. PDCAT 2008, pp. 370–377, 2008.

[45] S. Ghosh, M. Alam, D. Roychowdhury, and I. Sengupta. A GF(p) elliptic curve group

operator resistant against side channel attacks. GLSVLSI 2008, pp. 53–58, 2008.

[46] Z. Zhao. ID-based weak blind signature from bilinear pairings. International Journal

of Network Security, Vol.7, No.2, pp. 265-268, 2008.

[47] S. Ionica and A. Joux. Another approach to pairing computation in Edwards coordi-

nates. Indocrypt 2008, LNCS 5365, pp. 400-413, 2008.

[48] J. Takahashi and T. Fukunaga. Improved differential fault analysis on CLEFIA. FDTC

2008, pp. 25–34, 2008.

[49] M.P.L. Das and P. Sarkar. Pairing computation on twisted Edwards form elliptic

curves. Pairing 2008, LNCS 5209, pp. 192–210, 2008.

[50] D.J. Bernstein, P. Birkner, M. Joye, T. Lange, and C. Peters. Twisted Edwards curves.

Africacrypt 2008, LNCS 5023, pp. 389-405, 2008.

[51] J. Katz and Y. Lindell. Introduction to modern cryptography. Chapman & Hall/CRC,

2007.

[52] G.M.D. Dormale and J.J. Quisquater. High-speed hardware implementations of el-

liptic curve cryptography: a survey. J. Syst. Architect., Vol. 53, No. 23, pp. 72-84,

2007.

[53] S. Ghosh, M. Alam, I. Sengupta, and D. Roychowdhury. A robust GF(p) parallel

arithmetic unit for public key cryptography. DSD 2007, pp. 109–117, 2007.

163

BIBLIOGRAPHY

[54] P.S.L.M. Barreto, S.D. Galbraith, C. OhEigeartaigh, and M. Scott. Efficient pairing

computation on supersingular Abelian varieties. Designs, Codes and Cryptography,

Vol. 42, pp. 239–271, 2007.

[55] G. Chen, G. Bai, and H. Chen. A high-performance elliptic curve cryptographic pro-

cessor for general curves over GF(p) based on a systolic arithmetic unit. IEEE Trans-

actions on Circuits and Systems-II, Vol. 54, No. 5, pp. 412–416, 2007.

[56] N. Mentens, K. Sakiyama, B. Preneel, and I. Verbauwhede. Efficient pipelining for

modular multiplication architectures in prime fields. GLSVLSI 2007, pp. 534-539,

2007.

[57] S. Mangard, E. Oswald, and T. Popp. Power analysis attacks. Springer, 2007.

[58] J. Fan, K. Sakiyama, and I. Verbauwhede. Elliptic curve cryptography on embedded

multicore systems. WESS 2007, pp. 17–22, 2007.

[59] J.C. Ha, J. Park, S. Moon, and S.M. Yen. Provably secure countermeasure resistant to

several types of power attack for ECC. WISA 2007, LNCS 4867, pp. 333-344, 2007.

[60] K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede. Multicore curve-based cryp-

toprocessor with reconfigurable modular arithmetic logic units over GF(2n). IEEE

Transaction on Computers, Vol. 56, No. 9, pp. 1269-1282, 2007.

[61] A.J. Devegili, M. Scott, and R. Dahab. Implementing cryptographic pairings over

Barreto-Naehrig curves. Pairing 2007. LNCS 4575, pp. 197-207, 2007.

[62] E. Barke, W. Barker, W. Burr, W. Polk, and M. Smid. Recommendation for key man-

agement - part 1 : general (revised). NIST special publication 800-57, 2007.

[63] J. Takahashi, T. Fukunaga, and K. Yamakoshi. DFA mechanism on the AES schedule.

FDTC 2007, pp. 62-72, 2007.

[64] D.J. Bernstein and T. Lange. Faster addition and doubling on elliptic curves. Asiacrypt

2007, LNCS 4833, pp. 29-50, 2007.

[65] H.M. Edwards. A normal form for elliptic curves. Bull. AMS 44, pp. 393-422, 2007.

164

BIBLIOGRAPHY

[66] F.R. Henrıquez, N.A. Saqib, A.D. Perez, and C.K. Koc. Cryptographic algorithms on

reconfigurable hardware. Springer, 2006.

[67] D. Freeman. Constructing pairing-friendly elliptic curves with embedding degree 10.

ANTS 2006, LNCS 4076, pp. 452-465. 2006.

[68] P. K. Mishra. Pipelined computation of scalar multiplication in elliptic curve cryp-

tosystems. IEEE Transaction on Computers, Vol. 55, No. 8, pp. 1000-1010, 2006.

[69] K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede. Superscalar coprocessor for

high-speed curve-based cryptography. CHES 2006, LNCS 4249, pp. 415-429, 2006.

[70] K. Sakiyama, E.D. Mulder, B. Preneel, and I. Verbauwhede. A parallel processing

hardware architecture for elliptic curve cryptosystems. ICASP 2006, pp. 904–907,

2006.

[71] B. Ansari and M.A. Hasan. High performance architecture for elliptic curve scalar

multiplication. The University of Waterloo, Tech. Rep. CACR 2006-01, 2006.

[72] M. Benaissa and W. M. Lim. Design of flexible GF(2m) elliptic curve cryptography

processors. IEEE Transaction Very Large Scale Integr. (VLSI) Syst., Vol. 14, No. 6,

pp. 659-662, 2006.

[73] W. Chelton and M. Benaissa. High-speed pipelined ECC processor over GF(2m). SIPS

2006, 2006.

[74] C.J. McIvor, M. McLoone, and J.V. McCanny. Hardware elliptic curve cryptographic

processor over GF(p). IEEE Transactions on Circuits and Systems-I, Vol. 53, No. 9,

pp. 1946–1957, 2006.

[75] C. Shu, S. Kwon, and K. Gaj. FPGA accelerated Tate pairing based cryptosystems

over binary fields. FPT 2006, pp. 173-180, 2006.

[76] P.S.L.M. Barreto and M. Naehrig. Pairing-friendly elliptic curves of prime order. SAC

2005. LNCS 3897, pp. 319-331, 2006.

[77] F. Hess, N.P. Smart, and F. Vercauteren. The eta pairing revisited. IEEE Transactions

on Information Theory, Vol. 52, No. 10, pp. 4595-4602, 2006.165

BIBLIOGRAPHY

[78] A. Devegili, C. OhEigeartaigh, M. Scott, and R. Dahab. Multiplication and

squaring on pairing-friendly fields. Cryptology ePrint Archive, Report 2006/471.


[79] K. Sakiyama, B. Preneel, and I. Verbauwhede. A fast dual-field modular arithmetic

logic unit and its hardware implementation. ISCAS 2006, pp. 787–780, 2006.

[80] O.A. Khaleel, C. Papachristou, F. Wolff, and K. Pekmestzi. FPGA-based design of a

large moduli multiplier for public-key cryptographic systems. ICCD 2006, pp. 314–

319, 2006.

[81] S.M. Yen, L.C. Ko, S.J. Moon, and J.C. Ha. Relative doubling attack against Mont-

gomery ladder. ICISC 2005, LNCS 3935, pp. 117-128, 2006.

[82] D. Page and F. Vercauteren. A fault attack on pairing-based cryptography. IEEE

Transactions on Computers, Vol. 55, No. 9, pp. 1075–1080, 2006.

[83] H. Mamiya, A. Miyaji, and H. Morimoto. Secure elliptic curve exponentiation against

RPA, ZRA, DPA and SPA. IEICE Transaction of Fundamentals, Vol. E89-A, No.8,

2006.

[84] C. Whelan and M. Scott. Side channel analysis of practical pairing implementations :

which path is more secure?. Vietcrypt 2006, LNCS 4341, pp. 99–114, 2006.

[85] T. Akishita and T. Takagi. Zero-value register attack on elliptic curve cryptosystem.

IEICE Transaction of Fundamentals, Vol. E88-A, No. 1, 2005.

[86] D.J. Bernstein. Cache-timing attacks on AES. Technical report, 2005. Available at:

http://cr.yp.to/ antiforgery/ cachetiming-20050414.pdf

[87] W. Shusua and Z. Yuefei. A timing and area tradeoff GF(p) elliptic curve processor

architecture for FPGA. ICCCAS 2005, pp. 1308–1312, 2005.

[88] H. Eberle, S. Shantz, V. Gupta, N. Gura, L. Rarick, and L. Spracklen. Accelerating

next-generation public-key cryptosystems on generalpurpose CPUs. IEEE Micro, Vol.

25, No. 2, pp. 52-59, 2005.

166

BIBLIOGRAPHY

[89] R.C.C. Cheung, N.J. Telle, W. Luk, and P.Y.K. Cheung. Customizable elliptic curve

cryptosystems. IEEE Transaction on Very Large Scale Integr. (VLSI) Syst., Vol. 13,

No. 9, pp. 1048-1059, 2005.

[90] S. Chatterjee, P. Sarkar, and R. Barua. Efficient computation of Tate pairing in pro-

jective coordinate over general characteristic fields. ICISC 2004, LNCS 3506, pp.

168-181, 2005.

[91] N. Koblitz and A. Menezes. Pairing-based cryptography at high security levels. Cryp-

tology ePrint Archive, Report 2005/076, 2005. http://eprint.iacr.org/.

[92] C.M. Park, M.H. Kim, and M. Yung. A Remark on Implementing the Weil Pairing.

CISC 2005, LNCS 3822, pp. 313–323, 2005.

[93] S. Galbraith. Pairings. In I.F. Blake, G. Seroussi, and N.P. Smart, editors, Advances

in elliptic curve cryptography, London Mathematical Society Lecture Note Series,

chapter IX. Cambridge University Press, 2005.

[94] D.N. Amanor, C. Paar, J. Pelzl, V. Bunimov, and M. Schimmler. Efficient hardware

architectures for modular multiplication on FPGAs. International Conference on Field

Programmable Logic and Applications, pp. 539-542, 2005.

[95] D. N. Amanor. Efficient hardware architectures for modular multiplication. 2005.

Available at : http://www.crypto.ruhr-uni-bochum.de/ imperia/md/content/texte/ the-

ses/dnamanorthesis.pdf

[96] Q. Liu, D. Tong, and X. Cheng. Non-interleaving architecture for hardware imple-

mentation of modular multiplication. IEEE International Symposium on Circuits and

Systems, pp. 660–663, 2005.

[97] M.E. Kaihara, N. Takagi. A hardware algorithm for modular multiplication/division.

IEEE Transactions on Computers, Vol. 54, pp. 12–21, 2005.

[98] F. Crowe, A. Daly, and W. Marnane. A scalable dual mode arithmetic unit for public

key cryptosystems. ITCC 2005, pp. 568–573, 2005.

167

BIBLIOGRAPHY

[99] D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, and S. Hsu. An improved unified

scalable radix-2 montgomery multiplier. ARITH 2005, pp. 172–178, 2005.

[100] M. Ciet and M. Joye. Elliptic curve cryptosystems in the presence of permanent and

transient faults. Designs, Codes and Cryptography, Vol. 36, pp. 33–43, 2005.

[101] F. Koeune and F.X. Standaert. A tutorial on physical security and side-channel at-

tacks. FOSAD 2004/2005, LNCS 3655, pp. 78-108, 2005.

[102] A. Menezes, E. Teske, and A. Weng. Weak Fields for ECC. CT-RSA 2004, LNCS

2964, pp. 366-386, 2004.

[103] N. Saquib, F. Rodriguez, and A. Diaz. A parallel architecture for fast computation

of elliptic curve scalar multiplication over GF(2n). RAW 2004, pp. 26–27, 2004.

[104] M. Scott and P. Barreto. Compressed pairings. Crypto 2004. LNCS 3152, pp. 140-

156, 2004.

[105] D. Page and F. Vercauteren. Fault and side-channel attacks on pairing based cryp-

tography. Cryptology ePrint Archive, Report 2004/283. http://eprint.iacr.org/.

[106] A. Daly, W. Marnane, T. Kerins, and E. Popovici. An FPGA implementation of a

GF(p) ALU for encryption processors. Microprocessors and Microsystems, Vol. 28,

pp. 253–260, 2004.

[107] E. Ozturk, B. Sunar, and E. Savas. Low-power elliptic curve cryptography using

scaled modular arithmetic. CHES 2004, LNCS 3156, pp. 92–106, 2004.

[108] L.P. Lee and K.W. Wong. A random number generator based on elliptic curve opera-

tions. Computers and Mathematics with Applications, Vol. 47, pp. 217–226, Elsevier,

2004.

[109] V.S. Miller. The Weil pairing, and its efficient calculation. Journal of Cryptology,

Vol. 17, pp. 235-261, 2004.

[110] C. McIvor, M. McLoone, and J. McCanny. FPGA Montgomery multiplier architec-

tures - a comparison. Field-Programmable Custom Computing Machines, pp. 279-

282, 2004.168

BIBLIOGRAPHY

[111] S. Kwon. Efficient Tate pairing computation for supersingular elliptic curves over

binary fields. In Cryptology ePrint Archive, Report 2004/303. http://eprint.iacr.org/.

[112] R. Dutta, R. Barua, and P. Sarkar. Pairing-based cryptographic protocols : a survey.

Cryptology ePrint Archive, Report 2004/064. http://eprint.iacr.org/.

[113] H. Bar-El, H. Choukri, D. Naccache, M. Tunstall, and C. Whelan. The sorcerer’s

apprentice guide to fault attacks. In Cryptology ePrint Archive, Report 2004/10.


[114] B. C. Mames, M. Ciet and M. Joye. Low-cost solutions for preventing simple side-

channel analysis: side-channel atomicity. IEEE Transactions on Computers, Vol. 53,

No. 6, pp. 760–768, 2004.

[115] E. Brier, I. Dechene, and M. Joye. Unified addition formulæ for elliptic curve cryp-

tosystems. Embeded cryptographic hardware: methodology and architectures, 2004.

[116] L. Goubin. A refined power-analysis attack on elliptic curve cryptosystems. PKC

2003, LNCS 2567, pp. 199-211, 2003.

[117] D. Hankerson, A. Menezes, and S. Vanstone. Guide to elliptic curve cryptography.

Spinger, US, 2003.

[118] P. Fouque and F. Valette. The doubling attack – why upwards is better than down-

wards. CHES 2003, LNCS 2779, pp. 269–280, 2003.

[119] A. Satoh and K. Takano. A scalable dual-field elliptic curve cryptographic processor.

IEEE Transactions on Computers, Vol. 52, No. 4, pp. 449–460, 2003.

[120] S. B. Ors, L. Batina, B. Preneel, and J. Vandewalle. Hardware implementation of an

elliptic curve processor over GF(p). ASAP 2003, pp. 433–443, 2003.

[121] V. Bunimov and M. Schimmler. Area and time efficient modular multiplication of

large integers. ASAP 2003, 2003.

[122] K. Itoh, T. Izu, and M. Takenaka. A practical countermeasure against address-bit

differential power analysis. CHES 2003, LNCS 2779, pp. 382–396, 2003.

169

BIBLIOGRAPHY

[123] L.S. Au and N. Burgess. Unified radix-4 multiplier for GF(p) and GF(2n). ASAP

2003, pp. 1–11. 2003.

[124] S.B. Ors, L. Batina, B. Preneel, and J. Vandewalle. Hardware implementation of a

Montgomery modular multiplier in a systolic array. IPDPS ’03, pp. 184–186, 2003.

[125] M. Joye and S.M. Yen. The Montgomery powering ladder. CHES ’02, LNCS 2523,

pp. 291-302, 2003.

[126] I. Duursma and H. Lee. Tate pairing implementation for hyperelliptic curves y2 =

xp− x+d. Asiacrypt 2003, LNCS 2894, pp. 111-123, 2003.

[127] O. Billet and M. Joye. The Jacobi model of elliptic curve and side-channel analy-

sis. Applied Algebra, Algebric Algorithms and Error-Correcting Codes 2003, LNCS

2643, pp. 34–42, 2003.

[128] J. Solinas. ID-based digital signature algorithms. 2003,

http://www.cacr.math.uwaterloo.ca/conferences/2003/ecc2003/solinas.pdf.

[129] D.J Stinson. Cryptography theory and practice, second edition. Chapman &

Hall/CRC, 2002.

[130] J.J. Quisquater and D. Samyde. Eddy current for magnetic analysis with active sen-

sor. Esmart 2002, pp. 185–194, 2002.

[131] E. Trichina and A. Bellezza. Implementation of elliptic curve cryptography with

built in countermeasures against side-channel attacks. CHES 2002, LNCS 2523, pp.

99-113, 2002.

[132] E. Brier and M. Joye. Weierstraß elliptic curves and side-channel attacks. PKC 2002,

LNCS 2274, pp. 335–345, 2002.

[133] M. Stam and A.K. Lenstra. Efficient subgroup exponentiation in quadratic and sixth

degree extensions. CHES 2002, LNCS 2523, pp. 318-332, 2002.

[134] G. Gaubatz. Versatile Montgomery multiplier architectures. Masters thesis, 2002.

Available at: http://www.wpi.edu/Pubs/ETD/Available/etd-0430102-120529 /unre-

stricted/ gaubatz.pdf170

BIBLIOGRAPHY

[135] J. Guajardo, T. Wollinger, and C. Paar. Area efficient GF(p) architectures for GF(pm)

multipliers, 2002. Availeble at: http://www.wollinger.org/papers/Guajardo etal gfp

architectures.pdf

[136] H. Wu. Montgomery multiplier and squarer for a class of finite fields. IEEE Trans-

actions on Computers, Vol. 51, No. 5, 2002.

[137] Virtex-II ProT M platform FPGA handbook. Xilinx Inc., San Jose, CA, 2002.

[138] P.S.L.M. Barreto, H. Kim, B. Lynn, and M. Scott. Efficient algorithms for pairing-

based cryptosystems. Crypto 2002, LNCS 2442, pp. 354-368, 2002.

[139] S.P. Skorobogatov and R.J. Anderson. Optical fault induction attacks. CHES 2002,

LNCS 2523, pp. 2-12, 2002.

[140] T. Izu and T. Takagi. A fast parallel elliptic curve multiplication resistant against side

channel attacks. PKC 2002, LNCS 2274, pp. 280–296, 2002.

[141] W. Fischer, C. Giraud, E.W. Knudsen, and J.P.Seifert. Parallel scalar multiplication

on general elliptic curves over Fp hedged against non-differential side-channel at-

tacks. Cryptology ePrint Archive, Report 2002/007. http://eprint.iacr.org/.

[142] G. Orlando and C. Paar. A scalable GF(p) elliptic curve processor architecture for

programmable hardware. CHES 2001, LNCS 2162, pp. 348–363, 2001.

[143] D. Boneh and M.K. Franklin. Identity-based encryption from the Weil pairing.

Crypto 2001, LNCS 2139, pp. 213-229, 2001.

[144] D. Boneh, B. Lynn, and H. Shacham. Short signatures from the Weil pairing. Asi-

acrypt 2001, LNCS 2248, pp. 514-532, 2001.

[145] A. Miyaji, M. Nakabayashi, and S. Takano. New explicit conditions on elliptic curve

traces for FR-reduction. IEICE Trans. Fundamentals, Vol. E-84, No. A-5, pp. 1234–

1243, 2001.

[146] C. Clavier and M. Joye. Universal exponentiation algorithm. CHES 2001, LNCS

2162, pp. 300-308, 2001.

171

BIBLIOGRAPHY

[147] D. May, H.L. Muller, and N. Smart. Random register renamming to foil DPA. CHES

2001, LNCS 2162, pp. 28-38, 2001.

[148] N.P. Smart. The Hessian form of an elliptic curve. CHES 2001, LNCS 2162, pp.

118-125, 2001.

[149] E. Oswald and M. Aigner. Randomized addition-subtraction chains as a countermea-

sure against power attacks. CHES 2001, LNCS 2162, pp. 39-50, 2001.

[150] M. Joye and C. Tymen. Protection against differential analysis for elliptic curve cryp-

tography. CHES 2001, LNCS 2162, pp. 377-390, 2001.

[151] M. Joye and J. Quisquater. Hessian elliptic curves and side channel attacks. CHES

2001, LNCS 2162, pp. 402-410, 2001.

[152] P. Liardet and N. Smart. Preventin SPA/DPA in ECC systems using the Jacobi form.

CHES 2001, LNCS 2162, pp. 391-401, 2001.

[153] B. Moller. Securing elliptic curve point multiplication against side-channel attacks.

ISC 2001, LNCS 2200, pp. 324–334, 2001.

[154] D. Boneh, R.A. DeMillo, and R.J. Lipton. On the importance of eliminating errors

in cryptographic computations. Journal of Cryptology, Vol. 14, No. 2, pp. 101-119,

2001. Extended abstract in Eurocrypt 1997.

[155] I. Blake, G. Seroussi, and N. Smart. Elliptic curves in cryptography. London Math-

ematical Society Lecture Note Series, Vol. 265, Cambridge University Press, 2000.

[156] K. Okeya, H. Kurumatani, and K. Sakaurai. Elliptic curves with the Montgomery

form and their cryptographic applications. PKC 2000, LNCS 1751, pp. 238–257,

2000.

[157] K. Okeya and K. Sakura. Power analysis breaks elliptic curve cryptosystems even

secure against the timing attack. Indocrypt 2000, LNCS 1977, pp. 217–314, 2000.

[158] E. Savas, A.F. Tenca, and C.K. Koc. A scalable and unified multiplier architecture

for finite fields GF(p) and GF(2m). CHES 2000, LNCS 1965, pp. 281–296, 2000.

172

BIBLIOGRAPHY

[159] T.S. Messerges. Using second-order power analysis to attack DPA resistant software.

CHES 2000, LNCS 1965, pp. 238–251, 2000.

[160] A. Joux. A one round protocol for tripertite Diffie-Hellman. ANTS 2000, LNCS

1838, pp. 385–394, 2000.

[161] I. Biehl, B. Meyer, and V. Muller. Differential fault analysis on elliptic curve cryp-

tosystems. Crypto 2000. LNCS 1880, pp. 131-146. 2000.

[162] J.S. Coron. Resistance against differential power analysis for elliptic curve cryp-

tosystems. CHES 1999, LNCS 1717, pp. 292-302, 1999.

[163] P. Kocher, J. Jaffe, and B. Jun. Differential power analysis. Crypto 1999, LNCS

1666, pp. 388–397, 1999.

[164] T.S. Messerges, E.A. Dabbish, and R.H. Sloan. Investigations of power analysis

attacks on smartcards. USENIX Workshop on Smartcard Technology, 1999.

[165] S. Chari, C.S. Jutla, J.R. Rao, and P. Rohatgi. Towards sound approaches to counter-

act power-analysis attacks. Crypto 1999, LNCS 1666, pp. 398–412, 1999.

[166] National Institute of Science and Technology. Recommended elliptic curves for fed-

eral gorernment use. Available at: http://www.csrc.nist.gov/ groups/ ST/toolkit/ doc-

uments/dss/ NISTReCur.pdf, 1999.

[167] J. Lopez and R. Dahab. Fast multiplication on elliptic cuves over GF(2m) without

percomputation. CHES 1999, LNCS 1717, pp. 216-327, 1999.

[168] J.F. Dhem, F. Koeune, P.A. Leroux, P. Mestre, J.J. Quisquater, and J.L. Willems. A

practical implementation of the timing attack. CARDIS 1998, LNCS 1820, pp. 167–

182, 1998.

[169] S. Hauck, M.M. Hosler, and T.W. Fry. High-performance carry chains for FPGAs.

FPGA 1998, pp. 223–233, 1998.

[170] H.Cohen, A.Miyaji and T.Ono. Efficient elliptic curve exponentiation using mixed

coordinates. Asiacrypt 1998, LNCS 1514, pp. 51–65, 1998.

173

BIBLIOGRAPHY

[171] P.C. Kocher. Timing attacks on implementations of Diffie-Hellman, RSA, DSS and

other systems. Crypto 1996, LNCS 1109, pp. 104–113, 1996.

[172] N. Koblitz. CM curves with good cryptographic propertics. Crypto 1991, LNCS 576,

pp. 279–287, 1991.

[173] P.L. Montgomery. Speeding the pollard and elliptic curve methods of factorization.

Math. Comput., Vol. 48, pp. 243–264, 1987.

[174] N. Koblitz. Elliptic curve cryptosystems. Mathematics of Computation, Vol. 48, No.

177, pp. 203–209, 1987.

[175] V.S. Miller. Short programs for functions on curves. Unpublished manuscript, 1986.

[176] P.L. Montgomery. Modular multiplication without trial division. Math. Computa-

tion, Vol. 44, pp. 519–521, 1985.

[177] D. Chaum. Security without identification: Transaction systems to make Big Brother

obsolate. Comm. ACM, Vol. 28, pp. 1030–1044, 1985.

[178] V.S. Miller. Use of elliptic curves in cryptography. Crypto 1985, LNCS 218, pp.

417–426, 1985.

[179] K.R. Sloan. Comments on a computer algorithm for calculating the product A*B

modulo M. IEEE Transactions on Computers, Vol. C-34, No. 3, pp. 290–292, 1985.

[180] G.R. Blakley. A computer algorithm for calculating the product A*B modulo M.

IEEE Transactions on Computers, Vol. C-32, No. 5, pp. 497–500, 1983.

[181] R.P. Brent and H.T. Kung. A regular layout for parallel adders. IEEE Transactions

on Computers, Vol. C-31, No. 3, pp. 260–264, 1982.

[182] R.L. Rivest, A. Shamir, and L.M. Adleman. A method for obtaining digital signa-

tures and public-key cryptosystems. Commun. ACM, Vol. 21, No. 2, pp. 120–126,

1978.

[183] J. Pollard. Monte Carlo methods for index computation mod p. Math. Comp., Vol.

32, pp. 918-924, 1978.

174

Disseminations

Publications Directly Related to the Thesis

Refereed Journal

1. Santosh Ghosh, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. Petrel

: power and timing attack resistant elliptic curve scalar multiplier based on pro-

grammable GF(p) arithmetic unit. IEEE Transactions on Circuits and Systems − I

(TCAS–I), Vol. 58, No. 9, September 2011.

2. Santosh Ghosh, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. Fault

attack and countermeasures on pairing based cryptography. International Journal of

Network Security, Vol.12, No.1, pp. 26-33, January 2011.

Refereed Conference

1. Santosh Ghosh, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. High

speed flexible pairing cryptoprocessor on FPGA platform. Fourth International Con-

ference on Pairing-based Cryptography − Pairing 2010, LNCS 6487, pp. 450–466,

Japan, December 2010.

2. Santosh Ghosh, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. High

speed Fp multipliers and adders on FPGA platform. Design and Architectures for

Signal and Image Processing − DASIP 2010, Edinburgh, Scotland, October 2010.

3. Santosh Ghosh and Dipanwita Roy Chowdhury. Configurable multicore process-

ing unit for elliptic curve cryptography. 12th VLSI Design and Test Symposium −VDAT 2008, Bangalore, India, July 2008.

175

BIBLIOGRAPHY

Cryptology ePrint Archive

1. Santosh Ghosh, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. Secu-

rity of pairing cryptoprocessor against differential power attacks. Cryptology ePrint

Archive, Report 2011/181, http://eprint.iacr.org/.

Other Publications of the Author

Refereed Journal

1. Santosh Ghosh, Monjur Alam, Dipanwita Roy Chowdhury, and Indranil Sen Gutpa.

Parallel crypto-devices for GF(p) elliptic curve multiplication resistant against side

channel attacks. Computers and Electrical Engineering, Elsevier, Vol. 35, pp. 329–

338, 2009.

2. Monjur Alam, Santosh Ghosh, M jagon Mohan, Debdeep Mukhopadhyay, Dipan-

wita Roy Chowdhury, and Indranil Sen Gupta. Effect of glitches against masked

AES S-box implementation and countermeasure. Information Security (IFS), IET,

Vol. 3, No. 1, pp. 34–44, 2009.

Refereed Conference

1. Santosh Ghosh, Dipanwita Roy Chowdhury, and Abhijit Das. High Speed Crypto-

processor for ηT Pairing on 128-bit Secure Supersingular Elliptic Curves over Char-

acteristic Two Fields. CHES 2011 (to appear).

2. Santosh Ghosh. Design and analysis of pairing based cryptographic hardware for

prime fields. PhD forum ISVLSI 2011, IEEEXplore, pp. 1-2, Chennai, India, July

2011.

3. Santosh Ghosh. Design and analysis of pairing hardware on FPGA platform. PhD

forum DAC 2011, San Diego, US, June 2011.

176

BIBLIOGRAPHY

4. Santosh Ghosh and Dipanwita Roy Chowdhury. Elliptic curve based multi-signature

scheme for multi-server systems. IEEE Tencon 2008, Hyderabad, India, November

2008.

5. Santosh Ghosh, Monjur Alam, Dipanwita Roy Chowdhury, and Indranil Sen Gutpa.

A GF(p) Elliptic curve group operator resistant against side channel attacks. ACM

Great Lakes Symposium on VLSI − GLSVLSI 2008, Orlando, Florida, US, ACM,

pp. 53–58, May 2008.

6. Monjur Alam, Santosh Ghosh, Dipanwita Roy Chowdhury, and Indranil Sen Gutpa.

Single chip encryptor/decryptor core implementation of AES algorithm. 21st In-

ternational Conference on VLSI Design − VLSID 2008, Hyderabad, India, IEEE

Computer Society, pp. 693–698, January 2008.

7. Santosh Ghosh, Monjur Alam, Kundan Kumar, Debdeep Mukhopadhyay, and Di-

panwita Roy Chowdhury. Preventing the side-channel leakage of masked AES S-

Box. 15th International Conference on Advanced Computing & Communication −ADCOM 2007, IIT Guwahati, India, IEEE Computer Society, pp. 15–20, December

2007.

8. Avishek Saha and Santosh Ghosh. A speed-area optimization of full search block

matching hardware with applications in high-definition TVs (HDTV). High Perfor-

mance Computing − HiPC 2007, LNCS 4873, pp. 83–94, Goa, India, December

2007.

9. Santosh Ghosh, Monjur Alam, Dipanwita Roy Chowdhury, and Indranil Sen Gupta.

Effect of side channel attacks on RSA embedded devices. IEEE Tencon 2007, Taipei,

Taiwan, IEEE, pp, 1–4, November 2007.

10. Santosh Ghosh and Avishek Saha. Speed-area optimized FPGA implementation for

full search block matching. International Conference on Computer Design − ICCD

2007, IEEE Comp. Society, pp, 13–18, California, US, October 2007.

11. Santosh Ghosh, Monjur Alam, Indranil Sen Gupta, and Dipanwita Roy Chowdhury.

A robust GF(p) parallel arithmetic unit for public key cryptography. 10th EUROMI-

177

BIBLIOGRAPHY

CRO Conference on Digital System Design - Architectures, Methods and Tools −DSD 2007, Lubeck, Germany, IEEE Computer Society, pp. 109–117, August 2007.

12. Monjur Alam, Santosh Ghosh, Debdeep Mukhopadhyay, Dipanwita Roy Chowd-

hury, and Indranil Sen Gutpa. Latency optimized AES-Rijndael with flexible mode

of operation. 11th VLSI Design and Test Symposium − VDAT 2007, pp, 413–420,

Kolkata, India, August 2007.

13. Avishek Saha, Santosh Ghosh, Shamik Sural, and Jayanta Mukherjee. Toward

memory-efficient design of video encoders for multimedia applications. ISVLSI

2007, Porto Alegre, Brazil, IEEE Computer Society, pp, 453–454, May 2007.

14. Monjur Alam, Sonai Ray, Debdeep Mukhopadhyay, Santosh Ghosh, Dipanwita Roy

Chowdhury, and Indranil Sen Gutpa. An area optimized reconfigurable encryptor for

AES-Rijndael. Design Automation and Test in Europe − DATE 2007, Nice, France,

IEEE Computer Society, pp, 1116–1121, April 2007.

National Conferences and Workshops

1. Santosh Ghosh, Dipanwita Roy Chowdhury, and Indranil Sen Gutpa. The ECMQV

key agreement protocol on FPGA platform. National Workshop on Cryptology 2009,

Surat, India, August 2009.

2. Santosh Ghosh and Dipanwita Roy Chowdhury. A GF(2163) elliptic curve cryp-

tographic processor unit. National Conference on Information Security-Issues and

Challenges (NCISIC 2008), Orissa, India, January 2008.

3. Santosh Ghosh, Dipanwita Roy Chowdhury, and Indranil Sen Gupta. Side channel

attacks on RSA and ECC crypto devices. National Workshop on Cryptology 2007,

Coimbatore, India, September 2007.

178

Curriculum Vitae

SANTOSH GHOSH, son of Mr. Sadhan Ghosh and Mrs. Ambika Ghosh, was born on

September 10, 1978 at Kshiragram, Burdwan, WB, India. After passing the Higher

Secondary (10+2) Examination from Burdwan Municipal High School, Burdwan, he joined

Haldia Institute of Technology (HIT) under the Vidyasagar University to pursue the study

of Bachelor of Technology (B.Tech.) in Computer Science and Engineering (CSE) and

subsequently joined as a lecturer at CSE department of HIT in 2002. He pursed for further

study of Master of Science (M.S.) from the Department of Computer Science and Engi-

neering at the Indian Institute of Technology Kharagpur in 2006. After completion of the

Master’s course in the year 2008, he joined the same department in the same year, as a PhD

student. During this time, his research area has been broadly in the field of Cryptographic

Hardware and Side-channel Attacks. Research work presented in this thesis has been the

outcome of that effort. His other research interests include VLSI Design and Testing.

Contact e-mail :

santosh[dot]ghosh[at]gmail[dot]com

santosh[at]cse[dot]iitkgp[dot]ernet[dot]in

179

Documents

DESIGN AND ANALYSIS OF PAIRING BASED ...drc/thesis/santosh_phd...APPROVAL OF THE VIVA-VOCE BOARD June 23, 2011 Certiﬁed that the thesis entitled DESIGN AND ANALYSIS OF PAIRING BASED