Upload
donhu
View
218
Download
2
Embed Size (px)
Citation preview
DESIGN AND ANALYSIS OF PAIRING BASEDCRYPTOGRAPHIC HARDWARE FOR PRIME FIELDS
Santosh Ghosh
DESIGN AND ANALYSIS OF PAIRING BASED
CRYPTOGRAPHIC HARDWARE FOR PRIME FIELDS
Thesis submitted to the
Indian Institute of Technology Kharagpur
For award of the degree
of
Doctor of Philosophy
by
Santosh Ghosh
Under the guidance of
Professor Debdeep Mukhopadhyayand
Professor Dipanwita Roy Chowdhury
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
JUNE 2011
c⃝ 2011 Santosh Ghosh. All Rights Reserved
APPROVAL OF THE VIVA-VOCE BOARD
June 23, 2011
Certified that the thesis entitled DESIGN AND ANALYSIS OF PAIRING BASEDCRYPTOGRAPHIC HARDWARE FOR PRIME FIELDS submitted by SANTOSHGHOSH to the Indian Institute of Technology, Kharagpur, for the award of the de-gree Doctor of Philosophy has been accepted by the external examiners and thatthe student has successfully defended the thesis in the viva-voce examination heldtoday.
Member of DSC Member of DSC Member of DSCDr. Arobinda Gupta Dr. Shamik Sural Dr. Abhijit DasProfessor Asociate Professor Assistant ProfessorCSE Department School of Information Technology CSE DepartmentIIT Kharagpur, India IIT Kharagpur, India IIT Kharagpur, India
Supervisor SupervisorDr. Debdeep Mukhopadhyay Dr. Dipanwita Roy ChowdhuryAssistant Professor ProfessorCSE Department, IIT Kharagpur CSE Department, IIT Kharagpur
External Examiner ChairmanDr. Bimal K. Roy Dr. Jayanta MukhopadhyayDirector Professor and HeadIndian Statistical Institute, Kolkata CSE Department, IIT Kharagpur
CERTIFICATE
This is to certify that the thesis entitled Design and Analysis of Pairing Based
Cryptographic Hardware for Prime Fields, submitted by Santosh Ghosh to In-
dian Institute of Technology, Kharagpur, is a record of bona fide research work un-
der our joint supervision and we consider it worthy of consideration for the award
of the degree of Doctor of Philosophy of the Institute
Supervisor Supervisor
Dr. Debdeep Mukhopadhyay Dr. Dipanwita Roy Chowdhury
Assistant Professor Professor
CSE, IIT Kharagpur CSE, IIT Kharagpur
Date: Date:
to Sweta
Acknowledgements
THROUGH THIS LITTLE note and limited space, I try my best to express mysincere gratitude to some of those people without whose help this thesis simply
would not have come about. Foremost, I would like to thank the mysterious anddivine Nature that has coursed my life across so many people and so many oppor-tunities, and has given me the strength and health that have contributed in shapingup this thesis.
With all my heart I thank my supervisors Professor Debdeep Mukhopadhyayand Professor Dipanwita Roy Chowdhury for introducing me to this wonderfulworld of Cryptography and more so for making amply available their immensesupport, advise and encouragements in a number of ways. Along with them I thankProfessor Indranil Sen Gupta for his continuous support in my research carried outsince 2005 started from my MS degree. I also thank the members of my DoctoralScrutiny Committee, Professor Shamik Sural, Professor Avijit Das, and ProfessorArabinda Gupta, for giving me timely directions.
The role of supportive and welcoming friends and colleagues in the life of a re-searcher is undeniable. To this end, I would like to thank all the present and formermembers of the Embedded Systems Laboratory and the Department of ComputerScience and Engineering at IIT Kharagpur. I am grateful to the Central Library ofIIT Kharagpur for offering such a vast resource of research material and making itso easily accessible. I offer my special thanks to Ghatal, Rohan, Dhiman, Chester,Subidh, Bodhi, Debi da, Joydeb da, and Bivas da for their friendship and support.
My father has always been my greatest inspiration. I owe him my heart filledbenediction for guiding me through a path of knowledge and truth, for being thestrongest support in the journey towards my dreams and aspirations. I thank mymother for her unconditional consecration in making a happy and adorable home,and for keeping me in a healthy mental and physical state. I also thankful to all ofmy relatives for their love and support.
Most importantly, I thank my wife Sweta. I thank you dear for your immutablepatience and love, for carving a blissful end to each of my tiring days. Your emo-tional and moral support pulled me through this journey. I am grateful to God forhaving you by my side forever.
Santosh GhoshCSE, IIT-Kharagpur,
June 2011
DECLARATION
I certify that
a. The work contained in this thesis is original and has been done by myselfunder the general supervision of my supervisor.
b. The work has not been submitted to any other Institute for any degree ordiploma.
c. I have followed the guidelines provided by the Institute in writing the thesis.
d. I have conformed to the norms and guidelines given in the Ethical Code ofConduct of the Institute.
e. Whenever I have used materials (data, theoretical analysis, and text) fromother sources, I have given due credit to them by citing them in the text of thethesis and giving their details in the references.
f. Whenever I have quoted written materials from other sources, I have put themunder quotation marks and given due credit to the sources by citing them andgiving required details in the references.
Santosh GhoshDepartment of CSE,
IIT Kharagpur
Date:
Abstract
THE PRIMARY CHALLENGE in modern day cryptographic hardware development
lies in coping with progressively strong physical attacks commonly referred to as side-
channel analysis. This research deals with practical implementations and analysis of physi-
cal security of pairing based cryptographic operations on prime fields. Pairing computation
and elliptic curve scalar multiplication are two major operations in pairing based cryptogra-
phy. These operations in turn rely on arithmetic in finite fields − prime fields (Fp). Hence,
this work first designs a portable and compact architecture for Fp arithmetic. Subsequently,
the work proposes an efficient dual-core cryptoprocessor for elliptic curve scalar multipli-
cation based on the above compact Fp core. Field Programmable Gate Array (FPGA) is a
relevant platform which provides various in-built features for optimizing arithmetic opera-
tions. A configurable core on FPGA device has been developed for Fpk arithmetic based on
the above optimized Fp primitive. Two such configurable cores are utilized for developing
a pairing cryptoprocessor which computes pairing over Barreto-Naehrig curve. Security
of pairing computations against fault and power attacks are subsequently addressed in this
work. The work further studies existing as well as new vulnerabilities of pairing computa-
tions against fault and power attacks. Suitable countermeasures are also proposed to resist
those attacks.
Keywords: Pairing based cryptography, Elliptic curve cryptography, Field programmable
gate array, Prime field, Side-channel attacks.
xiii
Contents
Title Page i
Certificate of Approval iii
Certificate v
Acknowledgements ix
Declaration xi
Abstract xiii
Table of Contents xv
Symbols and Abbreviations xvii
1 Introduction 1
1.1 Motivation and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Mathematical Background and Preliminaries 9
2.1 Finite Field Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
xv
CONTENTS
2.1.1 Addition and Subtraction in Fp . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Multiplication in Fp . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Inversion and Division in Fp . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Montgomery Ladder for Exponentiation in Fp . . . . . . . . . . . . 12
2.2 Elliptic Curve Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Operations on Elliptic Curve . . . . . . . . . . . . . . . . . . . . . 15
2.3 Cryptographic Pairings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Tate Pairing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Side-channel Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Timing Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.2 Power Consumption Attacks . . . . . . . . . . . . . . . . . . . . . 24
2.5.2.1 Simple Power Analysis (SPA) Attacks . . . . . . . . . . 24
2.5.2.2 Differential Power Analysis (DPA) Attacks . . . . . . . . 25
2.6 Fault Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.1 Fault Induction Technique . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Survey of Related Work 31
3.1 Hardware Implementation of ECSM on Prime Fields . . . . . . . . . . . . 31
3.2 The ECSM Against Side-channel Attacks . . . . . . . . . . . . . . . . . . 34
3.2.1 Indistinguishable Point Add and Point Double . . . . . . . . . . . . 35
3.2.2 Regular Point Multiplication Algorithms . . . . . . . . . . . . . . 36
3.2.3 Base Point Randomization Techniques . . . . . . . . . . . . . . . . 37
3.2.3.1 Point Blinding . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3.2 Randomized Projective Representation . . . . . . . . . . 37
3.2.3.3 Randomized Elliptic Curve Isomorphisms . . . . . . . . 38
xvi
CONTENTS
3.2.3.4 Randomized Field Isomorphisms . . . . . . . . . . . . . 38
3.2.4 Scalar Multiplier Randomization Techniques . . . . . . . . . . . . 38
3.3 Implementation of Cryptographic Pairings . . . . . . . . . . . . . . . . . . 40
3.3.1 Software Library for 128-bit-secret Pairings . . . . . . . . . . . . . 41
3.3.2 Hardware Design for 128-bit-secret Pairings . . . . . . . . . . . . . 42
3.4 Fault and Side-channel Attacks on Pairings . . . . . . . . . . . . . . . . . 42
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Design and Analysis of Elliptic Curve Cryptoprocessor 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Motivation and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Programmable GF(p) Arithmetic Unit (PGAU) . . . . . . . . . . . . . . . 49
4.3.1 Motivation of PGAU . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.2 Proposed Programmable GF(p) Arithmetic Unit . . . . . . . . . . . 51
4.3.3 Programable Data Path Block . . . . . . . . . . . . . . . . . . . . 52
4.3.4 Hardware Cost and Performance . . . . . . . . . . . . . . . . . . . 58
4.4 Elliptic Curve Cryptoprocessor Resistant to Timing and Power Attacks . . . 61
4.4.1 Modified Montgomery Ladder Against DPA and DA . . . . . . . . 63
4.4.2 The ECSM on Single PGAU-core . . . . . . . . . . . . . . . . . . 64
4.4.3 The ECSM on Dual PGAU-core . . . . . . . . . . . . . . . . . . . 66
4.5 Security Analysis of the Proposed Cryptoprocessor . . . . . . . . . . . . . 69
4.5.1 Timing Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5.2 Simple Power Analysis (SPA) . . . . . . . . . . . . . . . . . . . . 70
4.5.3 Differential Power Analysis (DPA) . . . . . . . . . . . . . . . . . . 71
4.5.4 Doubling Attack (DA) . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.5 Security of the Random Generator . . . . . . . . . . . . . . . . . . 77
4.6 ECSM Implementation Result and Comparison . . . . . . . . . . . . . . . 78
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
xvii
CONTENTS
5 Fast Prime Field Adders and Multipliers on FPGA Platform 85
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2 Fast Additions on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.1 Proposed Addition Technique . . . . . . . . . . . . . . . . . . . . 91
5.2.2 Cost and performance . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3 Fast Fp Multipliers on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 Proposed Multiplication Technique . . . . . . . . . . . . . . . . . 97
5.3.2 Cost and Performance of Multiplier . . . . . . . . . . . . . . . . . 99
5.3.3 Security Against Timing and Power Attacks . . . . . . . . . . . . . 103
5.4 The PGAU and ECSM Hardware Based on Fast Adder . . . . . . . . . . . 105
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6 High Speed Flexible Pairing Cryptoprocessor 107
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2.1 Choice of Elliptic Curve . . . . . . . . . . . . . . . . . . . . . . . 110
6.2.2 Pairing Computation . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Programmable Fp-Primitive . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . 112
6.3.1.1 Computation of Fp-multiplication . . . . . . . . . . . . . 114
6.3.1.2 Computation of Fp-addition . . . . . . . . . . . . . . . . 115
6.3.1.3 Computation of Fp-subtraction . . . . . . . . . . . . . . 116
6.4 A Configurable Fpk Arithmetic Unit (CAU) . . . . . . . . . . . . . . . . . 116
6.5 The Pairing Cryptoprocessor (PCP) . . . . . . . . . . . . . . . . . . . . . 119
6.5.1 The Datapath Design . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.6 Computation of Tate Pairing on PCP . . . . . . . . . . . . . . . . . . . . . 121
6.6.1 Computation of Doubling Step . . . . . . . . . . . . . . . . . . . . 121
6.6.2 Computation of Addition Step . . . . . . . . . . . . . . . . . . . . 124
xviii
CONTENTS
6.6.3 Computation of Final Exponentiation . . . . . . . . . . . . . . . . 125
6.6.4 Cost for Computing Tate Pairing . . . . . . . . . . . . . . . . . . . 128
6.7 Computation of ate Pairing on PCP . . . . . . . . . . . . . . . . . . . . . . 128
6.8 Computation of R-ate Pairing on PCP . . . . . . . . . . . . . . . . . . . . 128
6.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.9.1 Comparison with Pairing Implementations . . . . . . . . . . . . . 129
6.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7 Pairing Computations Against Fault and Power Attacks 135
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.2 Fault Attack on Tate Pairing [82] . . . . . . . . . . . . . . . . . . . . . . . 136
7.2.1 Fault Induction Through Clock Signal . . . . . . . . . . . . . . . . 138
7.2.2 Analysis of Existing Countermeasures . . . . . . . . . . . . . . . . 139
7.2.2.1 New Point Blinding Technique [82] . . . . . . . . . . . . 140
7.2.2.2 Altering Traditional Point Blinding [82] . . . . . . . . . 141
7.2.3 Proposed Countermeasure . . . . . . . . . . . . . . . . . . . . . . 141
7.2.3.1 Correctness Analysis . . . . . . . . . . . . . . . . . . . 142
7.2.3.2 Security Against Fault Attack . . . . . . . . . . . . . . . 143
7.3 Fault Attack on Pairing in Edwards Coordinates . . . . . . . . . . . . . . . 144
7.3.1 Pairing in Edwards Coordinates . . . . . . . . . . . . . . . . . . . 144
7.3.2 Attack Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3.2.1 Practical Implication of Above Fault Attack . . . . . . . 146
7.3.3 Countermeasure . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.3.3.1 Correctness Analysis . . . . . . . . . . . . . . . . . . . 147
7.3.3.2 Security Against Fault Attack . . . . . . . . . . . . . . . 147
7.4 Power Attacks on Pairing Computations . . . . . . . . . . . . . . . . . . . 148
7.4.1 Weakness of Pairing Computations over Fp . . . . . . . . . . . . . 148
7.4.2 Proposed DPA Attack . . . . . . . . . . . . . . . . . . . . . . . . 149
xix
CONTENTS
7.4.3 Mounting the DPA on FPGA Platform . . . . . . . . . . . . . . . . 150
7.4.4 Proposed DPA Resistance Pairing Computation . . . . . . . . . . . 152
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8 Conclusions and Future Directions 155
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Bibliography 159
Dissemination 175
Bio-data 179
xx
Symbols and Abbreviations
Symbols:
O Point at Infinity
#E(Fq) Number of Points on an Elliptic Curve E(Fq)
E(Fq) Elliptic Curve Defined over a Finite Field Fq
Fq Finite Field with Order q
Fp Prime Field with a Large Prime Characteristic p
F2m Binary Field with Extension Degree m
F3n Characteristic-3 Field with Extension Degree n
Abbreviations:AES Advanced Encryption Standard
ASIC Application Specific Integrated Circuit
ASIP Application Specific Instruction-set Processor
ASM Addition, Subtraction and Multiplication in Fp
BN Barreto Naehrig
BRAM Block RAM
CAU Configurable Fpk Arithmetic Unit
CFP Configurable Fpk Primitive
CLB Configurable Logic Block
xxi
CONTENTS
CMOS Complementary Metal Oxide Semiconductor
DAU Data Access Unit
DES Data Encryption Standard
DLP Discrete Logarithm Problem
ECA Elliptic Curve Point Addition
ECC Elliptic Curve Cryptography
ECD Elliptic Curve Point Doubling
ECDLP Elliptic Curve Discrete Logarithm Problem
ECDSA Elliptic Curve Digital Signature Algorithm
ECMQV Elliptic Curve Menezes-Qu-Vanstone
ECSM Elliptic Curve Scalar Multiplication
EEA Extended Euclidean Algorithm
FF Flip Flop
FPGA Field Programmable Gate Array
GF Galois Field or Finite Field
IFD Instruction Fetch and Decode
LSB Least Significant Bit
LUT Look Up Table
MSB Most Significant Bit
NIST National Institute of Standards and Technology
PA Point Addition
PCP Pairing Cryptoprocessor
PD Point Doubling
PGAU Programmable GF(p) Arithmetic Unit
RSA Rivest, Shamir and Adleman
xxii
Chapter 1
Introduction
BILINEAR PAIRING is a candidate for one-way functions defined on elliptic or hy-
perelliptic curve group. Pairing based cryptography is suitable for securing identity
aware and ubiquitous computing devices. Major operations in pairing based cryptogra-
phy are pairing computation and elliptic curve scalar multiplication. This research focuses
on designing efficient hardware architectures for above mentioned operations on FPGA
platform. Implementations of the respective algorithms may leak secret information dur-
ing their execution through concealed channels, such as: power consumption, timing, and
faults. The attacks based on the exploitation of such concealed channels are known as side-
channel attacks. This research also focuses on the analysis and counteracts of elliptic curve
and pairing implementations against side-channel attacks.
1.1 Motivation and Objective
In recent times, Pairing-based cryptography has attained lot of importance. As a natural
consequence, its hardware implementation is extremely important. The implementations
must be cost-effective, both in terms of time and space requirement. This thesis focuses on
exploring several hardware design techniques which are employed in pairing based cryp-
tography. Two complex operations are elliptic curve scalar multiplication (or ECSM) and
pairing computation which are often used in pairing based cryptographic schemes. Field
programmable gate array (or FPGA) is one of the suitable platforms to develop hardware
1
Chapter 1 Introduction
for accelerating cryptographic operations. Thus, it may be prudent at this point to look into
the architectural design techniques on FPGA platform to improve the efficiency of ECSM
and pairing computation.
Finite field arithmetic is the most important primitive of ECSM and pairing compu-
tation. Pairing based cryptography requires all the underlying finite field operations like
addition, subtraction, multiplication, inversion, and division. In order to obtain an efficient
design, the present work first focusses on introducing hardware sharing among the finite
field operations. Modern FPGAs provide in-built features which may help in realizing op-
timized circuits. Thus, the proposed work also investigates FPGA features to accelerate the
finite field primitives. Subsequently, the work focuses on exploiting scopes of parallelism
in the finite field algorithms. It further explores the scope of parallelism in the computation
of ECSM and pairing using multiple cores of underlying primitives.
On the other hand, side-channel and fault attacks are the major threats on the imple-
mentation of any cryptographic algorithms. The present thesis explores not only these
vulnerabilities but also counteracting techniques of pairing schemes. Finally, the effect of
these techniques on the entire design and the final robustness of the design is evaluated.
In this thesis two broad aspects of the hardware for pairing based cryptography, namely,
efficient implementation and security against side-channel attacks, have been separately
studied. One of the main objectives of this thesis is to reduce the computation time of
major operations of a pairing based scheme. This reduction in computation time is brought
about by targeting the following aspects of FPGA implementation:
• Hardware sharing technique has been explored to develop an optimized programmable
architecture for prime field arithmetic.
• The in-built carry chains of an FPGA device have been exploited to develop a high-
speed adder circuit.
• A modified interleaved multiplication technique has been proposed to reduce the
critical path of a prime field multiplier.
• Multiple functional cores have been incorporated into the proposed cryptoprocessors
2
1.2 Contributions
for exploiting the parallelism of ECSM and pairing computations.
One more objective of this thesis is to provide the security of the proposed designs
against side-channel attacks. In that respect the following techniques have been proposed:
• The proposed finite field primitives help to make the cryptoprocessors resistant against
simple side-channel attacks.
• A new point blinding technique has been proposed which protects the secret pa-
rameter of ECSM operation against simple power analysis (SPA), differential power
analysis (DPA), and doubling attack (DA).
• A new counteracting technique has been proposed to defend the fault attacks against
pairing computations.
• Line function of pairings has been modified to defend differential power attacks.
1.2 Contributions
The contributions of the thesis are summarized below:
• Design and Analysis of Elliptic Curve Cryptoprocessor. We present an elliptic
curve cryptoprocessor by exploiting the concept of shared arithmetic hardware and
explore its security against timing and power attacks. The contribution of this work
is in three folds.
1. PGAU core: We propose a Programmable GF(p) Arithmetic Unit (PGAU) that
performs GF(p) addition, subtraction, multiplication, inversion, and division.
The modular operations are performed directly in 2’s complement number sys-
tem. The PGAU reduces 18% area compared to that required of an integrated
design where each arithmetic unit is a state-of-the-art stand alone implemen-
tation. The PGAU takes only 0.96 times slice area but achieves 2.67 times
speedup compared to the existing design [106].
3
Chapter 1 Introduction
2. Elliptic curve cryptoprocessor: We observe that the saving in area can be ex-
ploited by using multiple copies of PGAU for accelerating elliptic curve scalar
multiplication. Thus, we attempt to speed up the ECSM operation by using two
PGAU cores. The implementation of the proposed design is done on Xilinx
Virtex-II Pro FPGA device. The experimental result shows that the proposed
elliptic curve cryptoprocessor computes a 192-bit ECSM operation in 4.47ms.
The whole design demands 8972 CLB slices and runs at 43MHz clock on a
Virtex-II Pro FPGA. The same can run at 61MHz clock on a Virtex-IV FPGA
platform.
3. Side-channel attacks: The PGAU is designed in such a way that it does not
provide any timing and power attack vulnerabilities during the execution of
finite field operations. A new point blinding technique is proposed which is
applied on the SPA resistant Montgomery ladder for ECSM computation. The
analysis shows that the proposed cryptoprocessor is indeed secure against dif-
ferential and non-differential timing and power attacks. In order to show its
security against differential power analysis (or DPA) we first show an actual
DPA result on an FPGA implementation without any DPA resistance scheme.
This result ensures that the DPA is really capable to obtain the secret scalar
multiplier. The same analysis have been performed on our proposed implemen-
tation. It is shown that with even ten times more power traces we could not
find any significant DPA peak to guess the secret bits. The result ensures that
the proposed design is capable to protect the secret against DPA attack. The
proposed design is also capable to provide security against doubling attack.
• Fast Prime Field Adders and Multipliers on FPGA Platform. Finite field addi-
tion and multiplication are the most important operations in cryptography. Efficient
techniques of these operations greatly affect the overall performance of a cryptopro-
cessor. We explore the in-built features of an FPGA device to develop high-speed
prime field (Fp) primitives. The contributions of this work are briefly described here.
1. Fast carry chain (FCC): Modern FPGAs provide special carry logic for ad-
dition. The carry chains formed by the in-built carry logic are 32 bits long.
4
1.2 Contributions
Through experimental results this chapter shows that the carry propagation
adder (CPA) based on in-built carry logic for a 32-bit addition provides min-
imum latency compared to all other known addition techniques. Experimental
results show that the latency of above CPA is only 6.6ns whereas the same of
the carry lookahead adder is 9.2ns on a Virtex-II pro FPGA.
2. High-speed adder: Subsequently, we propose a hierarchical adder structure
for large operands using above fast carry chains (FCCs). The large operands are
decomposed hierarchically upto 32 bit-lengths based on Karatsuba technique.
The experimental result shows that the proposed technique significantly reduces
the routing delay as well as logic delay compared to the existing techniques. For
a comparison we implement some existing addition techniques for 256 and 512
-bit operands using 32-bit FCC. Thus they are designed as their respective 8
and 16-bit structures where a single bit full-adder is now replaced by a 32-bit
FCC. The proposed 256-bit adder provides 35% speedup from the best known
carry lookahead technique on an FPGA platform.
3. Fp-multiplier: A modification on interleaved multiplication algorithm is pro-
posed for improving the scope of parallelism. The modified algorithm exploits
the Montgomery ladder where doubling and addition within an iteration are
independent to each other. On the other hand, both of the operations are com-
puted at every iteration which provide a balanced execution and security against
non-differential side-channel attacks.
It further proposes a parallel iterative architecture based on the modified mul-
tiplication algorithm and high speed adders. It exploits the parallelism in two
levels. One is in the addition level and other is in the algorithmic level. The
extensive experimental results have been furnished to show its performance im-
provement of 70% over existing design and security against non-differential
timing and power attacks.
4. Speedup of ECC cryptoprocessor: It is now essential to validate the proposed
technique on elliptic curve and pairing computations. In case of elliptic curve
computation, we redesign the PGAU and ECSM cryptoprocessor. The old adder
circuits are now replaced by the proposed high speed adders in the new designs.5
Chapter 1 Introduction
The experimental result shows that the modified designs achieve 30% speedup
over the old designs. The same Fp-primitives are used to develop the pairing
cryptoprocessor which is described later.
• High Speed Flexible Pairing Cryptoprocessor. In this work we propose a crypto-
processor for the computation of pairings over Barreto-Naehrig curves (BN curves).
The proposed pairing cryptoprocessor (PCP) supports random curve parameters in-
cluding prime p. It supports all primes less than the given length (256 bits). We
develop a parallel configurable hardware for computing addition, subtraction, and
multiplication on Fp and Fp2 using high-speed Fp-primitives described previously.
Existing techniques to speed up arithmetic in extension fields [61] for fast computa-
tion in Fp6 and Fp12 are used on top of it. The major contributions of this work are
highlighted here.
1. CFP design: The chapter introduces a configurable Fpk-primitive (CFP) based
on the high-speed Fp-primitives described previously. The CFP has inherent
configurability to perform arithmetic in Fp and Fp2 for any p less than the given
length. Existing techniques to speed up arithmetic in extension fields [61] for
fast computation in Fp6 and Fp12 are implemented on top of it.
2. Pairing cryptoprocessor: A pairing cryptoprocessor is designed with two
CFP-cores. The advantages of dual core have been utilized by developing a
parallel scheduling of the underlying Fp-operations for pairing computation.
The proposed cryptoprocessor also provides flexibility for curve parameters.
Experimental results show a significant improvement in clock cycle counts for
pairing computations compared to the similar design reported in [17]. Due to
the above factor the speed of the proposed cryptoprocessor on a FPGA platform
is comparable with the existing CMOS design.
The proposed configurable Fpk arithmetic cores and parallel computation result in a
significant improvement on the performance of Tate, ate, and R-ate pairing over BN
curves. The result is demonstrated for a 256-bit BN curve which provides 128-bit
security.
6
1.3 Organization of the Thesis
• Pairing Computations Against Fault and Power Attacks. This work deals with
the fault and side-channel attacks on pairing computations which is another objective
of this thesis. The contributions of this chapter are summarized here.
1. Fault attack on pairing: It analyzes existing fault attacks and countermeasures
on pairing computations that are described in [82]. The attack assumes that the
respective fault is injected into a specific register inside the pairing cryptopro-
cessor. With experimental result this chapter depicts a fault injection technique
into a register by tuning the clock frequency. The chapter finds out the limita-
tions of the existing countermeasures. To overcome such limitations we propose
a new countermeasure to defend fault attacks on pairing computations.
2. Fault attack on Miller’s algorithm: A new representation of the addition law
on elliptic curves has been introduced by Edwards [65] in 2007, which provides
efficient elliptic curve group operations [64]. Pairing computation in Edwards
coordinates are proposed in [47]. This chapter analyzes the security of the
pairing computation proposed in [47] against a new fault attack. This chapter
shows a vulnerability against new fault attack on pairing computations over BN
curves and Edwards coordinates [47]. A suitable technique is also proposed to
counteract against such attack.
3. DPA on pairing: The side-channel attack based on power analysis on pairing
computation is another objective of the present work. We propose an attacking
technique based on differential power analysis on pairing computations over Fp.
Through experimental results we show how the proposed attack actually works
on an FPGA platform. A suitable technique is also proposed to counteract
against such power attack.
1.3 Organization of the Thesis
The rest of the thesis is structured as follows:
Chapter 2 gives a brief overview with related techniques and algorithms of finite field
operations. It also includes basic ideas of elliptic curve and pairing based cryptography.
Backgrounds on side-channel and fault attacks are also provided in this chapter.7
Chapter 1 Introduction
Chapter 3 reports some related works to present the state-of-art in connection to the thesis.
Chapter 4 presents an elliptic curve cryptoprocessor exploiting the concept of shared arith-
metic hardware and explore its security against timing and power attacks.
Chapter 5 explores the in-built features of an FPGA device to develop high-speed prime
field primitives. The multiplication algorithm has been modified to improve the scope of
parallelism and proposed a high-speed Fp multiplier for 2’s complement numbers.
Chapter 6 at first designs a configurable architecture for computing arithmetic in Fpk . Then
it proposes a cryptoprocessor for computing asymmetric pairings over BN curves that pro-
vide 128-bit security.
Chapter 7 deals with the security of pairing computations against fault and power attacks.
Through experimental results the actual technique of fault induction has been shown. Ac-
tual DPA attacks on a pairing computation has been described. Suitable countermeasures
have been proposed in this chapter.
Chapter 8 concludes the thesis and discusses some possible directions of future work.
1.4 Conclusion
This chapter has given an overview of the whole work. The motivation behind this
research, objectives and scopes are described. In the next chapter we provide a background
of the works described in this thesis.
8
Chapter 2
Mathematical Background and
Preliminaries
THIS CHAPTER PRESENTS the background related to this thesis. It starts with a brief
description of finite field arithmetic which are the underlying operations in elliptic
curve and pairing computations. Then, the chapter discusses the basic concepts on elliptic
curve and pairing to outline the foundation of the content of this thesis.
The steep growth in the processing speed causes the key sizes of cryptographic schemes
to increase almost 25% in every decades [7]. As per the recommendation by National
Institute of Standards and Technology (NIST) the security requirements upto 2010 is 80
bits, i.e., the computation complexity is 280, it is 112 bits upto 2030, and 128 bits beyond
2030. Along with this security requirements there evolved separate and independent public
key techniques: RSA, elliptic curve, pairings, etc. Among these techniques the key sizes of
the latter two are relatively lesser than the first one. In this chapter, we briefly describe some
of the methodologies, challenges and solutions related to later two public key techniques.
The practical implementations and security analysis against side-channel attacks of
above public key techniques are the major objectives of the thesis. Field programmable
gate array or FPGA is a relevant platform which is being used to develop application spe-
cific hardware. The FPGA devices provide different in-built features for basic arithmetic
modules which could be utilized for developing high-speed architectures for an application.
9
Chapter 2 Mathematical Background and Preliminaries
Side-channel attacks are the major threats on implementations of cryptographic algorithms.
It can break a secure algorithm with very less effort by exploiting some unwanted leakage
during execution. The basic concepts of FPGA platform and side-channel attacks are also
described in this chapter.
2.1 Finite Field Arithmetic
Finite Field or Galios Field (GF) is defined on a finite set of elements with a prime
characteristic. The smallest two finite fields are developed with characteristic 2 and 3,
which are known as F2 (or GF(2)) and F3 (or GF(3)), respectively. We represent a finite
field with a large prime characteristic p by Fp or GF(p). Most of the works described in this
thesis are based on Fp. Therefore, this section describes the arithmetic operations on Fp.
The multiplication, inversion, division, and exponentiation in above field can be computed
by different techniques [117]. However, this chapter discusses the techniques which are
further used to describe the proposed works in the thesis.
2.1.1 Addition and Subtraction in Fp
The operation (a+ b) in Fp adds two operands a, b, and it subtracts the modulus p
from the sum if (a+ b) ≥ p. However, the comparison can be avoided by the following
way. First, we perform c = a+b and then perform d = c− p. The final result is either c or
d which is decided by the carry out values of above two operations.
The doubling operation (2a) mod p is a special case of (a+b) mod p. This operation
is computed by same way of Fp addition by replacing the first addition for (a+ b) with a
left shift operation for computing 2a.
To perform Fp subtraction, input b is bitwise inverted and added to input a with carry
in 1. If the result is negative (i.e. the carry-out is low) then the modulus is added to
produce an output in Fp. The correct result is selected by the carry-out bit of the first
adder. The respective architectures for Fp addition, doubling, and subtraction are described
in [53, 106].
10
2.1 Finite Field Arithmetic
2.1.2 Multiplication in Fp
One of the interesting procedures to perform Fp multiplication is interleaved multi-
plication algorithm [179, 180], which is shown in Algo. 2.1. The main advantage of the
algorithm is that it does not require any final division. At every iteration the intermediate
result is reduced and it remains below the modular value.
Algorithm 2.1: Interleaved multiplication in Fp, IntMult (b,b, p).
Input: a,b, p. b = ∑k−1i=0 2ibi.
Output: (a ·b) mod p.s← 0.for i from k−1 downto 0 do
s← (2s) mod p.if bi = 1 then
s← (s+a) mod p.end
endreturn s.
Main difficulty of the algorithm is the computation of addition on large operands. The
carry chain linearly increases the latency of the addition operations. Thus, carry propa-
gation adder circuit is inefficient for developing a multiplier that is based on interleaved
multiplication procedure. However, some modifications for using carry save adder (CSA)
in interleaved multiplication algorithm are reported in [94, 95, 121]. The pre-computations
that are required in the existing modifications depend on the multiplicand a. The advantage
can be taken efficiently in an application where the repeated multiplications are performed
on a fixed multiplicand and varying multiplier. But in our applications like elliptic curve or
pairing computations, the finite field multiplications are normally performed on different
operands. Thus, the pre-computation cost is directly added with the multiplication pro-
cedure in elliptic curve and pairing based cryptographic schemes. A modification on the
above algorithm is proposed in chapter 5 for improving the efficiency of Fp multiplication.
2.1.3 Inversion and Division in Fp
The modular multiplicative inverse (a−1) mod p of an integer a exists if and only if a
and p are relatively prime, that is, gcd(a, p) = 1. Two methods for inversion are often used:11
Chapter 2 Mathematical Background and Preliminaries
the Fermat’s Little Theorem and a variant of the Extended Euclidean Algorithm (EEA). First
one computes the inversion by exponentiation. There are lot of variants of these algorithms
reported in literature; most of them are listed and discussed in [117]. One of the efficient
variant of EEA for Fp inversion is based on binary method, which is known as Binary In-
version Algorithm shown in Algo. 2.2. The algorithm runs iteratively, and proceeds towards
the goal. At every iteration either u or v is reduced by at least one bit length, which ensures
that the total number of iteration is at most 2k, where k is the maximum bit length of p and
a. In [106], the authors proposed the outline of modular division operation using a modular
inversion followed by a modular multiplication operation. The binary modular inversion
algorithm (Algo. 2.2) can easily be modified to perform modular division b/a = ba−1. To
obtain (b/a) mod p using this algorithm it is necessary to initialize the x1 variable in step
1 by b instead of 1. We follow this algorithm for performing GF(p) inversion and division
operations in the elliptic curve hardware which is described in chapter 4.
2.1.4 Montgomery Ladder for Exponentiation in Fp
The operation (ab) mod p in Fp can be performed by binary square-and-multiply al-
gorithm [117] with complexity O(k3), where k represents the bit length of p. It has two
variations: right-to-left and left-to-right. In the above iterative procedure the squaring is
performed at every iterations whereas the multiplication is performed if the respective bit
bi = 1. The main drawback of this procedure is its unbalanced computation depending on
the bit values of the exponent, for which it is vulnerable against simple-SCA attacks. To
overcome the above vulnerability sometimes Montgomery powering ladder is used [125].
The respective algorithm is shown in Algo. 2.3, where internal squarings and multiplica-
tions are performed in Fp.
2.2 Elliptic Curve Cryptography
Use of elliptic curves in cryptography has been independently introduced by V.S. Miller
[178] and N. Koblitz [174] in late eighties of the last century. It has largely reduced the key
sizes of public key schemes from traditional RSA [182] based techniques. For example, a
160-bit elliptic curve based scheme is equivalently secure with a 1024-bit well known RSA
12
2.2 Elliptic Curve Cryptography
Algorithm 2.2: Binary Inversion in GF(p).Input: a ∈ Fp.Output: (a−1) mod p.u← a, v← p, x1← 1, x2← 0.while u = 1 and v = 1 do
while u is even dou← u/2.if x1 is even then
x1← x1/2.endelse
x1← (x1 + p)/2.end
endwhile v is even do
v← v/2.if x2 is even then
x2← x2/2.endelse
x2← (x2 + p)/2.end
endif u≥ v then
u← u− v.x1← x1− x2.
endelse
v← v−u.x2← x2− x1.
endendif u = 1 then
return (x1) mod p.endelse
return (x2) mod p.end
scheme. An elliptic curve on a finite field consists of finite number of points which form an
abelian group. The exponentiation in case of elliptic curve group is known as elliptic curve
scalar multiplication (ECSM) which is represented as dP for any integer d and any point
13
Chapter 2 Mathematical Background and Preliminaries
Algorithm 2.3: The Montgomery ladder for exponentiation.
Input: a,b, p. b = ∑k−1i=0 2ibi.
Output: (ab) mod p.q1← a and q2← a2.for i from k−2 downto 0 do
if bi = 1 thenq1← q1 ·q2 and q2← (q2)
2.endelse
q2← q1 ·q2 and q1← (q1)2.
endendreturn q1.
P. The operation dP represents the addition (P+P+ · · ·(d− 1) times). It is an one way
function, where forward computation, i.e., given d and P, the computation of Q = dP is
easy. But, the reverse problem is computationally hard and it follows following definition.
Definition 1. Elliptic curve discrete logarithm problem (ECDLP): Given an elliptic curve
E defined over a finite field Fq, a point P ∈ E(Fq) of order n, and a second point Q ∈ ⟨P⟩,determine the integer d ∈ [0,r−1] such that Q = dP.
The ECDLP is the heart of elliptic curve cryptography. The security of an elliptic curve
scheme is based on the difficulty to solve this problem. The best algorithm to solve ECDLP
is known as Pollard-rho method [183]. The algorithm has a fully-exponential expected
running time of√
πn2 point additions. The hardness of the ECDLP depends on the choice
of an elliptic curve. For a given underlying field Fq, maximum resistance to Pollard’s rho
method can be attained by selecting an elliptic curve E for which n is prime and is as large
as possible. The most favourable situation arises when #E(Fq) is prime or almost prime,
i.e., #E(Fq) = kn, where n is prime and the co-factor k is small (e.g., k ∈ {1,2,3,4}). In
this case, since #E(Fq) lies in the Hasse interval [(√
q−1)2,(√
q+1)2], we have n≈ q and
we say that the elliptic curve has a security level of 12 log2 q bits [102].
Figure 2.1 represents a typical ECC implementation hierarchy. The top level comprises
of elliptic curve cryptographic schemes like ECDSA, ECMQV [155]. The second level
is the elliptic curve scalar multiplication (ECSM), which consists of a sequence of point
14
2.2 Elliptic Curve Cryptography
doubling (ECD) and point addition (ECA). The operations ECD and ECA, considered as
elliptic curve group operations, are in the third level. These two group operations consist
of a sequence of finite field division, multiplication, addition, and subtraction that belong
to the fourth level in the hierarchy.
Pairing based cryptography could be viewed as an extension of elliptic curve cryptog-
raphy. Along with the ECSM operation there is another one way function known as pairing
computations, which we discuss later.
Finite field addition
Finite field subtraction
Finite field multiplication
Finite field inversion
Elliptic curve point addition
Elliptic curve point doubling
Elliptic curve scalar multiplication
Elliptic curve cryptographic schemes
Figure 2.1: ECC implementation hierarchy.
2.2.1 Operations on Elliptic Curve
An elliptic curve E over GF(p) is often defined as the set of solutions (points) of the
following equation [117, 155],
y2 = x3 +ax+b, (2.1)
where x,y,a,b ∈ GF(p) and 4a3+27b2 = 0. The rational affine points on the curve and the
point at infinity O form an abelian group. The point O is used as an identity element of the
group. Thus, for every point P ∈ E, P+O = O +P = P. The group operations are known
as point addition (or ECA) and point doubling (or ECD).15
Chapter 2 Mathematical Background and Preliminaries
(a) (b)
Figure 2.2: Geometric interpretation of elliptic curve operations. (a) addition of two points
and (b) doubling a point.
The geometric interpretation of addition of two points on an elliptic curve and doubling
a point are depicted in Fig. 2.2. Suppose that P and Q are two distinct points on an elliptic
curve, and the P is not −Q. To add the points P and Q, a line is drawn through the two
points as shown in Fig. 2.2(a). This line will intersect the cubic curve in exactly one more
point, call −R. The point −R is reflected in the x-axis to the point R which represents the
resultant point of P+Q.
Similarly, to add a point P to itself or doubling a point P, a tangent line to the curve is
drawn at the point P as shown in Fig. 2.2(b). If y-coordinate of P is not 0, then the tangent
line intersects the elliptic curve at exactly one other point, −R. The −R is reflected in the
x-axis to R, which represents the resultant point of P+P or 2P.
The formulæ for ECA and ECD in affine coordinates are as follows: Let P = (x1,y1)
and Q = (x2,y2) are two points on E. If P =−Q then P+Q = O. Otherwise R = (x3,y3) =
P+Q is given by:
x3 = λ2− x1− x2 (2.2)
y3 = λ(x1− x3)− y1 (2.3)
16
2.2 Elliptic Curve Cryptography
where
λ =
{(3x2
1 +a)/2y1 if P = Q
(y2− y1)/(x2− x1) if P = Q.
In addition to the ECA and ECD, the inverse of a point P(x,y)∈E is computed as P(x,−y).
The ECSM computation technique is shown in Algo. 2.4, where k represents the bit size
of p, i.e., k = ⌈log2 p⌉. Algo. 2.4 is the Montgomery ladder [176] based ECSM algorithm.
It works as follows:
Let binary representation of d be (dk−1, · · · ,d0), and we assume that dk−1 = 1. The
algorithm starts with pair (P,2P). At the beginning of each step i, we have the pair
(Q1,Q2) = (mP,(m+ 1)P), where m = dk−1 · · ·dk−1−i . At the end of last step (i = 0),
we eventually have (Q1,Q2) = (dP,(d +1)P).
Algorithm 2.4: The Montgomery ladder for elliptic curve scalar multiplication.
Input: An integer d ≥ 1 and a point P on elliptic curve. d = ∑k−1i=0 2idi.
Output: dP.Q1← P and Q2← 2P.for i from k−2 downto 0 do
if di = 1 thenQ1← Q1 +Q2 and Q2← 2Q2.
endelse
Q2← Q1 +Q2 and Q1← 2Q1.end
endreturn Q1.
The Montgomery ladder has two significant advantages. First, both branches for di = 1
and di = 0 can be parallelized in an obvious way. Point addition and point doubling can be
run in parallel on two different processor cores. Second, the algorithm is resistant to non-
differential (simple) side-channel attacks [156]. The cryptoprocessor proposed in chapter 4
computes Montgomery ladder based ECSM operation by following above point inversion,
point addition, and point doubling rules in affine coordinates.
17
Chapter 2 Mathematical Background and Preliminaries
2.3 Cryptographic Pairings
The pioneering work in the field of pairing based encryption was proposed by Boneh
and Franklin [143]. The identity based encryption (IBE) scheme proposed in [143] uses
pairing computation as one of the major operations in encryption as well as decryption
procedures. The security of the scheme is based on the difficulty to solve well known
Bilinear Diffie-Hellman problem. A survey on pairing based cryptographic schemes is
given in [112]. This section gives a brief overview of Tate pairing computation and some
of the security issues against fault attack on pairing algorithms.
2.3.1 Tate Pairing
The name bilinear pairing indicates that it takes a pair of vectors as input and returns
a number. It performs a linear transformation on each of its input variable. For example,
the dot product of vectors is a bilinear pairing. Similarly, for cryptographic application
the bilinear pairing (or pairing) operations are defined on elliptic or hyperelliptic curves.
Pairing is a mapping e : G1×G2→ G3, where G1 is a curve group defined over a finite
field Fq, G2 is another curve group on the extension field Fqk , and G3 is an subgroup of the
multiplicative group of Fqk . Groups G1 and G2 could also be the same group. If G1 =G2
then the mapping e is called symmetric pairing. On the other hand if G1 = G2 then e is
called asymmetric pairing.
Every point on an elliptic curve is one of two kinds: a point of finite order or a point
of infinite order. For P to be a point of finite order means there exist a smallest integer l
such that lP = O. If no such l exists then P is of infinite order. In other words, P being of
infinite order means you can never get the point at infinity by adding P to itself, no matter
how many times you do it. This distinction between finite and infinite points leads to the
following definition:
Definition 2. l-torsion point: A point P ∈ E(Fq) is called a torsion point of order l or
l-torsion point if P has order l.
Gathering all of the torsion points of an elliptic curve E will form a finite subgroup of
E(Fq), called E(Fq)tor : E(Fq)tor = P ∈ E(Fq)tor|P has finite order⊆ E(Fq).
18
2.3 Cryptographic Pairings
Let, a large odd prime l divide the order of the curve group (#E(Fq)), and let, the point
P be a l-torsion point. Here, k is the corresponding embedding degree, often referred to
as security multiplier in pairing computation. It is the smallest positive integer such that l
divides qk−1. Then the Tate pairing of order l is a map
el : E(Fq)[l]×E(Fqk)[l]→ F∗qk/(F∗qk)l, (2.4)
where E(Fq)[l] denote the subgroup of E(Fq) of all points of order dividing l, and similarly
for Fqk . The l-Tate pairing on points P ∈ E(Fq)[l],Q ∈ E(Fqk)[l] is given by el(P,Q) =
fl,P(D). Here fl,P is a function on E whose divisor is equivalent to l(P)− l(O). D is a
divisor equivalent to (Q)− (O), whose support is disjoint from the support of fl,P. The
point O represents the point at infinity. For more information regarding divisor, we refer
the reader to [37, 138]. The formulas for D and fl,P(D) is given in following equations:
D = ∑i
aiPi (2.5)
fl,P(D) = ∏i
fl,P(Paii ). (2.6)
Cryptographic pairings satisfy following properties:
• Non-degeneracy : For each P = O there exist Q ∈ E(Fqk)[l] such that el(P,Q) = 1.
• Bilinearity : For any integer n, el([n]P,Q)= el(P, [n]Q)= el(P,Q)n for all P∈E(Fq)[l]
and Q ∈ E(Fqk)[l].
• It is efficiently computable.
The value el is a representative of an element of the quotient group F∗qk/(F∗qk)l . However
for cryptographic protocols, it is essential to have a unique representative. So el is raised
to the ((qk− 1)/l)-th power for obtaining an l-root of unity. The resulting value is called
reduced Tate pairing:
El(P,Q) = el(P,Q)(qk−1)/l. (2.7)
Computation of Tate pairing is performed by an ECSM based technique proposed by V.
Miller [175], which is shown in Algo. 2.5. The algorithm performs doubling for every bit19
Chapter 2 Mathematical Background and Preliminaries
value of l, and it performs addition only if the corresponding bit value of l is 1. Finally, it
returns the l-Tate pairing. In the algorithm, l′(Q) indicates the divisor of the straight line l′
connecting two points P1 and P2 with respect to point Q. Let the line l′ intersects the curve
at a third point X . Now v′(Q) is the divisor of the vertical line v′ through X with respect to
Q [37, 47].
Algorithm 2.5: Miller’s algorithm.Input: P an l torsion point ∈ E(Fq), Q ∈ E(Fqk).Output: the Tate pairing El(P,Q).i = ⌈log2(l)⌉,K← P, f ← 1.while i≥ 1 do
Compute equations of l′ and v′ arising in the doubling of K.K← 2K and f ← f 2l′(Q)/v′(Q).if the i-th bit of l is 1 then
Compute equations of l′ and v′ arising in the addition of K and P.K← P+K and f ← f l′(Q)/v′(Q).
endi← i−1.
endreturn f (q
k−1)/l .
The Tate pairing can only be computed efficiently if the security parameter k is small.
Before the work of Miyaji, Nakabayashi and Takano [145], it was assumed that k is in size
of l for any general curve. Thus, the early curves to be used in pairing based cryptography
were supersingular curves, since their security multiplier satisfies k ≤ 6. Barreto et al.
in [138] generalized the Miller’s technique for computing pairings which is also known
as BKLS algorithm. A pairing computation technique on hyperelliptic curves including
supersingular curves over F3m is proposed by Duursma and Lee in [126]. The algorithm
is further improved by Kown [111]. Recently, the pairing computation in highly efficient
Edwards coordinates [64, 65] and on Twisted Edwards coordinates [50] are defined by
Ionica and Joux [47], and by Das and Sarkar [49], respectively. A popular and widely used
elliptic curve which provides 128-bit security is proposed by Barreto and Naehrig [76].
This pairing friendly elliptic curve is defined over a 256-bit prime field with embedding
degree k = 12. The present thesis discusses about FPGA implementation of pairing based
cryptography, resistant against side-channel attacks. Hence, we present an overview of
20
2.4 FPGA Architecture
FPGAs and side-channel attacks, which is essential to the understanding of the work.
2.4 FPGA Architecture
The name field programmable gate array comes from its internal structure which con-
sists of an array of programmable logic clusters. We use the Xilinx Virtex device family
for our applications. Xilinx Virtex-II Pro is such an FPGA device which is implemented
in nine layers using 130nm CMOS technology [137]. It has an island style architecture
which consists of a two-dimensional array of Configurable Logic Blocks (CLBs) and pro-
grammable interconnect resources. An architectural overview of such an FPGA device is
shown in Fig. 2.3. Each CLB is a cluster of four identical sub-blocks called slice and two
3-state buffers. Each slices are equivalent and they contains following circuit elements.
I / O blocks
logic blocks Vertical
interconnects
Horizontal interconnects
Figure 2.3: Structure of a Virtex FPGA.
• Two function generators (F & G)21
Chapter 2 Mathematical Background and Preliminaries
• Two storage elements
• Arithmetic logic gates
• Large multiplexers
• Wide function capability
• Fast carry look-ahead chain
• Horizontal cascade chain (OR gate)
The function generators F & G are configurable as typically 4-input look-up tables (LUTs),
as 16-bit shift registers, or as 16-bit distributed RAM. In addition, the two storage elements
are either edge-triggered D-flip-flops or level-sensitive latches. Each CLB has internal fast
interconnect and be connected to a switch matrix to access general routing resources.
The LUTs are used to map fixed-input Boolean logic. The gates are used to implement
special functions such as fast carry chains. Exploring the benefit gained from such an in-
built carry chain for implementing adder and multiplier circuits with large operands is one
of the major objectives of this thesis.
2.5 Side-channel Attacks
Side-channels are defined to be unintended output channels of a system. A side-channel
attack (or SCA) exploits the unwanted leakage information through the side-channels of a
cryptographic device during the execution of specific operations. It is typically a passive
and non-invasive physical attack [101]. It observes some physical characteristic of the
device without interfering the execution. That is, the target device behaves exactly as if
no attack occurs. On the other hand, the side-channel attack is non-invasive which exploits
only the externally available and unintentionally leaked information. It does not de-package
or tamper the target device.
Various side-channels are known to mount an attack on a cryptosystems which are
shown in Fig. 2.4. The information through side-channels can be gathered easily in practice
and therefore it is essential that the threat of SCA be quantified when assessing the overall
security of a cryptosystem.22
2.5 Side-channel Attacks
D Cipher text
Key
Plain text
Power consumption
Heat
Computation time
Sound
Visible light
Frequency Faulty output
Electromagnetic radiation
Error message
I / O channels side channels
Figure 2.4: Traditional side-channels.
2.5.1 Timing Attacks
Usually the running time of a program is merely considered as a constraint, some pa-
rameter that must be reduced as much as possible by the programmer. More surprising
is the fact that the running time of a cryptographic device can also constitute an informa-
tion channel, providing the attacker with invaluable information on the secret parameters
involved. This is the idea of timing attack.
Timing information to compute a cryptographic operation was the first side-channel uti-
lized for attacks. It was brought to the attention of the cryptographic community when Paul
Kocher [171] in 1996 introduced a technique for exploiting timing variation in an attack.
Timing variation often occurs because of data-dependent operations or imbalance execu-
tions. For example, in the binary double-and-add (left-to-right) multiplication algorithm,
Algo. 2.1, the addition is only taken place if the corresponding bit value of the multiplier is
one. In order to exploit timing information, precise timing measurements need to be made.
The first experimental result on timing attack against an actual smartcard implementation
of the RSA was shown in [168].
23
Chapter 2 Mathematical Background and Preliminaries
2.5.2 Power Consumption Attacks
As described in [163], the integrated circuits are built out of individual transistors,
which act as voltage-controlled switches. In a transistor, current flow is directed across the
transistor substrate when charge is applied to (or remove from) the gate. This current then
delivers charge to the gates of other transistors, interconnect wires, and other circuit loads.
The motion of electric charge consumes power and produces electromagnetic radiation,
both of which are extremely detectable.
Nowadays, almost all smartcards and other mobile processors are implemented as inte-
grated circuits (IC) in CMOS technology. From these devices, normally two types of power
consumption leakage can be observed. The transition count leakage, which is related to the
number of bits that change their state at a time; and the Hamming weight leakage, which is
related to the number of 1’s being processed at a time. The internal current flow of a cryp-
toprocessor can be observed from outside by measuring the current drawn from the power
supply. Certainly, power analysis attack is applicable only to hardware implementation
of the crypto systems. Power analysis attack is particularly effective and proven success-
ful in attacking smartcards or other dedicated embedded systems storing the secret key.
Among all types of SCA attacks known today, the number of literatures on power analysis
attacks and the relevant countermeasures is the largest. The power analysis attacks have
been demonstrated to be very powerful attacks for most straightforward implementations
of symmetric and public key ciphers. Basically, power analysis attacks are divided into
Simple and Differential Power Analysis (referred to as SPA and DPA, respectively).
2.5.2.1 Simple Power Analysis (SPA) Attacks
Simple power analysis or SPA is generally based on looking at the visual representation
of the power consumption of a device when an encryption operation is being performed.
It is a technique that involves direct interpretation of power consumption measurements
collected during cryptographic operations. The SPA can yield information about a device’s
operation as well as key material.
In a SPA attack, the attacker directly observes the power consumption of a device. The
amount of power consumption varies depending on the instructions being executed and it is
24
2.5 Side-channel Attacks
necessarily distinguishable by their power trace. In addition, the attacked instruction need
to have a relatively simple or direct relationship with the secret key. For example, SPA can
be used to break RSA implementations by reveling differences between multiplication and
squaring operations. Similarly, many DES implementations have visible differences within
permutation and shifts, and can be broken using SPA [163]. However, it is not difficult to
design a system that will not be vulnerable to SPA attacks.
2.5.2.2 Differential Power Analysis (DPA) Attacks
When simple power analysis is not feasible differential power analysis (or DPA) can
be tried. DPA uses many measurements. It tries to exploit the relationship between the
processed data and the power consumption, whereas, SPA exploits the relationship between
the power consumption and the executed operations. While the later may not be successful,
the former has more chances of a success.
In DPA attack, the attacker records the power consumption of several runs of a crypto-
graphic algorithm implemented on electronic devices. In general, every runs are performed
on some random plain texts with a fixed secret key. The DPA attack relies on the fact that
the power consumption of a device varies to perform same operation on different data. This
power consumption difference is very small and they are not visible from their direct plots.
However, it could be measurable and exploited by sophisticated offline analysis.
To get a clear idea about the DPA attack let us demonstrate it on binary field addition
which is described in [40]. Let, f (x) be an irreducible polynomial of degree m over F2. We
assume that an element a = am−1xm−1 +am−2xm−2 + · · ·+a1x+a0 of F2m ∼= F2[x]/( f (x))
is represented by the polynomial basis with ai ∈ F2.
The addition of a(x)+b(x) is performed as: (am−1⊕bm−1)xm−1+(am−2⊕bm−2)xm−2+
· · ·+(a1⊕b1)x+(a0⊕b0). Let us consider that a(x) be a secret key which is added with
b(x), the publicly known, and even be chosen by the attacker. The attacker chooses some
random b(x) and collects the power consumption for each executions.
Let, W be the power consumption associated with the addition operation a(x)+ b(x).
Let, the adversary chooses thousands of random b(x) and collects corresponding W . To
recover the i-th bit of a(x), we guess that ai = 0 and divide power consumptions into two
25
Chapter 2 Mathematical Background and Preliminaries
sets by bi. The formation of two power consumption sets S0 and S1 are done by following
way :
Sk = {W | bi = k} with k ∈ {0,1}
Thus, the differential power consumption is
∆ = < S1−S0 > .
Now if the guess is correct, then ∆ will be positive, as the Hamming weight of output
in S1 is atleast one more than the S0. Otherwise, ∆ will be negative. Thus, by the repetition
of above DPA technique the attacker can obtain all bit values of a(x).
2.6 Fault Attacks
Fault attack is another powerful technique to break a cryptosystem [154]. These theo-
retical findings were applied on both symmetric ciphers [13, 48, 63, 139] and asymmetric
ciphers [161] by several researchers. In this attack, a fault is injected during the compu-
tation of a cryptographic algorithm on a cryptoprocessor. It exploits the faulty output to
deduce the secret key. The faults of a device can be characterized from several aspects
which are as follows.
• Permanent Fault: It damages a cryptographic device in a permanent way. The
device will behave incorrectly in all future computations. Such damage includes
freezing a memory cell to a constant value, cutting a data bus, stuck a logic output at
VCC or GND line, etc.
• Transient Fault: As opposed to the permanent fault, with a transient fault, the de-
vice is disturbed during its processing, so that it will only perform fault(s) during
that specific computation. Examples of such disturbances are radioactive bombing,
abnormally high or low clock frequency, abnormal voltage in power supply, etc.
• Error Location: Some attacks require to induce the fault in a specific location such
as a specific memory cell, a specific bit of a register, etc.26
2.6 Fault Attacks
• Time of Occurrence: Some attacks require to be able to induce the fault at a specific
time during the computation. For example, induce a fault at a particular round output
of DES or AES algorithm.
• Error Type: Many types of error may be considered. For example, flip the value
of some bits, freeze a memory cell to 0 or 1, prevent a jump from being executed,
disable instruction decoder, flips in memory only in one direction (e.g. a bit can be
flipped from 1 to 0, but not the opposite), etc.
The fault model has much importance regarding the feasibility of an attack. The works
on fault attacks can be categorized into two groups. First group deals with the way to
induce a given type of fault in a cryptoprocessor. The second assumes a fault model and
deals with the way this model can be exploited to break a cryptosystem. Later one does not
bother about the way such faults be induced in practice. These two groups are of course
complementary to determine the potential weaknesses induced by a fault induction method.
2.6.1 Fault Induction Technique
Fault induction is taken place by tuning the channels which affects the device’s envi-
ronment and putting it in abnormal conditions [101]. Many channels are available to the
attacker. Some of them are as follows:
• Power: Unappropriate power supply affects the behavior of a device. For example, a
smartcard, as per ISO standards, must be able to tolerate supply voltage between 4.5V
and 5.5V . Within this range the smartcard must be able to work properly. However, a
deviation of the power supply of much more than the specified tolerance might affect
its functionality. It will indeed lead to a wrong computation result, provided that it is
able to complete the current computation.
• Clock Frequency: An abnormally high or low frequency may induce errors in pro-
cessing. A fine tuning of clock frequency or a clock glitch at proper time can com-
pletely change the execution of a processor. It may even omit an instruction from the
execution sequence.
27
Chapter 2 Mathematical Background and Preliminaries
• Temperature: The device can process in extreme temperature conditions to induce
faults. Although, this is not a good choice for mounting fault attacks in practice.
• Radiations: Correctly focused radiations can harm the behavior of a cryptoproces-
sor. In practice, the attacker may put the devices like smartcard into a microwave
oven to have it perform erroneous computations.
• Light: The illumination of a transistor causes it to conduct. Thereby, it may induce a
transient fault. By applying an intense light source, it is possible to change individual
bit values in an SRAM [139]. The same technique could also interfere with jump
instructions, causing conditional branches to be taken wrongly.
• Eddy current: Eddy currents induced by the magnetic field produced by an alter-
nating current in a coil could generate various errors inside a chip. It could induce a
fault in RAM, EPROM, EEPROM, and Flash memory cells. For example, it could
change the value of a pin code in a mobile phone card [130].
Cryptanalysis based on fault is an interesting area of research. We refer the reader
to [13] and [101] for fault attacking techniques on AES and RSA, respectively.
2.7 Terminologies
In the subsequent chapters, the following terms will be encountered, and therefore are
described below.
1. Finite field: This is a field that contains only finitely many elements. Finite field is
also known as Galois field (in honor of Evariste Galois). Finite field is an abstract
algebra construct, which, with pn many elements, are represented by the notation of
Fpn or GF(pn) where p is a prime number called the characteristic of the field, and n
is a positive integer.
2. Elliptic curve: This is a smooth, projective algebraic curve of genus one, on which
there is a specified point O. An elliptic curve is in fact an abelian variety, that is, it
28
2.7 Terminologies
has a multiplication defined algebraically with respect to which it is a (necessarily
commutative) group and O serves as the identity element.
3. Elliptic curve group operations: These are the operations on which an elliptic curve
group is formed. An elliptic curve group is an additive group. The addition opera-
tions are defined on such a group based on the underlying elliptic curve equations. In
general, there are two operations namely: point addition (adds two different points,
e.g., P+Q) and point doubling (adds two similar points, e.g., P+P).
4. Elliptic curve scalar multiplication (ECSM): This operation multiplies a point on
an elliptic curve with an integer (scalar). This operation is also known as elliptic
curve point multiplication or some times called elliptic curve exponentiation. This
operation is an one way function, i.e., the forward computation − given a point P
and an integer d the computation of Q = dP is easy, but, the reverse computation −given Q and P finding out the integer d such that Q = dP is hard.
5. Fq-primitives: These are the units on which the respective Fq arithmetic operations
can be performed.
6. Programmable unit: This is a hardware unit which provides inherent programma-
bility. For example, a programmable Fp multiplier is a multiplier unit which supports
all primes less than a given length. A programmable unit does not require to recon-
figure the FPGA for changing the parameters.
7. Dual core: Two identical arithmetic units (cores) which can compute in parallel.
The cores can use a same or sometimes different memory blocks for itput and output.
Main objective of utilizing dual core in a processor is the improvement of parallelism
as well as the reduction of computation time.
8. Pairing: This is a mathematical construct on which the elements are processed pair-
wise and generates a single element. In case of cryptography, the pairing is per-
formed on pair of points on elliptic or hyperelliptic Jacobian curves, and it generates
an element on an integer field.
29
Chapter 2 Mathematical Background and Preliminaries
2.8 Conclusion
This chapter has described mathmetical background and preliminaries that are essen-
tial for understanding the works described in the following chapters. It has given a brief
overview on techniques and algorithms for performing arithmetic operations on finite fields
with large prime characteristic. In the next chapter, we give a literature survey of the works
related to the contributions of this thesis.
30
Chapter 3
Survey of Related Work
THIS CHAPTER DISCUSSES some of the previously published works, which, either
directly or indirectly, relate to the contributions of this thesis. Besides, it tries to pro-
vide the reader with a basic understanding of the state-of-the-art in research in this domain.
We start with the investigation of the existing works in the area of prime field elliptic curve
scalar multiplication (ECSM) on hardware platforms. Next, we will delve into the analysis
of the various reported techniques on side-channel attacks and corresponding countermea-
sures on ECSM. Finally, we will probe into the state of affairs of pairing computation
techniques, respective hardware and software implementations, and their security against
fault and side-channel attacks.
3.1 Hardware Implementation of ECSM on Prime Fields
The efficient implementation of ECSM on prime fields is achieved by applying opti-
mization at different hierarchical stages. Normally, the research in this direction follows
either of two level of optimizations:
1. Field-stage optimization. It chooses a prime field characteristic with lower ham-
ming weight which provides faster multiplication and inversion technique, mostly in
the reduction stage. Some of the specialized number systems are also used to perform
prime field operations more efficiently. Like, Montgomery number system [176] and
31
Chapter 3 Survey of Related Work
Residue number system (RNS) [117].
2. Coordinates and scalar multiplication-stage optimizations. Research tries to re-
duce the number of field inversions (projective coordinates), number of point addi-
tions (windowing), and replace point doubles (endomorphism methods).
However, the efficiency of an implementation also varies on underlying platform. For
example, the same architecture implemented on a customized CMOS library is much more
faster than the same on an FPGA platform. This is because of the in-built varying properties
of different platforms. Therefore, the choice of platform also plays an important role for
implementing ECSM. Though FPGA is slower but it is much cheeper than the fabrication
of a customized CMOS design. FPGA is reconfigurable, i.e., you can wipe out your design
and use the same FPGA for other design. By considering these facts FPGA is accepted as
a good choice for implementing different embedded systems including for cryptographic
applications. The very fact that the entire design takes place in-house also raises the level
of trust on FPGA based cryptographic designs.
Apart from the platform, different level of parallelism and pipelining could be adopted
to design an efficient architecture for ECSM operation on prime field. ECSM is computed
hierarchically as shown in previous chapter, Fig. 2.1. Active research is going on to opti-
mize an ECSM architecture at each stages of its computation hierarchy. These optimiza-
tions could be focused either on specific platform or on general architectures. The works
described in [56] and [58] proposed parallelism techniques for implementing a prime field
multiplier efficiently. The work of [56] proposed a pipelined GF(p) multiplier with mul-
tiple processing elements having data-width < ⌈log2 p⌉. Thus, it explores both pipelining
and parallelism for implementing an efficient GF(p) multiplier. Whereas, the work [58]
proposes a parallel Montgomery reduction multiplier.
Many hardware implementations have been documented for computing the elliptic
curve scalar multiplication. A good survey in this area is described in [52]. The ECC hard-
wares are broadly designed on GF(2m) and GF(p). Efficient implementations of GF(2m)
arithmetic units are reported in [32], which are effectively embedded into a ECSM hard-
ware. Some of the very good ECSM hardware for GF(2m) are [16, 26, 30, 31, 41, 60, 68, 69,
71–73, 89, 103, 119].32
3.1 Hardware Implementation of ECSM on Prime Fields
An efficient implementation of GF(p) ECC on general purpose processor was proposed
in [88]. Customized hardware implementation of ECSM for the curves defined over GF(p)
was introduced in [142]. Orlando et al. in [142] proposed a reduced instruction set GF(p)
ECC processor for a fixed prime p = (2192−264−1). The GF(p) ALU proposed in [106]
combines different arithmetic units to a common unit for ECC processor. Thereafter, a lot
of hardware implementations were proposed, and some of the good results were shown
in [4, 14–16, 31, 33, 45, 55, 58, 70, 74, 87, 107, 119, 120]. Most of them have used Mont-
gomery numbers to perform modular arithmetic. The conversion of a binary number A to
its Montgomery domain representation A and the reverse operation are expressed as:
A = MonPro(A, 22m (mod M), M)
A = MonPro(A, 1, M)
where MonPro(a,b,M) represents the Montgomery product algorithm [176] to compute
modular multiplication a.b (mod M). Some other implementations like [4, 15] use RNS
numbers for computing underlying arithmetic in ECSM operation. The conversions be-
tween binary and corresponding specific representations incur additional costs. For exam-
ple, MonPro() takes log2 M number of clock cycles in the VLSI implementation reported
in [106] for converting a log2 M-bit binary number to its equivalent Montgomery represen-
tation.
Among the existing works, the designs in [14] and [15] support only NIST primes [166].
Other designs support any general primes as a field characteristic. The work in [87] showed
a timing-and-area tradeoff for implementing GF(p) ECC processor. Sakiyama et al. in [70]
accelerated ECSM operation using parallel modular arithmetic logic units. The work de-
scribed in [58] proposes two different levels of parallelism for designing an efficient ECSM
architecture. It defines the parallelism applied for computing a single GF(p) operation as
horizontal parallelism and the finite field operations are computed in parallel if no data
dependency exists, this is defined as vertical parallelism. In the same paper, an embedded
multicore system for elliptic curve cryptography was proposed. Using sixteen 18-bit mul-
tiplier cores, the system achieves both horizontal and vertical parallelism to develop a very
long instruction word (VLIW) processor for computing scalar multiplication on GF(p) el-
liptic curve. A parallel and scalable processor for performing ECSM in both prime and33
Chapter 3 Survey of Related Work
binary fields was proposed in [16]. In [4], a prime field ECSM processor has been pro-
posed. In this design the underlying operations are performed in RNS.
Side-channel attack [163, 171] is one of the major threats in developing cryptographic
hardware. This cryptanalytic technique exploits the leakage information (side-channels)
of the device while it executes some cryptographic algorithms. The most popular side-
channels are power and time [57]. Among aforementioned related designs, [4] and [45]
attempted to implement the GF(p) ECC hardware that is secured against side-channel at-
tacks. The design proposed in [45] provides security against simple power analysis (SPA)
and timing attacks only. However, the doubling attack [81, 118] and differential power
analysis (DPA) attack [162] are more powerful attacks, which can also work on SPA resis-
tant designs. Hence, an efficient as well as SPA, DPA, doubling attack, and timing attack
resistant ECSM hardware is on demand, which is aimed at this work. Very recently, in
2010, the ECSM cryptoprocessor [4] addressed SPA and DPA attacks. However, it did not
consider the more powerful doubling attack. Let us now make a literature survey to know
the existing techniques for protecting ECSM operation against side-channel attacks.
3.2 The ECSM Against Side-channel Attacks
This section describes the vulnerability of ECSM operation against side-channel at-
tacks. The ECSM operation is defined as : Q = dP, where the scalar multiplier d is used
as a secret key. A number of works have been reported to protect ECSM operation against
side-channel attacks.
The computation formulæ for elliptic curve addition (ECA) and doubling (ECD) are
different (Ref. section 2.2.1). The distinction between these two operations may leak some
information for revealing the bit values of the secret d. Simple side-channel (i.e., simple
power analysis (or SPA)) attacks may be applied for exploiting this distinction. However,
there are well defined counteracting techniques for preventing this vulnerability. The SPA
resisting techniques are as follows :
1. Unifying the addition formulæ [115, 132] or considering alternative parameteriza-
tions [127, 151, 152].
34
3.2 The ECSM Against Side-channel Attacks
2. Inserting dummy operations [45, 114, 162].
3. Using algorithms that are already regular and do not leak by definition [132,140,141,
153, 156, 167].
Though the above techniques are sufficient to resist SPA attacks, they are vulnerable
against more sophisticated differential side-channel (i.e., differential power analysis (or
DPA)) attacks [162, 163]. In order to thwart differential side-channel analysis, the inputs
of the ECSM, namely, base point P and the scalar multiplier d, should be randomized [83,
150, 162]. Some combined methods can also prevent differential side-channel attacks on
ECSM operations [59,149]. Some of the techniques to protect an ECSM operation against
differential and simple side-channel attacks are briefly described in the following sections.
3.2.1 Indistinguishable Point Add and Point Double
Different methods have been reported to make the point addition and point doubling for-
mulæ indistinguishable in different coordinates. The works reported in [115,132] proposed
the unifying addition formulæ for GF(2m) elliptic curve in affine and projective coordinate
systems. It is observed that every elliptic curve is isomorphic to a Weierstraß form [132]
and the parameterizations other than the Weierstraß form may lead to faster unified point
addition formulæ [151, 152]. These parameterizations are mainly based on either of Hes-
sian form [148] or Jacobi form [127]. It is observed that using these parameterizations the
computation of point addition become cheaper also. Table 3.1 gives a comparison based
on computation cost of these two different parameterizations with the original Weierstraß
form. In the table the symbols M and C stand for finite field multiplication and multiplica-
tion by a constant, respectively.
It may be noted that different standard elliptic curves like IEEE 1363, FIPS 186.2, and
SECG use a group order #E(F) = h · q, where q is a prime and the cofactor h ≤ 4. The
point addition and doubling formulæ need not be strictly equivalent to prevent side-channel
analysis. This can be achieved by inserting some dummy operations. This could be help-
ful when two (distinct) elliptic curve group operations are similar. This technique has
been adopted for the elliptic curves defined over GF(2m) in [114]. The technique proposed
35
Chapter 3 Survey of Related Work
Table 3.1: Point addition for elliptic curve over GF(p).
Parameterization Cost Cofactor
Weierstraß form [132] 17M + 1C (general case)–
(with unified formulæ) 16M + 1C (a4 = -1)
Hessian form [151] 12M h ∝ 3
Extended Jacobi form [127] 13M + 3C (general case) h ∝ 2
13M + 1C (ε = 1) h ∝ 4∩of 2 quadrics [152] 16M + 1C h ∝ 4
in [114] has assumed that the loading/storing of random values from different registers
is indistinguishable. But in practice it may not be true. A possible solution is presented
in [147] by random register renaming. The work reported in [45] proposed an indistin-
guishable computation technique of point addition and point doubling on GF(p) elliptic
curves in affine coordinates.
3.2.2 Regular Point Multiplication Algorithms
Elliptic curve group operation (ECA and ECD) formulæ may have different side-channel
traces, provided they do not leak any information about the secret scalar multiplier d for
evaluating Q = dP. For binary algorithms, this implies that the processing of bits 0 and bits
1 of multiplier d are indistinguishable. There are some algorithms which can process both
0 and 1 bit values of the multiplier d in atomic way. This can be achieved by following
ways.
• Classical Algorithms. This trick usually tries to remove the conditional branching
in the double-and-add based algorithms [155]. It consists of a dummy point addition
when multiplier bit di is zero. As a result, each iteration executes a point doubling
followed by a point addition [162]. One such algorithm for some special form of
elliptic curves was developed by Montgomery in [176]. This algorithm is particu-
larly suited for elliptic curves defined over GF(2m) in a normal basis representation
36
3.2 The ECSM Against Side-channel Attacks
performed on Lopez and Dahab coordinates [167].
• Atomic Algorithms. This is the generalized idea of double-and-add always tech-
nique proposed by Chevallier et al. [114], resulting in the concept of side-channel
atomicity.
There are numerous variants of above regular point multiplication techniques based on
efficient software implementations on different processor architectures [155].
3.2.3 Base Point Randomization Techniques
In general, base point randomization techniques try to develop some strategies on the
mathematical structure of the curves, which lead to efficient and simple ways for protecting
the secret multiplier d at the time of Q = dP computation against differential side-channel
analysis. The reported techniques are:
3.2.3.1 Point Blinding
The method is analogous to Chaum’s blind signature scheme for RSA [177]. The base
point P is blinded by adding a secret random point R for which the value of S = dR is
known. The ECSM operation, Q = dP, is performed by computing d(P+R) and subtract-
ing S to get Q. The concept proposed in [83, 162, 171] uses the idea that the point R and
S = dR are stored inside the device and refreshed at each new execution of ECSM by com-
puting R← rR and S← rS, where r is a (small) random generated at each new execution.
As R is secret, the representation of point P∗ = P+R is unknown in the computation of
Q∗ = dP∗, which ensures the security against differential side-channel attacks. However, a
difficulty of this technique is the overhead due to the computations and storage of R and S
in the cryptoprocessor. On the other hand, their initial value must be secret, which increases
the key size.
3.2.3.2 Randomized Projective Representation
In projective coordinates the randomization of a point can be done in a very simple
manner. The points are not uniquely represented in projective coordinates. For example,
in Jacobian coordinates, the triplets (θ2XP : θ3YP : θZp)J with any θ = 0 represent same37
Chapter 3 Survey of Related Work
point; and in homogeneous coordinates, the triplets (θXP : θYP : θZP) with any θ =0 represent same point. There are some other projective representations also which are
described in [170].
In the projective coordinates representation, for each new execution of point multipli-
cation Q = dP; the projective input point P is randomized with a random non-zero value
θ [162]. Therefore, an attacker is no longer be able to predict any specific bit in the binary
representation of P, which is used to mount the DPA attack [162].
3.2.3.3 Randomized Elliptic Curve Isomorphisms
Point P = (x,y) on elliptic curve E can be randomized as P∗ = ϕ(P) on E∗ = ϕ(E),
for a random curve isomorphism ϕ. Then the computation of Q = dP can be computed by
Q = ϕ−1dP∗ [150].
3.2.3.4 Randomized Field Isomorphisms
Let a point P be on an elliptic curve E defined over a finite field F. A random field
isomorphism J : F→ F∗ is applied to P and E to get point P∗ = J (P) on E∗ = J (E). Then
the operation Q = dP is evaluated as J−1dP∗ [150].
3.2.4 Scalar Multiplier Randomization Techniques
In this section we review another way of randomizing the computation Q = dP with a
randomized representation of the scalar multiplier d. There are several strategies to ran-
domize the multiplier d, which includes:
1. Multiplier Blinding: The secret scalar multiplier d is blinded here by d∗ = d + r×ord(P), where ord(P) denotes the order of the point P ∈ E(F), and r is randomly
chosen. The operation Q = dP is computed by Q = d∗P. The randomization of the
value d can also be done by d∗ = d+ r#E, where #E denotes the order of the elliptic
curve group. This relation holds because by Lagrange’s Theorem the order of an
element always divides the order of its group [162, 171].
2. Multiplier Splitting: The multiplier d can also be decomposed into two or more
38
3.2 The ECSM Against Side-channel Attacks
parts. This idea was introduced in [165] as a generalized side-channel countermea-
sure. The splitting can be done additively, as d = d∗1 +d∗2 , where d∗1 = r and d∗2 = d−r
for a random r [146]. The multiplicative splitting of scalar multiplier d is introduced
in [131]. Using this technique the Q = dP evaluates as Q = dr−1(rP) for a random r
invertible modulo ord(P).
3. Forbenius Endomorphism: Endomorphism can be applied to represent the multi-
plier d for some special type of curves. For example, the Forbenius endomorphism
applied on Koblitz elliptic curves [172] is reported in [150]. One thing should be
mentioned here is that, the Forbenius expansion is roughly twice the length of a bal-
anced binary representation of a number.
Regarding the scalar multiplier randomization technique following thing can be observed.
1. Normally the value of scalar multiplier d < ord(P). Now in multiplier blinding
scheme d∗ = d + r ord(P) > ord(P) for any non zero r. So, during the compu-
tation of Q = d∗P some intermediate resultant point may lie on the neutral element
of the group or point at infinity (O). This can be attempted for side-channel analysis.
In [85,116] a special kind of differential power analysis attack, known as Zero-value
Register Attack (ZRA) is described, which tries to exploit the active registers of a
cryptoprocessor with zero intermediate result.
2. Regarding the multiplier splitting schemes, it is not clear whether if an attacker can
find the bit values of d during Q = dP using some side-channel analysis, then the
same thing can also be applicable to find the bit values of d∗1 and d∗2 , and it can
compute the secret d = d∗1 +d∗2 .
In 2010, a modified technique is proposed in [11] to defend a specific DPA attack,
known as address bit DPA or ADPA [164]. The general countermeasure against ADPA on
RSA and ECC was proposed in [122]. It randomizes the address of data being accessed
during the exponentiation of RSA and ECC. We refer the recent survey of [12] to the
reader for further information on side-channel attacks and countermeasures against ECSM
operation.
39
Chapter 3 Survey of Related Work
3.3 Implementation of Cryptographic Pairings
Pairing based cryptography started in the beginning of this century when Joux [160] in-
troduced the application of Weil pairing to construct a three-party one-round key agreement
protocol. Subsequently, Boneh and Franklin presented the first fully functional, efficient,
and provably secure identity-based encryption scheme [143] using the properties of bilinear
pairing on elliptic curves. Cryptographic pairings require tedious computations on elliptic
curves or hyperelliptic Jacobian curves defined over large finite fields. Weil and Tate are
the two oldest pairing techniques, which came at the same time in literature [155]. The
first efficient computation of pairing for cryptographic applications was introduced in 1986
when V. Miller described the Tate pairing computation technique over finite fields [175].
The generalized and efficient algorithm based on Miller’s technique was proposed after al-
most two decades by Barreto, Kim, Lynn, and Scott in [138], which is also known as BKLS
algorithm. Several authors have found further algorithmic improvements to decrease the
complexity of Miller’s algorithm by reducing its loop length [3, 29, 77, 126].
Both Weil and Tate pairings are based on the Miller’s loop. Most works focused on
speeding up the computation of the Tate pairing because the Weil pairing is more time-
consuming [92]. The computation of Weil pairing (W (P,Q) = M(P,Q)/M(Q,P) where
M stands for Miller’s loop), needs two Miller steps. One Miller step is called the Miller
lite part and the other Miller step is called the full Miller part [128]. On the other hand,
computation of Tate pairing (T (P,Q) = M(P,Q)c) requires one Miller lite part and one final
exponentiation. In lower security level, the full Miller part is much more time consuming
than a final exponentiation. Thus, it appears that the Weil pairing is more time-consuming
than in the case of the Tate pairing. By comparing the exponentiation of the Tate pairing
with the computation of the full Miller part, one can see a proper power of the Weil pairing
can be computed faster than the Tate pairing at high security levels [91]. It is observed
in [91] that at 256-bit security level and above the computation of Weil pairing will be
faster than the Tate pairing.
The underlying algebraic curve also plays an important role in pairing computation.
Active research is also going on for obtaining better pairing friendly curve which provides
more efficient pairing computation technique and more security with smaller field size. In
40
3.3 Implementation of Cryptographic Pairings
recent days some of the renowned curves are Miyaji, Nakabayashi, and Takano (MNT)
curve [145], Freeman curve [2, 67], and Barreto-Naehrig (BN) curve [76]. Among them,
the most popular BN curve is defined over a 256-bit prime field with embedding degree
k = 12 which provides 128-bit symmetric security. Efficient computation of arithmetic
operations over pairing-friendly tower extensions of finite fields are proposed in [18, 91].
Different varieties of pairing computations have appeared in the literature based on the
underlying algebraic curves and finite fields. Some of the most popular techniques are:
Duursma-Lee Tate pairing [126] over characteristic three fields F3m;1 ηT pairing over both
binary and characteristic three fields [54]; and ate [77], R-ate [28], and optimal-ate [3]
pairings over large prime fields. The ηT pairing [54] is the most efficient algorithm for
symmetric pairings (G1 = G2) that are always defined over supersingular curves. Other
pairings like ate, R-ate, and optimal-ate are known as asymmetric pairings (G1 =G2).
As per National Institute of Standards and Technology (NIST) recommendation it is
essential to chose a pairing that can achieve 128-bit security for its application beyond 2030.
Therefore, in this thesis we only study the existing software and hardware implementation
of pairings which can achieve at least 128-bit security.
3.3.1 Software Library for 128-bit-secret Pairings
The software implementation results for symmetric pairings over supersingular curves
which achieves 128-bit security are shown in [5, 24]. The software library presented in [5]
takes 3.02 millions of cycles to compute ηT pairing on a supersingular curve defined over
F21223 . It is implemented on a dual quad-core Intel Xeon 45nm systems for taking facility
of parallelism on eight cores. Authors in [24] report 5.42 millions of cycles to compute
ηT pairing on a supersingular curve defined over F3509 on an Intel Core i7 45nm processor
using eight cores.
Asymmetric pairings for achieving 128-bit security are mostly defined on BN curves.
The software library presented in [8] takes 4.47 millions of cycles to compute the optimal-
ate pairing on a 257-bit BN curve using only one core of an Intel Core 2 Quad Q6600
1Throughout the report we use Fqm and GF(qm) for the same meaning which represent finite field or Galois
field with characteristic q.
41
Chapter 3 Survey of Related Work
processor. A more efficient software implementation of asymmetric bilinear pairings for
128-bit security levels is described in [10]. In this software library, the optimal-ate pairing
over a 254-bit BN curve is computed in just 2.33 million of clock cycles on a single core of
an Intel i7 2.8GHz processor. Some other software implementation results for asymmetric
pairings over BN curves are reported in [27, 39, 42, 61].
3.3.2 Hardware Design for 128-bit-secret Pairings
The hardware implementation result of pairings over BN curves has been provided in-
dividually by Kammler et al. [17] and Fan et al. [19] in 2009. Both the designs are based
on 130nm CMOS technology. The first one is designed as a complete application specific
instruction-set processor (ASIP) augmented with some special instructions for comput-
ing pairings over BN curve. It results in the computation time of an optimal-ate pairing
over general primes in 15.8ms. The faster Fp-arithmetic for BN curve proposed in [19]
exploits the specific features of BN parameters and respective primes. This specialized de-
sign computes one R-ate pairing over BN curve in 2.9ms only. In [6], a compact hardware
is proposed for computing the Tate pairing over 128-bit-security supersingular curves. It
uses a characteristic-3 field with moderate composite-degree of field extensions for achiev-
ing 128-bit security as-well-as efficient tower field arithmetic. On a Virtex-4 FPGA, this
accelerator computes the pairing in 2.2ms while requiring no more than 4755 slices.
However, to the best of our knowledge, there does not exist an FPGA design of a pairing
cryptoprocessor for BN curves. Considering the popularity and the reconfigurability of
FPGA devices there is a strong impact of FPGA designs of crypto algorithms. The resource
constraint and the lesser clock frequency of FPGA devices pose further design challenges
to the designer.
3.4 Fault and Side-channel Attacks on Pairings
Boneh et al. [154] in 1997 introduced fault attacks and show how to recover secret keys
of RSA and discrete logarithm based cryptosystems. Thereafter, a lot of research based
on fault and side-channels have been undertaken on different cryptosystems to recover the
42
3.4 Fault and Side-channel Attacks on Pairings
secret keys. The first mention of side channel analysis of pairings was in 2004 when Page
and Vercauteren [82, 105] described a fault attack of Duursma-Lee algorithm [126] for
characteristic three. It shows the multiplication operation in a general pairing could be
attacked using Simple Power Analysis (or SPA) and a Messerges style Differential Power
Analysis (or DPA) [159]. The attack is based on a transient fault at the loop boundary of
the Miller’s loop of such pairing computations. Suitable countermeasures based on point
blinding technique [162] are also proposed in [82] for protecting the secret against above
side-channel attacks. However, these counteracting techniques require additional private
parameters which increases the complexity of whole system − from key establishment to
pairing computations.
Thereafter, in depth approach for performing side channel analysis on pairing imple-
mentations was described in [84]. This work targets Tate, ate, and ηT pairings and it
deterministically calculates partial output of a pairing computation based on the structural
expansion of basic finite field operations. Differential power analysis (DPA) on ηT pairing
over F2m is described in [40]. It targets addition and multiplication operations where the
secret and public parameters are directly involved. The attack works as follows :
1. Identify the addition and multiplication operations inside the ηT pairing computation
where one operand is public and other one is secret.
2. If such addition operation is found then apply the DPA on it. Remember that the
operation a+ b in F2m is nothing but the bitwise XOR of a and b. Perform DPA on
addition involving one secret and one chosen parameters.
3. If such multiplication is found then mount a DPA attack on it. The multiplication
in F2m is performed by shift and add procedure. The DPA attack on binary field
multiplication involving one secret and one chosen parameters could be applied to
find out the secret.
Suitable countermeasure also proposed in [40], which is based on projective coordinates
randomization [155] technique. However, no related works on pairing computations over
prime fields have been reported in the literature.
43
Chapter 3 Survey of Related Work
3.5 Conclusion
This chapter has presented a survey on diverse research activities related to the design
of cryptoprocessors for elliptic curve and pairing computations. Various methods of fault
and side-channel attacks and existing countermeasures on respective operations have been
also mentioned in this chapter. With this background of related work next chapter presents
our work on elliptic curve cryptoprocessor.
44
Chapter 4
Design and Analysis of Elliptic Curve
Cryptoprocessor
THIS CHAPTER PROPOSES an elliptic curve cryptoprocessor and analyzes its secu-
rity against physical attacks. Elliptic curve operations are based on the underlying
finite field arithmetic. Therefore, this work first designs a programmable arithmetic unit
for performing addition, subtraction, multiplication, inversion, and division in prime fields.
An elliptic curve cryptoprocessor for computing scalar multiplication is subsequently de-
signed for the curves defined over prime fields. The proposed cryptoprocessor comprises
two identical cores of programmable arithmetic unit. We explore a parallel scheduling for
computing elliptic curve scalar multiplication on proposed dual core cryptoprocessor. A
suitable technique is proposed and applied on the elliptic curve cryptoprocessor for resist-
ing it against differential power analysis and doubling attacks. In summery, the proposed
cryptoprocessor is inherently programmable, memoryless, and resistant against timing and
power attacks. It efficiently optimizes area × time per bit value for elliptic curve scalar
multiplication.
4.1 Introduction
Due to the increased demand of secured communication it is important to speed up
public key cryptography (PKC). Application areas like mobile communication emphasize
45
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
that high performance security architectures for large volume of data is of utmost impor-
tance. In order to provide security in these resource constrained devices, PKC algorithms
should be implemented in small area with high throughput. Elliptic curve cryptography
(ECC) [174, 178] is one of the best PKC algorithms as it provides high security at lesser
bit sizes than RSA [182]. In mobile applications, ECC is regarded more suitable than
RSA based public key schemes because it operates with higher throughput, lower power
consumption, and lesser area requirements. In current days, pairing based cryptography,
which is an extension of ECC, are mostly used in identity aware hand held devices.
Elliptic curve scalar multiplication (ECSM) is one of the most important operations
in elliptic curve as well as pairing base cryptography. It relies on underlying finite field
primitives like multiplication and inversion. A lot of work has been reported on the tech-
niques of speeding up performance of ECSM [66, 117, 155, 176]. The speed-up techniques
are essentially three-folds. First, high speed algorithms and architectures for finite field
primitives like multiplication and inversion were invented. Secondly, the methods of scalar
multiplications were improved to reduce the number of underlying costly operations, like
multiplicative inverse. Finally, the architectures were improved to incorporate more paral-
lelism in the computation of the scalar product.
In this chapter, we present an elliptic curve scalar multiplier (or elliptic curve crypto-
processor) exploiting the concept of shared arithmetic hardware and explore its security
against timing and power attacks. The contribution of the chapter is in three folds.
• PGAU core. We propose a Programmable GF(p) Arithmetic Unit (PGAU) that
performs GF(p) addition, subtraction, multiplication, inversion, and division. The
modular operations are performed directly in 2’s complement number system. The
PGAU reduces 18% area compared to that required in an integrated design where
each arithmetic unit is a state-of-the-art stand alone implementation. The PGAU
takes only 0.96 times slice area but achieves 2.67 times speedup compared to the
GF(p) ALU [106] with respect to a target operation AB/C (mod p).
• Elliptic curve cryptoprocessor. We observe that the saving in area of PGAU design
can be exploited by using its multiple copies in elliptic curve cryptoprocessor. We
attempt to speed up the elliptic curve scalar multiplication by using two cores of the46
4.2 Motivation and Objective
proposed programmable GF(p) arithmetic unit. The implementation of the proposed
cryptoprocessor is done on Xilinx Virtex-II Pro FPGA platform.
• Side-channel attacks. The programmable GF(p) arithmetic unit is designed in such
a way that it does not provide any timing and power attack vulnerabilities during
the execution of finite field operation. A new point blinding technique is proposed
and applied on the proposed elliptic curve cryptoprocessor. The experimental results
are furnished for ensuring its security against timing, simple power analysis (SPA),
differential power analysis (DPA), and doubling attacks.
The outline of the present chapter is as follows: the chapter starts with a brief descrip-
tion of motivation and objective of the work followed by the description of the proposed
programmable GF(p) arithmetic unit. Then it elaborates the timing and power attack resis-
tant elliptic curve cryptoprocessor along with experimental results.
4.2 Motivation and Objective
The hardware implementation of elliptic curve based cryptographic primitives are on
demand. More specifically, it is required to implement elliptic curve scalar multiplication
on hardware to achieve higher throughput of those primitives. In order to achieve better
performance on dedicated hardware, in general, multiple copies of similar hardware units
are integrated on a die and run in parallel. The parallelism of elliptic curve scalar multi-
plication algorithm is achieved by integrating multiple copies of GF(p) adder, subtractor,
multiplier, and inverter/divider units [58]. However, multiple units demand more hardware
area as well as they consume more power. It is observed that the utilization of individual
GF(p) arithmetic units are also low in existing designs. For example, the utilization of par-
allel ECC hardware (PPU2) in [23] is 42.7% during ECSM computation. The current work
attempts to improve the utilization of hardware.
In order to improve the utilization of hardware area, we first study the architectures of
individual GF(p) arithmetic units, which are shown in [106] and [53]. We observe that there
are huge scopes for hardware optimization. The GF(p) arithmetic units are mainly based on
two’s complement adder. As per the architectures shown in [53], GF(p) adder, subtractor,47
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
multiplier, and divider consist of 2, 2, 3, and 8 two’s complement adders, respectively.
Instead of separate adder circuits in each of the arithmetic operations, we can keep only an
optimum number of such circuits and reuse them for performing all four GF(p) arithmetic
operations. It is also observed that some other parts like internal registers and counters can
also be reused for computing GF(p) multiplication and division. Thus, we can share and
optimize the hardware for computing all four GF(p) operations. The objective is that the
sharing technique helps to improve hardware utilization.
We introduce a programmable architecture for computing all the four aforementioned
finite field operations on a single unit, which we call programmable GF(p) arithmetic unit
(PGAU). The unit can be reprogrammed to perform any of the four operations. However,
the proposed PGAU is not only applicable to ECC but it can also be used for developing
any finite field cryptographic primitives.
Another objective to develop such a programmable unit is for achieving lower value
of area × time per bit1 while computing elliptic curve scalar multiplication. It is a fact
that there is a minimal scope of parallelism within an elliptic curve point addition (ECA)
or elliptic curve point doubling (ECD) formulæ. The sequence of operations in an ECA
(or ECD) provides only a limited scope for simultaneously computing two different GF(p)
operations. Here we mainly refer to GF(p) multiplication and division, which are the most
time consuming operations compared to others. Thus, the performance of ECSM compu-
tation can be potentially improved by the parallel computation of independent ECA and
ECD within an iteration of ECSM algorithm. In [23], this parallelism is achieved by keep-
ing two copies of GF(p) multiplier and divider units, which demands twenty-two two’s
complement adders, two mod-k counters, and two mod-2k counters along with complex
control circuits for scheduling modular operations in ECA and ECD [23].
We alleviate the complexity of the architecture by keeping two dedicated PGAUs for
ECA and ECD. The two PGAU cores run as parallel threads. Thus, it achieves higher
throughput because of the increased parallelism. However, the increase in area is smaller
compared to a mere duplication of the GF(p) arithmetic operations because of the com-
1The term time per bit indicates the bitrate, which infers that the term area × time per bit is nothing but
the area × bitrate value. Lower value of this parameter indicates the better performance of a design.
48
4.3 Programmable GF(p) Arithmetic Unit (PGAU)
pactness of the PGAU, which helps to reduce the area × time per bit value.
The security of the proposed elliptic curve cryptoprocessor against timing and power
attacks is another major consideration in this work. We propose a technique for resisting
a special type of power attack known as doubling attack (DA). The adopted algorithms
and design techniques make the proposed design secured against differential and non-
differential timing and power attacks. Exhaustive experimental results have been shown
against those attacks to demonstrate the strength of the proposed design.
4.3 Programmable GF(p) Arithmetic Unit (PGAU)
In a typical elliptic curve processor, as depicted in Fig. 4.1, there are dedicated arith-
metic units to perform the finite field operations, namely addition, subtraction, multiplica-
tion, and division. In the figure, the arithmetic units are represented as: Op1,Op2,Op3,
and Op4, respectively. The controller logic schedules the operations performed by the
arithmetic units Op1−4. The present work observes that the arithmetic units have a sig-
nificant commonality and is hence amenable to hardware sharing. Thus, the elliptic curve
processor can alternatively be transformed into a combination of a common logic, individ-
ual unshared logic of each of the Op1−4, a controller logic, and a logic for configuration,
which is shown in Fig. 4.2.
Op 2
Controller logic
Op 4 Op 1 Op 3
in 1 in 2 in 3 in 4 out 4 out 3 out 2 out 1
Figure 4.1: General structure of ECC processor with dedicated hardware units.
It may be observed that the common hardware logic from Op1−4 have been extracted and
integrated into one common logic module. The unshared portions are designed as Op1−4.49
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
unshared logic – Op 1
unshared logic – Op 2
unshared logic – Op 3
unshared logic – Op 4
common logic
logic for configuration controller logic op
in out
Figure 4.2: Structure of ECC processor with shared arithmetic unit.
Depending on the operation required (denoted by Op) the controller logic uses the logic
for configuration module to program the unit to perform as one of the four operations.
This programmable elliptic curve processor consumes much less area than that of dedicated
arithmetic units existing in conventional implementation. However, the programmable pro-
cessor as shown in Fig. 4.2 may have following downsides.
• It essentially has resources to compute only one finite field operation at a time. There-
fore, it prevents the computation of several finite field operations in parallel.
• Due to the extra controlling logic, the place and route become complex which may
require additional design time. It may also demand larger chip area and provide
longer critical path of the design.
But in this work, we use elliptic curves defined over a prime field and the points on the
curve are represented in affine coordinates. In this point representation, there is a very little
scope of parallel computation of prime field operations. Thus a common arithmetic core is
useful. The elliptic curve processor with resource sharing technique as shown in Fig. 4.2
can also be implemented hierarchically by smaller modules for reducing the place and route
complexity. Therefore, the degradation of results for additional logic is negligible. The
main objective of this work is to optimize area× time per bit value in ECSM computation.50
4.3 Programmable GF(p) Arithmetic Unit (PGAU)
4.3.1 Motivation of PGAU
Main motivation behind the programmable GF(p) arithmetic unit is to optimize the area
by exploiting the resource sharing technique. Although resource sharing prevents the com-
putation of several GF(p) operations in parallel (because they share the same resources).
This does not cause problems with the point representation we use in the thesis, because
those point formulæ do not have much parallelism. Motivated by this fact we aim to design
an optimized programmable unit for computing all underlying finite field operations.
The objective of the work is to develop a programmable unit for GF(p) arithmetics using
lesser resources. We choose underlying algorithms for different operations which require
lesser resources and have maximum common resources. The common resources are then
extracted from each of the architectures, and develop a programmable GF(p) arithmetic unit
(PGAU). We found the bit serial interleaved multiplication and binary inversion algorithms
in GF(p) are more suitable in this respect in cost of higher multiplication time.
4.3.2 Proposed Programmable GF(p) Arithmetic Unit
Figure 4.3 presents a top level block diagram of the proposed programmable arithmetic
unit for performing GF(p) addition, subtraction, multiplication, inversion, and division.
It consists of four ⌈log2 p⌉-bit operand registers, namely u, v, x1, and x2, for holding the
intermediate results. The operation a [op] b (mod p) is done inside the data path module,
where op indicates the operation being performed by the unit, and the result appears at port
R. The control path decodes the instructions and generates appropriate signals to configure
the data path; so that, the operations are performed correctly. In our proposed architecture,
essentially, we have five different instructions that are encoded in four bit opcode. To
make the controller unit as well as configuration less complex, one extra bit is used. The
operations and corresponding opcodes (op) are described in Table 4.1.
Inside the architecture, the four opcode bits are referred as D,M,A/S, and I/D. The opcode
bits D,M,and A/S, are used to control data flow inside the data path block; whereas, I/D
is only used to differentiate inversion from division by initializing the x1 register by 1 and
b, respectively (see Fig. 4.3).
51
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
common cores unshared
logic
data path
control path operand registers
M U X
op clock reset
p
R
a b
configu ration
Figure 4.3: Programmable GF(p) adder, subtractor, multiplier, and divider unit.
Table 4.1: Different opcodes for PGAU.Operation Opcode
D M A/S I/DGF(p) Addition 0 0 0 XGF(p) Subtraction 0 0 1 XGF(p) Multiplication 0 1 0 XGF(p) Inversion 1 0 1 0GF(p) Division 1 0 1 1
4.3.3 Programable Data Path Block
Figure 4.4 depicts a broad view of the data path block of the proposed programmable
GF(p) arithmetic unit. Considering the hardware cost, parallelism, and efficiency of com-
puting GF(p) multiplication and division, we classify the data path block into three major
sub-blocks; namely DP1, DP2, and DP3. We use the sub-blocks for computing GF(p)
operations in the following way.
• GF(p) addition/subtraction. GF(p) addition and subtraction are performed by the
DP3 module which takes inputs from a,b, and p ports. It computes a± b (mod p)
and produces the output at a′′, which then comes out through port R.
52
4.3 Programmable GF(p) Arithmetic Unit (PGAU)
common cores unshared logic
common cores
mux
mux
unshared logic
mux
mux
mux
u x 1 v x 2
2 x
p a
a . b i
b
mux
D M
A / S
b i u '
v ' x 1 '
x 2 '
x 1 ''
x 1
x 2
E R
R un vn x 1 n x 2 n
u 0 v 0
x 2 '' u '' v '' a ''
D P
1 D P
2
D P
3
mux mux mux mux 0 1 0 1 0 1 0 1 0 1 2
Figure 4.4: Data path block.
• GF(p) multiplication. The proposed design computes GF(p) multiplication using
bit serial interleaved multiplication algorithm, which is described in chapter 2. Each
iteration of this multiplication algorithm (Algo. 2.1) consists of two steps, GF(p)
doubling and GF(p) addition. DP1 and DP3 are used to compute those two steps
in only one clock cycle. Operand register u (Fig. 4.3) is used to accumulate the
intermediate result. At the k–th iteration, where k = ⌈log2 p⌉, the final result comes
at a′′ port of DP3, and result goes out through port R. Hence the multiplication
latency of the proposed design is k clock cycles.
• GF(p) inversion/division. The proposed design computes GF(p) inversion as well
as division using binary inversion/division algorithm which is described in Algo. 2.2,
section 2.1.3 of chapter 2. One iteration of this algorithm consists of three steps. We
say, step–1 comprise of the operations in steps 2.2.1 and 2.2.2, step–2 the operations
53
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
in steps 2.5.1 and 2.5.2, and step–3 the operations in steps 2.7 and 2.8 of the algo-
rithm. For the clear understanding of our design, aforementioned steps of Algo. 2.2
are revisited below.
Step–1:
2.2.1. u← u/2
2.2.2. if x1 is even then x1← x1/2 else x1← (x1 + p)/2
Step–2:
2.5.1. v← v/2
2.5.2. if x2 is even then x2← x2/2 else x2← (x2 + p)/2
Step–3:
2.7. if u≥ v then u← u− v, x1← x1− x2
2.8. else v← v−u, x2← x2− x1
The first two steps perform similar operations on different inputs, {u,x1} and {v,x2};whereas, step–3 operates on input {u,v,x1,x2}. Three data path units namely DP1,
DP2, and DP3 are used to perform aforementioned three steps, respectively. The
updated values of u,v,x1, and x2 after an iteration come out at un,vn,x1n, and x2n,
respectively. These intermediate results are then accumulated into the respective
registers, u,v,x1,x2. The multiplexing of intermediate results as per Algo. 2.2 is done
based on the bit values of u0 and v0, which indicates whether current value of u and v
are odd. If either of u and v is even then the intermediate results will come from DP1
(Step–1) and DP2 (Step–2); otherwise, they will come from DP3 (Step–3). All the
three data path sub-blocks run in parallel for computing inversion as well as division,
and they compute one iteration of the algorithm in only one clock cycle. At every
clock cycle either u or v is reduced by one bit size. Hence, the inversion/division
latency of PGAU is atmost 2k clock cycles.
The common cores of DP3 module is identified as the common operator logic for all
five GF(p) operations. The DP2 module and unshared logic of DP3 module are used only
for inversion and division operations. The common cores of DP1 module are used for mul-
tiplication, inversion, and division. The muxes are considered as logic for configurations.
The architectures and functionalities of DP1, DP2, and DP3 are described in following54
4.3 Programmable GF(p) Arithmetic Unit (PGAU)
paragraphs. The control signal ER (Fig. 4.4) is a 2-bit signal, say ER1 and ER2 . The logic
for ER1 = D∨M, and ER2 = D∧T , where T is a temporary one bit signal generated by the
following logic : i f v = 1 then T = 1 else T = 0, also ∨ and ∧ stands for Boolean OR and
AND operations. Let us now describe the functionality of each of the modules in the data
path of our proposed programmable GF(p) arithmetic unit.
Module DP1. Figure 4.5 renders the DP1 module. It consists of one ⌈log2 p⌉-bit binary
adder/subtractor, and a set of multiplexors. For GF(p) multiplication (D = 0), DP1 per-
forms 2u (mod p), and passes the result to the port 2x; whereas, for GF(p) inversion and
division (D = 1), it performs u/2, and if x1 is even then it performs x1/2 else (x1 + p)/2.
Finally, if u is even then DP1 passes the computation results else it passes the values of u
and x1 to u′ and x′1, respectively.
shifter 2 u u / 2 x 1 / 2 ( p + x 1 )/ 2
adder
mux mux
x 1 u p
a b c in c out
s D
mux mux mux
mux
u k - 1
x 1 0
u 0
2 x u’ x 1 '
1 0
1 0 1 0
1 0
1 0
1 0
p + x 1
Figure 4.5: DP1 block in the data path.
Module DP2. Figure 4.6 portrays the DP2 module, which is used only for inversion and
division. This module performs v/2, and if x2 is even then x2/2 else (x2 + p)/2. Finally, if
v is even then DP2 passes the updates else it passes the old values of v and x2 to v′ and x′2,
respectively.
55
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
shifter v / 2 x 2 / 2 ( p + x 2 )/ 2
adder
x 2 v p
a b c in c out
s D
mux mux
mux x 2 0
v 0
v’ x 2 '
1 0 1 0
1 0
p + x 2
Figure 4.6: DP2 block in the data path.
adder a b
c in
s GF ( p ) A / S unit a b
p s
A / S 1
mux 1 0
mux 1 0
mux 3 2 1 0
mux 3 2 1 0
u v x 1 x 2 p 2 x a a . b i b
mux 1 0
mux 1 0
mux 1 0
mux 1 0
u " v " x 1 " x 2 " a "
A / S
S Y
uov
Figure 4.7: DP3 block in the data path.
Module DP3. Figure 4.7 shows the detailed architecture of DP3 module. It comprises
of one GF(p) addition/subtraction (A/S) unit, input and output multiplexors, and some
56
4.3 Programmable GF(p) Arithmetic Unit (PGAU)
additional circuitry used in inversion and division only. The GF(p) A/S unit is used for
computing all five GF(p) arithmetic operations. DP3 computes a[+,−]b (mod p), if D = 0
and M = 0. The result of GF(p) addition or subtraction goes to the output port R through
a′′. In case of GF(p) multiplication (D = 0, M = 1), DP3 computes 2x + a.bi (mod p).
In each iteration, intermediate result of multiplication is restored in register u; while, at
final iteration, the value of a′′ passes through output port R. The DP3 is fully utilized for
computing GF(p) inversion and division (D = 1). For D = 1, if u ≥ v then DP3 performs
u′′ = u− v and x′′1 = x1− x2 (mod p) else it performs v′′ = v− u and x′′2 = x2− x1 (mod
p). Thus the result of subtractor unit (uov) is multiplexed to either of u′′ and v′′. And the
result of GF(p) A/S unit (a′′) is multiplexed to either of x′′1 and x′′2 . The inputs x and y of
GF(p) A/S unit are properly assigned by a couple of 4×1 multiplexers that are controlled
by select line S. The variables are in accordance with the description of Algo. 2.2. The
control signals Y and S (S1S0) are generated in the following way:
Y = 1, i f u≥ v
= 0, otherwise;
and
S0 = MD+DY
S1 = D,
where D and M indicate the opcode bits.
Fig. 4.8 depicts the programmable GF(p) adder/subtractor (A/S) unit. In order to
achieve a programmable modular adder/subtractor unit, we need to control the input data
of two binary adder circuits. The control signal A/S configures the circuit for GF(p) ad-
dition and subtraction, accordingly. If A/S = 0 then the unit performs x+ y (mod p) else
it performs x− y (mod p). Therefore, in the inversion and division opcodes A/S is one,
whereas in the multiplication opcode A/S is zero (see Table 4.1), as addition is needed for
multiplication whereas subtraction is needed for inversion/division. The responsibility of
three data path sub-blocks are summarized in Table 4.2.
The programmable GF(p) arithmetic unit is programmable also in the sense that it sup-
ports all primes smaller than the given lengths (192, 224, 256 bits).57
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
adder
mux mux
p
a b c in c out
s
1 0 1 0
y x
A / S
adder a b
c in c out
s
mux 1 0
mux 1 0
x + y , x - y ( mod p )
Figure 4.8: GF(p) adder and subtractor (A/S) unit.
Table 4.2: Major operations in sub-blocks of PGAU.
GF(p) Operation DP1 DP2 DP3Addition/Subtraction − − a±bMultiplication 2u − 2x+a.biInversion/Division u/2, x1/2, v/2, x2/2, ±(a−b),
(x1 + p)/2 (x2 + p)/2 ±(x1− x2)
4.3.4 Hardware Cost and Performance
In order to compute GF(p) addition and subtraction, the proposed programmable arith-
metic unit takes only one clock cycle. In case of GF(p) multiplication, the unit takes only
⌈log2 p⌉ clock cycles; and for GF(p) inversion as well as division, it takes only 2⌈log2 p⌉clock cycles. Table 4.3 shows the amount of resource required for implementing the pro-
posed programmable GF(p) arithmetic unit on Virtex-II FPGA. The number of clock cycles
and time required to compute GF(p) addition/subtraction (A/S), multiplication (M), and in-
version/division (I/D) are listed in Table 4.4.
Table 4.5 gives a comparative study of our proposed PGAU and stand alone implemen-
58
4.3 Programmable GF(p) Arithmetic Unit (PGAU)
Table 4.3: Implementation result of PGAU on Virtex-II FPGA.
Prime p Area Frequency(bits) Slice LUT Equivalent gate (MHz)
192 3 985 7 328 60.9k 40224 4 657 8 547 71.0k 37256 5 379 9 821 81.5k 34
Table 4.4: Performance of PGAU on Virtex-II FPGA.
GF(p) operation #Clock Time (µs)k = 192 k = 224 k = 256
Addition/Subtraction 1 0.025 0.027 0.029Multiplication k 4.800 6.000 7.300Inversion/Division 2k 9.600 12.100 14.600k = ⌈log2 p⌉
tation of different GF(p) arithmetic units. In the table, we refer conventional processor with
dedicated GF(p) adder, subtractor, multiplier, and divider units as the integrated processing
unit (IPU). The design of IPU, we consider, is similar to the parallel GF(p) arithmetic unit
shown in [53]. The work described in [53] provides the design architectures for each of
the individual arithmetic operations in prime fields. However, it does not provide the cost
for implementing individual units. Thus, for proper comparison we implement each of the
GF(p) arithmetic units based on the architectures provided in [53]. Table 4.5 shows that
the area requirement of our PGAU is only 82% of the IPU.
Table 4.5: Comparative hardware costs in LUT on Virtex-II FPGA.
Prime p Hardware Cost (LUT) PGAUIPU
(bits) Add Sub Mult Inv/Div IPU PGAU192 577 577 1 145 6 616 8 915 7 328 0.82224 673 673 1 329 7 731 10 406 8 547 0.82256 769 769 1 508 9 146 12 192 9 821 0.81− Add, Sub, Mult, and Inv/Div indicate stand alone adder, subtractor, multiplier, and
inverter/divider units, respectively.
Comparison with GF(p) ALU [106]. A GF(p) ALU for encryption processor is presented
by Daly et al. [106]. It performs modular multiplication and inversion in Montgomery
domain. The above ALU is a carry propagate adder based architecture consisting of 3 k+259
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
bit adder units. It is specified that the ALU computes the multi-cycle modular operations
like multiplication and inversion by executing repeated addition/subtraction operations on
respective data sets. The paper does not mention anything about the control units. We
may assume that those signals are generated from another module outside the ALU unit.
The control unit obviously adds on extra overhead with the proposed ALU for performing
respective modular operations. However, [106] also shows the implementation results of
the architecture that is implemented for computing AB/C (mod p) operation.
The major differences between proposed PGAU and the ALU proposed in [106] with
respect to the adopted design strategies are shown in Table 4.6. Proposed PGAU performs
all GF(p) operations on two’s complement binary number domain. Whereas, the GF(p)
ALU of [106] computes the multiplication and inversion on Montgomery domain num-
bers. Thus they need an overhead for input and output conversion in elliptic curve based
applications. Another major advantage of PGAU design is that it performs GF(p) division
directly in 2k clock cycles; while, the ALU of [106] performs the same in 3k clock cycles
by executing an inversion followed by a multiplication.
Table 4.6: Difference between PGAU and GF(p) ALU [106].
GF(p) ALU [106] PGAUNumber Domain Montgomery two’s ComplementMult. Algo. Montgomery InterleavedInv. Algo. Montgomery Binary inversionDiv. Algo. Inversion followed Binary inversion/
by multiplication division
The implementation result of our 192-bit PGAU is compared with the same length
GF(p) ALU [106] in Table 4.7. Both of the designs are implemented on Virtex II FPGA
platform. In the table CC indicates the number of clock cycles that are required to compute
the respective operations on the respective designs, and T indicates the respective time in
µs. The 192-bit implementation consumes 4135 slice area, and it operates at the maximum
of 19 MHz clock frequency. Whereas, proposed PGAU in same bit length consumes 3985
slices and runs at the maximum of 43 MHz clock frequency. Due to the direct division,
the PGAU saves k clock cycles for each division operation, which is a major operation in
elliptic curve point operations in affine coordinates. The small area gains in our design is60
4.4 Elliptic Curve Cryptoprocessor Resistant to Timing and Power Attacks
due to the step-by-step optimized architecture for all operations.
Table 4.7: Performance comparison between PGAU and GF(p) ALU [106].
DesignSlice
Multiplication (A ·B) Inversion (A−1) Division (A/B) AB/C(192-bit) CC, T CC, T CC, T CC, TProposed 3985 k, 4.8 2k, 9.6 2k, 9.6 3k, 15.4Daly [106] 4135 k, 10.1 2k, 20.2 3k, 30.3 4k, 40.4−CC : number of clock cycles. T : time in µs.
4.4 Elliptic Curve Cryptoprocessor Resistant to Timing
and Power Attacks
The proposed elliptic curve cryptoprocessor employs above programmable GF(p) arith-
metic unit. The whole architecture has inherent programmability features. That is, the
prime p can be changed without reconfiguring the FPGA. The architecture is designed for
computing the elliptic curve operations in affine coordinates.
In affine coordinates, point addition (ECA) consists of two multiplications and one divi-
sion, and point doubling (ECD) consists of three multiplications and one division in GF(p).
The elliptic curve scalar multiplication (ECSM) is performed by executing a number of
ECA and ECD operations. In general, we may consider that the integer d in dP operation
consists of 0.5⌈log2 d⌉ number of 1’s. Thereby, using binary algorithm one ECSM can be
computed by ⌈log2 d⌉ number of ECD and 0.5⌈log2 d⌉ number of ECA operations [155].
But, Okeya et al. [156] pointed out that aforementioned imbalanced ECSM computa-
tion procedure is vulnerable to non-differential side-channel attacks. However, the Mont-
gomery ladder, which is described in Algo. 2.4, is balanced and it computes both ECD and
ECA at every iteration irrespective of the bit value di. Therefore, it executes 30% additional
operations over binary algorithms for defending above attacks.
It is shown in [162] that the balanced ECSM algorithms are secured against non-
differential side-channel attacks, but they are vulnerable against differential side-channel
attacks. Differential Power Analysis (DPA) is one of the most popular differential side-61
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
channel attacks. Coron in [162] described the DPA on an balanced ECSM algorithm other
than the Montgomery ladder. However, the operands (curve points) occurred at every it-
erations of Montgomery ladder are deterministic. For example, if dk−2 = 1 then operands
in ECD operation is 2P else it is P. It is a known fact that there is a difference between
the power consumptions for performing ECD operations on 2P and P. Thus, there is a
correlation between the secret bit dk−2 and the power consumption at first iteration. In gen-
eral, there are correlations between di and the power consumption at the respective iteration
number k− i−2 for k−2≥ i≥ 0. The DPA is based on such correlations. Hence, Algo. 2.4
is vulnerable against DPA attack.
There are several ways to protect the secret in ECSM operation against DPA attack.
Point blinding technique proposed by Coron [162] over the Montgomery ladder can de-
fend the DPA attack. It works by following way. The point P to be multiplied is blinded
by adding a secret random point R for which S = dR. Scalar multiplication is done by
computing the point d(R+P) and subtracting S to get Q = dP. Let us consider an user
executes a set of dPi operations in a session. The input points R and S are private to the
user, which are provided by the same way as d for every new session. The random point R
and S are refreshed at each new execution by computing R← (−1)b2R and S← (−1)b2S
with a random bit b. This makes the DPA attack infeasible since the point P′ = P+R to be
multiplied by d is not known to the attacker.
A new kind of power attack known as doubling attack (DA) [118] breaks the double-
and-add always algorithms by only two query. It exploits the power consumption profiles
of the elliptic curve device to compute dP and d(2P). This attack also breaks the above
DPA resistant point blinding technique. The only difference is that during the second query
with 2P the device executes 2P+2R with probability 0.5. The attack is extended to break
Montgomery ladder in [81]. Therefore, the combination of above two ideas can break the
DPA resistant Montgomery ladder. The computation of Montgomery ladder (Algo. 2.4)
with above point blinding technique for two input points P and 2P is shown in Table 4.8.
Let us assume that d = 11001011 and M = P+R.
In the table, it is observed that if di−1 = di then same doubling operation is executed
on the (i−1)th iteration of d(2(P+R)) computation and the (i)th iteration of d(2(P+R))
62
4.4 Elliptic Curve Cryptoprocessor Resistant to Timing and Power Attacks
Table 4.8: The dM and d(2M) in the Montgomery ladder.
i di Process of dM Process of d(2M)7 1 Q1 = 1+M Q1 = 1+2M
Q2 = 2(M) Q2 = 2(2M)6 1 Q1 = M+2M Q1 = 2M+4M
Q2 = 2(2M) Q2 = 2(4M)5 0 Q2 = 3M+4M Q2 = 6M+8M
Q1 = 2(3M) Q1 = 2(6M)4 0 Q2 = 6M+7M Q2 = 12M+14M
Q1 = 2(6M) Q1 = 2(12M)3 1 Q1 = 12M+13M Q1 = 24M+26M
Q2 = 2(13M) Q2 = 2(26M)2 0 Q2 = 25M+26M Q2 = 50M+52M
Q1 = 2(25M) Q1 = 2(50M)1 1 Q1 = 50M+51M Q1 = 100M+102M
Q2 = 2(51M) Q2 = 2(102M)0 1 Q1 = 101M+102M Q1 = 202M+204M
Q2 = 2(102M) Q2 = 2(204M)Return Q1 = 203M Q1 = 406M
computation. Thus the power consumption profiles for any such i can find out whether di−1
is same as di, i.e., both of them are either 0 or 1. Now starting from the first iteration (MSB
of d) we can easily find out all bits of the secret exponent d using above doubling attack.
4.4.1 Modified Montgomery Ladder Against DPA and DA
The DPA and DA resistant Montgomery ladder is sketched in Algo. 4.4.1. We propose
a modification of the Coron’s point blinding technique to defend doubling attack (DA). The
refreshment of point R and S are done by following way :
R = (−1)b3R,
S = (−1)b3S.
The modified technique is indeed secure against DPA attack, as the rest of the conditions
remain unchanged. It is also secure against doubling attack (or DA) because in the second
query it adds ±3R with P. Therefore, as per DA it effectively computes d(P+R) and
d(2P±3R), which will not execute any similar operation during ith and (i−1)th iterations,
63
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
respectively. The algorithm performs dP correctly because it is executed by computing
d(P±3R)− (±3S) where R = dS.
Algorithm 4.4.1. DPA and DA resistant Montgomery ladder.Input: An integer d ≥ 1 and points P,R,S such that S = dR.Output: dP.1. Q1← P and Q2← R2. Q1← Q1 +Q23. Q2← 2Q14. for i from k−2 down to 0 do
if di = 1 thenQ1← Q1 +Q2 and Q2← 2Q2
elseQ2← Q1 +Q2 and Q1← 2Q1
5. b← ∑k−1j=0 Q1X j(mod 2), //Q1X is the x-coordinate of Q1
6. Q1← Q1−S7. R← (−1)b3R, S← (−1)b3S8. return Q1
It is described in [108] that the ECSM operation generates a good random pattern. The
security of this random generator depends on the input point P, which is private. However
in our case the input P is public. In the proposed technique, the input point P is added
with a private point R before starting the ECSM operation. The private R infers a private
input (P+R) to the ECSM operation. Thus, our modified Montgomery ladder works same
as [108]. We take the bitwise XOR of all bits of x-coordinate of the point Q1 = d(P+R)
as a random bit b. The bit b is indeed random because the operation b =⊕k
i=1 xi where
x1, · · · ,xk denote the bits of a random number x must be hard to compute if f (ECSM in our
case) cannot be inverted (see § 6.1.3, [51] and § 5.9.1, [129]). The bit b is used to refresh
the random point R and corresponding S for next execution which are performed at steps 5
and 7 of Algo. 4.4.1.
4.4.2 The ECSM on Single PGAU-core
The DPA resistant ECSM computation described in Algo. 4.4.1 performs both ECD
and ECA at every iterations. Table 4.9 shows the finite field operations required to perform
ECD and ECA. On a single PGAU core implementation, we perform ECA and ECD se-
quentially; an ECA followed by an ECD at every iteration. The computations of ECA and64
4.4 Elliptic Curve Cryptoprocessor Resistant to Timing and Power Attacks
Tabl
e4.
9:Pa
ralle
lism
ofE
CD
and
EC
Aof
step
-4of
Alg
o.4.
4.1
intw
oG
F(p)
arith
met
icco
res.
EC
D(2
Qd i+
1)on
PGA
U1
EC
A(Q
1+
Q2)
onPG
AU
2C
lock
sG
F(p)
Ope
ratio
nR
TL
Clo
cks
GF(
p)O
pera
tion
RT
L1
MR
A←
RQ(d
i+1)
x×
RQ(d
i+1)
x1
SR
D←
RQ
2y-R
Q1y
k+2
AR
B←
RA
+R
A2
SR
E←
RQ
2x-R
Q1x
k+3
AR
A←
RB
+R
A3
DR
D←
RD
/RE
k+4
AR
A←
RA
+R
CP§
2k+4
MR
E←
RD×
RD
k+5
AR
B←
RQ(d
i+1)
y+
RQ(d
i+1)
y3k
+5A
RF←
RQ
1x+
RQ
2xk+
6D
RA←
RA
/RB
3k+6
SR
E←
RE
-RF
3k+7
MR
B←
RA×
RA
3k+7
SR
F←
RQ
1x-R
E4k
+8A
RC←
RQ(d
i+1)
x+
RQ(d
i+1)
x3k
+8M
RF←
RD×
RF
4k+9
SR
B←
RB
-RC
4k+9
SR
F←
RF
-RQ
1y4k
+10
SR
C←
RQ(d
i+1)
x-R
B4k
+11
MR
C←
RA×
RC
Res
ult:
(RB
,RC
)=2Q
(di+
1),(
RE
,RF)
=Q
1+
Q2
5k+1
2S
RC←
RC
-RQ(d
i+1)
y−
RT
Lst
ands
forR
egis
terT
rans
ferL
ogic
−§
Reg
iste
rRC
Pco
ntai
nsth
ecu
rve
para
met
era.
−R
Q1x
,RQ
1y,R
Q2x
,and
RQ
2yco
ntai
nth
eva
lue
ofx
and
yco
ordi
nate
sof
Q1
and
Q2
resp
ectiv
ely.
65
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
ECD take 4k+10 and 5k+13 clock cycles. Thus, one iteration of step-4 of the Algo. 4.4.1
is computed in (9k+ 23) clock cycles by the single PGAU core based ECSM cryptopro-
cessor, where k = ⌈log2 p⌉= ⌈log2 d⌉ (refer Table 4.9). Hence, the number of clock cycles
(TS) required to perform ECSM operation on the single PGAU core is calculated as:
TS = (k−1)(9k+23)+3(5k+13)+4(4k+10)
= 9k2 +45k+56, (4.1)
where 3(5k+13) and 4(4k+10) number of clock cycles are required to execute the steps
2, 6, and 7 of the respective algorithm.
4.4.3 The ECSM on Dual PGAU-core
Parallelism can be considered to achieve a faster computation of elliptic curve scalar
multiplication. More than one PGAUs can be incorporated in this regard.
PGAU 1
PGAU 2
x
p y R o 1
x y p
R o 2
m u x
m u x
RP
RA RB RC
RCP
RQ 1 x
RQ 1 y Controller Logic
RD RE RF
d clock reset
A
B
RQ 2 x
RQ 2 y
RR x
RR y
RS x
RS y
Figure 4.9: Programmable dual-PGAU-core ECSM unit.
The proposed design is based on k-bit two’s complement parallel adder circuit, and it
does not use any combinational multiplier; hence do not have any scope of horizontal par-
allelism [58]. We try to maximize the parallelism of GF(p) operations; i.e., use vertical66
4.4 Elliptic Curve Cryptoprocessor Resistant to Timing and Power Attacks
parallelism. Fig. 4.9 depicts our proposed dual core GF(p) elliptic curve scalar multipli-
cation (ECSM) unit. It comprises of two programmable GF(p) arithmetic units (PGAUs),
on which we compute finite field operations of ECD and ECA concurrently. We assign
PGAU1 for ECD and PGAU2 for ECA. The state machine based controller unit is imple-
mented for sequencing the operations mentioned in Table 4.9. It takes the responsibility
of loading large operands into the respective registers through 32-bit data port, generating
opcodes for both PGAUs, selecting operand values accumulating the ECD and ECA results
at every iterations of ECSM algorithm, updating the random pair (R,S) for next execution,
and passing the resultant point Q = dP through the 32-bit output port. There are two over-
lapped register blocks A and B. The registers inside the block A are in accordance with
PGAU1. The intermediate results of ECD, which appears at Ro1 of PGAU1, are stored in
one of the registers RA, RB, and RC. Similarly, the intermediate results of ECA, which
appears at Ro2 of PGAU2, are stored in one of the registers RD, RE, and RF . The registers
RCP and RP contain the curve parameter a and prime modulus p, respectively. Registers
RQ1x, RQ1y, RQ2x, and RQ2y contain the value of x and y coordinates of Q1 and Q2, the
two points on which the ECSM operates, respectively. The overlapping registers of both
register blocks A and B are RQ1x, RQ1y, RQ2x, RQ2y, and RP, which are in accordance with
both PGAUs. The final result of an iteration is restored into the RQ1x, RQ1y, RQ2x, and
RQ2y registers, which are used as the input points at the next iteration. Table 4.10 shows
the data transfer among registers for final result of an iteration.
Table 4.10: Restoring the intermediate results of dP operation.
di = 0 di = 1RQ1x← RB RQ1x← RERQ1y← RC RQ1y← RFRQ2x← RE RQ2x← RBRQ2y← RF RQ2y← RC
The operation in step-6, Q1← Q1−S, is executed in the proposed cryptoprocessor as :
Q2← S, Q1← Q1−Q2.
In step-7, the operation R← (−1)b3R is performed as:
Q2← R, Q2← 2Q2, Q2← Q2 +R, R← (−1)bQ2.
67
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
The similar process is followed for performing S← (−1)b3S. The above procedures are
adopted for reducing the size of multiplexer in the input ports of PGAUs. The random bit
b is generated by a XOR tree included into the controller logic in only one clock cycle.
The operation scheduling on PGAU1 and PGAU2 at different clock cycles is shown
in Table 4.9 for performing ECD and ECA in parallel. The computation of ECD takes
only 5k+13 clock cycles; whereas, ECA takes only 4k+10 clock cycles. Operations are
performed in parallel, and the next iteration depends on the results of both ECD and ECA
of current iteration. Hence, one iteration of the Montgomery ECSM ladder is computed
in 5k + 13 clock cycles by the proposed dual-core elliptic curve cryptoprocessor, where
k = ⌈log2 p⌉. The latency of Montgomery ECSM algorithm in our proposed design is
derived here.
Latency of dP computation. The overhead for the DPA and DA resistance involves the ini-
tialization, final result computation, and random points refreshment as stated in Algo. 4.4.1.
It consists of three ECD, and four ECA operations, which takes 3(5k+ 13)+ 4(4k+ 10)
clock cycles on proposed dual-PGAU-core cryptoprocessor. Each iteration of the algo-
rithm takes 5k+ 13 clock cycles. The algorithm goes through all bits of scalar multiplier
d starting from the second most significant bit. Let us consider k = ⌈log2 p⌉ = ⌈log2 d⌉.The number of clock cycles (TD) required to perform one dP operation on the proposed
cryptoprocessor is as follows.
TD = (k−1)(5k+13)+3(5k+13)+4(4k+10)
= (k+2)(5k+13)+4(4k+10)
= 5k2 +39k+66. (4.2)
Therefore, the latency of the proposed dual-PGAU-core ECSM cryptoprocessor is 5k2 +
39k+66 clock cycles. Hence, according to equations 4 and 5 the clock cycle latencies of
single and dual core implementation are as follows.
• Latency in single-PGAU-core : 9k2 +45k+56.
• Latency in dual-PGAU-core : 5k2 +39k+66.
68
4.5 Security Analysis of the Proposed Cryptoprocessor
Thus, for a 192-bit ECSM (k = 192) the dual-PGAU-core implementation performs
1.8 times faster compared to corresponding single-PGAU-core implementation. Through
out this thesis, our proposed elliptic curve cryptoprocessor indicates this dual-PGAU-core
architecture, sometimes it is also called dual-core elliptic curve cryptoprocessor.
4.5 Security Analysis of the Proposed Cryptoprocessor
This section shows that the proposed implementation is indeed secure against timing
and power attacks. In case of ECC applications, d is used as a private key of the user. The
(S,R) pair is also private. The private parameters are applied to the proposed cryptoproces-
sor once in a session. Within a new session the user can decrypt several message (P). For
every decryption within a session the (S,R) pair is refreshed by a random bit b.
4.5.1 Timing Attacks
Timing attack was introduced by Paul Kocher in 1996 [171], which was the first re-
ported side-channel attack on cryptographic implementations. There are two types of tim-
ing attack that are applied on ECSM implementations [45].
• The Hamming weight model. This attack is only applicable to the unbalanced al-
gorithms, where the ECA is performed only in the iterations where di = 1. This
attacking model exploits the timing measurement of dP computation for finding out
Hamming weight of the secret parameter d. This attack does not exactly find out
bit values of d, but it reduces the search space. The attack works as follows: let us
consider that the adversary knows the time required to perform ECD and ECA by
the target device. The adversary measures the time required to perform one dP op-
eration by the device. From these timing information the adversary tries to guess the
Hamming weight of d.
• Statistical Timing Attack. This is more sophisticated and more powerful timing at-
tack. Let, the target device take different amount of time to perform ECD operations
on different points. The processed point in an iteration is correlated to the respective
bit value of d. Thus, a statistical analysis of timing variations to perform ECD of a69
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
particular iteration finds out the secret bits.
The proposed elliptic curve cryptoprocessor is secure against above mentioned timing
attacks. It computes symmetric operation at every iteration. It performs both ECD and
ECA in parallel, which takes exactly 5k + 13 clock cycles at every iteration. According
to Equation 4.2 it takes 5k2 + 39k + 66 clock cycles to perform one dP operation. Let
us consider that the time period of one clock cycle is t. Thus the measured value of dP
computation time tdP = t(5k2 + 39k+ 66). It is considered that the computation time for
ECD and ECA, which are denoted by tecd and teca, are known to the adversary. In fact the
value of teca is same as the time required for computing each and every iteration on the
proposed cryptoprocessor, which is 5k+13 clock cycles. Thus, tdP/teca is fixed for a given
k, which does not help to find out the Hamming weight of secret d.
On the other hand, statistical timing analysis believes that the value of tecd varies with
input points. However, in the proposed elliptic curve cryptoprocessor tecd value for ev-
ery point is unique. It is achieved through the design technique adopted for implementing
PGAU. The PGAU takes a fixed amount of time for computing a GF(p) arithmetic opera-
tion on different inputs. For example, it takes exactly k and 2k clock cycles for computing
multiplication and division on every input for a given finite field. Thus, the proposed cryp-
toprocessor is secure against statistical timing attack.
4.5.2 Simple Power Analysis (SPA)
Kocher in [163] first described SPA and DPA attacks. SPA observes the power con-
sumption of one single execution of a cryptographic algorithm. The SPA attack on ECSM
implementation is based on the observation that the power consumed at a given time is re-
lated to the point operations being executed and the bit value of the secret scalar multiplier
being manipulated. The SPA on naive implementation, which are based on imbalanced
computations, finds out the bit values of secret multiplier [45,155]. However, the proposed
implementation consists of the following SPA resisting properties.
• It is based on the balanced computation of Montgomery ladder.
• It does not execute any conditional branch statement.70
4.5 Security Analysis of the Proposed Cryptoprocessor
• It performs field multiplication and squaring by the same sequence of operations.
• It performs fixed set of operations for every iterations of dP execution.
Thus, the power consumption profile exhibits an uniform pattern throughout the dP com-
putation from which by simple observation it is impossible to identify the respective bit
value of the secret multiplier. Therefore, the proposed cryptoprocessor is secure against
SPA attack.
On the other hand, an n bit scalar multiplier d (on average) consists of n/2 number of
non-zero bits in its binary representation. In case of binary double-and-add algorithm, the
addition of two points on the elliptic curve is computed only while processing the non-zero
bits of d. However, our implementation is based on Montgomery ladder which performs
both point addition and point doubling at every iterations, irrespective of the bit values,
make the design resistant against SPA. As described in Section 2.2.1, the costs of one point
addition and one point doubling are 2M +D and 3M +D, respectively, where M and D
stands for multiplication and division in GF(p). Thus, the total cost of one elliptic curve
scalar multiplication using binary double-and-add procedure is n(4M+1.5D), whereas the
same using proposed procedure is n(5M+2D). It incurs n(M+0.5D) additional operations
due to the side-channel attack resistance property. In our proposed cryptoprocessor each
M and D demand n and 2n clock cycles, respectively. Therefore, the overhead cost of our
proposed side-channel attack resistant scheme is (n(2n)/n(7n))100%≃ 30%.
4.5.3 Differential Power Analysis (DPA)
In DPA attack the adversary exploits deterministic variations in the power consumption
that are caused by processing varying data. A DPA on ECC is described in [162]. Here we
describe a similar type of DPA on SPA resistant Montgomery ladder. A DPA on Algo. 2.4
in section § 2.2.1 can be performed by noticing that at step j the processed points Q1 and Q2
depend only on the bits (dk−1, · · · ,d j) of d. When point Qi, i ∈ {1,2} is processed, power
consumption will be correlated to the bit patterns of Qi and thus to a specific bit si (say
LSB) of Qi. No correlation will be observed with a point not computed. Thus it is possible
to successively recover the bits of the secret d by guessing which points are presently being
71
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
computed by the device. The computed points for first three bits are shown in Table 4.11.
Table 4.11: Computed points in Algo. 2.4 for first three bits of d.
dk−2 = 0Q2 = 3P, Q1 = 2P
dk−3 = 0 dk−3 = 1Q2 = 5P, Q1 = 4P Q2 = 6P, Q1 = 5P
dk−4 = 0 dk−4 = 1 dk−4 = 0 dk−4 = 1Q2 = 9P, Q2 = 10P, Q2 = 11P, Q2 = 12P,Q1 = 8P Q1 = 9P Q1 = 10P Q1 = 11P
dk−2 = 1Q2 = 4P, Q1 = 3P
dk−3 = 0 dk−3 = 1Q2 = 7P, Q1 = 6P Q2 = 8P, Qc = 7P
dk−4 = 0 dk−4 = 1 dk−4 = 0 dk−4 = 1Q2 = 13P, Q2 = 14P, Q2 = 15P, Q2 = 16P,Q1 = 12P Q1 = 13P Q1 = 14P Q1 = 15P
The main objective of this DPA attack is to recover the secret d. It recovers d iteratively
starting from second MSB (dk−2). Let us examine the ECD operation in Montgomery lad-
der. At the first iteration, if dk−2 = 0 then it performs 2P else it performs 2(2P). The point
2P in both cases appeared either in output or in input. That means the power consumption
in the first iteration is always correlated with 2P. Whereas, the point 4P appeared only if
dk−2 = 1. Thus the power consumption in first iteration is correlated with 4P only if the
secret bit dk−2 = 1.
In order to mount the DPA attack, we implement the Montgomery ladder on an FPGA
platform which is specially designed for power analysis attack. The board provides an 1-
ohm resistor between the power supply and the VCCINT pin of FPGA device. We measure
the current drawn through that resistor during ECSM computation by a current probe. The
specification of the probe is Tektronix current probe (serial number B014316). We use
the probe with a TCPA300 power amplifier in standby mode. The measured power is
displayed and stored in a Tektronix TDS5032B Digital Phosphor Oscilloscope. We develop
software tools to automate the whole process for varying inputs. The power consumption is
proportional to the voltage drop across the register and is measured in terms of mV which
72
4.5 Security Analysis of the Proposed Cryptoprocessor
is varied around 10mV . The power signal is sampled at 12.5MS/s.
The computation of ECSM with same exponent d is performed repeatedly with varying
P. The attack first targets the secret bit dk−2 which is processed at first iteration. Thus,
we store the power consumptions during the first iteration of dP computations. The power
consumptions are then divided into two sets based on the specific bit (LSB in our case)
of 2P. We calculate the mean power consumptions of each of the sets. Then the absolute
value of the difference-of-mean power consumption is calculated on above two means. The
similar processing has been done for 4P also.
500 1000 1500 2000 25000
1
2
3
4
5
x 10−4
samples
diffe
renc
e−of
−m
ean
(V)
for 2Pfor 4P
Figure 4.10: Difference-of-mean power for 2000 traces.
Figure 4.10 shows the corresponding results. The attack is based on the power con-
sumptions of 2000 different inputs. In the figure it is shown that the difference-of-mean for
2P gives significant peaks whereas there are no significant peaks for 4P, which identifies
dk−2 = 0. The difference-of-mean for 2P gives peaks as 2P actually occurred in the first
iteration (refer 4.11). But the same does not give any peak (it is almost nullified) for 4P as
2(2P) never was computed during first iteration. After identifying the bit dk−2 the process
is repeated for further bits. Thus the SPA resistant Montgomery ladder is vulnerable against
DPA attack.
However, our modified point blinding technique has been incorporated in Montgomery
ladder. The proposed cryptoprocessor computes the ECSM operation by executing above
73
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
DPA resistant Montgomery ladder (Algo. 4.4.1). The Montgomery ladder in this algorithm
starts its execution with an unknown point P+R. The point R is updated at every new exe-
cution by (−1)b3R with a random bit b. The randomness of our adopted random generator
is described in [108], and we assume there is no weakness in the random generator. Now,
whenever a new execution is encountered the computation starts with a new unknown point.
Thus the processed points in the Algo. 4.4.1 are not deterministic. Also there is no relation
among the point occurrences in adjacent executions [157]. Every new point P changes to
some new unknown point P′ = P+R for which no known specific bit value could be cho-
sen on target point. Thus, no correlation could be found between a specific bit of the target
point and the respective power consumption.
500 1000 1500 2000 25000
1
2
3
4
5
x 10−4
samples
Diff
eren
ce−
of−
mea
n (V
)
2000 traces5000 traces10000 traces20000 traces
Figure 4.11: Difference-of-mean power for 2P on proposed cryptoprocessor.
In order to prove the DPA resistance property, we perform the above mentioned attack
on our proposed elliptic curve cryptoprocessor. During the first iteration the attack is per-
formed with a maximum of 20,000 traces. Figure 4.11 shows the difference-of-mean during
the first iteration with respect to the LSB of 2P. The same for 4P is shown in Fig. 4.12.
It is observed that none of the difference-of-mean powers gives significant peaks, all are
nullified. Therefore, neither 2P nor 4P has occurred in the first iteration. Some unknown
points have been processed there. Thus the DPA could not identify the secret bit dk−2,
which ensures that the proposed cryptoprocessor is secure against DPA attack.
74
4.5 Security Analysis of the Proposed Cryptoprocessor
500 1000 1500 2000 25000
1
2
3
4
5
x 10−4
samples
Diff
eren
ce−
of−
mea
n (V
)
2000 traces5000 traces10000 traces20000 traces
Figure 4.12: Difference-of-mean power for 4P on proposed cryptoprocessor.
4.5.4 Doubling Attack (DA)
It is already shown that the Coron’s point blinding technique on Montgomery ladder is
vulnerable against doubling attack (see § 4.4). The DA is based on the two queries. One is
on some input P and the other one is on 2P. The DA exploits the similar (point doubling)
operations for computing dP and d(2P). Therefore, it is very essential to process above
two queries by the target device for doubling attack. In case of Coron’s point blinding
technique the first query is processed on some P′ = P+R. The second query is processed
on 2P′ with 0.5 probability as random point R is refreshed by ±2R.
However, in our modified point blinding technique the random point R is refreshed by
±3R. Thus, though the first query with P is processed on P′ = P+R the second query
with 2P is processed on P′′ = 2P±3R, which is not in the form of 2P′. Thus, the essential
requirement of DA is not satisfying on our elliptic curve cryptoprocessor. The computa-
tions of dP′ and dP′′ for d = 11001011 are shown in Table 4.12. No similar operation
is performed during (i− 1)th and ith iterations for processing second and first queries, re-
spectively. Hence, the computation of our modified Montgomery ladder (Algo. 4.4.1) with
proposed point blinding technique, as well as the proposed elliptic curve cryptoprocessor
is secure against doubling attack.
A common argument one may arrive is whether the proposed scheme can defend the
75
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
Table 4.12: The d(P+R) and d(2P+3R)) in the Montgomery ladder.
i di Process of d(P+R) Process of d(2P+3R))7 1 Q1 = 1+(P+R) Q1 = 1+(2P+3R)
Q2 = 2(P+R) Q2 = 2(2P+3R)6 1 Q1 = (P+R)+2(P+R) Q1 = (2P+3R)+(4P+6R)
Q2 = 2(2P+2R) Q2 = 2(4P+6R)5 0 Q2 = 3(P+R)+4(P+R) Q2 = (6P+9R)+(8P+12R)
Q1 = 2(3P+3R) Q1 = 2(6P+9R)4 0 Q2 = 6(P+R)+7(P+R) Q2 = (12P+18R)+(14P+21R)
Q1 = 2(6P+6R) Q1 = 2(12P+18R)3 1 Q1 = 12(P+R)+13(P+R) Q1 = (24P+36R)+(26P+39R)
Q2 = 2(13P+13R) Q2 = 2(26P+39R)2 0 Q2 = 25(P+R)+26(P+R) Q2 = (50P+75R)+(52P+78R)
Q1 = 2(25P+25R) Q1 = 2(50P+75R)1 1 Q1 = 50(P+R)+51(P+R) Q1 = (100P+150R)+(102P+153R)
Q2 = 2(51P+51R) Q2 = 2(102P+153R)0 1 Q1 = 101(P+R)+102(P+R) Q1 = (202P+303R)+(204P+306R)
Q2 = 2(102P+102R) Q2 = 2(204P+306R)Return Q1 = 203(P+R) Q1 = 406P+309R
DA for a second query on 3P instead of 2P. The answer is yes. In that case the second query
is processed on 3P′ with a 0.5 probability. Now, the DA basically exploits the consecutive
point doubling (ECD) operations (where an ECD computes 2P from an input P). The
original DA is based on the fundamental observation that the output of an ECD on 2P′ is
same as the output of two consecutive ECD on P′. This similarity of the two computations,
one on P′ and the other on 2P′ continues throughout the ECSM computations, and result
in the DA. However, with our modification, the same observation does not hold for P′
and 3P′. Fig. 4.13 shows an execution tree for first few iterations with all possible input
combinations of the Montgomery ladder. The execution of two queries are performed with
two different inputs P and 3P, respectively.
The execution tree shows few similarities during first three iterations of dP′ and d(3P′).
For example, if first iteration of d(3P′) and second iteration of dP′ both executes 6P′ then
it could identify {dk−2,dk−3} = {0,1}. Similarly, if first iteration of d(3P′) and third iter-
ation of dP′ both executes 12P′ then it could identify {dk−2,dk−3,dk−4} = {1,0,0}. But,
no similarities can occur at further iterations. Due to the above few similarities it is recom-
76
4.5 Security Analysis of the Proposed Cryptoprocessor
16 , 17 48 , 51
17 , 18 51 , 54
18 , 19 54 , 57
19 , 20 57 , 60
20 , 21 60 , 63
21 , 22 63 , 66
22 , 23 66 , 69
23 , 24 69 , 72
24 , 25 72 , 75
25 , 26 75 , 78
26 , 27 78 , 81
27 , 28 81 , 84
28 , 29 84 , 87
29 , 30 87 , 90
30 , 31 90 , 93
31 , 32 93 , 96
8 , 9 24 , 27
9 , 10 27 , 30
10 , 11 30 , 33
11 , 12 33 , 36
12 , 13 36 , 39
13 , 14 39 , 42
14 , 15 42 , 45
15 , 16 45 , 48
4 , 5 12 , 15
5 , 6 15 , 18
6 , 7 18 , 21
7 , 8 21 , 24
2 , 3 6 , 9
3 , 4 9 , 12
1 , 2 3 , 6
0 1
0 1 0 1
0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
- d k - 1 = 1 - 1 , 2 indicates Q 1 = P’ , Q 2 = 2 P’ ( 1 - st iteration of dP’ ) - 3 , 6 indicates Q 1 = 3 P’ , Q 2 = 6 P’ ( 1 - st iteration of d ( 3 P’ ) ) Match found for
{ d k - 2 , d k - 3 } = {{ 0 , 1 }, { 1 , 0 }} Match found for { d k - 2 , d k - 3 , d k - 4 } = { 1 , 0 , 0 }
- First query P’ = P + R , computes dP’ . - Second query : P” = 3 P ± 3 R , computes d ( 3 P’ ) with probability 0 . 5 .
Figure 4.13: Execution tree for doubling attack with P and 3P.
mended to avoid those few values of d as the secret key. Hence, the DA on the proposed
technique can not be performed even by two queries on P and 3P.
4.5.5 Security of the Random Generator
Let us consider that an operation P+R is performed on a public parameter P and a
private parameter R. Security of the randomness of ECSM operation as per [108] depends
on the privacy of the point P+R. Thus, the system will be vulnerable if an attacker can
find out R. The side-channel attack (SCA) in this regard can be considered to find out an
unknown R by exploiting P+R operations.
A DPA attack on (A+B) in F2m is shown in [40]. That field addition is performed
by XOR (linear) operation. The power consumption of such operation is directly related
to a specific bit of the output and inputs. Thus, a correlation is found. But, in case of
point addition (P+R) the output is not linearly related to the inputs. It is performed by
executing a set of finite field operations. The above correlation could not be found in this
case between the power consumption and a specific bit of inputs and output, where from
that specific unknown input bit is guessed. Hence, it is secure against above attack.
77
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
Let us assume that a new side-channel attack can manage to find the vulnerability of
(P+R) operation. A very common phenomenon for any powerful side-channel attack is
that a sufficient number of times (say 4000 times) you need to compute the same operation
on the same private parameter (R) with varying public parameter (P). But, it could not be
possible in the proposed technique. The private parameter R is changed randomly at every
new execution within a session by the proposed device.
However, if (S,R) pair is fixed for a user then the operation (P+R) on the same R is
performed at the beginning of each of the sessions. Thus, 4000 sessions can give sufficient
side-channel information to mount that attack. To protect this vulnerability it is necessary
to design the protocol in such a way that at the end of a session the updated (S,R) pair
is sent back to the user through some secure channel. The latest (S,R) pair is used for
the next session of that user. Therefore, using our proposed technique with the above high
level protocol completely avoids the execution of repeated (P+R) operation on the same R,
which ensures the security of the proposed random generator against side-channel attacks.
4.6 ECSM Implementation Result and Comparison
We have implemented the ECSM units on FPGA platforms. The design has been done
in Verilog (HDL). The synthesis, mapping, placement, and routing have been done on Xil-
inx ISE 7.1i. Simulation at different levels have been performed on ModelSim XE III 6.0a
simulator. Table 4.13 shows the post place and route results of ISE for three different bit
sizes. The target device is a Xilinx Virtex-II Pro FPGA. The 192-bit dual core implementa-
tion computes an ECSM in 4.47 ms running at 43 MHz. The 192-bit cryptoprocessor uses
8 972 slices and 3127 flip-flops. The estimated equivalent gate count of 192-bit implemen-
tation is 133 685.
The dual PGAU-core elliptic curve cryptoprocessor has been implemented for different
FPGA platforms. The total slice area consumption with different FPGA devices are almost
same. But due to the speed grade factor, frequency of the design changes and thus the
performance varies with devices. Table 4.14 shows the performance of proposed crypto-
processor on different FPGA platforms.
78
4.6 ECSM Implementation Result and Comparison
Table 4.13: Hardware cost and time of ECSM operation on Virtex-II Pro FPGA.
Prime p Frequency Area ECSM Time(bits) (MHz) Slice LUT FF Equivalent gate (ms)
Single PGAU-core implementation192 43 4 463 8 791 1 147 65 874 7.92224 40 5 226 9 344 1 253 79 051 11.55256 36 6 102 10 561 1 512 87 024 16.71
Dual PGAU-core implementation192 43 8 972 15 394 3127 133 685 4.47224 40 10 386 17 914 3513 154 861 6.50256 36 11 953 20 779 3873 177 534 9.38
Table 4.14: Performance of dual PGAU-core implementation on different FPGA.
bit Spartan-III Virtex-II Pro Virtex-IVFrequency Time Frequency Time Frequency Time
(MHz) (ms) (MHz) (ms) (MHz) (ms)192 32 6.00 43 4.47 61 3.15224 28 9.27 40 6.50 58 4.49256 24 14.05 36 9.38 54 6.26
The utilization of area for performing ECSM operation is shown in Table 4.15. It is
considered that the multiplication unit consumes only 1/6 slices of inversion/division units
(refer Table 4.5). The PPU1 proposed in [23] consists of one multiplier and one divider that
utilize 82.3% and 65.9% times of the total time required for an ECSM computation. The
total area utilization of PPU1 for computing ECSM operation is only 68.2%. Similarly,
PPU2 in [23] consists of two multipliers and two dividers, which utilize 39.4%, 59.1%,
39.4%, and 39.4% during the ECSM operation. Its total area utilization is 40.9% on an
average during ECSM operation. Whereas, in the proposed design 2/3 area (DP1 and DP3
except subtractor) utilize 100% and another 1/3 area (DP2 and subtractor in DP3) utilize
only 40%. Thus, the proposed design utilizes 80.0% total area on an average during the
execution of above operation.
Table 4.16 shows the performance comparison of elliptic curve scalar multiplication.
The design, which is reported in [31], supports dual-field operations with relatively higher
hardware area (40 219 Slices) and thus we have not explicitly compared it with proposed
79
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
Table 4.15: Area utilization of designs in ECSM operation.
PPU1 [23] PPU2 [23] Proposed68.2% 40.9% 80.0%
design. It may be observed that the designs are implemented on different platforms, and
are using different resources. Thus, a straight forward comparison is not fair. However,
we analyze them individually with our proposed elliptic curve cryptoprocessor. Embedded
multi-core design proposed by Fan et al. [58] consumes 0.35 times slice area compared to
proposed design, but along with slices, the design [58] uses 6 Block RAM (or BRAM),
which is equivalent to 6× 18 Kb RAM, and sixteen 18-bit dedicated multipliers. On the
same Virtex-II Pro platform, compared to [58], the proposed design gives 2.21 times better
throughput with respect to ECSM operation. The design described in [74] consumes 15 755
slices and computes one point multiplication in 3.86 ms. Design proposed by Sakiyama et
al. [70] uses 1.2 times slice area along with 9× 18 Kb RAM and two 256-bit dedicated
modular arithmetic logic units. The 256-bit implementation of [70] takes 2.70 ms, which
has a 2.20 times more throughput compared to proposed 192-bit design. The ECSM im-
plementation reported in [87] gives only 0.74 times throughput in smaller area compared
to proposed elliptic curve cryptoprocessor.
The design proposed in [31] provides maximum throughput with respect to CMOS as
well as FPGA implementations. The design proposed in [58] consumes minimum amount
of slice area in FPGA compared to the designs that are able to perform division operation
also. Our design in this respect makes a trade off, also it has additional features that it
could resist non-differential and differential timing and power attacks. Figure 4.14 gives a
pictorial view of the performance of current design and the designs that were proposed by
Sakiyama et al. [70] and Fan et al. [58]. Different resources inside the Virtex-II Pro FPGA
device that are used by the related designs are shown in four groups. The major features
that make the current design superior than existing designs are as follows :
• The whole architecture have inherent programmability. That is, it supports all primes
less than the given lengths (192, 224, 256 bits). Therefore, it can be used to process
a larger number of curves defined on GF(p).
80
4.6 ECSM Implementation Result and Comparison
Tabl
e4.
16:P
erfo
rman
ceco
mpa
riso
nof
ellip
ticcu
rve
scal
arm
ultip
licat
ion
over
arbi
trar
ypr
ime
field
s.
Ref
eren
cePr
ime
pD
evic
eFr
eque
ncy
Are
aTi
me
Thr
ough
put
(bits
)(M
Hz)
(ms)
(Kbp
s)Pr
opos
ed*
192
Vir
tex-
IIPr
o43
897
2Sl
ice
+3
127
FF4.
4743
.0A
nany
i[14
],20
09‡
192
Vir
tex-
IIPr
o49
2079
3Sl
ice,
32M
ult.
(18-
bit)
7.24
26.5
Schi
nian
akis
[15]
,200
9‡19
2V
irte
x-E
−25
012
LU
T3.
5454
.2L
ai[3
1],2
008
192
Vir
tex-
IIPr
o94
4021
9Sl
ice
1.25
153.
6Fa
n[5
8],2
007§
192
Vir
tex-
IIPr
o93
317
3Sl
ice
+16
Mul
t.+
6B
RA
M9.
9019
.4Sa
kiya
ma
[70]
,200
625
6V
irte
x-II
Pro
100
1084
7Sl
ice
+9
BR
AM
2.70
94.8
McI
vor[
74],
2006
256
Vir
tex-
IIPr
o39
1575
5Sl
ice,
256
Mul
t.(1
8-bi
t)5.
9942
.7Sh
uhua
[87]
,200
5†19
2V
irte
x-II
502
365
Slic
e+
2B
RA
M+
114
7FF
6.00
32.0
Lai
[16]
,200
916
0C
MO
S0.
13µm
121
170
Kga
tes
0.61
262.
3L
ai[3
1],2
008
160
CM
OS
0.13
µm21
715
1K
gate
s0.
3447
0.6
Che
n[5
5],2
007
256
CM
OS
0.13
µm55
612
2K
gate
s1.
0125
3.5
Sato
h[1
19],
2003
256
CM
OS
0.13
µm13
812
0K
gate
s2.
6895
.5−
*Im
plem
enta
tion
itsel
fis
secu
reag
ains
ttim
ing
and
pow
erat
tack
s.−
‡Im
plem
enta
tion
supp
orts
fixed
NIS
Tpr
imes
only
.−
§It
supp
orts
mod
ular
mul
tiplic
atio
nof
arbi
trar
yle
ngth
.−
†D
oes
noti
nclu
dedi
vide
rand
does
notc
ompu
teth
ere
sult
inaf
fine
co-o
rdin
ates
.
81
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
• The current design is a memoryless design. It does not use any block RAM of the
FPGA. The RAM cells consume more power compared to logic cells [35]. Thus in
low power applications our design is more useful than the designs with memory.
• It efficiently optimizes the area × time per bit value for ECSM operation.
• The proposed design is secure against known timing and power attacks.
Flip Flops
Our Fan Sakiyama
8972
3127
3173
6
16
10 847
4 . 47 ms 9 . 90 ms 2 . 70 ms
Our Fan
Sakiyama
Our Fan Sakiyama
0 0
0 Fan
Sakiyama 9
Our
0 Our Fan
Sakiyama 0
R e s
o u r c
e s
ECSM ComputationTime
Slice
BRAM
Multiplier
Figure 4.14: Performance of related designs with respect to area and time.
For detailed comparison, let us consider one LUT is equivalent to 16×1 RAM [34] and
one BRAM consists of 1024×18-bit RAM; hence, one BRAM is equivalent to 576 slices.
We also consider, for simplicity, one 18-bit multiplier is equivalent to 197 slices [36]. The
aforementioned equivalence relations are used to calculate the equivalent slice area that
are required to implement related designs. Table 4.17 shows the equivalent slice area and
corresponding comparative parameter area × time per bit values. Compared to other de-
sign, the proposed cryptoprocessor holds the second best position with respect to the above
parameter, just after the design in [70]. However, the design of [70] contains memory
elements and also it does not provide security against any side-channel attack. Except
the current design, all other designs implement normal Double-and-Add scalar multiplica-
tion algorithm. This algorithm is susceptible to simple side-channel analysis, like simple82
4.7 Conclusion
power attacks (SPA) [155]. However, the proposed architecture protects the secret multi-
plier against above vulnerabilities. This incurs an additional cost of 30% operations on an
average and requires an extra 6k bit register.
Table 4.17: Performance comparison of different designs.
Reference EqS ECSM Time EqS×Timebit
TPAR BRAM(ms)
Proposed 9 072 4.47* 211 Yes NoAnanyi [14] 27 097 7.24 1022 No NoLai [31] 40 219 1.25 262 No NoSchinianakis [15] 13 500 3.54 249 No NoFan [58] 9 781 9.90 504 No YesMaIvor [74] 15 755 3.86 238 No NoSakiyama [70] 16 031 2.70 169 No Yes− EqS: equivalent slice. TPAR: timing and power attack resistant.− * TPAR property needs 30% additional operations and 6k bit extra register.
Figure 4.15 presents the ECSM computation time versus equivalent slice area of related
designs. The area × time per bit value of our design is only 0.20 times compared to the
design of Ananyi et al. [14], 0.77 times compared to the design of Lai et al. [31], 0.82
times compared to the design of Schinianakis et al. [15], 0.42 times compared to the design
of Fan et al. [58], 0.64 times compared to the design of McIvor et al. [74], and 1.2 times
compared to the design of Sakiyama et al. [70]. Therefore, the above comparison show
that our design optimizes the area × time per bit value along with the inherent timing and
power attack resistant property.
4.7 Conclusion
The chapter has investigated the scope of hardware sharing among the finite field prim-
itives. The reduced area of the programmable GF(p) arithmetic unit has allowed the use of
dual cores to accelerate the ECSM operation. It has been shown that the proposed crypto-
processor performs one 192-bit ECSM operation in 4.47ms. The proposed architecture has
the additional advantage of being memoryless. The most important thing is that the pro-
posed elliptic curve cryptoprocessor provides security against timing and power attacks.
Experimental results have been furnished to show that the proposed design is the best area
83
Chapter 4 Design and Analysis of Elliptic Curve Cryptoprocessor
Fan Our
Sakiyama
1 2 3 4 5 6 7 8 9 10
5000 10 000 15 000 20 000 25 000 30 000
ECSM Execution time ( ms )
E q u
i v a l
e n t S
l i c e
35 000 40 000
McIvor
45 000 Lai
Ananyi
S c h
i n i a
n a k i
s
Figure 4.15: Performance of related designs.
× time per bit value optimized timing and power attack resistant elliptic curve scalar mul-
tiplier on GF(p).
In the next chapter we will focus on further performance improvement of our current
PGAU and ECSM cryptoprocessor on FPGA platform. The work explores the in-built
features of an FPGA device for designing optimized architectures for arithmetic operations,
based on which we measure the performance gain of our current design.
84
Chapter 5
Fast Prime Field Adders and Multipliers
on FPGA Platform
FINITE FIELD ADDITION and multiplication are the most important operations in
cryptography. Efficient techniques of these operations greatly affect the overall per-
formance of a cryptoprocessor. The development of such a cryptoprocessor for bilinear
pairings is one of the major objectives of this thesis. The bilinear pairings in cryptogra-
phy are computed on elliptic or hyperelliptic curves that are defined over finite fields. Its
security and computation efficiency depends on underlying curve and finite field. We call
them the pairing-friendly curve and the pairing-friendly field, respectively. The selection
of curves and design of efficient cryptoprocessor for pairing computation will be addressed
in the next chapter. However, this chapter deals with the efficient design techniques for
prime field (Fp) arithmetic on FPGA platform.
The efficient computations of addition/subtraction and multiplication in Fp are the main
objective of this chapter. We propose high-speed prime field adder and multiplier circuits
on FPGA platform. The work shows the efficient utilization of in-built fast carry chains
of an FPGA device for developing a high speed adder circuit. It follows the Karatsuba
decomposition technique for computing above operations. Through experimental results it
shows that due to the optimized addition chains, Karatsuba decomposition upto a particular
level improves the performance. But, further decomposition degrades the same.
85
Chapter 5 Fast Prime Field Adders and Multipliers on FPGA Platform
Subsequently, the chapter modifies the existing interleaved multiplication algorithm us-
ing Montgomery ladder. The modified algorithm indeed improves the scope of parallelism.
Also, it provides the security against non-differential side-channel attacks. The experi-
mental result shows that the proposed design provides 70% speedup from the best known
existing design. The actual power analysis has been performed to show its security against
non-differential power attack which is also known as simple power analysis (SPA).
Finally, the chapter redesigns the programmable GF(p) arithmetic unit (PGAU) and
the elliptic curve scalar multiplication (or ECSM) cryptoprocessor based on the proposed
fast adder circuits. The experimental result shows that the modified designs achieve 30%
speedup over the designs reported in previous chapter.
5.1 Introduction
The requirement of high security in current days electronic applications are mostly
provided by the protocols that are developed with RSA, elliptic curve, and pairing based
cryptography. Modular addition and multiplication are the fundamental operations in all
such public key schemes. In case of elliptic curve and pairing, the key sizes are relatively
lesser than RSA, and it is around 256 bits long for achieving 128-bit security level. Thus,
efficient design of modular (or finite field) addition and multiplication on large operands are
one of the objectives of this thesis. The performance gain from this optimized finite field
primitives further help to design highly optimized processor for pairing based cryptography.
Various techniques exist to improve the efficiency of addition. Some of the known
techniques are carry select, carry lookahead, and Brent-Kong carry. All of these techniques
have optimized the length of the carry chain in order to improve the efficiency of an ad-
dition. But, none of them has considered the routing complexity of the respective adder
circuits on hardware. This chapter experimentally shows that the routing delay of these
addition techniques on an FPGA platform is almost same as the logic delay.
On the other hand modern FPGAs provide special carry logic for addition. The carry
chains formed by the in-built carry logic are 32 bits long. This chapter explores that the
carry propagation adder based on in-built carry logic for a 32-bit addition provides the
86
5.1 Introduction
minimum latency compared to all other addition techniques. Subsequently, it develops a
hierarchical adder structure for large operands using above fast carry chain (FCC). The
experimental result shows that the proposed technique significantly reduces the routing
delay as well as logic delay compared to the existing techniques. For a 256-bit adder, the
proposed technique provides 35% speedup compared to the best known technique on an
FPGA platform.
Another major operation for any public key scheme is the finite field multiplication.
The Fp-multiplication algorithms are based on either of iterative interleaved additions or
multiplication followed by reduction. The reduction in Fp requires division by a large
prime. However, the Montgomery partial reduction algorithm [176] avoids division in
Fp. But, only problem in such algorithm is that the operands as well as the results are
in a specific format called Montgomery numbers. The operands first need to convert in
such form and then the multiplication is performed. Thus, the Montgomery reduction is
useful for exponentiation like algorithms where multiplications are repeatedly computed
before producing the final result. Some of the existing architectures for implementing
multiplication by Montgomery reduction are shown in [44,79,80,95–99,106,110,124,158].
The best known design in so far is proposed in [44], which computes multiplication in only
0.38µs in Montgomery domain. However, it uses thirty-three 18-bit in-built multipliers and
1704 slices, i.e., in total 8000 slices of an FPGA device make it costly.
The Fp multiplication based on interleaved addition chains was introduced at the same
time with Montgomery reduction by Blakley [179, 180]. The algorithm is described in
§ 2.1.2. It is an iterative algorithm computed by following standard double-and-add pro-
cedure on 2’s complement numbers. At every iteration it reduces the intermediate result
which always remains below the modular value. However, the main difficulty of this al-
gorithm is the computation of addition on large operands. The carry chain linearly in-
creases the latency of the addition operations. Thus, carry propagation adder circuit is
inefficient for developing an interleaved multiplication algorithm based Fp-multiplier for
large operands.
Some modifications on the above algorithm for reducing the multiplication latency due
to the carry chains have been made in [121] and [25]. The modifications are based on carry
87
Chapter 5 Fast Prime Field Adders and Multipliers on FPGA Platform
save adder (CSA). The modifications are either based on sign estimation technique [25]
or it uses some pre-computed value [121]. Both of the techniques require some additional
computations, which require additional circuitry and memory elements in hardware. The
pre-computations that are required to perform (A ·B) mod p by the modified algorithms
depend on the multiplicand A. The advantage of pre-computation can be taken efficiently
in an application where the repeated multiplications are performed on a fixed multiplicand
and varying multiplier. But, in our applications like elliptic curve or pairing the above is no
longer valid. Thus, the pre-computation cost is directly added with the multiplication pro-
cedure in elliptic curve and pairing based cryptographic applications. Another disadvantage
of above techniques is that the carry save adder inherently requires one carry propagation
adder for computing the final output. Therefore, better techniques could be explored, which
is more suitable for applications like elliptic curve and pairings.
This chapter proposes an efficient architecture for interleaved multiplication algorithm
on FPGA platform using in-built fast carry logic. It shows different level of decompositions
of operands and respective parallelism for executing interleaved multiplication algorithm.
For further speedup we modify the above algorithm by exploring the scope of parallelism
within an iteration. The Montgomery laddering [125] technique (see § 2.1.4) is exploited
for such modification. The doubling and addition operations within an iteration of our
modified algorithm are independent to each other. Thus, they can be computed in parallel.
On the other hand, both of the operations are computed at every iterations which provides a
balanced execution and security against non-differential side-channel attacks [156]. Subse-
quently, a two-level parallel architecture based on the modified algorithm is proposed. The
experimental result shows that the proposed design gives the best performance considering
the final reduction step for multiplication in Fp for a large prime p.
In order to demonstrate the efficiency gained by the proposed techniques on an FPGA
platform, we redesign the PGAU and ECSM hardware which are described in previous
chapter. The old adder circuits are now replaced by the proposed high speed adders in the
new designs. The experimental result shows a significant performance improvement has
been achieved due to the high speed adder. The major contributions of the current chapter
could be identified as:
88
5.2 Fast Additions on FPGA
1. High speed adder. It explores the utilization of in-built carry logic of an FPGA
device and proposes a hierarchical structure for developing a high speed adder circuit
for large operands.
2. Parallel Fp–multiplier. It proposes a modification on interleaved multiplication al-
gorithm for improving the scope of parallelism. Subsequently, we propose a parallel
iterative architecture based on the proposed modification and high speed adders. The
proposed architecture exploits the parallelism in two levels. One is in the addition
level and other is in the algorithmic level. The extensive experimental results have
been furnished to show its performance improvement over existing designs and se-
curity against non-differential timing and power attacks.
3. Speedup of elliptic curve cryptoprocessor. The PGAU and ECSM hardware de-
scribed in chapter 4 are redesigned based on the proposed high-speed adder. Sub-
sequently, the chapter shows the respective performance gains achieved through our
proposed technique.
The outline of the chapter is as follows: the chapter starts with the description of an
efficient adder circuit on FPGA platform and comparison of its performance with the best
known technique. The modification of interleaved multiplication algorithm, the respective
architecture, its implementation costs, respective performance, and security against power
attacks are described thereafter. The chapter then shows the performance gains of the
new PGAU and elliptic curve cryptoprocessor compared to their old designs presented in
Chapter 4.
5.2 Fast Additions on FPGA
Addition is one of the fundamental operations for computing any cryptographic algo-
rithms. It is also the major operation to perform interleaved multiplication. This section
develops an efficient adder unit specifically for FPGA platforms in general, although specif-
ically we discuss about Xilinx Virtex-II pro FPGA devices. The modern FPGAs support
a maximum of 32-bit ripple carry chain [169]. A carry chain is placed in one row of the
FPGA, and it interfaces with all the FPGA cells in that row. Each of the 32-bit carry chains89
Chapter 5 Fast Prime Field Adders and Multipliers on FPGA Platform
support k-bit carry computations placed at any point within a carry chain resource, where
k ≤ 32. Thus the addition upto 32 bits in an FPGA device using ripple-carry (or carry-
propagation) adder (CPA) requires the lowest routing complexity compared to other adders
including Brent-Kung [181], and carry lookahead adders.
Table 5.1 shows the performance of 32-bit adders on Virtex-2 Pro FPGA. In the table
the total delay (TD) comprises of input buffer delay, computation delay, and output buffer
delay. Typically in a Virtex-2 Pro FPGA device the input buffer delay is 1.452ns and the
output buffer delay is 2.851ns. These two buffers infer that the actual addition time in the
respective adders are 4.303ns less than the time shown in the TD column. It is observed that
the 32-bit carry propagation adder gives the best performance as well as lowest area among
all 32-bit adders on FPGA platform. There are 16 slices (or 32 LUT) in a single column
of a Virtex-2 Pro FPGA device. Each LUT in a column consists of a special carry logic
which is directly connected to its next adjacent LUT. Therefore, the carry propagation up
to 32-bit takes minimum routing delay which indeed forms the best adder circuit for 32-bit
additions on an FPGA platform.
Table 5.1: Performance of different 32-bit adders on Virtex-2 Pro FPGA.
Adder Type LD (ns) RD (ns) TD (ns) Slice LUTCarry Propagation 5.882 0.799 6.621 16 32Brent-Kong Carry 5.395 4.174 9.569 84 151Lookahead Carry 5.271 4.017 9.288 69 127Carry Select 5.563 4.321 9.884 82 149− LD, RD, and TD stand for Logic Delay, Route Delay, and Total Delay.
In the applications like elliptic curve and pairing computations it is essential to perform
additions on large operands which are much larger than 32 bits. In case of such applications
256-bit operands are very common to use. Let us now observe the performance of different
adders for such a large bit length. It is now known to us that for a 32-bit addition on an
FPGA platform, carry propagation adder provides the best performance. From this point
we call it as 32-bit fast carry chain (or FCC). We design various adder circuits for 256-bit
addends based on FCC. The 256-bit addends are broken into eight 32-bit smaller parts and
we use them as basic units.
90
5.2 Fast Additions on FPGA
Table 5.2: Performance of 256-bit adders based on 32-bit FCC on Virtex-2 Pro.
Adder Type LD (ns) RD (ns) TD (ns) Slice LUTCarry Propagation 14.058 0.799 14.857 128 256Brent-Kong Carry 7.514 6.120 13.634 373 712Lookahead Carry 8.574 5.358 13.932 486 910Carry Select 7.514 6.460 13.974 373 711− LD, RD, and TD stand for Logic Delay, Route Delay, and Total Delay.
Let us assume, A = ((a7232+a6)264+(a5232+a4))2128+((a3232+a2)264+(a1232+
a0)) and B=((b7232+b6)264+(b5232+b4))2128+((b3232+b2)264+(b1232+b0)), where
each of the ai, bi are 32-bit words. An addition ai +bi is performed by a 32-bit FCC. The
output S = ((s7232 + s6)264 +(s5232 + s4))2128 +((s3232 + s2)264 +(s1232 + s0)), and the
final carry output is cout . Different adder circuits (carry propagation, Brent-Kong carry,
lookahead carry, and carry select) are designed as their respective 8-bit structures [169],
where a bit addition is now replaced by a 32-bit FCC. The comparative performance study
of different such adders are provided in Table 5.2. Among different known techniques
the Brent-Kong carry with 32-bit FCC provides the bast performance which takes only
13.634ns to perform a 256-bit addition on a Virtex-2 Pro FPGA. It is observed that the
routing delay of each of the structures except carry propagation adder is significantly high.
In the following subsection, we propose an adder structure on FPGA platform which re-
quires lesser routing delay and provides maximum speed for large addends.
5.2.1 Proposed Addition Technique
Our addition technique follows the Karatsuba decomposition [117]. Normally, the
Karatsuba technique is used to compute a multiplication of two large operands. We ex-
ploit the same technique for addition. Let us consider the addition (A+B) for two 256-bit
addends A and B. The addends are broken as shown in Fig. 5.1. They are decomposed upto
level 3 (height of the decomposition tree, h = 3). However, further decompositions upto
level 8 (individual bit level) for a 256-bit addend is possible. The level of decomposition
increases the scope of parallelism for computing additions. It is already shown that for a
32-bit addition on FPGA platform, the best adder unit is formed by 32-bit FCC. Thus, we
stop the decomposition at level 3 for designing the proposed 256-bit adder.
91
Chapter 5 Fast Prime Field Adders and Multipliers on FPGA Platform
LL 64 : 64
LHL 32 : 32
LH 64 : 64
L 128 : 128
LHH 32 : 32
LLL 32 : 32
LLH 32 : 32
HL 64 : 64
HHL 32 : 32
HH 64 : 64
H 128 : 128
HHH 32 : 32
HLL 32 : 32
HLH 32 : 32
A + B 256 : 256
Figure 5.1: The Karatsuba decomposition of 256-bit addends.
We use Add256bit routine that is shown in Algo. 5.3 for computing the addition A+
B. The algorithm hierarchically computes the addition of 256-bit addends. It calls three
Add128bit routines which follow the definition as given in Algo. 5.4. The first call of
Add128bit routine at step-1 executes the L part of Fig. 5.1. Whereas the next two calls at
steps 2 and 3 execute the H part with carry-in zero and carry-in one, respectively. Step-4
chooses the correct result of H part based on its actual carry-in which is the carry-out of L
part (c0). It is observed that all three calls of Add128bit routines are independent and can
be computed in parallel.
Algorithm 5.3: Add256bit (A,B,cin).Input: A = 2128A1 +A0,B = 2128B1 +B0,cinOutput: A+B+ cin
/* S0,S′0,S1,S
′1 are 128-bit variables. */
/* c0,c1,c′0,c
′1 are single bit variables for carry. */
1. {c0,S0}← Add128bit(A0,B0,cin)2. {c′0,S
′0}← Add128bit(A1,B1,0)
3. {c′1,S′1}← Add128bit(A1,B1,1)
/* {x,Y} indicates concatenation of x and Y. */4. S1← S
′c0
, c1← c′c0
5. return {c1,S1,S0}
An Add128bit routine is executed by calling three independent Add64bit routines. It
computes the 128-bit addition by same way of the 256-bit addition procedure. Only differ-
ence is that here L and H parts contain 64-bit operands instead of 128-bit. The Add64bit
routine is defined in Algo. 5.5. It computes the addition of two 64-bit operands by three
92
5.2 Fast Additions on FPGA
Algorithm 5.4: Add128bit (A,B,cin).Input: A = 264A1 +A0,B = 264B1 +B0,cinOutput: A+B+ cin
/* S0,S′0,S1,S
′1 are 64-bit variables. */
/* c0,c1,c′0,c
′1 are single bit variables for carry. */
1. {c0,S0}← Add64bit(A0,B0,cin)2. {c′0,S
′0}← Add64bit(A1,B1,0)
3. {c′1,S′1}← Add64bit(A1,B1,1)
/* {x,Y} indicates concatenation of x and Y. */4. S1← S
′c0
, c1← c′c0
5. return {c1,S1,S0}
Algorithm 5.5: Add64bit (A,B,cin).Input: A = 232A1 +A0,B = 232B1 +B0,cinOutput: A+B+ cin
/* S0,S′0,S1,S
′1 are 32-bit variables. */
/* c0,c1,c′0,c
′1 are single bit variables for carry. */
1. {c0,S0}← A0 +B0 + cin2. {c′0,S
′0}← A1 +B1 +0
3. {c′1,S′1}← A1 +B1 +1
/* {x,Y} indicates concatenation of x and Y. */4. S1← S
′c0
, c1← c′c0
5. return {c1,S1,S0}
32-bit additions. Therefore, Add256bit routine hierarchically computes the addition of two
256-bit addends. The whole computation consists of twenty-seven 32-bit additions, which
are independent and can be computed in parallel.
The architecture of the proposed adder based on Add256bit is depicted in Fig. 5.2.
It performs one 32-bit addition (ai + bi) using a 32-bit fast carry chain (FCC). A 64-bit
addition is performed by three 32-bit adders and a 32-bit 2:1 MUX. A 128-bit addition is
performed by three 64-bit adders and a 64-bit 2:1 MUX. So, the proposed 256-bit adder is
formed by three above mentioned 128-bit adders and a 128-bit 2:1 MUX. In case of a 256-
bit adder as shown in Fig. 5.2, twenty-seven 32-bit additions are performed in parallel. The
correct result is finally selected by multiplexors. The critical path of the proposed 256-bit
adder circuit consists of one 32-bit fast carry chain and three 2:1 MUXs.
93
Chapter 5 Fast Prime Field Adders and Multipliers on FPGA Platform
Add 64 bit
Add 64 bit
Add 64 bit
Add 64 bit
Add 64 bit
Add 64 bit
Add 64 bit
Add 64 bit
Add 64 bit
+
a i + 1 b i + 1
0 co
s
+
a i + 1 b i + 1
1 co s
+
a i b i
cin co
s 0 1 0 1
32
32
s i
s i + 1
c 1
32
32
a 0 b 0 a 1 b 1
cin
0
a 2 b 2 a 3
0
b 3
a 4 b 4 a 5
0
b 5
a 6 b 6 a 7 b 7
a 6 b 6 a 7 b 7
1
a 6 b 6 a 7 b 7
1
a 2 b 2 a 3 b 3
1 0
a 6 b 6 a 7 b 7
a 4 b 4 a 5
1
b 5
0 1 0 1
0 1 0 1
0 1 0 1
0 1 0 1
s 1 - 0
s 3 - 2
s 7 - 4
cout
k
j
j + k
s i + 1 - i Add 64 bit
Figure 5.2: The proposed 256-bit adder based on 32-bit fast carry chain.
5.2.2 Cost and performance
The algorithms 5.3, 5.4, and 5.5 show that the computation of a 256-bit addition
decomposes the operands upto 256/23 = 32 bits. Let h denotes the decomposition level,
which is incremented with each decomposition. Thus, at the beginning h = 0, at 128-bit
level it is incremented to 1, and at 32-bit level h = 3.
However, the decomposition can be continued further upto single bit level. Instead of
performing a 32-bit addition on a 32-bit FCC it could be further decomposed. Table 5.3
shows the performances of such adder circuits for different decompositions on a Virtex-2
Pro FPGA. The table shows the performances of such adder circuits for 256 and 512 -bit
operands. It is observed that the decomposition upto 32-bit improves the performance (i.e.,
94
5.2 Fast Additions on FPGA
reduces the latency) but after that it degrades. This is due to the fast carry chain (FCC)
which are inherently available on a Virtex-2 pro FPGA. At h = 3 (for 256-bit), the critical
path of the proposed adder contains one 32-bit FCC and three 2:1 MUXs. The further
decomposition adds more MUXs in the critical path. For example, at h = 4 the critical path
contains 16-bit fast carry chain and four 2:1 MUXs. The routing delay is also increased.
Table 5.3: Performance of different decompositions for adders in FPGA.
h § 256-bit adder 512-bit adderSlice Latency (ns) Slice Latency (ns)
0 128 14.9 256 24.21 266 11.2 532 16.02 427 10.3 853 12.73 695 † 10.1 1384 11.84 1088 10.7 2160 † 11.65 1706 11.4 3359 12.26 2674 12.2 5195 12.9
− § h indicates the height of decomposition tree. † minimum latency at 32-bit FCC.
The latencies of fast carry chains in its different lengths and one 2:1 MUX on a Virtex-2
Pro FPGA is shown in Table 5.4. The latencies of a 32-bit FCC and a 16-bit FCC are 6.7ns
and 6.1ns, whereas the latency of a 2:1 MUX is 5.1ns. So, the latency of a 32-bit addition
operation using 32-bit fast carry chain is 6.7ns, whereas the same using three parallel 16-bit
fast carry chains and one 2:1 MUX is 11.2ns. Hence, further decompositions below 32-
bit degrades the performance of the adder circuits in FPGA. It is also observed previously
that the 32-bit carry propagation adder provides the best performance among all existing
addition techniques on an FPGA platform. Above experimental result ensures that the use
of 32-bit FCC is the best choice for adding two 32-bit addends on an FPGA platform.
Hence, we stop decomposition at level-3 for developing a 256-bit high-speed adder circuit.
The same is stopped at level-4 in case of 512-bit adder.
Table 5.5 shows the performances of the 256 and 512 -bit adders on the FPGA platform.
The Brent-Kong carry adder with 32-bit FCC provides the best performance among all ex-
isting techniques (see Table 5.2). It is observed that our proposed adder gives 35% speedup
from Brent-Kong carry adder on a Virtex-2 pro FPGA. Due to its hierarchical structure the
routing delay becomes half compared to the Brent-Kong structure.
95
Chapter 5 Fast Prime Field Adders and Multipliers on FPGA Platform
Table 5.4: Latencies of circuit elements on a Virtex-2 pro FPGA.
Fast carry chain (FCC) 2:1 MUX32-bit 16-bit 8-bit 4-bit6.7ns 6.1ns 5.8ns 5.6ns 5.1ns
Table 5.5: Performance comparison of proposed adder on Virtex-2 Pro FPGA.
Adder Type 256-bit adders 512-bit addersSlice LD RD TD Slice LD RD TD
(ns) (ns) (ns) (ns) (ns) (ns)Proposed 695 6.94 3.13 10.07 2160 7.65 3.96 11.61Brent-Kong Carry† 373 7.51 6.12 13.63 1220 9.12 7.23 16.35−† Brent-Kong carry with 32-bit FCC is the best among existing adders (see Table 5.2).− LD, RD, and TD stand for Logic Delay, Route Delay, and Total Delay.
5.3 Fast Fp Multipliers on FPGA
Multiplication is another important underlying operation in cryptography. It is per-
formed as (A ·B) mod p in case of a cryptographic scheme defined over Fp, where A,B∈Fp.
The operation (A · B) mod p is also known as modular multiplication. In school book
method modular multiplication is performed by a multiplication followed by a division op-
eration. Due to the division this procedure is very costly. Different techniques have been
developed for avoiding division operation in modular multiplication procedure [117].
Interleaved multiplication [180] (see § 2.1.2) is one of the procedures that can compute
(A ·B) mod p without final division. This is a bit serial addition based algorithm which
takes k iterations where k indicates the bit length of the operands. Therefore, in practice the
performance of this algorithm depends on the efficiency of the underlying adder circuit.
Table 5.6 shows the costs and performances of interleaved multipliers based on different
adder circuits. It is observed that our proposed high-speed adder based implementation
achieves 63% speedup compared to the same based on carry propagation adder. It achieves
36% speedup compared to the same based on Brent-Kong carry with 32-bit FCC adder. It
can be considered that the above speedup is achieved through the efficient parallelism in
addition operation (our high-speed adder) on FPGA platform. In subsequent section we
96
5.3 Fast Fp Multipliers on FPGA
Table 5.6: Performance of the interleaved multipliers on Virtex-2 Pro FPGA.
Adder Type 256-bit Fp multiplier 512-bit Fp multiplierSlice Frequency Time Slice Frequency Time
(MHz) (µs) (MHz) (µs)Proposed high-speed 2701 62 4.1 7890 53 9.7Carry propagation 808 38 6.7 1560 22 23.0Brent-Kong carry † 1853 46 5.6 6385 37 13.8Carry select † 1838 44 5.8 6324 35 14.6Carry lookahead † 2114 45 5.7 6826 36 14.2− †adders are based on 32-bit fast carry chain (or FCC).
propose a modification on the interleaved multiplication algorithm for achieving the scope
parallelism within an iteration which helps to further speedup.
5.3.1 Proposed Multiplication Technique
This section proposes a modified interleaved multiplication algorithm based on Mont-
gomery ladder [125] (described in § 2.1.4). The modified algorithm for 256-bit operands
(IMML256bit) is shown in Algo. 5.6. It can be scaled accordingly for other bit lengths. The
main objective of such modification is to perform modular addition and doubling in paral-
lel. In order to perform (A ·B) mod p, it uses two temporary variables S0 and S1 which are
initialized by A and 0, respectively. At every iteration, it performs Sbi = (Sbi +Sbi) mod p
and Sbi= (2Sbi
) mod p, which are independent to each other.
The respective architecture is shown in Fig. 5.3. The proposed multiplier works as
follows. It consists of three adders, two registers, five 2:1 MUX, and some additional
circuit elements for controlling the data flow among data path. The registers S0 and S1 are
initialized by A and 0, respectively. At every iteration registers are updated with the results
produced by Fp addition and doubling operations, accordingly, as defined in Algo. 5.6.
After 256 such iterations the register S1 holds the Fp multiplication result. The architecture
is scalable for shorter and longer operands. It is inherently programmable which supports
all primes less than the given lengths (256, 512 -bit). We show the results of the proposed
multiplier of 256 and 512 -bit implementations on Virtex-2 Pro FPGA.
The operands are too long compared to 32-bit length. We break the addition operation
97
Chapter 5 Fast Prime Field Adders and Multipliers on FPGA Platform
Algorithm 5.6: Interleaved Montgomery ladder, IMML256bit(A,B, p).Input: A,B, p. B = ∑255
i=0 2ibiOutput: AB mod p1. S1← 0 and S0← A2. for i from 255 down to 0 do
/* T0,T ′0,T1,T ′1 are 256-bit variables. *//* c0,c1,c′0,c
′1 are single bit variables for carry. */
{c0,Tbi}← Add256bit(Sbi,Sbi,0)
{c′0,T ′bi}← Add256bit(Tbi, p,1)
{c1,Tbi}← Sbi
<< 1{c′1,T ′bi
}← Add256bit(Tbi, p,1)
/* {x,Y} indicates concatenation of x and Y. p indicates1’s complement of p*/if (c0 or c′0) then Sbi ← T ′bi
else Sbi ← Tbi
if (c1 or c′1) then Sbi← T ′bi
else Sbi← Tbi
3. return S1
2 X X
S
mux b i 0 1
S 1 S 0
mux b i 0 1 mux 0 1
clk
Add 256 bit a b
s
0
256
cin
cout
256
Add 256 bit a b
s
1
256
cin
cout
256 M
mux 0 1
Add 256 bit a b
s
1
256
cin
cout
256 M
mux 0 1
x 255
256
reset
Figure 5.3: The proposed Montgomery ladder based interleaved multiplier.
into smaller operations by following Karatsuba decomposition procedure. The smaller
operations are performed in parallel as described in previous section. The results of the
smaller operations are then combined in the final result. The combination in this case is
performed by multiplexer (MUX). Each of the levels in Karatsuba binary decomposition
procedure adds one additional 2:1 MUX in the critical path.
The circuit elements that form the critical path in above 256-bit multiplier with different
98
5.3 Fast Fp Multipliers on FPGA
decompositions are listed in Table 5.7. The critical path of the multiplier is formed by two
256-bit adders and two MUXs. Among two additional MUXs, one is used to select the
correct reduced results of respective modular addition and doubling operations, whereas
the other one is used to restore the intermediate results after an iteration into the respective
registers based on bi. The critical path of the multiplier varies for different adder circuits. It
is observed that the minimum critical path delay is obtained if we develop Add256bit units
by our proposed 256-bit high-speed adder. The respective minimum critical path is found
at the decomposition level where h = 3.
Table 5.7: Circuit elements in the critical path of 256-bit Fp multipliers.
h § Critical path Latency (ns)0 2 * 256-bit carry chain + 2 MUX 18.361 2(128-bit carry chain + 1 MUX) + 2 MUX 16.022 2(64-bit carry chain + 2 MUX) + 2 MUX 12.893 2(32-bit carry chain + 3 MUX) + 2 MUX 12.504 2(16-bit carry chain + 4 MUX) + 2 MUX 13.285 2(8-bit carry chain + 5 MUX) + 2 MUX 17.97− § h indicates the height of decomposition tree, i.e. decomposition stops at 256/2h bits.−MUX indicates 2:1 multiplexer.
5.3.2 Cost and Performance of Multiplier
The designs have been done in Verilog (HDL). The synthesis, mapping, placement,
and routing have been done on Xilinx ISE 7.1i. The results are based on the post routing
simulation using ModelSim XE III 6.0a simulator. The target device is a Xilinx Virtex-II
Pro FPGA. Table 5.8 shows the cost and performances of proposed 256 and 512 -bit Fp
multipliers. For both the cases we demonstrate the cost and computation time of an Fp
multiplication, where operands are decomposed at different levels.
In the table 5.8, the column h indicates the decomposition level, i.e., the decomposition
stops in respective designs when the operands are in 256/2h-bit. As expected, it is observed
in both 256 and 512 -bit multipliers that the maximum speed (minimum time) is achieved
when the operand decomposition stops at 32-bit. It is due to the use of in-built 32-bit fast
carry chain (or FCC) of the FPGA device.
99
Chapter 5 Fast Prime Field Adders and Multipliers on FPGA Platform
Table 5.8: Performance of the proposed multiplier with different decompositions on aVirtex-II Pro FPGA platform.
h § 256-bit implementation 512-bit implementationSlice Area Time (µs) Area×Time Slice Area Time (µs) Area×Time
0 2271 4.7 10 674 5119 14.3 73 2011 2123 4.1 8 704 4362 12.9 56 2702 2808 3.3 9 266 5637 9.0 50 7333 3475 †3.2 11 120 7775 7.7 59 8684 4808 3.4 16 347 9792 †7.3 71 4825 6888 4.6 31 685 13630 7.4 100 862− § h indicates the height of decomposition tree, i.e. decomposition stops at 256/2h bits.− †minimum multiplication latency at 32-bit FCC.
Table 5.9 shows the costs and multiplication times of the proposed multiplier imple-
mented with different adders. For example, the carry propagation adder (CPA) based
multiplier is developed by following proposed interleaved multiplication on Montgomery
ladder (Algo. 5.6) where the additions are performed by a given length (256, 512 -bit)
CPA. Similarly, the Brent-Kong carry based multiplier is designed by following Algo. 5.6
where additions are performed by Brent-Kong carry with 32-bit FCC adder. The high-
speed adder based implementation of the proposed multiplication technique achieves 36%
speedup compared to the same based on Brent-Kong carry based 256-bit adder with 32-bit
FCC. It is also observed that the proposed multiplication technique with high-speed adder
achieves 28% speedup compared to the interleaved multiplication with high-speed adder
(shown in Table 5.6).
Table 5.9: Performance of the proposed Fp multiplier with different adders on a Virtex-2Pro FPGA platform.
Adder Type 256-bit Fp multiplier 512-bit Fp multiplierSlice Frequency Time Slice Frequency Time
(MHz) (µs) (MHz) (µs)Proposed high-speed 3475 80 3.2 9792 70 7.3Carry propagation 2271 54 4.7 5119 35 14.3Brent-Kong carry † 2547 60 4.3 8722 48 10.7Carry select † 2515 58 4.4 8705 46 11.2Carry lookahead † 2874 59 4.4 9117 47 10.9− †adders are based on 32-bit fast carry chain.
100
5.3 Fast Fp Multipliers on FPGA
Table 5.10 shows the comparison of the proposed Fp multiplier with the existing con-
temporary designs. In some implementations the pre and post -computation costs are
not considered which apparently show lower multiplication costs. The designs reported
in [134,135] are for some fixed primes p < 25. Thus they are not included into the compar-
ison table. We can perform a fair comparison of performance of the proposed interleaved
multiplier with existing architectures using the same algorithm and on the same platform.
The performance of the interleaved multiplier has been attempted to be improved by
utilizing the carry save adder (CSA) units by different researchers. We have implemented
the redundant interleaved architecture of [121] and compared with our proposed multipli-
ers. A disadvantage of CSA based algorithm is that it requires at least one final addition
for the correct result. In case of redundant interleaved multiplier, it requires some pre-
computations which depends on the multiplicand. The pre-computation and final addition
costs are also added with the multiplication cost. The pre and post -computations of a CSA
based multiplier require absolute addition of two large operands. In such a multiplier we
perform the absolute additions by our proposed fast adder circuit. The latencies of CSA
and our proposed adder are 4.60ns and 10.07ns, respectively. Thus in the CSA based mul-
tiplier, the pre and post -computations are performed by a divide-by-four clock. Whereas,
the iterative computations are performed on CSA by the original clock.
Figure 5.4 depicts a graphical view of the performances of contemporary designs. All
such existing designs in our knowledge are based on CSA and discards the additional costs
due to pre and post -computations. According to the results produced by different authors,
the best known design takes 5.5µs for computing one 256-bit Fp multiplication based on the
interleaved algorithm. It may be noted that our proposed 256-bit adder reduces the delay
of the above multiplier from 5.5µs to 3.7µs. Further, our proposed IMML32FCC multiplier
takes only 3.2µs for the same operation. Hence, it gives 70% speedup from the best known
existing designs described in [25] and [95]. However, one drawback of the proposed design
is that it requires higher slice area which also increases the overall area × time per bit or
AT/B value.
For the sake of completeness, further comparisons have been furnished with Mont-
gomery reduction based multipliers implemented both on FPGA and CMOS libraries. The
101
Chapter 5 Fast Prime Field Adders and Multipliers on FPGA PlatformTa
ble
5.10
:Per
form
ance
com
pari
son
ofdi
ffer
entF
pm
ultip
liers
.
Ref
eren
ceM
ultip
licat
ion
Type
Plat
form
Bit
leng
thA
rea
Tim
eA
T/B
§
Prop
osed
Inte
rlea
ved
with
Mon
t-V
irte
xII
pro
256
3475
slic
es3.
2µs
43.4
gom
ery
ladd
erB
unim
ovet
al.[
121]
,R
edun
dant
Inte
rlea
ved
Vir
tex
IIpr
o25
637
11sl
ices
3.7µ
s53
.6(im
plem
ente
dby
us)
Bun
imov
etal
.[12
1]R
edun
dant
Inte
rlea
ved
Vir
tex
II25
618
34sl
ices
5.5µ
s39
.4(im
plem
ente
dby
[25]
,200
9)†
Am
anor
etal
.[95
],20
05In
terl
eave
dV
irte
x20
00E
256
1030
slic
es5.
5µs
22.1
‡A
bdel
Fatta
het
al.[
25],
2009
Mod
ified
Inte
rlea
ved
Vir
tex
II25
623
61sl
ices
6.9µ
s63
.6O
rset
al.[
120]
,200
3M
ontg
omer
yre
duct
ion
V81
2E-B
G-5
6025
615
48sl
ices
7.7µ
s46
.651
229
72sl
ices
16.2
µs94
.0D
aly
etal
.[10
6],2
004
Mon
tgom
ery
redu
ctio
nV
irte
xII
2000
256
3109
slic
es5.
8µs
70.4
McI
vore
tal.
[74]
,200
4M
ontg
omer
yre
duct
ion
Vir
tex
IIPr
o25
646
63sl
ices
+1.
3µs
88.7
64M
ultip
liers
Am
anor
etal
.[95
],20
05M
ontg
omer
yre
duct
ion
Vir
tex
2000
E25
618
00sl
ices
5.6µ
s39
.4H
arri
set
al.[
99],
2005
Mon
tgom
ery
redu
ctio
nV
irte
xII
2000
256
5598
LU
T+
3.9µ
s46
.310
245n
bitR
AM
16.0
µs48
.2C
row
eet
al.[
98],
2005
Mon
tgom
ery
redu
ctio
nV
irte
xII
2000
256
5267
slic
es5.
8µs
119.
3Sa
kiya
ma
etal
.[79
],20
06,
Mon
tgom
ery
redu
ctio
nV
irte
x-II
Pro
256
4836
slic
es4.
0µs
75.6
Kha
leel
etal
.[80
],20
06M
ontg
omer
yre
duct
ion
Vir
tex
IV25
634
345
slic
es0.
4µs
53.7
Kaw
akam
ieta
l.[4
4],2
008
Mon
tgom
ery
redu
ctio
nV
irte
xII
Pro
256
1704
slic
es+
0.38
µs12
.433
Mul
tiplie
rsSa
vas
etal
.[15
8],2
000
Mon
tgom
ery
redu
ctio
n1.
2µm
CM
OS
256
−6.
6µs
−L
iuet
al.[
96],
2005
Mon
tgom
ery
redu
ctio
n0.
13µm
CM
OS
1024
221
kga
tes
1.5µ
s−
Kai
hara
etal
.[97
],20
05M
ontg
omer
yre
duct
ion
0.35
µmC
MO
S25
618
kce
lls1.
5ms
−−
†Doe
sno
tcon
side
rthe
final
CPA
sum
and
pre
com
puta
tions
whi
char
ere
quir
edfo
rdis
tinct
mul
tiplic
ands
.−
‡Red
unda
ntIn
terl
eave
d[1
21]r
equi
res
som
epr
eco
mpu
tatio
nsw
hich
are
cons
ider
edto
best
ored
insi
deth
eFP
GA
in[2
5].
−§
A:S
lice
Are
a,T
:Tim
ein
µs,B
:Bit
Len
gth.
102
5.3 Fast Fp Multipliers on FPGA
0
500
1000
1500
2000
2500
3000
3500
4000 S
l i c e
A r e
a
F p Multiplication Time ( us )
0 . 5 1 . 5 1 . 0 2 . 0 3 . 0 2 . 5 3 . 5 4 . 5 4 . 0 5 . 0 6 . 0 5 . 5 6 . 5 7 . 0
Our 2010
Amanor et al . 2005
AbdelFattah et al . 2009
7 . 5 8 . 0
Bunimov et al . 2003 designed by AbdelFattah et al . 2009
Bunimov et al . 2003 designed by us
Figure 5.4: Different implementations of interleaved multiplier on FPGA.
Montgomery reduction involves conversions, which are not incorporated in the values men-
tioned in Table 5.10. Although this conversion needs to perform only once at the beginning
and once at the end for computing exponentiation like algorithms where multiplication is
performed repeatedly.
5.3.3 Security Against Timing and Power Attacks
Our proposed multiplier has an additional property of security against timing attack and
simple power analysis (SPA) attack. This is achieved due to the balanced computation of
our proposed architecture. The proposed design performs every multiplications in a con-
stant time which ensures its security against timing attack. At every iterations it performs
a fixed amount of computation which leads to follow a fixed power consumption profile.
Therefore, by observing a single trace of the power profile it is not possible to distinguish
one iteration from another, which ensures its security against simple power analysis (or
SPA) attack.
In order to prove its security against SPA attack, we first show how this attack finds
the secret by exploiting a multiplier. Let us assume that a multiplication (A ·B) mod p is
103
Chapter 5 Fast Prime Field Adders and Multipliers on FPGA Platform
500 1000 1500 2000 2500−4
−2
0
2
4x 10
−3
samples
pow
er c
onsu
mpt
ion
(V)
1 0 1 1 0 1
Figure 5.5: Simple power analysis on a naive multiplier.
performed on a secret B, which is represented by ∑k−1i=0 2ibi. Let us further assume that the
multiplication is performed by interleaved multiplication algorithm. The doubling opera-
tion (2A) mod p is performed at every iterations but the addition is performed at an iteration
if and only if the respective bit bi = 1. The power consumption profile of this naive im-
plementation by following such an algorithm is shown in Fig. 5.5. It is exploited easily to
distinguish the iterations where addition is performed from the iterations where the same is
not performed. Hence, it finds out all secret bit values bi, where 0≤ i≤ k−1 by the simple
power analysis.
500 1000 1500 2000 2500−4
−2
0
2
4x 10
−3
samples
pow
er c
onsu
mpt
ion
(V)
Figure 5.6: Simple power analysis on our proposed multiplier.
The same analysis is performed on our proposed Fp multiplier. The respective power
consumption profile is shown in Fig. 5.6. In this waveform it is not possible to distinguish104
5.4 The PGAU and ECSM Hardware Based on Fast Adder
an iteration from another. This is because all iterations consist same computation in our
proposed multiplication technique. Hence, the proposed multiplier is indeed secure against
simple power analysis (or SPA) attack.
5.4 The PGAU and ECSM Hardware Based on Fast Adder
The programmable GF(p) arithmetic unit (PGAU) and elliptic curve scalar multiplica-
tion (ECSM) hardware have been described in previous chapter. In this section we show the
performance gain of the above hardware using proposed fast adders on an FPGA platform.
We replace all the adder units of PGAU architecture by fast adders. The modified unit is
called PGAU-FA. The performance gain and cost overhead of the PGAU-FA is compared
with our previous PGAU in Table 5.11. It is observed that due to the proposed addition
technique the 256-bit PGAU-FA gives 41% speedup compared to the same length PGAU.
Table 5.11: Performances of PGAU-FA and PGAU on Virtex-II Pro FPGA.bit PGAU-FA PGAU [Chapter 4] Speedup
slice Frequency TM TID slice Frequency TM TID(MHz) (µs) (µs) (MHz) (µs) (µs)
192 6 743 56 3.43 6.86 3 895 44 4.36 8.72 27%224 7 814 53 4.23 8.46 4 675 41 5.46 10.92 29%256 9 408 52 4.92 9.85 5 384 37 6.92 13.84 41%− TM and TID indicate Fp (or GF(p)) multiplication and inversion/division times, respectively.
Next we redesign the whole dual core ECSM hardware based on PGAU-FA instead of
PGAU. We call this new ECSM hardware by ECSM-FA. The performance of the ECSM-
FA is compared with the previous ECSM in Table 5.12. The speedup factor shows that due
to the fast adder circuits a 256-bit elliptic curve scalar multiplication achieves 39% speedup
from its previous implementation.
5.5 Conclusion
This chapter has presented techniques to improve the performance of different adder
and Fp multipliers on FPGA platforms. The designs are based on the proper usage of the
105
Chapter 5 Fast Prime Field Adders and Multipliers on FPGA Platform
Table 5.12: Comparison between with and without fast adder based ECSM hardwares.bit ECSM with PGAU-FA (ECSM-FA) ECSM [Chapter 4] Speedup
Slice Frequency Time Slice Frequency Time(MHz) (ms) (MHz) (ms)
192 10 350 55 3.50 8 972 43 4.47 27%224 11 936 52 5.00 10 386 40 6.50 30%256 13 350 50 6.75 11 953 36 9.38 39%− Comparison is made on Virtex-II Pro platform.− Time in ms indicates one ECSM computation time.
in-built carry chains in the FPGA device. This work shows that the carry chains have a
direct impact on the level of decomposition of the adder to obtain the fastest adder. The
interleaved multiplication algorithm has been modified based on the Montgomery powering
ladder. The parallel architecture based on the proposed fast adder circuits has been shown
to give 70% speedup from the best known existing designs.
We redesigned the PGAU and ECSM hardware that were described in the last chap-
ter using our proposed fast adder circuit. It is observed that the proposed adder technique
indeed improves the overall performance of the ECSM hardware. The 256-bit ECSM cryp-
toprocessor achieves 39% speedup in cost of only 12% additional slices over its old design.
In summery, this chapter has presented efficient design techniques for prime field arith-
metic on FPGA platform. Towards the goal of designing an efficient and secure pairing
cryptoprocessor, the next chapter deals with the selection of a pairing-friendly curve and
design of respective cryptoprocessor. The pairing cryptoprocessor is based on our currently
developed fast addition and multiplication techniques on FPGA platform.
106
Chapter 6
High Speed Flexible Pairing
Cryptoprocessor
APART FROM ELLIPTIC CURVE scalar multiplication, pairing computation is an-
other tedious operation in pairing based cryptography. The security and computa-
tion efficiency of a cryptographic bilinear pairing mostly depend on underlying algebraic
curves. As per NIST recommendation, 128-bit security is essential beyond 2030 [62].
Some of the existing curves which provide 128-bit security are : Barreto-Naehrig curves
(BN curves) defined over a 256-bit prime field with embedding degree 12, Supersingular
curves defined over a 1223-bit binary field with embedding degree 4, and supersingular
curves defined over a 509-bit characteristic–3 field with embedding degree 6 [42]. Among
them the BN curves are most popular in current days.
This chapter presents a Pairing Cryptoprocessor (PCP) over Barreto-Naehrig curves.
The proposed architecture is specifically designed for field programmable gate array (FPGA)
platform. The objective of this chapter is to utilize the efficient implementation of the un-
derlying finite field primitives namely adder, subtractor, and multiplier that have been de-
scribed in previous chapter for developing pairing cryptoprocessor. This is brought about
the two stages of the cryptoprocessor design:
1. A configurable Fpk arithmetic unit (CAU) has been developed which has inherent
configurability to perform arithmetics in Fp and Fp2 for any p less than the given
107
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
length.
2. The PCP has been developed using two CAU cores as arithmetic operators along
with additional control units and memory elements.
Extensive parallelism techniques have been proposed to realize a PCP which requires
lesser clock cycles than the existing designs. The proposed design is the first reported
result on an FPGA platform for 128-bit security. The proposed cryptoprocessor provides
flexibility to choose the curve parameters for pairing computations.
The cryptoprocessor needs 1764 k, 1242 k, and 856 k cycles for the computation of
Tate, ate, and R-ate pairings, respectively. On a Virtex-4 FPGA device it consumes 52
kSlices at 50MHz and computes the Tate, ate, and R-ate pairings in 35.3 ms, 24.9 ms, and
17.0 ms, respectively, which are comparable to known CMOS implementations.
6.1 Introduction
Cryptographic pairing [93] is a bilinear map G1×G2 → G3 where G1 and G2 are
additive groups and G3 is a multiplicative group. Let E be an elliptic curve defined over Fq
having even embedding degree k with respect to the prime divisor r of order of the elliptic
curve (#E(Fq)). Suppose further that r3 does not divide #E(Fq) and r2 does not divide
qk− 1. Many cryptographic pairings such as the Tate pairing [138], ate pairing [77], and
R-ate pairing [28] choose G1 to be an order-r cyclic subgroup of E(Fq), G2 to be an order-r
cyclic subgroup of E(Fqk), and G3 to be a subgroup of F∗pk with order r. The above pairings
are called asymmetric pairings as G1 and G2 are different. The pairings considered in this
chapter are the (reduced) Tate pairing tr :G1×G2→G3, the ate pairing ar :G2×G1→G3,
and the R-ate pairing Rr : G2×G1→G3.
Selection of such groups as well as field types have a strong impact on the security and
computation cost of pairing. We choose Barreto-Naehrig curves (or BN curves) [76] for
computing the above pairings. The BN curves are a type of elliptic curves E defined over
prime fields Fp having prime order #E(Fp) and an embedding degree k = 12. These curves
are especially well suited for the computation of above pairings with 128-bit security level
by choosing a 256-bit prime p.108
6.2 Prior Work
This chapter proposes a cryptoprocessor for the computation of pairings over BN curves.
The proposed pairing cryptoprocessor (PCP) is flexible to choose curve parameters includ-
ing prime p. It supports all primes less than the given length (256 bits). Field programmable
gate array (FPGA) is one of the suitable platforms for implementing cryptographic algo-
rithms. In this chapter, we develop a parallel configurable hardware for computing addition,
subtraction, and multiplication on Fp and Fp2 . Existing techniques to speed up arithmetics
in extension fields (see [61, 78]) for fast computation in Fp6 and Fp12 are used on top of it.
The major contributions of the chapter are highlighted here.
• The chapter introduces an underlying configurable primitive for Fpk arithmetics on
FPGA platform.
• It proposes a pairing hardware that is flexible for curve parameters.
• Parallelism techniques are adopted in different levels including underlying finite field
operations which drastically reduces the overall cycle count of pairing computation.
• The proposed FPGA design achieves a comparable speed with the existing CMOS
design.
The proposed configurable Fpk arithmetic cores and parallel computation result in a signif-
icant improvement on the performance of Tate, ate, and R-ate pairing over BN curves. The
result is demonstrated for a 256-bit BN curve that provides 128-bit security.
The chapter starts with a brief description of cryptographic pairings and BN curves.
Then it describes the pairing cryptoprocessor followed by the description of experimental
results based on BN curves and provides comparative study with existing contemporary
designs.
6.2 Prior Work
The software implementation results of pairings over BN curve have been shown in [8],
[42], [39], and [61]. The highly optimized software codes run on a 64-bit core2 processor
which computes a R-ate pairing in only 10,000,000 cycles. The software implementation
109
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
of [8] gives the speed record for the computation of Optimal-ate pairing on BN curves,
which is computed by 4,470,408 cycles on a Intel Core 2 Quad Q6600 processor.
An application specific instruction-set processor (ASIP) has been proposed in [17]. It is
designed by extending a RISC core with additional scalable functional units. It requires a
special programming environment in order to execute pairings. Therefore, the authors have
developed a special C compiler. Implementation result shows that the ASIP can compute
an Optimal-ate pairing in 15.8 ms over a 256-bit BN curve at 338 MHz with a 130 nm
CMOS library.
A pairing processor specially for BN curves has been proposed in [19]. It exploits the
characteristic of the field defined by BN curves and choose curve parameters such that the
underlying Fp multiplication becomes more efficient. It shows a 5.4 times speedup of a
pairing computation compared to the ASIP proposed in [17]. However, the main limitation
of the pairing processor [19] is that it is useful only for computing pairings over a fixed BN
curve.
6.2.1 Choice of Elliptic Curve
The most important parameters for cryptographic pairings are the underlying finite
field, the order of the curve, the embedding degree, and the order of G1,G2 and G3. These
parameters should be chosen such that the best exponential time algorithms to solve the
discrete logarithm problem (DLP) in G1 and G2 and the sub-exponential time algorithms
to solve the DLP in G3 take longer than a chosen security level. We choose 128-bit sym-
metric key security for current cryptoprocessor. For the 128-bit security level, the National
Institute of Standards and Technology (NIST) recommends a prime group order of 256 bits
for E(Fp) and of 3072 bits for the finite field Fpk [62].
Barreto-Naehrig curves, introduced in [76], are elliptic curves over fields of prime order
p with embedding degree k = 12. The BN curve is represented as :
EFp : Y 2 = X3 +3
110
6.2 Prior Work
with BN parameter z = 6000000000001F2D (in hexadecimal) [61]. It forms the group
E(Fp) with order #E(Fp) = r = 36z4 + 36z3 + 18z2 + 6z+ 1, which is a 256-bit prime of
Hamming weight 91. The field characteristic p = 36z4 +36z3 +24z2 +6z+1 is a 256-bit
prime of Hamming weight 87, and t−1 = p− r = 6z2+1 is a 128-bit integer of Hamming
weight 28. Here t = p+ 1− r is the trace of EFp . The prime p ≡ 7 (mod 8) (so -2 is a
quadratic non-residue, we represent it by β) and p≡ 1 (mod 6).
6.2.2 Pairing Computation
Different varieties of Tate pairing could be computed over BN curves. Among them
ate [77], R-ate [28], and Optimal-ate [3] are most popularly used. Algo. 6.1 shows compu-
tation of Tate pairing. It consists of two major steps : the computation of Miller function
and the final exponentiation. The first part is computed by one of the optimized version
of Miller algorithm [175]. Several optimizations of this algorithm have been presented
in [138]. The resulting algorithm is called BKLS algorithm. The pairings other than Tate
are computed by similar way using different parameter other than r and by interchanging
the input points [42].
Algorithm 6.1: Computing the Tate pairing.Input: P ∈ G1 and Q ∈ G2.Output: er(P,Q).
Write r in binary : r = ∑L−1i=0 ri2i.
T ← P, f ← 1.for i from L−2 downto 0 do
T ← 2T .f ← f 2 · lT,T (Q).if ri = 1 and i = 0 then
T ← T +P.f ← f · lT,P(Q).
endendreturn f (q
k−1)/r.
The BN curves also admits a sextic twist [42], which means that the point Q is mapped
on a point Q′ defined over Fp2 . Thus the line functions lT,T (Q) and lT,P(Q) is computed111
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
over Fp2 instead of Fp12 . Value of the line functions are represented as : l0 + l1W 2 + l2W 3,
with l0 ∈ Fp, l1, l2 ∈ Fp2 , and a quadratic non-residue W over Fp2 . The Miller function
f is computed over Fp12 , which is represented as : f0 + f1W + f2W 2 + f3W 3 + f4W 4 +
f5W 5, with fi ∈ Fp2 . So in the Tate pairing computation f 2, f · lT,T (Q), and f · lT,P(Q) are
performed on Fp12 . Whereas all other computations are performed on Fp and Fp2 .
The detailed procedure of pairing computation including the final exponentiation on BN
curve is described in [42] and [61]. Another efficient way of computing final exponentiation
is described in [20]. This paper follows the descriptions that are given in [42] for comput-
ing the Tate, ate, and R-ate pairings. We use Jacobian coordinate systems for performing
elliptic curve operations, where a point (X ,Y,Z) corresponds to the point (x,y) in affine
coordinates with x = X/Z2 and y = Y/Z3. Let (m,s, i) denote the cost of multiplication,
squaring, inversion in Fp. Using Jacobian coordinate system the Miller function of Tate
pairing on BN curve requires 27934m and the final exponentiation requires 7246m+ i [42].
Thus the total cost for Tate pairing on BN curve is 35180m+ i. Similarly, the cost of ate
pairing is 23047m+ i and the cost of R-ate pairing is 15093m+2i.
6.3 Programmable Fp-Primitive
In this section we develop a programmable Fp-primitive based on above 256-bit high-
speed adder circuits. Essential operations for pairing computation are addition, subtraction,
and multiplication in finite fields. Figure 6.1 depicts the overall resulting architecture of
the proposed Fp-adder/subtractor/multiplier unit.
6.3.1 Architecture Description
Our first objective for designing such an integrated architecture is to reduce the overall
hardware costs for computing three essential prime field operations in pairing computa-
tion. The architecture consists of several independent blocks which operate in parallel for
accelerating the execution of respective operations. The whole architecture is subdivided
into four macro-blocks (A1,A2,A3,A4) and seven micro-blocks (B1,B2,B3,B4,B5,B6,B7).
The macro-blocks are used to compute the arithmetic operations, whereas, micro-blocks
112
6.3 Programmable Fp-Primitive
A 64
A 64
A 64
A 64
A 64
A 64
A 64
A 64
A 64
1 0
1 0
A 64
A 64
A 64
A 64
A 64
A 64
A 64
A 64
A 64
l e f t
s h i
f t e r
1 0
s 1 s 2
A 64
A 64
A 64
A 64
A 64
A 64
A 64
A 64
A 64
c 0 1
1 0 0 1 0 1
b i ~ c 0
p
1 0 1 0 b i b i
0 1 c 0
data MSB or carry out
1 0
~ c 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
B 1 B 2
B 3
B 4
B 5
B 6
v 2 s 1 + s 2 v 2 s 1 + ( ~s 2 ) +1
A 1 : w 2 v 2 + (~p) + 1 w 2 v 2 + p
A 2 : v 1 2 u A 3 : w 1 v 1 + (~p) + 1 A 4 :
v 1
u
v 1 w 1 v 2 w 2
t 1
t 2
B 7
c 1 c 2
Figure 6.1: The architecture of Fp adder/subtractor/multiplier unit.
are primarily responsible for dataflow among the macro-blocks, the registers, and the i/o
ports. The functionality of the individual blocks are described here.
• Macro-blocks A1,A2, and A4 are 256-bit adders based on our proposed technique as
described in section 5.2.
• Block A3 performs 2u for an integer u ∈ Fp. This is done by simply one bit left shift
having only rewiring and no additional logic cells.
• Micro-block B1 consists of one 2:1 multiplexer that selects either v1 or w1 based on
the most significant bits (or carry-outs) of 2u and 2u− p operations. Therefore, this
block completes the 2u mod p operation.
• Block B2 selects either s1 or s2 as the input to the A3.
• Blocks B3,B4, and B5 help to compute Fp- addition and subtraction in A1 and A2.
113
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
The control signal c0 holds zero for addition and one for subtraction. Thus, if c0 = 1
then block B5 selects ¬s2 else it selects s2. Similarly, if c0 = 1 then block B4 selects
p else it selects ¬p. Block B3 completes the operation by selecting the correct result.
In case of Fp-subtraction (i.e., c0 = 1), it selects either v2 or w2 based on the most
significant bit (MSB) of v2 only, whereas, for Fp-addition it does the same based on
the MSB of both v2 and w2.
• Blocks B6 and B7 multiplex t1 (the output of 2u mod p) and t2 (the output of s1± s2
mod p) as the new value of s1 and s2 registers, respectively.
6.3.1.1 Computation of Fp-multiplication
Proposed Fp-primitive follows the parallelism technique of Montgomery ladder [125]
for computing Blakley multiplication algorithm in Fp [180]. The choice of this algorithm
is due to its lower hardware cost and intrinsic adaptability to Montgomery ladder for paral-
lelism. We rewrite it, in Algorithm 7 with parenthesized indices in superscript in order to
emphasize the intrinsic dependency as well as parallelism of the multiplication procedure.
The algorithm computes two intermediate results (s(i)1 and s(i)2 ) in each iteration. The data
transfer inside the architecture (Fig. 6.1) for computing (a ·b) mod p is as follows :
• The register s1 and s2 hold the iterative results s(i)1 and s(i)2 of Algorithm 7, which are
initialized by zero and a, respectively, as specified in step 1.
• Iterative execution starts from i = n− 1 and goes down to zero as shown in step 2.
This step is executed by a 8-bit counter, which belongs to the control part of the
proposed design and it is not shown in Fig. 6.1.
• Block B2 of Fig. 6.1 executes step 3. The modular doubling (as computed by exe-
cuting the steps 4, 6, 8, and 10) and the modular addition (as computed by executing
the steps 5, 7, 9, and 11) are performed in parallel. In Fig. 6.1, steps 4 and 6 are
performed in blocks A3 and A4, respectively, whereas, both the steps 8 and 10 are
performed in block B1. Similarly, steps 5 and 7 are performed in blocks A1 and A2,
respectively, whereas, both the steps 9 and 11 are performed in block B3. During the
execution of Fp-multiplication control signal c0 remains zero.114
6.3 Programmable Fp-Primitive
Algorithm 7 : The interleaved multiplication based on Montgomery ladder†.Input: p, a = ∑n−1
i=0 2iai and b = ∑n−1i=0 2ibi.
Output: a ·b mod p.1. s(n)1 ← 0; s(n)2 ← a ;2. for i = n−1 down to 0 do3. if bi = 1 then u(i)← s(i+1)
2 ; else u(i)← s(i+1)1 ;
4. v(i)1 ← 2u(i);5. v(i)2 ← s(i+1)
1 + s(i+1)2 ;
6. w(i)1 ← v(i)1 +(¬p)+1;
7. w(i)2 ← v(i)2 +(¬p)+1 ;
8. c(i)1 ← (v(i)1 )n | (w(i)1 )n;
9. c(i)2 ← (v(i)2 )n | (w(i)2 )n ;
10. if c(i)1 = 1 then t(i)1 ← w(i)1 ; else t(i)1 ← v(i)1 ;
11. if c(i)2 = 1 then t(i)2 ← w(i)2 ; else t(i)2 ← v(i)2 ;
12. if bi = 1 then s(i)1 ← t(i)1 ; else s(i)1 ← t(i)2 ;13. if bi = 1 then s(i)2 ← t(i)2 ; else s(i)2 ← t(i)1 ;14. end for15. return s(0)1 ;† In the algorithm, x(i) represents the value of x at i-th iteration, (x)n indicates then-th bit of x, and | indicates logical OR.
• Finally, results of the current iteration are restored as specified in step 12 and step 13
in parallel by B6 and B7 blocks.
All steps from step 3 to step 13 of Algorithm 7 are performed within one clock by the
proposed architecture. Therefore, to compute a multiplication in Fp256 the proposed design
takes only 256 clock cycles.
6.3.1.2 Computation of Fp-addition
The proposed design executes Algorithm 8 for computing Fp-addition. As described in
step 1, the architecture initializes registers s1 and s2 by operands a and b, respectively. It
executes steps 2 and 3 in blocks A1 and A2. Based on the most significant bits of v2 and w2
it produces the correct result of s1+ s2 mod p in block B3 as described in step 3 and step 4.
During the execution of Fp-addition the control signal c0 holds logic zero. The proposed
115
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
architecture computes a Fp-addition in one clock cycle.
Algorithm 8 : The addition in prime field.Input: p, a = ∑n−1
i=0 2iai and b = ∑n−1i=0 2ibi.
Output: a+b mod p.1. s1← a; s2← b ;2. v2← s1 + s2 ;3. w2← v2 +(¬p)+1 ;4. c2← (v2)n | (w2)n;5. if c2 = 1 then t2← w2 ; else t2← v2;6. return t2 ;
6.3.1.3 Computation of Fp-subtraction
Subtraction a−b mod p on the proposed design is performed by executing Algorithm 9.
As described in step 1, the architecture initializes registers s1 and s2 by operands a and b,
respectively. The subtraction s1− s2 is performed as s1 + (¬s2) + 1 by the A1 block of
Fig. 6.1. Block B5 computes ¬s2 and sends it to the A1 block. During subtraction the
control signal c0 holds one. The architecture further executes step 3 in block A2 and based
on the most significant bit of v2 it produces the correct result of s1− s2 mod p in block B3.
The whole computation takes only one clock cycle of the proposed architecture.
Algorithm 9 : The subtraction in prime field.Input: p, a = ∑n−1
i=0 2iai and b = ∑n−1i=0 2ibi.
Output: a−b mod p.1. s1← a; s2← b ;2. v2← s1 +(¬s2)+1 ;3. w2← v2 + p ;4. c2← (v2)n;5. if c2 = 1 then t2← w2 ; else t2← v2;6. return t2 ;
6.4 A Configurable Fpk Arithmetic Unit (CAU)
It is observed that the major operations for pairing computations over BN curves are
performed either on Fp or on Fp2 . Thus, we design a configurable architecture for per-
forming arithmetic in Fp and Fp2 . Figure 6.2 shows the architecture of the proposed CAU.116
6.4 A Configurable Fpk Arithmetic Unit (CAU)
It consists of three Fp-adder/subtractor/multiplier units described before in Section 6.3.
Each of these units along with their input multiplexers are identified as separate blocks
(A1,A2,A3), which can operate in parallel. The CAU operates on two modes; namely,
Fp-mode and Fp2-mode. In Fp-mode, it computes three independent Fp-operations on
A1, A2, and A3 blocks. The respective operations that are computed in this mode are
t1← a0[+/−/·]b0, t2← a1[+/−/·]b1, and t3← a2[+/−/·]b2 as shown in the figure.
F p - Add / Sub / Mult - 1
a 0
c 10 t 3
0 1 c 9
0 1 c 1 0 1 2 c 2
F p - Add / Sub / Mult - 2
0 1 2 3 c 3 0 1 2 c 4
F p - Add / Sub / Mult - 3
0 1 c 5 0 1 c 6
c 11 t 4 c 8 t 2 c 7 t 1
a 1
b 0
b 1
a 2 b 2
p
t 1 a 0 +/ - b 0 t 3 a 0 + a 1 t 1 a 0 . b 0 t 1 t 1 – t 2
A 1 :
t 2 a 1 + / - b 1 t 4 b 0 + b 1 t 2 a 1 . b 1
t 4 t 1 + t 2
t 2 t 3 – t 4
A 2 :
t 3 a 2 +/ - b 2 t 3 t 3 . t 4 t 3 a 2 .b 2
A 3 :
Figure 6.2: The architecture of Configurable Fpk Arithmetic Unit (CAU).
In Fp2-mode, the CAU computes Fp2-multiplication. Let an element α ∈ Fp2 be repre-
sented as α0+α1X , where α0,α1 ∈Fp and X is an indeterminate. The formula of Karatsuba
multiplication c = ab in Fp2 is :
v0 = a0b0, v1 = a1b1,
c0 = v0 +βv1,
c1 = (a0 +a1)(b0 +b1)− v0− v1,
where v0,v1,c0,c1,a0,a1,b0,b1 ∈ Fp. Here β is a quadratic non-residue in Fp which is −2
in case of BN curve. We compute a ·b in the proposed CAU as described in Algorithm 10.
All operations within a step of the Algorithm 10 are computed in parallel, whereas,
individual steps are executed one-by-one. Step 1 of the algorithm is computed by block
A1 and block A2. Then the CAU executes three independent Fp-multiplications as defined
117
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
Algorithm 10 : The multiplication in Fp2 .Input: p, a = a0 +a1X and b = b0 +b1X .Output: a ·b.1. t3← a0 +a1 ; t4← b0 +b1 ;2. t1← a0 ·b0 ; t2← a1 ·b1 ; t3← t3 · t4 ;3. t1← t1− t2 ; t4← t1 + t2 ;4. t1← t1− t2 ; t2← t3− t45. return t1 + t2X ;
in step 2 by A1, A2, and A3, respectively. After executing steps 3 and 4 by A1 and A2
blocks the final result is stored into the registers t1 and t2 as defined in step 5. The cost of
multiplication in Fp2 is 3m where m represents the cost of one Fp-multiplication. However,
due to three parallel independent Fp-multiplication units this cost on the proposed CAU is
only m. The Fp2-squaring is performed as a ·a for reducing the multiplexer complexity of
the CAU for which too we pay the same cost.
The micro instruction sequence generator finds the current operation type and generates
the respective micro instructions which are nothing but the control signals ci, 1 ≤ i ≤ 11.
The respective values of control signals, which on the other hand, represents the scheduling
of different operations on CAU are depicted in Table 6.1.
Table 6.1: Micro-instructions for performing arithmetic in Fp and Fp2 .
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
Fp-mode, execution of three Fp-operations in parallel
0 1 1 0 1 1 1 1 1 1 0
Fp2-mode, execution of Algorithm 10†
0 0 0 0 0 0 0 0 0 1 1
0 1 1 0 0 0 1 1 1 1 0
1 2 2 1 0 0 1 0 0 0 1
1 2 3 2 0 0 1 1 0 0 0† : Each row of micro-instructions represents one step in Algo. 10.
This sequence generator is constructed as a typical state machine which generates micro
instructions at each state. Its deterministic state transition takes place at every clock cycle
based on the current state and overall status of the CAU. In case of a multiplication in Fp256,118
6.5 The Pairing Cryptoprocessor (PCP)
it remains in a same state for 256 cycles, whereas it remains for one cycle only in a state
for computing Fp256- addition and subtraction. Thus, the cost m means 256 clock cycles
in the proposed pairing cryptoprocessor. Similarly, the computation of c = ab in F(p256)2
takes only 259 clock cycles which is approximately equal to m.
6.5 The Pairing Cryptoprocessor (PCP)
This section describes the proposed architecture of the cryptoprocessor. The main nov-
elty of the architecture lies in its efficient utilization of FPGA features. Independent oper-
ations are exploited at each level of pairing computations to evolve an optimized parallel
design. We explain here the top level of the design followed by its internal parts.
6.5.1 The Datapath Design
The major operations for pairing computations are point doubling (PD), point addition
(PA), line computation (l(Q)), f 2, and f · l(Q). In case of Tate pairing on BN curve, the
PA and PD are performed on E(Fp). Hence, the underlying operations are performed in
Fp. Similarly, the operation l(Q) is performed in Fp2 , while the other two operations are
performed in Fp12 . In case of ate and optimal-ate pairings, the PA, PD, l(Q) are performed
in Fp2 , and f 2, f · l(Q) are performed in Fp12 . However, each of the above computations are
well defined and constitute a number of independent Fp-operations. The proposed datapath
executes those independent operations in parallel to speed up pairing computations.
Figure 6.3 shows the overall resulting structure of the datapath. Two configurable Fpk
arithmetic units (CAU) are included which perform arithmetic in Fp and Fp2 depending
on their mode of configurations. The instructions to configure the CAUs are stored into
a small memory segment called instruction memory. There is a special instruction fetch
and decode (IFD) unit which reads the respective instructions and converts them to proper
configuration signals for both the CAUs. The input data to the CAUs come in parallel from
respective registers. The mechanism and regularity of data access for computing above
operations are fairly simple. The distribution of access to the registers and resolution of
access conflicts are handled efficiently at the runtime by a dedicated hardware block called
119
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
Register 1
Micro - instruction sequence
generator - 1
Sequence control
Instruction memory
IFD
control lines data lines
access control
enable lines
Register 2 Register d
Micro - instruction sequence
generator - 2
Configurable F p k - arithmetic unit
( CAU – 1 )
Configurable F p k - arithmetic unit
( CAU – 2 )
1 2 d
select lines
p
1 12 2 3 4 5 6 7 8 9 10 11
1 2 d
1 2 3 4 5 6
1 2
results
d operands
1 2 6 7 8 18
Execution
unit
Active registers
Data access unit ( DAU )
Figure 6.3: The datapath of the pairing cryptoprocessor.
data access unit (DAU) which distributes the data to the CAUs from the registers and vice
versa.
Each CAU performs atmost three Fp-operations in parallel. Thus, overall twelve inde-
pendent operands along with modulus p and six outputs are accessed in either directions
between memory elements and the CAUs. This on-demand concurrent data requests re-
sult in multiple independent read or write connections between CAUs and DAU. The DAU
takes care of granting accesses. Therefore, a simple multiplexing protocol is used between
CAUs and registers, which is able to confirm a request within the same cycle in order not
to cause any delay cycles when trying to access data in parallel. The data accesses and
instruction sequences are hard coded into the sequence control of the architecture which
avoids the additional software development costs.
The data access conflicts have been resolved prior to design of the DAU. The proposed
one is a custom hardware for pairing computations which executes a fixed set of operations.
The dependency of the instructions are predefined and thus the access conflicts are known.
The priority of the data processing and the respective execution is rearranged accordingly
which achieves maximum utilization of CAUs.
The data access unit or DAU acts as a mediator while transferring data between CAUs
120
6.6 Computation of Tate Pairing on PCP
and memory elements. Due to the demand of parallel access, the proposed cryptoprocessor
stores all intermediate results in its active registers. To fulfil our aimed parallelism of
pairing computations on BN curves the proposed design consists of fifty 256-bit registers
(i.e., d = 50 in Fig. 6.3). Each of the register consists of data-in, data-out, and enable lines.
It gets updated by data-in lines when the respective enable signal is invoked. The crossbar
switch (results) redirects the outputs of each operation to registers. Similarly, the operands
are redirected from registers to the input ports of the CAUs. The respective select signals
are generated prior to the above two redirection procedures by the sequence control unit.
The access control block synchronizes the select lines of the multiplexers for operands and
results. It also synchronizes the enable signals of registers for restoring the intermediate
results.
6.6 Computation of Tate Pairing on PCP
We follow the formula and algorithms for the computation of asymmetric pairings
(Tate, ate, and R-ate) that are given in [42]. The major computations in pairing algorithm
are the Miller function and the final exponentiation. The Miller function consists of two
major steps, namely : doubling step and addition step. Here, we discuss the computation
of above steps for Tate pairing over BN curve on our proposed PCP.
The Tate pairing (tr) over BN curve takes input points P and Q over Fp and Fp2 , re-
spectively. The parameter r is a 256-bit prime of Hamming weight 91. Thus, the Miller
algorithm runs for 255 iterations having 255 doubling steps and 90 addition steps. There
are sufficient independent operations within the doubling and addition steps which can be
performed in parallel. Our proposed PCP consists of a fixed number of functional units.
Therefore, an optimization can be done based on the available functional units and the op-
erations. In the following subsections, we describe the optimized scheduling of above steps
on proposed PCP.
6.6.1 Computation of Doubling Step
The doubling step consists of the following computations.
121
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
• The point doubling (2T ) operation.
• The computation of tangent line at point T (lT,T (Q)).
• The squaring of Miller function ( f 2).
• The multiplication of Miller function with line function ( f · lT,T (Q)).
The computation of 2T , lT,T (Q), and f 2 are performed in parallel on our PCP. In Jacobian
coordinates the formulae for doubling a point T = (X ,Y,Z) are 2T = (X3,Y3,Z3) where
X3 = 9X4− 8XY 2, Y3 = (3X2)(4XY 2−X3)− 8Y 4 and Z3 = 2Y Z. The tangent line at T ,
after clearing denominators, is l(x,y) = 3X3−2Y 2−3X2Z2x+Z3Z2y [90].
In case of Tate pairing computation on BN curves X ,Y,Z,X3,Y3,Z3 ∈ Fp and x,y ∈ Fp2 .
Let us assume that x and y are represented as x0+x1U and y0+y1U. The above operations
are performed by one of the CAUs by following way.
Instructions ASM1,1 ASM1,2 ASM1,31. t0← X2 t1← Y 2 t2← Y ·Z2. t3← (t0)2 t4← X · t1 t5← (t1)2
3. t4← 2t4 t6← 2t3 Z3← 2t24. t4← 2t4 t6← 2t6 t5← 2t55. t3← t3 + t6 − t5← 2t56. X3← t3− t2 − t5← 2t57. t3← t4−X3 − t7← t7 + t08. t7← t7 · t3 t4← Z2 t2← X · t09. Y3← t7− t5 t1← 2t1 t5← 2t210. t4← t4 · t0 t0← t4 ·Z3 −11. t2← 2t4 − t5← t2 + t512. t4← t4 + t2 l0← t5− t1 −13. l10← t4 · x0 l11← t4 · x1 −14. l20← t0 · y0 l21← t0 · y1 −
In the above table ASMi, j stands for the j-th Fp Add/Sub/Mult unit of i-th CAU. Nonlin-
ear Fp operations are performed in the instructions 1, 2, 8, 10, 13, and 14. If we assume that
s≈ m then the cost of above operations is 6m. At the same time CAU2 starts the computa-
tion of f 2. We represent the Miller function f as : ( f0,0+ f0,1V + f0,2V 2)+( f0,0+ f0,1V +
f0,2V 2)W , where fi, j ∈ Fp2 . The equivalent representations of f are [42] as follows:122
6.6 Computation of Tate Pairing on PCP
f = f0 + f1W , where f0, f1 ∈ Fp6
= ( f0,0 + f0,1V + f0,2V 2)+( f1,0 + f1,1V + f1,2V 2)W , where fi, j ∈ Fp2
= f0,0 + f1,0W + f0,1W 2 + f1,1W 3 + f0,2W 4 + f1,2W 5
The computation of c = f 2 is performed in Fp12 using complex method by following
way.
v = f0 · f1,
c0 = ( f0 + f1)( f0 +β f1)− v−βv,
c1 = 2v,
where v,c0,c1 are in Fp6 . It requires two Fp6 multiplications. The Fp6 multiplication is
performed in the tower field F(p2)3 using Karatsuba technique, which requires six multi-
plications in Fp2 . Let us consider that an element ai ∈ Fp2 is represented as : ai0 + ai1U,
ai j ∈ Fp. The computation of v = f0 · f1 on CAU2 is as follows:
1. v0← f00 · f10, where f00, f10 ∈ Fp2 13. t3← t1 · t2, where t1, t2 ∈ Fp2
2. v1← f01 · f11, where f01, f11 ∈ Fp2 14. t10← v00 + v10, t11← v01− v113. v2← f02 · f12, where f02, f12 ∈ Fp2 15. t20← v21 + v214. t10← f010 + f020, t11← f011 + f021 16. t10← t10 + t20, t11← v20− t115. t20← f110 + f120, t21← f111 + f121 17. v10← t30− t10, v11← t31 + t116. t3← t1 · t2, where t1, t2 ∈ Fp2 18. t10← f000 + f020, t11← f001 + f0217. t10← v10 + v20, t11← v11− v21 19. t20← f100 + f120, t21← f101 + f1218. t30← t30− t10, t31← t31− t11 20. t3← t1 · t2, where t1, t2 ∈ Fp2
9. t31← t31 + t31 21. t10← v00 + v20, t11← v01 + v2110. v00← v00− t31, v01← v01 + t30 22. t10← v10− t10, t11← v11− t1111. t10← f000 + f010, t11← f001 + f011 23. v20← t30 + t10, v21← t31 + t1112. t20← f100 + f110, t21← f101 + f111.
The result v∈ Fp6 is represented as : (v00+v01U)+(v10+v11U)V +(v20+v21U)V 2,
where vi j ∈ Fp. In the above computation, steps 1, 2, 3, 6, 13, 20 perform multiplications
in Fp2 . Thus the cost of v = f0 · f1 is 6m, which is computed in parallel with 2T , lT,T (Q)
on CAU2 by the proposed PCP.
123
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
The second Fp6 multiplication, i.e., the computation of ( f0+ f1)( f0+β f1) is performed
by both the CAUs, which costs only 3m in the PCP. Therefore, the total cost of computing
2T , lT,T (Q), and f 2 by the PCP is 9m.
The l(Q) is represented as : (l0 + l1V )+ (l2V )W , where l0 ∈ Fp, l1, l2 ∈ Fp2 , which
is equivalent to l0 + l1W 2 + l2W 3. The computation of f · l(Q) is performed in the tower
field F((p2)3)2 by following way.
f ′ = f · l(Q)
= (( f0,0 + f0,1V + f0,2V 2)+( f1,0 + f1,1V + f1,2V 2)W ) · ((l0 + l1V )+(l2V )W )
The top most extension is quadratic. Thus the computation of f · l(Q) is done by three Fp6
multiplications, which are identified as :
t11 = (l0 + l1V ) · ( f0,0 + f0,1V + f0,2V 2)
t12 = (l2V ) · ( f1,0 + f1,1V + f1,2V 2)
t13 = (l0 +(l1 + l2)V ) · ((( f0,0 + f1,0)+( f0,1 + f1,1)V +( f0,2 + f1,2)V
2)
One multiplication in Fp6 using Karatsuba method requires 18 Fp multiplications. How-
ever, due to the sparse representation of l(Q) the cost of computing t1i , 1 ≤ i ≤ 3 is lesser
than the actual costs of three Fp6 multiplications. Both the equations for t11 and t1
3 re-
quire only 14 Fp multiplications. In our parallel cryptoprocessor the above two equations
are computed in parallel on two CAUs, which costs 5m. The computation of t12 requires
only nine Fp multiplications, which is performed on both the CAUs and it costs only 2m.
Therefore, the computation of f · l(Q) requires 37 Fp multiplications, which costs only
7m in our PCP. Therefore, the total cost for computing the doubling step (the computation
of 2T, lT,T (Q), f 2, and f · l(Q)) of the Miller algorithm for Tate pairing on BN curve is
9m+7m = 16m.
6.6.2 Computation of Addition Step
The addition step consists of the computations of T +P, lT,P(Q), and f · lT,P(Q). The
formulae for mixed Jacobian-affine addition are the following: if T = (X1,Y1,Z1) is in Ja-
cobian coordinates and P= (X2,Y2) is in affine coordinates, then T +P= (X3,Y3,Z3) where124
6.6 Computation of Tate Pairing on PCP
X3 = (Y2Z31 −Y1)
2− (X2Z21 −X1)
2(X1 +X2Z21), Y3 = (Y2Z3
1 −Y1)(X1(X2Z21 −X1)
2−X3)−Y1(X2Z2
1 −X1)3, Z3 = Z1(X2Z2
1 −X1). The line through T and P is l(x,y) = (X2(Y2Z31 −
Y1)−Y2Z3)−(Y2Z31−Y1)x+Z3 ·y. During the addition step of Miller algorithm we compute
the above operations in parallel on both CAUs. There are limited independent operations
in this step. Therefore, there are scopes for optimizing the scheduling of operations on Fp
arithmetic units. The optimization is done based on the requirements of additional registers
and related wiring. The respective scheduling is shown here.
CAU1 CAU21. t0← Y2 ·Z1, t0← (Z1)
2 −2. t0← t1 · t0, t1← t1 ·X2 −3. t4← t1 +X1, t0← t0−Y1, t5← t1−X1 −4. t3← (t0)2, Z3← t5 ·Z1, t7← (t5)2 l10← t0 · x0, l11← t0 · x15. t2← t7 ·X1, t4← t4 · t7, t5← t5 · t7 t10← t0 ·X26. X3← t3− t4 −7. t2← t2−X3 −8. t2← t2 · t0, t4← Y2 ·Z3, t5← t5 ·Y1 l20← Z3 · y0, l21← Z3 · y19. Y3← t2− t5 l0← t10− t4
In the above scheduling, the nonlinear operations (multiplication and squaring) in Fp
are performed in steps 1, 2, 4, 5, and 8. Thus, the cost of computing T +P, lT,P(Q) is 5m
in the PCP. This computation is followed by f · l(Q), which costs 7m. Therefore, the cost
for evaluating the addition step is 5m+7m = 12m in the PCP.
6.6.3 Computation of Final Exponentiation
The final exponentiation is computed by following way. It follows the optimization to
factor (p12−1)/r into three parts [61] and compute f (p12−1)/r as :
fp12−1
r = f(p6−1)× p6+1
p4−p2+1× p4−p2+1
r
= (( f p6−1)p2+1)p4−p2+1
r .
125
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
The computation is done by following way:
1. f ← f p6−1.
2. f ← f p2+1.
3. a← f−(6z+5), b← ap, b← a ·b.
4. Compute f p, f p2, f p3
.
5. f ← f p3·[b · ( f p)2 · f p2
]6z2+1·b · ( f p · f )9 ·a · f 4.
Table 6.2 lists the operation costs of final exponentiation on the PCP. The power of
(p6 − 1) in F(p6)2 is an easy exponentiation, which is performed by a conjugation [8]
(Frobenius) and a division. The operation f p6= f0− f1W . Thus, f p6−1 is performed by
one inversion and one multiplication in Fp12 . As shown in [61], the inversion (a0+a1W )−1
in Fp12 using quadratic over sextic extensions is computed as :
(a0 +a1W )−1 = (a0−a1W )/((a0)2 +2(a1)
2).
It is computed by one inversion, two multiplications, and two squarings in Fp6 . The inver-
sion a = (a0 + a1V + a2V 2)−1 in Fp6 using cubic over quadratic extension is performed
by following way :
A = (a0)2 +2a1a2, B =−2(a2)
2−a0a1, C = (a1)2−a0a2.
F =−2a1C+a0A−2a2B.
a = (A+BV +CV 2)/F.
It requires one inversion, nine multiplications, and three squarings in Fp2 . On the proposed
PCP, we compute a = (a0 +a1U)−1 in Fp2 by :
t1← (a0)2, t2← (a1)
2
t3← t1 +2t2
a0← a0/t3, a1←−a1/t3.
The division in Fp is performed by binary inversion/division algorithm as described in
Chapter 2. On our parallel PCP, the above operations cost is 3m. The cost for computing
126
6.6 Computation of Tate Pairing on PCP
Table 6.2: Operation costs for the final exponentiation on our PCP.
Operation cost on PCPf p6−1 29mf p2+1 12m
f−(6z+5) 480map, a ·b, f p, f p2
, f p321m
T ← b · ( f p)2 · f p224m
T ← T 6z2+1 951mf p3 ·T ·b · ( f p · f )9 · f 4 93m
inversion in Fp6 is 7m+3m = 10m. The cost for computing the inversion in Fp12 is 20m in
our PCP. Hence, the cost for computing f p6−1 is 29m.
The first part of the exponentiation is not only cheap (although it does require an ex-
tension field division), it also simplifies the rest of the final exponentiation [20]. After
raising to the power of (pd−1) the field element becomes unitary [104]. This has impor-
tant implications, as squaring of unitary elements is significantly cheaper than squaring of
non-unitary elements, and any future inversions can be implemented by simple conjuga-
tion [27, 133].
Let we have an element a = (a0 +a1U) ∈ FP2 Then the powering of a to the power of
the modulus p is computed by following relationship.
(a0 +a1U)p = (a0−a1U)mod p.
This relation could be applied for higher order tower extensions. The exponentiation
by p2 + 1 in FP12 is done by applying the p2-power Frobenius automorphism and one
multiplication in Fp12 . Following the procedure described in [42] the p2-power Frobenius
automorphism is computed by five multiplications in Fp2 . Thus the cost of f p2+1 on our
PCP is 12m.
The above procedures are followed for computing f p, f p3also, and each is computed
by five Fp2 multiplications, which costs only 3m on our PCP. The exponentiations f 6z+5,
T z and (T z)6z are performed by repeated square-and-multiply. Note that 6z+5 and 6z have
127
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
bit length 66 and Hamming weight 11, while z has bit length 63 and Hamming weight 11.
The respective costs for computing them on our PCP is listed in Table 6.2.
6.6.4 Cost for Computing Tate Pairing
In case of BN curve, r has bitlength 256 and Hamming weight 91. Thus the total cost
for evaluating iterative Miller function of the Tate pairing computation is 5176m on our
PCP. The cost for computing the final exponentiation is 1610m. Hence, the total cost for
computing a Tate pairing over BN curves by our cryptoprocessor is 6786m, which takes
1,764,360 cycles.
6.7 Computation of ate Pairing on PCP
The ate pairing interchanges the input points of Tate pairing and it runs a smaller num-
ber of iterations. It uses t − 1 (instead of r) to determine the number of iterations in the
Miller algorithm [42]. In case of BN curve t ≈√
r, and t−1 is a 128-bit prime with Ham-
ming weight 28, which makes the number of iterations halved as well as it reduces the
number of addition step drastically. The computation costs 3165m, 1610m, and 4775m for
the Miller function, the final exponentiation, and the ate pairing, respectively on our pro-
posed hardware. Hence, the number of cycles required to compute an ate pairing by the
PCP is 1 241 500.
6.8 Computation of R-ate Pairing on PCP
The R-ate [28] pairing follows the same procedures of ate pairing but it uses a = 6z+2
(instead of t − 1) to determine the number of iterations in the Miller algorithm. Since,
a ≈√
t on BN curves, and a has bitlength 66 and Hamming weight 9 [42], the Miller al-
gorithm in R-ate pairing has 65 doubling steps and 8 addition steps. Thus, in our parallel
cryptoprocessor, 1681m, 1610m, and 3291m are the costs for Miller function, final expo-
nentiation, and R-ate pairing, respectively. The total number of cycles required to compute
an R-ate pairing is 855 660 by the proposed design.
128
6.9 Results
6.9 Results
The whole design has been done in Verilog (HDL). All synthesis results have been
obtained from Xilinx ISE Design Suit [22] using a Virtex-4 xc4vlx200 FPGA device with
a supply voltage of 1.2V . The design can run at a maximum frequency of 50MHz. The
pairing hardware uses around 52k logic slices including controllers and data access unit. It
uses 27k flip flops for registers. It finishes one Tate, ate, and R-ate pairing computations in
35.3ms, 24.9ms, and 17.0ms, respectively. Table 6.3 shows the implementation results.
Table 6.3: Implementation results of the pairing cryptoprocessor on xc4vlx200 device.
Operation Slice LUT FF Frequency Cycles Security Times(MHz) (bit) (ms)
Tate52 k 101 k 27 k 50
1764 k128
35.3ate 1242 k 24.9R-ate 856 k 17.0
The critical path of the design goes through the data access mechanism, then through
two 256-bit adders, the multiplexer mx1, and back through data access mechanism. In § 5 it
is shown that the latency of a 256-bit adder circuit is 9.9ns. However, this addition latency
consists of input buffer delay of 1.3ns, addition logic delay of 6.2ns, and output buffer
delay of 2.4ns. The individual delays of the addition logic includes input and output buffer
delays. In our architecture the critical path is within two internal registers which includes
neither the input buffer nor the output buffer. Therefore, the total latency of the critical path
of the design is calculated as 3.8ns+2×6.2ns+1.6ns+2.2ns = 20ns.
6.9.1 Comparison with Pairing Implementations
This section provides the performance comparison with related pairing implementa-
tions over BN curves. Performances are compared with actual implementations of cryp-
tographic pairings on software and dedicated hardware achieving a 128-bit security level.
Hardware implementations of ηT pairing over binary and cubic curves are shown in [21,
75]. These designs are for lower security level (72-bit) and hence it shall be unfair to com-
pare with the present design. Table 6.4 gives a performance comparison of related hardware
and software implementations.129
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
Tabl
e6.
4:H
ardw
are
and
soft
war
eim
plem
enta
tion
ofpa
irin
gsov
erB
N-c
urve
s.
Ref
eren
cePl
atfo
rmFr
eque
ncy
Are
aC
ycle
sTi
mes
(MH
z)(m
s)Ta
tepa
irin
gPr
opos
edV
irte
x-4
5052
kSlic
es17
64k
35.3
[17]
130
nmC
MO
S33
897
kGat
es11
627
k34
.4[6
1]Pe
ntiu
m-4
3400
-15
674
0k
-at
epa
irin
gPr
opos
edV
irte
x-4
5052
kSlic
es12
42k
24.9
[17]
130
nmC
MO
S33
897
kGat
es77
06k
22.8
[19]
§13
0nm
CM
OS
204
183
kGat
es86
2k
4.2
[1]§
Vir
tex-
621
04,
014
Slic
es,4
2D
SP48
E1s
336
k1.
6[4
2]Pe
ntiu
m-4
2400
-81
000
k-
[39]
64-b
itco
re2
2400
-14
640
k-
[61]
Pent
ium
-434
00-
133
620
k-
optim
al-a
tepa
irin
gPr
opos
edV
irte
x-4
5052
kSlic
es85
6k
17.0
[17]
130
nmC
MO
S33
897
kGat
es53
40k
15.8
[19]
§13
0nm
CM
OS
204
183
kGat
es59
3k
2.9
[1]§
Vir
tex-
621
04,
014
Slic
es,4
2D
SP48
E1s
245
k1.
17[4
2]64
-bit
core
224
00-
1000
0k
-−
§im
plem
enta
tion
spec
ifica
llyfo
rBN
-cur
ves
with
fixed
para
met
ers.
130
6.9 Results
Due to the parallel structure our PCP computes six Fp multiplications in parallel which
are completed in 256 cycles. The main features that strengthen the proposed PCP for
pairing computations are as follows :
• The proposed cryptoprocessor is the first FPGA results for pairing computation with
128-bit security.
• Our adopted parallelism and efficient use of two Fpk arithmetic cores reduce the total
number of cycles drastically.
• Due to the inherent properties the frequency of a design in FPGA is much lower
than that in ASIC (CMOS standard cell). However, the speed achieved of the PCP is
comparable to the CMOS standard cell design.
• The PCP is flexible to configure for different curve parameters.
The underlying platform plays a crucial role in determining the performance of a design.
Thus, existing designs on different platforms does not lead to a fair comparison. We try
to find out the platform independent features of existing designs and compare them with
our proposed one. The cycles required to compute pairings on different designs may be
considered such a parameter.
Kammler et al. [17] reported the first hardware implementation of cryptographic pair-
ings achieving a 128-bit security. In [17] the proposed hardware is not only a cryptoproces-
sor, but an actual ASIP : it is in fact a general purpose processor, augmented with finite field
arithmetic units in order to compute pairings. It uses the same z that we have considered
to generate a 256-bit BN curve. The Montgomery algorithm is used for Fp multiplication.
The platform of the design is 130 nm CMOS standard cell library, whereas our design is on
Virtex-4 FPGA. The main feature of the design [17] is the fast modular multiplication in Fp
which takes only 68 cycles. The average cycle count of our PCP for one Fp multiplication
is only 43 which is 1.6 times faster than [17]. With respect to the Tate pairing computation,
the design of [17] takes 11 627 k cycles, whereas our design takes only 1730 k cycles,
which is much less (0.15 times only) compared to [17].
131
Chapter 6 High Speed Flexible Pairing Cryptoprocessor
Fan et al. [19] proposed a processor for cryptographic pairing over BN curves. They
designed a fast modular multiplier in Fp only for BN parameters which takes only 23
cycles. The 130 nm ASIC design of [19] provides the best known performance which takes
only 2.9ms for computing a R-ate pairing over BN-curve. This design also attains smaller
area-latency product than that in [17]. But the main drawback of the design proposed
in [19] is that it does not provide the flexibility to compute pairings on chosen parameters.
Whereas, our design provides the above flexibility in all aspects which indeed requires
more cycles.
The results of software implementations [39, 42] are quite impressive. On an Intel 64-
bit core2 processor, R-ate pairing requires only 10,000,000 cycles. The advantages of Intel
core2 is that it has a fast multiplier (two full 64-bit multiplications in 8 cycles) and relatively
high clock frequency. It takes 13 times more clock cycles than our cryptoprocessor. In a
very recent work by Naehrig et al. [8] shows that the Optimal-ate pairing on BN curves can
be computed by 4,470,408 cycles on an Intel Core 2 Quad Q6600 processor. The software
implementation of same pairing on a different curve is described in [10]. It takes only
2.63 million clock cycles on a Intel Core i7 : 2.8 GHz processor. However, the exact time
required to compute pairings by executing softwares on a Desktop or Server systems are
not predictable. It depends on so many other factors like available cache memory, context
switching, bus speed of the system, etc.
6.10 Conclusion
In this chapter we presented an FPGA based architecture for computing cryptographic
pairings over 256-bit BN curves. The design is flexible to choose curve parameters. Ex-
tensive parallelism techniques have been incorporated to speed up overall cryptographic
pairing computations. It provides a comparable speed with the existing ASIC designs. The
overall clock cycles required to compute pairings over BN curves are less than existing
designs. To the best of our knowledge it is the first FPGA result for high security (128-bit)
cryptographic pairings.
The next chapter focusses on the security analysis of pairing computations against two
physical attacks. Fault attack on a pairing computation which tries to exploit the faulty out-
132
6.10 Conclusion
put of a transient fault. The fault is injected into a specific register of the pairing cryptopro-
cessor. On the other hand, the power attack exploits the variations of power consumption
during pairing computations.
133
Chapter 7
Pairing Computations Against Fault and
Power Attacks
BILINEAR PAIRING is a new and increasingly popular way of constructing crypto-
graphic protocols. This has resulted in the development of pairing based schemes
such as identity based encryption (IBE) which are ideally used in identity aware devices.
The security of such devices leads to the security of pairing computations. This thesis
considers the security of the pairing computations against physical attacks based on covert
power channel and faulty outputs. The introductory works on cryptanalysis of pairing
computations by exploiting power consumption and faulty outputs are described in [82]
and [40]. However, the existing works have addressed only a small set of pairing computa-
tions, even they have not performed actual attacks.
7.1 Introduction
In general, implementation technique for computing the Tate pairing such as Barreto,
Kim, Lynn, and Scott (BKLS) algorithm [138] are effectively realized as point multipli-
cation with a fixed multiplier and some auxiliary operations. However, the algorithms for
Tate pairing by Duursma and Lee [126] and their modification by Kwon [111] are not based
on point multiplication algorithm. These two algorithms compute Tate pairing on super-
singular curves over F3m field. Fault injection attacks on these two pairing algorithms have
135
Chapter 7 Pairing Computations Against Fault and Power Attacks
been explicitly studied by Page and Vercauteren in [82]. The attack exploits the effect of
fault at a specific register which stores the number of iterations of the pairing computa-
tions. The countermeasures for resisting fault attacks on respective pairing algorithms are
also described in the same paper.
This chapter describes how the above fault attack is mounted on a cryptoprocessor. The
attack assumes that the respective fault is injected into a specific register inside the pairing
cryptoprocessor. With experimental result this chapter shows the fault injection technique
into a register by tuning the clock frequency. The chapter also finds out the limitations of
the existing countermeasures and proposes a new countermeasure to defend fault attacks.
It further finds out a weakness of pairing computations based on Miller’s algorithm. It
demonstrates the said vulnerability on the computations of asymmetric pairings over BN
curves [76] and over Edwards coordinates [47]. A suitable counter measuring technique
against such attack is also proposed in this chapter.
The side-channel attack based on power consumption analysis on pairing computation
is another objective of this chapter. Differential power analysis on ηT pairing over F2m
is described in [40], which targets addition and multiplication operations performed on
one secret and one public parameter. In this chapter we propose a DPA attack on pairing
computations over prime fields. Through experimental results we demonstrate the proposed
attack on FPGA platform. The chapter further proposes a suitable computation procedure
of pairings over prime field which is secure against DPA attack.
The outline of the chapter is as follows: the chapter starts with the demonstration of
fault injection technique followed by proposing a countermeasure against fault attack. Then
it proposes a new fault attacking technique which is described on pairing computations over
BN curves and over Edwards coordinates. The chapter then demonstrates the power attacks
on pairing computations over BN curves which is followed by a counteracting technique.
7.2 Fault Attack on Tate Pairing [82]
Fault attack on pairing computation tries to exploit erroneous results that are produced
by the device in presence of some transient fault at loop bound m. Algo. 7.1 presents a
136
7.2 Fault Attack on Tate Pairing [82]
specialised algorithm proposed by Duursma and Lee in [126] to compute pairings on a
family of hyperelliptic curves, including the supersingular curves in characteristic 3. In the
algorithm, ρ, σ, and b are known system parameters. The algorithm was further improved
by Kown [111] and by Barreto et al. [138].
Algorithm 7.1: Duursma-Lee algorithm.Input: P = (x1,y1), Q = (x2,y2)Output: fP(ϕ(Q)) ∈ µl ⊂ F∗q6
f ← 1for i = 1 to m do
x1← x31, y1← y3
1µ← x1 + x2 +bλ←−y1y2σ−µ2
g← λ−µρ−ρ2
f ← f .gx2← x1/3
2 , y2← y1/32
endreturn f q3−1
Page and Vercauteren [82] studied the security of pairing algorithms against fault attack.
They have shown that if an adversary can induce proper transient fault at loop bound m
of Duursma-Lee algorithm then the secret point P(x1,y1) could be revealed easily. The
transient fault on m can be induced through glitch attack, or provoking error in memory or
register in where m is stored [113].
Let an adversary induce transient faults into the register that holds the value of loop
boundary m. It measures the modified loop boundary and corresponding pairing result. Let
us consider it replaces the loop boundary m with m± r and m± r + 1 in two instances.
The corresponding pairing results are R1 = em±r and R2 = em±r+1. The ratio of these two
pairings gives
R =R2
R1=
em±r+1
em±r= gq3−1
m±r+1,
where
gi =−y3i
1 .y2σ−µ2i −µiρ−ρ2.
137
Chapter 7 Pairing Computations Against Fault and Power Attacks
The value of gi from gq3−1i can be extracted through root finding algorithm and by solving
some linear system of equations [82]. Here σ and ρ are field extension parameters known
to the attacker. The attacker can extract the value of x1 and y1 from above equation. We
refer [82] for further analysis and information regarding above attack.
7.2.1 Fault Induction Through Clock Signal
The above fault attack assumes that a known fault is already injected into the register
holding the loop boundary. However, it does not use an actual fault injection technique.
This section proposes such a technique for injecting fault into a specific register of a cryp-
toprocessor. The proposed technique manipulates the clock frequency for injecting fault.
In our target application it is necessary to induced fault at a register which holds the
loop boundary. In case of Tate pairing computation over F3m field, the value of m is more
than 512, which can be stored in a 10-bit (or longer) register. The hardware is assumed to be
designed in such a way that it receives the value of m through a serial port. The respective
register is designed as a shift register. The serial data is normally generates synchronously
with receiver’s clock. The shift operation of the register is performed by the same clock.
Therefore, if the attacker can have control over the clock signal of the register then he (or
she) can store some faulty data in the register instead of the actual incoming data.
For example, let us consider we have a 12-bit register to store the value of m. Let us
further assume that the register loads serially through Bus 0 by synchronous left shift with
clock. Therefore, 12 clocks are required to store a new 12-bit value into the correspond-
ing register. The correct data value which is aimed to store into the register is 1365 (or
010101010101 in binary).
Figure 7.1 shows the experimental results of three instances of above load operation.
An instance of load operation takes 12 clocks. During the period of load operation the
signal BUS 1 remains high. The first instance (left most “/out” column) shows the correct
load. It is performed without any inconvenience. The next two columns show the result
of two faulty load operations. During 7-th and 5-th cycles of these two load operations,
respectively, we tune the clock to its 4 times higher frequency. In these cycles due to the
high-speed clock the incoming data from BUS 0 could not be stored into the LSB of the
138
7.2 Fault Attack on Tate Pairing [82]
correct value
faulty values at two instances
Figure 7.1: Expected and faulty values of register containing the value of m.
register and no shift operation has been performed. But, the counter has been incremented
and it goes for next input. As a result, the final value after 12 clocks into the corresponding
register becomes faulty. In the above experiment the faulty values are 2742 and 2773, re-
spectively in two instances. Therefore, the fault injection into the loop boundary of pairing
computation can be done easily through clock signal. The faulty value of the loop boundary
is then exploited to find out the secret point P = (x1,y1) of Duursma-Lee algorithm.
7.2.2 Analysis of Existing Countermeasures
Page and Vercauteren [82] have given two countermeasures against fault attacks on
pairing based cryptography. Both of the countermeasures are based on point blinding tech-
nique. In the fault attack as described in section 7.2 the fault is injected randomly into the
loop boundary m. The attacker can easily measure the faulty value of m through timing
or power analyses. The attacker collects two pairing results R1 and R2 for two faulty loop
boundary m± r and m± r+1, respectively, and computes the ratio
R =R2
R1=
em±r+1(P,Q)
em±r(P,Q), (7.1)
which is exploited to compute the x and y coordinates of secret point P. The countermea-
sures proposed in [82] protect the fault attack by randomizing the input points P and Q so139
Chapter 7 Pairing Computations Against Fault and Power Attacks
that the ratio R could not be exploited.
7.2.2.1 New Point Blinding Technique [82]
The aim of point blinding technique is randomization of input points so that the attacker
could not utilize knowledge of the public point in pairing computation. This countermea-
sure chooses two integers x,y randomly from Z∗l such that xy≡ 1 (mod l). The points P and
Q in e(P,Q) computation are blinded by computing xP and yQ. The pairing is computed
on xP and yQ as e(xP,yQ) since it is known that
em(P,Q) = em(xP,yQ)
= em(P,Q)xy. (7.2)
In both Duursma-Lee and Kwon-BGOS algorithms, the input points are processed and
it produces pairing result after m iterations. Now according to the relationship, which is
shown in Eq. 7.2, the pairing result on set of points (P,Q) and (xP,yQ) are equal. According
to the definition, the above equality holds for correct pairing results only, which is produced
after m iterations of above algorithm. Therefore, using the aforementioned fault attack
on new point blinding technique the attacker could not find out a ratio R for which R =
em±r/em±r+1. With this countermeasure two such outputs are :
R1 = em±r(x1P,y1Q) = em±r(P,Q),
R2 = em±r+1(x2P,y2Q) = em±r+1(P,Q).
Therefore, the ratio of R2 and R1 could not be exploited for finding out the secret point P.
Variable x and y are updated after every pairing computation. It is suggested in [82] that
the refreshment of these random variables are done by computing x = (x · c) mod l, and
y = (y ·d) mod l such that (c ·d)≡ 1 (mod l).
However, the main difficulty of the new point blinding technique is the generation of
random variables x and y for which x · y ≡ 1 (mod l). One possibility is that the random
variables are generated by the user during the key generation procedure. The user will apply
them to the cryptoprocessor during the pairing computation along with the private key. Also
if we follow the above procedure for the refreshment of (x,y) then another similar pair (c,d)
140
7.2 Fault Attack on Tate Pairing [82]
need to be available to the user and they are applied to the cryptoprocessor. Therefore, this
protocol demands 4m bits additional private parameter.
Alternatively, we may avoid the overhead private parameters by following way. It may
be considered that two such pairs of integers (x,y) and (c,d) are stored inside the pairing
cryptoprocessor. So that, not need to apply them by the individual user. The main problem
in this scheme is that we need to design an architecture for a fixed l and store all four
random variables inside the hardware. It is only possible for a fixed application with a
fixed value of l. It is necessary to replace the whole hardware if a new l needs to be chosen.
7.2.2.2 Altering Traditional Point Blinding [82]
The pairing computation on points P, Q is performed by
e(P,Q) = e(P,Q+X).e(P,X)−1,
where X is a random point. It is assumed that P is secret and Q is public. The fault attack
described in [82] exploits knowledge of the public point Q. This defence mechanism [82]
tries to randomize the public point Q using the random point X . Thus, it computes e(P,Q+
X) instead of e(P,Q), and eliminates the surplus by multiplying the inverse of e(P,X).
The value e(P,X)−1 are assumed to be supplied to the cryptoprocessor for avoiding ad-
ditional pairing computation and inversion in extension field. One difficulty of this coun-
termeasure is that the refreshment of the random point X . In [82], it is done by bX and
e(P,X)−b such that b ∈R {−2,+2}. The major difficulty of such countermeasure is that
the generation of X which is random. As mentioned in case of previous countermeasure,
it can be considered as private to the user and it is applied to the cryptoprocessor during
the execution of pairing algorithm. So that the key size is increased by 18m bits (2× 6m
for x and y coordinates of X , and 6m for e(P,X)−1), where it is considered that embedding
degree of the underlying elliptic curve is 6 and hence the point X and e(P,X)−1 are in Fq6 .
7.2.3 Proposed Countermeasure
This section proposes a suitable countermeasure against fault attack on pairing com-
putation. The underlying principle of fault attack on pairing computation is based on the
ability of the attacker to change the value of the loop boundary m. The attacker also has141
Chapter 7 Pairing Computations Against Fault and Power Attacks
the ability to measure the change from timing or power analysis of the computation. The
attacker tries to obtain two pairing computations one for m+ r and the other for m+ r,
augmented by 1 through fault induction. Hence, our countermeasure ensures that even if
there is a fault the attacker cannot correlate the pairing output with number of iterations.
The objective is to disable the attacker from ascertaining the ratio R2/R1, as mentioned in
Section 7.2. At the same time our proposed countermeasure does not increase the size of
the user’s private key.
The proposed countermeasure blinds the loop boundary m as it is the main factor in
fault attacks. It protects the loop boundary so that the attacker cannot guess the number
of iterations for which the faulty output is produced. It modifies the Duursma-Lee algo-
rithm for protecting secret point in pairing computation against fault attack. The modified
algorithm is shown in Algo. 7.2. Other pairing computation procedures, like Kown-BGOS
algorithm can be modified by same procedure in order to defend it against fault attack.
Algorithm 7.2: Modified Duursma-Lee algorithm.Input: P = (x1,y1), Q = (x2,y2)Output: fP(ϕ(Q)) ∈ µl ⊂ F∗q6
Choose r1 ∈R F∗q6 , and r2 ∈R Z, 2≤ r2 ≤ mf0← r1, f1← 1m′← m+ r2for i = 1 to m′ do
x1← x31, y1← y3
1µ← x1 + x2 +bλ←−y1y2σ−µ2
g← λ−µρ−ρ2
f1← f1.gj← (i == m)f0← f j
x2← x1/32 , y2← y1/3
2endreturn f q3−1
0
7.2.3.1 Correctness Analysis
Theorem 7.1: The modified Duursma-Lee algorithm produce the correct result.
142
7.2 Fault Attack on Tate Pairing [82]
Proof.
The Algo. 7.2 is modified from original Duursma-Lee algorithm (Algo. 7.1) for resisting
it against side-channel and fault attacks. The original algorithm runs for m iterations and
produce the pairing result after m-th iteration. In the modified algorithm, the loop boundary
m′ is random as m′←m+r2, r2 ∈R Z and r2≤m. It runs for a random number of iterations.
However, the intermediate pairing result f1 is restored into f0 at the m-th iteration only. It
is not restored for other iterations. At the end of the execution, i.e. after m′ iterations f0
holds the pairing result of m iterations. Hence, the algorithm produces the correct reduced
Tate pairing result.
7.2.3.2 Security Against Fault Attack
Security Assumption. The adversary can only induce unknown random faults into the loop
boundary.
Theorem 7.2: The modified Duursma-Lee algorithm is secure against fault attack.
Proof.
In the fault attack, the adversary is interested in two pairing results, Rm′±r′ and Rm′±r′+1.
We may consider the following two scenarios.
• Inject fault at m′ : The adversary can change the value of m′ to m′±r′ (with random
r′) by injecting fault at m′. Thus, our modified Duursma-Lee algorithm runs for
m′± r′ iterations. If the resultant value m′± r′ ≥ m then the algorithm produces
result Rm for m iterations else it produces random value rq3−11 as a pairing result. So,
the adversary cannot collect two such target outputs by injecting random faults at m′
register.
• Inject fault at m : The adversary can inject random fault at m register, and alter
m to m± r′. Thus, the algorithm runs for m± r′+ r2 iterations. But, it produces
result Rm±r′ for m± r′ iterations only, where r2 and r′ both are random. This result
can be collected by the adversary. The adversary can also measure the total number
of iterations m± r′+ r2 by timing or power analysis. But, it could not correlate
the outputs and corresponding measured iteration numbers, which are actually not143
Chapter 7 Pairing Computations Against Fault and Power Attacks
correlated. Thus, it could not find out two useful pairing results. Therefore, the fault
attack described in [82] could not be mounted on proposed countermeasure.
7.3 Fault Attack on Pairing in Edwards Coordinates
This section attempts to analyze the security of pairing computation in Edwards coor-
dinates that is defined by Ionica and Joux [47] against fault attack. It finds out a weakness
of such algorithm in presence of fault and give a suitable countermeasure.
7.3.1 Pairing in Edwards Coordinates
Edwards showed in [65] that every elliptic curve defined over an algebraic number field
F is birationally equivalent to a curve over some extension of F given by the equation:
x2 + y2 = c2(1+ x2y2) (7.3)
Thereafter Bernstein and Lange [64] showed that the group operations can be performed
most efficiently on the elliptic curves defined in the Edwards coordinates. The equation
x2 + y2 = 1+dx2y2 is called the Edward curve [50]. It was shown in [50] that an Edwards
curve E is birationally equivalent to the elliptic curve Ed : (1/(1− d))v2 = u3 + 2((1+
d)/(1−d))u2 +u via the rational map:
ψ : Ed → E (7.4)
(u,v)→(
2uv,u−1u+1
).
The addition formulas on Edwards curve is given by:
(x1,y1),(x2,y2)→(
x1y2 + y1x2
1+dx1x2y1y2,
y1y2− x1x2
1−dx1x2y1y2
).
It is shown in [64] that above addition law is complete when d is not a square. This means
that it is defined for all pairs of input points on the Edwards curve with no exceptions for
doubling operation, neutral element, etc.
The pairing computation in Edwards coordinates and on Twisted Edwards coordinates [50]
are defined by Ionica and Joux [47], and by Das and Sarkar [49], respectively. The dou-
bling and mixed addition steps of Miller’s algorithm for pairing computation are redefined144
7.3 Fault Attack on Pairing in Edwards Coordinates
in Edwards and Twisted Edwards coordinates in these two papers. It is shown that the
computation of pairing f in Edwards coordinates is the most efficient than that of Twisted
Edwards coordinates. This paper takes the pairing computation that is given in [47] for
analyzing security against fault attack.
7.3.2 Attack Procedure
The fault attack defined in [82] will not work on Miller’s algorithm, Algo. 2.5, in Ed-
wards coordinates due to the complex nature of the iterative operations. For example, the
doubling operation [64] on K = (X1,Y1,Z1) gives 2K = (X3,Y3,Z3), and the formulas are:
X3 = 2X1Y1(2Z21− (X2
1 +Y 21 )),
Y3 = (X21 +Y 2
1 )(Y21 −X2
1 ),
Z3 = (X21 +Y 2
1 )(2Z21− (X2
1 +Y 21 )).
Similarly, during addition K is updated by K +P, which is even more complex than dou-
bling [64]. The point K is initialized by the secret point P = (X0,Y0,1).
Algorithm 2.5 is realized as a point multiplication along with some additional field
multiplication for computing pairing value f . The value of f in doubling step of the Miller’s
algorithm in Edwards coordinates [47] can be computed by f ← f 2l′, where in case of even
embedding degree and k > 2, l′ can be computed by following equation:
l′ = 2X1Y1(x/y− y/x)(X21 −Y 2
1 )(X21 +Y 2
1 −Z21)
− 2(X21 −Y 2
1 )2(X2
1 +Y 21 −Z2
1)
− dx2y2Z21(X
21 +Y 2
1 )(2Z21−X2
1 −Y 21 )
+ (X21 +Y 2
1 )(2Z21−X2
1 −Y 21 )(X
21 +Y 2
1 −Z21).
The Tate pairing El(P,Q) is computed by Miller’s algorithm on points P,Q such that
P is an l-torsion point on the curve E(Fq) and Q ∈ E(Fqk). In order to mount fault attack
on Miller’s algorithm in Edwards coordinates, we assume that the adversary has ability to
inject fault at the register l. We further assume that the adversary can obtain the pairing
result El(P,Q) for l = 2. This may be possible by adopting some powerful fault injection
procedure or from a number of trial with the help of timing and simple power analysis [82,145
Chapter 7 Pairing Computations Against Fault and Power Attacks
100, 113]. If l = 2 then the Miller’s algorithm runs for only one iteration and it executes
only doubling part of Algo. 2.5. In such a scenario the pairing output f = l′ and K = P.
So, f will be a function of X0,Y0,x,y, and d, which can be deduced from the equation
of l′ by replacing X1 by X0, Y1 by Y0, and Z1 by 1. We can assume that the value of d
(curve parameter) and Q = (x,y) are known to the attacker. Thus, f has been simplified
and represented by the following equation:
f = a1X60 +a2Y 6
0 +a3X50 Y0 +a4X0Y 5
0 +a5X20 Y 4
0
+a6X40 Y 2
0 +a7X0Y 30 +a8X3
0 Y0 +a9X20 Y 2
0
+a10X40 +a11Y 4
0 +a12X20 +a13Y 2
0 , (7.5)
for constants a1, · · · ,a13. Here a1, · · · ,a13 are constants as they can be expressed interms
of known values, x,y, and d. We can linearize the above equation by using a number of
variables. The public point Q could be changed for obtaining a number of such equations.
Hence, X0,Y0 could be solved by solving the set of linear equations.
7.3.2.1 Practical Implication of Above Fault Attack
Let us assume l is a large prime (say 256 bits long in practice). Then the probability of
setting l = 2 by random fault injection [113] is very less (≈ 2−256 for a 256-bit l). Hence
a random fault in register l has vary less probability of success. However, we propose a
different strategy.
The requirement of our fault attack is satisfied by inverting the least-significant-bit of l
(say l[1]) and setting i = 1. Note that since l is a odd prime, l[1] is 1. Now, if l is 256 bits
long then i is of ⌈log2(256)⌉ = 8 bits. Hence the probability of setting i = 1 by random fault
injection is at least 2−8. The algorithm runs for only one iteration as i = 1, and it executes
only the doubling part as l[i] = 0. Thus the probability of success of the attacker is 2−9.
Hence we expect that after 512 trials the attacker will be successful at least once.
7.3.3 Countermeasure
In order to resist the above fault attack it is ensured that the Miller’s algorithm does
not produce a valid pairing result for l = 2, and for the condition that i = 1 and l[1] = 0.
In general, l is a odd prime in l-Tate pairing computation, which means l[1] = 1. But for146
7.3 Fault Attack on Pairing in Edwards Coordinates
mounting the above fault attack it is essential to alter the value of l[1] from 1 to 0. Thus we
suggest modified Miller’s algorithm that is shown in Algo. 7.3 for defending against fault
attack. The proposed technique ensures that the pairing does not compute for l[1] = 0.
Algorithm 7.3: Fault attack resistant Miller’s algorithm.Input: P an l torsion point ∈ E(Fq), Q ∈ E(Fqk)Output: the Tate pairing El(P,Q)i = [log2(l)],K← P, f ← 1.if l[1] = 0 then
return 0.endwhile i≥ 1 do
Compute equations of l′ and v′ arising in the doubling of K.K← 2K and f ← f 2l′(Q)/v′(Q).if the i-th bit of l is 1 then
Compute equations of l′ and v′ arising in the addition of K and P.K← P+K and f ← f l′(Q)/v′(Q).
endi← i−1.
endreturn f (q
k−1)/l .
7.3.3.1 Correctness Analysis
Theorem 7.3: The fault-attack resistant Miller’s algorithm produce the correct result for
cryptographic pairing computation.
Proof.
The modified Miller’s algorithm performs correctly for cryptographic pairing computation.
It is automatically aborted if l is even. It returns zero if least significant bit (LSB) of l is
zero, i.e., l[1] = 0. But, pairing computation for cryptographic applications chooses l as a
large odd prime. Thus, the LSB of l is one, i.e., l[1] = 1. In this case our proposed modified
Miller’s algorithm executes exactly same operations with its original form (Algo. 2.5).
Thus it produces correct pairing value for cryptographic applications.
7.3.3.2 Security Against Fault Attack
Theorem 7.4: The Algorithm 7.3 is secure against above fault attack.147
Chapter 7 Pairing Computations Against Fault and Power Attacks
Proof.
The fault attack described in section 7.3 believes that the attacker has ability to inject fault
at particular variables during execution. It injects fault at variables i and l. In order to
mount the fault attack in pairing computation in Edwards coordinate it is necessary to sets
i = 1 and l[1] = 0. Let us assume that the adversary has successfully injected the required
fault. Now for performing the attack on the pairing computation it is also necessary to get
the correct result for faulty values of i and l. But the proposed fault-attack resistant Miller’s
algorithm does not execute the pairing with above fault. It will simply return zero. Thus
the proposed countermeasure is secure against the fault attack described in section 7.3.
7.4 Power Attacks on Pairing Computations
Page and Vercauteren [105] presented SPA and DPA attacks on the pairing computa-
tions performed by the Duursma-Lee algorithm [126] and the BLKS algorithm [138] over
F3m . The power consumption attack on ηT pairing computation over F2m is described by
Kim et al. in [40]. However, the same in case of Fp has not been studied so far. This sec-
tion investigates the security of pairing computations over Fp against power consumption
attacks.
7.4.1 Weakness of Pairing Computations over Fp
In the decryption step of identity-based encryption schemes [112], a dominant operation
is e(SID,U), where SID is the fixed secret key, and U is a part of a ciphertext. In this case,
power analysis may try to extract the secret key from the pairing computation by repeatedly
manipulating U . The Tate pairing over Fp consists of elliptic curve group operations (ECD
and ECA), the line functions, and the Miller function [42]. The line functions as per the
definition provided by Chatterjee et al. [90] use both the public point U and private points
SID. The formula of line functions are based on the underlying Fp primitives.
During the addition step of Tate pairing computation the formula of the line function is
l(x,y) = (y−Y2)Z3−(x−X2)(Y2Z31−Y1) [90]. In pairing based cryptographic schemes, the
point T = (X1,Y1,Z1) is an intermediate resultant point of current point doubling operation,
148
7.4 Power Attacks on Pairing Computations
the point U = (X2,Y2) is used as a public parameter (it could be the plain texts or messages),
and SID = (x,y) is used as the private key. The resultant point (T +U) is represented by
(X3,Y3,Z3). Therefore, in such a scheme the operations (x−X2) and (y−Y2) could be
exploited through power analysis attacks for finding out the x and y-coordinates of the
secret point.
7.4.2 Proposed DPA Attack
In this section, we investigate differential power analysis (or DPA) attack against the
subtraction (x−X2) used in the Tate pairing on elliptic curves in Fp, where x is secret
and X2 is public and known to, or even chosen by, the attacker. The subtraction (x−X2) in Fp is computed by first computing S = x−X2 and then the result is reduced (if
required) by adding p with S. Let us assume that all operations are performed on 2’s
complement numbers. Therefore, the subtraction S = x−X2 could be performed as: S =
∑ki=0 2isi = ∑k−1
i=0 2ixi+∑k−1i=0 2iX2i +1, where k represents the bit length of operands (x,X2)
and X2i corresponds to the 1’s complement of X2i . The subtraction is started from the least
significant bit (or LSB) by computing sum and carry bits iteratively. The formula for i-th
carry bit is: ci = xiX2i ⊕ xici−1 ⊕ X2ici−1. Similarly, the i-th sum bit is computed as:
si = xi ⊕ X2i ⊕ ci−1 for k−1≤ i≤ 0 with c−1 = 1.
The proposed DPA attack works by following way. The attacker first collects the power
consumption traces of n number of randomly chosen public point U . We consider the sim-
plified Hamming weight model for power leakage [159]. In this model, power consumption
depends on the Hamming weight of the data being processed. Thus, we can express the
power consumption W as:
W = εH +η
where H, ε, and η represent the Hamming weight of the intermediate data, the incremental
amount of power for each extra 1 in the Hamming weight, and the noise, respectively. We
assume that the average of noise η is zero.
Let W be the power consumption associated with the subtraction operation (x−X2).
We start from the LSB and iteratively find all bits of the x-coordinate of the secret point
SID =(x,y). To recover the i-th bit of x, we guess that xi = 0 and divide power consumptions
149
Chapter 7 Pairing Computations Against Fault and Power Attacks
into two sets by X2i⊕ ci−1.
Pk = {W | X2i ⊕ ci−1 = k} with k = {0,1}
Thus, the differential power consumption is:
∆ = < P1−P0 > .
If the guess is correct, then the averages of P1 and P0 are, ε(M + 1)/2 and ε(M− 1)/2,
where M corresponds to the bit length of S. Thus, if ∆ > 0, we know that xi = 0; otherwise,
the averages of P1 and P0 is ε(M−1)/2 and ε(M+1)/2. Thus, if ∆ < 0 then xi = 1. There
should be a positive peak when xi = 0 and a negative peak when xi = 1.
In summary, since the subtraction operation (x−X2) of line function in pairing compu-
tation is vulnerable to the proposed attack, we can recover x. Next, we can obtain the value
of y-coordinate of the secret point SID by solving the curve equation.
7.4.3 Mounting the DPA on FPGA Platform
We perform the actual DPA attack on our proposed pairing cryptoprocessor (or PCP).
The PCP is implemented on a customized FPGA board for power analysis. We put an one
ohm resistor between the VCCint pin of the FPGA chip and the on board voltage regu-
lator. We measure the current drawn through that resistor during pairing computation by
a current probe. The specification of the probe is Tektronix current probe (serial number
B014316). We use the probe with a TCPA300 power amplifier in standby mode. The
measured power is displayed and stored in a Tektronix TDS5032B Digital Phosphor Oscil-
loscope. We develop software tools to automate the whole process for varying inputs. The
power consumptions are measured in terms of mV which is varying around ±5mV . The
power signal is sampled at 12.5MS/s.
We choose an x with x0 = 0 and perform (x−X2) for 2000 times with 2000 different
randomly chosen X2. The respective power consumptions are stored in 2000 one dimen-
sional vectors. Now we differentiate the the power vectors in two sets namely P1 and P0.
A vector will be in set P1 if X20 ⊕ c−1 = 1; i.e., X20 = 1. Otherwise, the vector will be
in set P0. For computing the differential power consumption we subtract the average of P0
150
7.4 Power Attacks on Pairing Computations
vectors (means) from the average of P1 vectors. We say this differential power consumption
vector as difference-of-means which is represented by ∆. Then we accumulate the samples
of ∆ and plot it. The respective difference-of-means is depicted in Fig. 7.2(a), which shows
a positive peak as expected for x0 = 0.
50 100 150 2000
2
4
6
8
x 10−3
samples
diffe
renc
e−of
−m
eans
(a)
50 100 150 200
−8
−6
−4
−2
0x 10
−3
samplesdi
ffere
nce−
of−
mea
ns
(b)
Figure 7.2: The correlation between LSB and corresponding average power differences of
an addition in Fp. (a) for x0 = 0 and (b) for x0 = 1.
The same experiment has been repeated for another x with x0 = 1. The difference-of-
means in this case is plotted in Fig. 7.2(b). In this case the expectation of < P1−P0 > is
negative and we got the result as expected with 2000 random X2.
Above experimental result ensures that an attacker can easily mount the DPA attack
on pairing computation over Fp. After finding out the LSB, DPA can be performed for
second LSB, and so on. The same power traces could be utilized for finding out all secret
bits. The differentiation of power vectors into two sets depending on the current value of
(X2i ⊕ ci−1) upto the generation of the difference-of-means will be repeated for finding
out each of the secret bits. Thus, above DPA attack iteratively finds out all bits of the x-
coordinate of secret SID. After obtaining the x-coordinate, the value of y-coordinate could
be obtained easily by solving the underlying elliptic curve equation.
151
Chapter 7 Pairing Computations Against Fault and Power Attacks
7.4.4 Proposed DPA Resistance Pairing Computation
In the pairing computation, the secret point is only used for computing the line func-
tions. The formula of the line function during doubling step of the Miller algorithm over
Fp is as follows:
lT,T (x,y) = Z3Z2y−2Y 2−3X2(Z2x−X),
where T = (X ,Y,Z) be the intermediate resultant point of Miller algorithm while 2T =
(X3,Y3,Z3) [90].
The formula of lT,T (x,y) is using the secret point SID = (x,y) of identity based encryp-
tion (IBE) [143]. But, it does not use the public point U = (X2,Y2). Therefore, this function
could not be exploited by any side-channel attacks.
The second line function lT,P(x,y) is computed during the addition step of the Miller
algorithm. In IBE scheme P is replaced by U . The formula of lT,P(x,y) is:
lT,U(x,y) = (y−Y2)Z3− (x−X2)(Y2Z31−Y1),
where T (X1,Y1,Z1) is the intermediate result of doubling step and (X3,Y3,Z3) represents the
addition result of T +U . In this line computation formula both public point U = (X2,Y2)
and private point SID = (x,y) are used. The computation of lT,U(x,y) is the main weakness
of pairing computation over Fp against side-channel attacks. The DPA attack described
above can easily find out the x and y-coordinates of private point SID by exploiting the
above formula.
The main drawback of the above formula is that the public and private parameters are
directly involved to perform an Fp operation. The side-channel attack thus exploit the
respective Fp operation for finding out the secret bits by manipulating public parameter U .
To counter act on such computation against side-channel attacks it could be computed by
following way.
lT,P(x,y) = (X2(Y2Z31−Y1)−Y2Z3)− (Y2Z3
1−Y1)x+Z3 · y.
The above computation technique does not have any Fp primitive which is performed
on one public parameter and one private parameter. The attacker may try to exploit the152
7.5 Conclusion
power consumption of the cryptoprocessor during the computation of lT,P(x,y). The private
parameter x in the above formula is multiplied with an unknown parameter (Y2Z31 −Y1).
Therefore, no difference-of-mean can be computed for identifying the secret bits of x.
The second secret parameter y is multiplied with Z3 in the modified computation of
lT,P(x,y). The parameter Z3 is computed by executing the formula Z3 = Z1(X2Z21 −X1)
which ensures Z3 is unknown due to the unknown temporary point T (X1,Y1,Z1). Therefore,
no difference-of-mean value can be computed based on the specific bits of Z3 for identifying
the secret bits of y. Thus, the proposed counteracting technique protects both x and y
coordinates of secret point SID, which ensures the security of pairing computation against
DPA attack.
7.5 Conclusion
This chapter has started with the description of the security issues of pairing algorithms
in presence of fault. It has briefly described the existing countermeasures which are based
on the point blinding technique. It has proposed a new countermeasure against fault attacks.
The proposed countermeasure is indeed secure and it requires less computation and mem-
ory overhead compared to existing measuring techniques. The chapter further has shown
a weakness of Miller’s algorithm in Edwards coordinates in presence of fault. It has also
proposed a suitable countermeasure against such weakness.
The current chapter also has described the security issues of the pairing computations
over Fp against power analysis attacks. It has proposed a differential power analysis against
pairing computations. A suitable counteract also has been proposed to protect the private
key of identity based encryption scheme.
153
Chapter 8
Conclusions and Future Directions
THIS CHAPTER CONCLUDES the thesis by underlining the main contributions. It
also discusses the possible directions of future work.
8.1 Conclusions
Right from the end of nineties of the last century, since the onset of the elliptic curve
and pairing eras, there has been an inconceivable growth in their efficient and secure im-
plementation process. Growth, primarily in the efficient implementation, made it feasible
to use more and more efficient algorithms and design techniques to compute underlying
finite field operations. Along with this, more and more robust implementations against
powerful physical attacks − like side-channel and fault attacks have been developed. This
thesis aims at offering some useful techniques towards reduction in time and area as well
as providing security against side-channel and fault attacks of hardware for pairing based
cryptography on FPGA platform. The contributions of the work are concluded as follows:
• In chapter 4, it has been identified that the design objective is reduction in area and
latency of ECSM operation against timing and power attacks by developing a secure
GF(p) elliptic curve cryptoprocessor. A shared programmable unit has been pro-
posed that can perform GF(p) addition, subtraction, multiplication, inversion, and di-
vision. Thereafter, an elliptic curve cryptoprocessor has been proposed based on two
155
Chapter 8 Conclusions and Future Directions
programmable functional cores. We have proposed a new point blinding technique
which can protect the secret in ECSM operation against DPA including doubling at-
tack. Through actual differential power analysis on FPGA platform we have shown
that the proposed ECSM cryptoprocessor is indeed secure against DPA attack.
• Chapter 5 has focussed on the utilization of in-built FPGA features for developing
optimized prime field arithmetic units. We have proposed a hierarchical adder struc-
ture based on the in-built carry chains of an FPGA device which drastically reduces
the routing delay as well as the overall addition delay of a large operand adder cir-
cuit. The chapter also has proposed a parallelism technique by which the critical
path delay of the interleaved multiplier has been reduced by 50% compared to the
existing design. Finally, it has demonstrated that the proposed high-speed addition
and multiplication techniques improves the performance of ECSM cryptoprocessor
by 30% compared to its previous version described in chapter 4.
• We have focused on the implementation of pairings over BN curves in chapter 6. This
chapter has introduced a parallel cryptoprocessor on FPGA platform for computing
asymmetric pairings over BN curves. The generic design provides the flexibility
to choose curve parameters. Extensive parallelism techniques have been exploited
to speed up pairing computations. The overall clock cycles required to compute
pairings over BN curves has been reduced drastically. The proposed FPGA design
provides a comparable speed with the existing ASIC designs.
• In chapter 7, the major objective was security analysis of pairing computations against
physical attacks. We have demonstrated a practical fault injection technique on a
pairing hardware on FPGA platform. Subsequently, this chapter has proposed a
new countermeasure which overcomes the drawback of the existing measuring tech-
niques. The chapter also has shown a vulnerability of Miller’s algorithm for pairing
computation over BN curves and over Edwards coordinates. It has also proposed a
suitable countermeasure against the attack. Finally, the chapter has demonstrated the
vulnerability of pairing computations over BN curves against DPA attack followed
by the proposal for defending such attacks.
156
8.2 Future Directions
8.2 Future Directions
The work proposed in each chapter of this thesis can be extended for further research.
In this section, some of the directions in which the problems can be further pursued have
been described.
• In chapter 4, the PGAU has been designed based on a bit serial interleaved multipli-
cation algorithm. In future the underlying bit serial multiplication algorithm can be
replaced by a digit serial multiplication algorithm and the performance can be com-
pared with the current design. A common programmable unit could be aimed for
arithmetic operations over binary, char-3, and prime fields.
• Chapter 5 is concerned only with the performance gained through the utilization of
in-built fast carry chains (or FCC) of an FPGA device. However, modern FPGA
devices provide several in-built components, such as: DSP blocks, Multiplier blocks,
Power PC, etc. A future work could be aimed at the development of highly optimized
architectures for finite field operations by utilizing the in-built components of an
FPGA device.
• In chapter 6, the pairing cryptoprocessor has been proposed for computing pairings
over BN curves. We have attempted the cryptoprocessor for general prime p. How-
ever, the BN curves are defined over specific primes. Its other global parameters also
have some specific forms. Therefore, in future these properties could be exploited for
developing a optimized FPGA architecture for asymmetric pairings over BN curves.
• The fault attack described in chapter 7 has considered some specific pairing com-
putations. However, there are a number of pairing-friendly elliptic curves on which
pairing computation formulæ are different. A future direction of the work could be
studying the designed attack techniques on other pairing algorithms.
• This thesis only has focussed on the pairing based cryptographic hardware over prime
fields. Moving slightly beyond the scope of this thesis, 128-bit-security pairing com-
putation hardware over other finite fields may be considered. Furthermore, work to-
wards a multi-field programmable cryptoprocessor may be taken up, like the design
157
Chapter 8 Conclusions and Future Directions
of a pairing cryptoprocessor for binary, char-3, and prime fields.
• The present design methodology also does not incorporate power as a design metric.
Hence, an important future extension would be implementing low power techniques
into the design.
158
Bibliography
[1] J. Fan, F. Vercauteren, and I. Verbauwhede. Efficient Hardware Implementation of
Fp-arithmetic for Pairing-Friendly Curves. IEEE Trasaction on Computers, 2011. To
appear.
[2] D. Freeman, M. Scott, and E. Teske. A taxonomy of pairing-friendly elliptic curves.
Journal of Cryptology, Vol. 23, No. 2, pp. 224–280, 2010.
[3] F. Vercauteren. Optimal pairings. IEEE Transactions on Information Theory, Vol. 56,
No. 1, pp. 455–461, 2010.
[4] N. Guillermin. A high speed coprocessor for elliptic curve scalar multiplications over
Fp. CHES 2010, LNCS 6225, pp. 48–64, 2010.
[5] D.F. Aranha, J. Lopez, and D. Hankerson. High-speed parallel software implementa-
tion of the ηT pairing. CT-RSA 2010, LNCS 5985, pp. 89–105. Springer, 2010.
[6] N. Estibals. Compact hardware for computing the Tate pairing over 128-bit-security
supersingular curves. Pairing 2010, LNCS 6487, pp. 397–416, 2010.
[7] BlueKrypt, Cryptographic key length recommendation. http://www.keylength.com
/en/4/. 2010.
[8] M. Naehrig, R. Niederhagen, and P. Schwabe. New software speed records for crypto-
graphic pairings. Cryptology ePrint Archive, Report 2010/186. http://eprint.iacr.org/.
[9] R. Granger and M. Scott. Faster squaring in the cyclotomic subgroup of sixth degree
extensions. PKC 2010, LNCS 6056, pp. 209–223, 2010.
159
BIBLIOGRAPHY
[10] J.L. Beuchat, J.E.G. Dıaz, S. Mitsunari, E. Okamoto, F.R. Henrıquez, and T. Teruya.
High-speed software implementation of the optimal ate pairing over Barreto-Naehrig
curves. Pairing 2010, LNCS 6487, pp. 21–39, 2010.
[11] M. Izumi, J. Ikegami, K. Sakiyama, and K. Ohta. Improved countermeasure against
address-bit DPA for ECC scalar multiplication. DATE 2010, pp. 981–984, 2010.
[12] J. Fan, X. Guo, E.D. Mulder, P. Schaumont, B. Preneel, and I. Verbauwhede. State-
of-the-art of secure ECC implementations: a survey on known side-channel attacks
and countermeasures. HOST 2010, pp. 76–87, 2010.
[13] D. Mukhopadhyay. An improved fault based attack of the advanced encryption stan-
dard. Africacrypt 2009, LNCS 5580, pp. 421-434, 2009.
[14] K. Ananyi, H. Alrimeih, and D. Rakhmatov. Flexible hardware processor for elliptic
curve cryptography over NIST prime fields. IEEE Trans. VLSI Systems, Vol. 17, No.
8, pp. 1099–1112, 2009.
[15] D.M. Schinianakis, A.P. Fournaris, H.E. Michail, A.P. Kakarountas, and T. Stouraitis.
An RNS implementation of an Fp elliptic curve point multiplier. IEEE Transactions
on Circuits and Systems-I, Vol. 56, No. 6, pp. 1202–1213, 2009.
[16] J.Y. Lai and C.T. Huang. A highly efficient cipher processor for dual-field elliptic
curve cryptography. IEEE Transactions on Circuits and Systems-II, Vol. 56, No. 5,
pp. 394–398, 2009.
[17] D. Kammler, D. Zhang, P. Schwabe, H. Scharwaechter, M. Langenberg, D. Auras, G.
Ascheid, and R. Mathar. Designing an ASIP for cryptographic pairings over Barreto-
Naehrig curves. CHES 2009, LNCS 5747, pp. 254–271, 2009.
[18] N. Benger and M. Scott. Constructing tower extensions for the implementation
of pairing-based cryptography. Cryptology ePrint Archive, Report 2009/556, 2009.
http://eprint.iacr.org/.
[19] J. Fan, F. Vercauteren, and I. Verbauwhede. Faster Fp-arithmetic for cryptographic
pairings on Barreto-Naehrig curves. CHES 2009, LNCS 5747, pp. 240-253, 2009.
160
BIBLIOGRAPHY
[20] M. Scott, N. Benger, M. Charlemagne, L.J. Dominguez Perez, and E.J. Kachisa. On
the final exponentiation for calculating pairings on ordinary elliptic curves. Pairing
2009, LNCS 5671, pp. 78-88, 2009.
[21] J. Beuchat, J. Detrey, N. Estibals, E. Okamoto, and F.R. Henrıguez. Hardware acceler-
ator for the Tate pairing in characteristic three based on Karatsuba-Ofman multipliers.
Cryptology ePrint Archive, Report 2009/122. http://eprint.iacr.org/.
[22] Xilinx ISE design suit, 2009. http://www.xilinx.com/tools/designtools.htm.
[23] S. Ghosh, M. Alam, D. Roychowdhury, and I. Sengupta. Parallel crypto-devices for
GF(p) elliptic curve multiplication resistant against side-channel attacks. Computers
and Electrical Engineering, Elsevier, Vol. 35, pp. 329–338, 2009.
[24] J.L. Beuchat, E.L. Trejo, L. M. Ramos, S. Mitsunari, and F.R. Henrıquez. Multi-core
implementation of the Tate pairing over supersingular elliptic curves. CANS 2009,
LNCS 5888, pp. 413–432, 2009.
[25] A.M. AbdelFattah, A.M.B. El-Din, and H.M.A. Fahmy. An efficient architecture
for interleaved modular multiplication. World Academy of Science, Engineering and
Technology, Vol. 56, 2009.
[26] J. Jiang, J. Chen, J. Wang, D.S. Wong, and X. Deng. High performance architec-
ture for elliptic curve scalar multiplication over GF(2m). Cryptology ePrint Archive,
Report 2008/066. http://eprint.iacr.org/.
[27] M. Naehrig, P.S.L.M. Barreto, and P. Schwabe. On compressible pairings and their
computation. Africacrypt 2008, LNCS 5023, pp. 371-388, 2008.
[28] E. Lee, H.S. Lee, and C. M. Park. Efficient and generalized pairing computation on
abelian varieties. Cryptology ePrint Archive, Report 2008/040. http://eprint.iacr.org/.
[29] F. Hess. Pairing lattices. Pairing 2008, LNCS 5209, pp. 18–38, 2008.
[30] W.N. Chelton and M. Benaissa. Fast elliptic curve cryptography on FPGA. IEEE
Trans. VLSI Systems, Vol. 16, No. 2, pp. 198–205, 2008.
161
BIBLIOGRAPHY
[31] J.Y. Lai and C.T. Huang. Elixir: High-throughput cost-effective dual-field processors
and the design framework for elliptic curve cryptography. IEEE Trans. VLSI Systems,
Vol. 16, No. 11, pp. 1567–1580, 2008.
[32] T.C. Chen, S.W. Wei, and H.J. Tsai. Arithmetic unit for finite field GF(2m). IEEE
Transactions On Circuits And Systems–I, Vol. 55, No. 3, 2008.
[33] R. Laue and S. A. Huss. Parallel memory architecture for elliptic curve cryptography
over GF(p) aimed at efficient FPGA implementation. J. Signal Process. Syst., Vol. 51,
pp. 39-55, 2008.
[34] C. Maxfield. FPGA architectures. http://www.pldesignline.com/192200165; jses-
sionid=2EZPXOVIXZFT GQSNDLQCKIKCJUNN2JVN?pgno=4, December 2008.
[35] H. Kaeslin. Digital integrated circuit design – from VLSI architectures to CMOS
fabrication. Cambridge University Press, 2008.
[36] K. Chapman. Expanding dedicated multipliers. White paper : Xilinx FPGAs.
http://www.xilinx.com/support/documentation/white papers /wp277.pdf, December
2008.
[37] J. Hoffstein, J. Pipher, and J.H. Silverman. An introduction to mathmatical cryptog-
raphy. Springer, 2008.
[38] A. Barenghi, G. Bertoni, L. Breveglieri, and G. Pelosi. A FPGA coprocessor for the
cryptographic Tate pairing over Fp. ITNG 2008, pp. 112-119, 2008.
[39] P. Grabher, J. Großschadl, and D. Page. On software parallel implementation of cryp-
tographic pairings. SAC 2008. LNCS 5381, pp. 35-50, 2008.
[40] T.H. Kim, T. Takagi, D.G. Han, H. Kim, and J. Lim. Power analysis attacks and
countermeasures on ηT pairing over binary fields. ETRI Journal, Vol. 30, No. 1, pp.
68–80, 2008.
[41] C. Rebeiro and D. Mukhopadhyay. High speed compact elliptic curve cryptoprocessor
for FPGA platforms. Indocrypt 2008, LNCS 5365, pp. 376–388, 2008.
162
BIBLIOGRAPHY
[42] D. Hankerson, A. Menezes, and M. Scott. Software implementation of pairings. In:
Joye, M., Neven, G. (eds.) Identity-Based Cryptography, 2008.
[43] J. Fan and I. Verbauwhede. Extended abstract : unified digit-serial multiplier/inverter
in finite field GF(2m). HOST 2008, 2008.
[44] K. Kawakami, K. Shigemoto, and K. Nakano. Redundant radix-2 number system for
accelerating arithmetic operations on the FPGAs. PDCAT 2008, pp. 370–377, 2008.
[45] S. Ghosh, M. Alam, D. Roychowdhury, and I. Sengupta. A GF(p) elliptic curve group
operator resistant against side channel attacks. GLSVLSI 2008, pp. 53–58, 2008.
[46] Z. Zhao. ID-based weak blind signature from bilinear pairings. International Journal
of Network Security, Vol.7, No.2, pp. 265-268, 2008.
[47] S. Ionica and A. Joux. Another approach to pairing computation in Edwards coordi-
nates. Indocrypt 2008, LNCS 5365, pp. 400-413, 2008.
[48] J. Takahashi and T. Fukunaga. Improved differential fault analysis on CLEFIA. FDTC
2008, pp. 25–34, 2008.
[49] M.P.L. Das and P. Sarkar. Pairing computation on twisted Edwards form elliptic
curves. Pairing 2008, LNCS 5209, pp. 192–210, 2008.
[50] D.J. Bernstein, P. Birkner, M. Joye, T. Lange, and C. Peters. Twisted Edwards curves.
Africacrypt 2008, LNCS 5023, pp. 389-405, 2008.
[51] J. Katz and Y. Lindell. Introduction to modern cryptography. Chapman & Hall/CRC,
2007.
[52] G.M.D. Dormale and J.J. Quisquater. High-speed hardware implementations of el-
liptic curve cryptography: a survey. J. Syst. Architect., Vol. 53, No. 23, pp. 72-84,
2007.
[53] S. Ghosh, M. Alam, I. Sengupta, and D. Roychowdhury. A robust GF(p) parallel
arithmetic unit for public key cryptography. DSD 2007, pp. 109–117, 2007.
163
BIBLIOGRAPHY
[54] P.S.L.M. Barreto, S.D. Galbraith, C. OhEigeartaigh, and M. Scott. Efficient pairing
computation on supersingular Abelian varieties. Designs, Codes and Cryptography,
Vol. 42, pp. 239–271, 2007.
[55] G. Chen, G. Bai, and H. Chen. A high-performance elliptic curve cryptographic pro-
cessor for general curves over GF(p) based on a systolic arithmetic unit. IEEE Trans-
actions on Circuits and Systems-II, Vol. 54, No. 5, pp. 412–416, 2007.
[56] N. Mentens, K. Sakiyama, B. Preneel, and I. Verbauwhede. Efficient pipelining for
modular multiplication architectures in prime fields. GLSVLSI 2007, pp. 534-539,
2007.
[57] S. Mangard, E. Oswald, and T. Popp. Power analysis attacks. Springer, 2007.
[58] J. Fan, K. Sakiyama, and I. Verbauwhede. Elliptic curve cryptography on embedded
multicore systems. WESS 2007, pp. 17–22, 2007.
[59] J.C. Ha, J. Park, S. Moon, and S.M. Yen. Provably secure countermeasure resistant to
several types of power attack for ECC. WISA 2007, LNCS 4867, pp. 333-344, 2007.
[60] K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede. Multicore curve-based cryp-
toprocessor with reconfigurable modular arithmetic logic units over GF(2n). IEEE
Transaction on Computers, Vol. 56, No. 9, pp. 1269-1282, 2007.
[61] A.J. Devegili, M. Scott, and R. Dahab. Implementing cryptographic pairings over
Barreto-Naehrig curves. Pairing 2007. LNCS 4575, pp. 197-207, 2007.
[62] E. Barke, W. Barker, W. Burr, W. Polk, and M. Smid. Recommendation for key man-
agement - part 1 : general (revised). NIST special publication 800-57, 2007.
[63] J. Takahashi, T. Fukunaga, and K. Yamakoshi. DFA mechanism on the AES schedule.
FDTC 2007, pp. 62-72, 2007.
[64] D.J. Bernstein and T. Lange. Faster addition and doubling on elliptic curves. Asiacrypt
2007, LNCS 4833, pp. 29-50, 2007.
[65] H.M. Edwards. A normal form for elliptic curves. Bull. AMS 44, pp. 393-422, 2007.
164
BIBLIOGRAPHY
[66] F.R. Henrıquez, N.A. Saqib, A.D. Perez, and C.K. Koc. Cryptographic algorithms on
reconfigurable hardware. Springer, 2006.
[67] D. Freeman. Constructing pairing-friendly elliptic curves with embedding degree 10.
ANTS 2006, LNCS 4076, pp. 452-465. 2006.
[68] P. K. Mishra. Pipelined computation of scalar multiplication in elliptic curve cryp-
tosystems. IEEE Transaction on Computers, Vol. 55, No. 8, pp. 1000-1010, 2006.
[69] K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede. Superscalar coprocessor for
high-speed curve-based cryptography. CHES 2006, LNCS 4249, pp. 415-429, 2006.
[70] K. Sakiyama, E.D. Mulder, B. Preneel, and I. Verbauwhede. A parallel processing
hardware architecture for elliptic curve cryptosystems. ICASP 2006, pp. 904–907,
2006.
[71] B. Ansari and M.A. Hasan. High performance architecture for elliptic curve scalar
multiplication. The University of Waterloo, Tech. Rep. CACR 2006-01, 2006.
[72] M. Benaissa and W. M. Lim. Design of flexible GF(2m) elliptic curve cryptography
processors. IEEE Transaction Very Large Scale Integr. (VLSI) Syst., Vol. 14, No. 6,
pp. 659-662, 2006.
[73] W. Chelton and M. Benaissa. High-speed pipelined ECC processor over GF(2m). SIPS
2006, 2006.
[74] C.J. McIvor, M. McLoone, and J.V. McCanny. Hardware elliptic curve cryptographic
processor over GF(p). IEEE Transactions on Circuits and Systems-I, Vol. 53, No. 9,
pp. 1946–1957, 2006.
[75] C. Shu, S. Kwon, and K. Gaj. FPGA accelerated Tate pairing based cryptosystems
over binary fields. FPT 2006, pp. 173-180, 2006.
[76] P.S.L.M. Barreto and M. Naehrig. Pairing-friendly elliptic curves of prime order. SAC
2005. LNCS 3897, pp. 319-331, 2006.
[77] F. Hess, N.P. Smart, and F. Vercauteren. The eta pairing revisited. IEEE Transactions
on Information Theory, Vol. 52, No. 10, pp. 4595-4602, 2006.165
BIBLIOGRAPHY
[78] A. Devegili, C. OhEigeartaigh, M. Scott, and R. Dahab. Multiplication and
squaring on pairing-friendly fields. Cryptology ePrint Archive, Report 2006/471.
http://eprint.iacr.org/.
[79] K. Sakiyama, B. Preneel, and I. Verbauwhede. A fast dual-field modular arithmetic
logic unit and its hardware implementation. ISCAS 2006, pp. 787–780, 2006.
[80] O.A. Khaleel, C. Papachristou, F. Wolff, and K. Pekmestzi. FPGA-based design of a
large moduli multiplier for public-key cryptographic systems. ICCD 2006, pp. 314–
319, 2006.
[81] S.M. Yen, L.C. Ko, S.J. Moon, and J.C. Ha. Relative doubling attack against Mont-
gomery ladder. ICISC 2005, LNCS 3935, pp. 117-128, 2006.
[82] D. Page and F. Vercauteren. A fault attack on pairing-based cryptography. IEEE
Transactions on Computers, Vol. 55, No. 9, pp. 1075–1080, 2006.
[83] H. Mamiya, A. Miyaji, and H. Morimoto. Secure elliptic curve exponentiation against
RPA, ZRA, DPA and SPA. IEICE Transaction of Fundamentals, Vol. E89-A, No.8,
2006.
[84] C. Whelan and M. Scott. Side channel analysis of practical pairing implementations :
which path is more secure?. Vietcrypt 2006, LNCS 4341, pp. 99–114, 2006.
[85] T. Akishita and T. Takagi. Zero-value register attack on elliptic curve cryptosystem.
IEICE Transaction of Fundamentals, Vol. E88-A, No. 1, 2005.
[86] D.J. Bernstein. Cache-timing attacks on AES. Technical report, 2005. Available at:
http://cr.yp.to/ antiforgery/ cachetiming-20050414.pdf
[87] W. Shusua and Z. Yuefei. A timing and area tradeoff GF(p) elliptic curve processor
architecture for FPGA. ICCCAS 2005, pp. 1308–1312, 2005.
[88] H. Eberle, S. Shantz, V. Gupta, N. Gura, L. Rarick, and L. Spracklen. Accelerating
next-generation public-key cryptosystems on generalpurpose CPUs. IEEE Micro, Vol.
25, No. 2, pp. 52-59, 2005.
166
BIBLIOGRAPHY
[89] R.C.C. Cheung, N.J. Telle, W. Luk, and P.Y.K. Cheung. Customizable elliptic curve
cryptosystems. IEEE Transaction on Very Large Scale Integr. (VLSI) Syst., Vol. 13,
No. 9, pp. 1048-1059, 2005.
[90] S. Chatterjee, P. Sarkar, and R. Barua. Efficient computation of Tate pairing in pro-
jective coordinate over general characteristic fields. ICISC 2004, LNCS 3506, pp.
168-181, 2005.
[91] N. Koblitz and A. Menezes. Pairing-based cryptography at high security levels. Cryp-
tology ePrint Archive, Report 2005/076, 2005. http://eprint.iacr.org/.
[92] C.M. Park, M.H. Kim, and M. Yung. A Remark on Implementing the Weil Pairing.
CISC 2005, LNCS 3822, pp. 313–323, 2005.
[93] S. Galbraith. Pairings. In I.F. Blake, G. Seroussi, and N.P. Smart, editors, Advances
in elliptic curve cryptography, London Mathematical Society Lecture Note Series,
chapter IX. Cambridge University Press, 2005.
[94] D.N. Amanor, C. Paar, J. Pelzl, V. Bunimov, and M. Schimmler. Efficient hardware
architectures for modular multiplication on FPGAs. International Conference on Field
Programmable Logic and Applications, pp. 539-542, 2005.
[95] D. N. Amanor. Efficient hardware architectures for modular multiplication. 2005.
Available at : http://www.crypto.ruhr-uni-bochum.de/ imperia/md/content/texte/ the-
ses/dnamanorthesis.pdf
[96] Q. Liu, D. Tong, and X. Cheng. Non-interleaving architecture for hardware imple-
mentation of modular multiplication. IEEE International Symposium on Circuits and
Systems, pp. 660–663, 2005.
[97] M.E. Kaihara, N. Takagi. A hardware algorithm for modular multiplication/division.
IEEE Transactions on Computers, Vol. 54, pp. 12–21, 2005.
[98] F. Crowe, A. Daly, and W. Marnane. A scalable dual mode arithmetic unit for public
key cryptosystems. ITCC 2005, pp. 568–573, 2005.
167
BIBLIOGRAPHY
[99] D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, and S. Hsu. An improved unified
scalable radix-2 montgomery multiplier. ARITH 2005, pp. 172–178, 2005.
[100] M. Ciet and M. Joye. Elliptic curve cryptosystems in the presence of permanent and
transient faults. Designs, Codes and Cryptography, Vol. 36, pp. 33–43, 2005.
[101] F. Koeune and F.X. Standaert. A tutorial on physical security and side-channel at-
tacks. FOSAD 2004/2005, LNCS 3655, pp. 78-108, 2005.
[102] A. Menezes, E. Teske, and A. Weng. Weak Fields for ECC. CT-RSA 2004, LNCS
2964, pp. 366-386, 2004.
[103] N. Saquib, F. Rodriguez, and A. Diaz. A parallel architecture for fast computation
of elliptic curve scalar multiplication over GF(2n). RAW 2004, pp. 26–27, 2004.
[104] M. Scott and P. Barreto. Compressed pairings. Crypto 2004. LNCS 3152, pp. 140-
156, 2004.
[105] D. Page and F. Vercauteren. Fault and side-channel attacks on pairing based cryp-
tography. Cryptology ePrint Archive, Report 2004/283. http://eprint.iacr.org/.
[106] A. Daly, W. Marnane, T. Kerins, and E. Popovici. An FPGA implementation of a
GF(p) ALU for encryption processors. Microprocessors and Microsystems, Vol. 28,
pp. 253–260, 2004.
[107] E. Ozturk, B. Sunar, and E. Savas. Low-power elliptic curve cryptography using
scaled modular arithmetic. CHES 2004, LNCS 3156, pp. 92–106, 2004.
[108] L.P. Lee and K.W. Wong. A random number generator based on elliptic curve opera-
tions. Computers and Mathematics with Applications, Vol. 47, pp. 217–226, Elsevier,
2004.
[109] V.S. Miller. The Weil pairing, and its efficient calculation. Journal of Cryptology,
Vol. 17, pp. 235-261, 2004.
[110] C. McIvor, M. McLoone, and J. McCanny. FPGA Montgomery multiplier architec-
tures - a comparison. Field-Programmable Custom Computing Machines, pp. 279-
282, 2004.168
BIBLIOGRAPHY
[111] S. Kwon. Efficient Tate pairing computation for supersingular elliptic curves over
binary fields. In Cryptology ePrint Archive, Report 2004/303. http://eprint.iacr.org/.
[112] R. Dutta, R. Barua, and P. Sarkar. Pairing-based cryptographic protocols : a survey.
Cryptology ePrint Archive, Report 2004/064. http://eprint.iacr.org/.
[113] H. Bar-El, H. Choukri, D. Naccache, M. Tunstall, and C. Whelan. The sorcerer’s
apprentice guide to fault attacks. In Cryptology ePrint Archive, Report 2004/10.
http://eprint.iacr.org/.
[114] B. C. Mames, M. Ciet and M. Joye. Low-cost solutions for preventing simple side-
channel analysis: side-channel atomicity. IEEE Transactions on Computers, Vol. 53,
No. 6, pp. 760–768, 2004.
[115] E. Brier, I. Dechene, and M. Joye. Unified addition formulæ for elliptic curve cryp-
tosystems. Embeded cryptographic hardware: methodology and architectures, 2004.
[116] L. Goubin. A refined power-analysis attack on elliptic curve cryptosystems. PKC
2003, LNCS 2567, pp. 199-211, 2003.
[117] D. Hankerson, A. Menezes, and S. Vanstone. Guide to elliptic curve cryptography.
Spinger, US, 2003.
[118] P. Fouque and F. Valette. The doubling attack – why upwards is better than down-
wards. CHES 2003, LNCS 2779, pp. 269–280, 2003.
[119] A. Satoh and K. Takano. A scalable dual-field elliptic curve cryptographic processor.
IEEE Transactions on Computers, Vol. 52, No. 4, pp. 449–460, 2003.
[120] S. B. Ors, L. Batina, B. Preneel, and J. Vandewalle. Hardware implementation of an
elliptic curve processor over GF(p). ASAP 2003, pp. 433–443, 2003.
[121] V. Bunimov and M. Schimmler. Area and time efficient modular multiplication of
large integers. ASAP 2003, 2003.
[122] K. Itoh, T. Izu, and M. Takenaka. A practical countermeasure against address-bit
differential power analysis. CHES 2003, LNCS 2779, pp. 382–396, 2003.
169
BIBLIOGRAPHY
[123] L.S. Au and N. Burgess. Unified radix-4 multiplier for GF(p) and GF(2n). ASAP
2003, pp. 1–11. 2003.
[124] S.B. Ors, L. Batina, B. Preneel, and J. Vandewalle. Hardware implementation of a
Montgomery modular multiplier in a systolic array. IPDPS ’03, pp. 184–186, 2003.
[125] M. Joye and S.M. Yen. The Montgomery powering ladder. CHES ’02, LNCS 2523,
pp. 291-302, 2003.
[126] I. Duursma and H. Lee. Tate pairing implementation for hyperelliptic curves y2 =
xp− x+d. Asiacrypt 2003, LNCS 2894, pp. 111-123, 2003.
[127] O. Billet and M. Joye. The Jacobi model of elliptic curve and side-channel analy-
sis. Applied Algebra, Algebric Algorithms and Error-Correcting Codes 2003, LNCS
2643, pp. 34–42, 2003.
[128] J. Solinas. ID-based digital signature algorithms. 2003,
http://www.cacr.math.uwaterloo.ca/conferences/2003/ecc2003/solinas.pdf.
[129] D.J Stinson. Cryptography theory and practice, second edition. Chapman &
Hall/CRC, 2002.
[130] J.J. Quisquater and D. Samyde. Eddy current for magnetic analysis with active sen-
sor. Esmart 2002, pp. 185–194, 2002.
[131] E. Trichina and A. Bellezza. Implementation of elliptic curve cryptography with
built in countermeasures against side-channel attacks. CHES 2002, LNCS 2523, pp.
99-113, 2002.
[132] E. Brier and M. Joye. Weierstraß elliptic curves and side-channel attacks. PKC 2002,
LNCS 2274, pp. 335–345, 2002.
[133] M. Stam and A.K. Lenstra. Efficient subgroup exponentiation in quadratic and sixth
degree extensions. CHES 2002, LNCS 2523, pp. 318-332, 2002.
[134] G. Gaubatz. Versatile Montgomery multiplier architectures. Masters thesis, 2002.
Available at: http://www.wpi.edu/Pubs/ETD/Available/etd-0430102-120529 /unre-
stricted/ gaubatz.pdf170
BIBLIOGRAPHY
[135] J. Guajardo, T. Wollinger, and C. Paar. Area efficient GF(p) architectures for GF(pm)
multipliers, 2002. Availeble at: http://www.wollinger.org/papers/Guajardo etal gfp
architectures.pdf
[136] H. Wu. Montgomery multiplier and squarer for a class of finite fields. IEEE Trans-
actions on Computers, Vol. 51, No. 5, 2002.
[137] Virtex-II ProT M platform FPGA handbook. Xilinx Inc., San Jose, CA, 2002.
[138] P.S.L.M. Barreto, H. Kim, B. Lynn, and M. Scott. Efficient algorithms for pairing-
based cryptosystems. Crypto 2002, LNCS 2442, pp. 354-368, 2002.
[139] S.P. Skorobogatov and R.J. Anderson. Optical fault induction attacks. CHES 2002,
LNCS 2523, pp. 2-12, 2002.
[140] T. Izu and T. Takagi. A fast parallel elliptic curve multiplication resistant against side
channel attacks. PKC 2002, LNCS 2274, pp. 280–296, 2002.
[141] W. Fischer, C. Giraud, E.W. Knudsen, and J.P.Seifert. Parallel scalar multiplication
on general elliptic curves over Fp hedged against non-differential side-channel at-
tacks. Cryptology ePrint Archive, Report 2002/007. http://eprint.iacr.org/.
[142] G. Orlando and C. Paar. A scalable GF(p) elliptic curve processor architecture for
programmable hardware. CHES 2001, LNCS 2162, pp. 348–363, 2001.
[143] D. Boneh and M.K. Franklin. Identity-based encryption from the Weil pairing.
Crypto 2001, LNCS 2139, pp. 213-229, 2001.
[144] D. Boneh, B. Lynn, and H. Shacham. Short signatures from the Weil pairing. Asi-
acrypt 2001, LNCS 2248, pp. 514-532, 2001.
[145] A. Miyaji, M. Nakabayashi, and S. Takano. New explicit conditions on elliptic curve
traces for FR-reduction. IEICE Trans. Fundamentals, Vol. E-84, No. A-5, pp. 1234–
1243, 2001.
[146] C. Clavier and M. Joye. Universal exponentiation algorithm. CHES 2001, LNCS
2162, pp. 300-308, 2001.
171
BIBLIOGRAPHY
[147] D. May, H.L. Muller, and N. Smart. Random register renamming to foil DPA. CHES
2001, LNCS 2162, pp. 28-38, 2001.
[148] N.P. Smart. The Hessian form of an elliptic curve. CHES 2001, LNCS 2162, pp.
118-125, 2001.
[149] E. Oswald and M. Aigner. Randomized addition-subtraction chains as a countermea-
sure against power attacks. CHES 2001, LNCS 2162, pp. 39-50, 2001.
[150] M. Joye and C. Tymen. Protection against differential analysis for elliptic curve cryp-
tography. CHES 2001, LNCS 2162, pp. 377-390, 2001.
[151] M. Joye and J. Quisquater. Hessian elliptic curves and side channel attacks. CHES
2001, LNCS 2162, pp. 402-410, 2001.
[152] P. Liardet and N. Smart. Preventin SPA/DPA in ECC systems using the Jacobi form.
CHES 2001, LNCS 2162, pp. 391-401, 2001.
[153] B. Moller. Securing elliptic curve point multiplication against side-channel attacks.
ISC 2001, LNCS 2200, pp. 324–334, 2001.
[154] D. Boneh, R.A. DeMillo, and R.J. Lipton. On the importance of eliminating errors
in cryptographic computations. Journal of Cryptology, Vol. 14, No. 2, pp. 101-119,
2001. Extended abstract in Eurocrypt 1997.
[155] I. Blake, G. Seroussi, and N. Smart. Elliptic curves in cryptography. London Math-
ematical Society Lecture Note Series, Vol. 265, Cambridge University Press, 2000.
[156] K. Okeya, H. Kurumatani, and K. Sakaurai. Elliptic curves with the Montgomery
form and their cryptographic applications. PKC 2000, LNCS 1751, pp. 238–257,
2000.
[157] K. Okeya and K. Sakura. Power analysis breaks elliptic curve cryptosystems even
secure against the timing attack. Indocrypt 2000, LNCS 1977, pp. 217–314, 2000.
[158] E. Savas, A.F. Tenca, and C.K. Koc. A scalable and unified multiplier architecture
for finite fields GF(p) and GF(2m). CHES 2000, LNCS 1965, pp. 281–296, 2000.
172
BIBLIOGRAPHY
[159] T.S. Messerges. Using second-order power analysis to attack DPA resistant software.
CHES 2000, LNCS 1965, pp. 238–251, 2000.
[160] A. Joux. A one round protocol for tripertite Diffie-Hellman. ANTS 2000, LNCS
1838, pp. 385–394, 2000.
[161] I. Biehl, B. Meyer, and V. Muller. Differential fault analysis on elliptic curve cryp-
tosystems. Crypto 2000. LNCS 1880, pp. 131-146. 2000.
[162] J.S. Coron. Resistance against differential power analysis for elliptic curve cryp-
tosystems. CHES 1999, LNCS 1717, pp. 292-302, 1999.
[163] P. Kocher, J. Jaffe, and B. Jun. Differential power analysis. Crypto 1999, LNCS
1666, pp. 388–397, 1999.
[164] T.S. Messerges, E.A. Dabbish, and R.H. Sloan. Investigations of power analysis
attacks on smartcards. USENIX Workshop on Smartcard Technology, 1999.
[165] S. Chari, C.S. Jutla, J.R. Rao, and P. Rohatgi. Towards sound approaches to counter-
act power-analysis attacks. Crypto 1999, LNCS 1666, pp. 398–412, 1999.
[166] National Institute of Science and Technology. Recommended elliptic curves for fed-
eral gorernment use. Available at: http://www.csrc.nist.gov/ groups/ ST/toolkit/ doc-
uments/dss/ NISTReCur.pdf, 1999.
[167] J. Lopez and R. Dahab. Fast multiplication on elliptic cuves over GF(2m) without
percomputation. CHES 1999, LNCS 1717, pp. 216-327, 1999.
[168] J.F. Dhem, F. Koeune, P.A. Leroux, P. Mestre, J.J. Quisquater, and J.L. Willems. A
practical implementation of the timing attack. CARDIS 1998, LNCS 1820, pp. 167–
182, 1998.
[169] S. Hauck, M.M. Hosler, and T.W. Fry. High-performance carry chains for FPGAs.
FPGA 1998, pp. 223–233, 1998.
[170] H.Cohen, A.Miyaji and T.Ono. Efficient elliptic curve exponentiation using mixed
coordinates. Asiacrypt 1998, LNCS 1514, pp. 51–65, 1998.
173
BIBLIOGRAPHY
[171] P.C. Kocher. Timing attacks on implementations of Diffie-Hellman, RSA, DSS and
other systems. Crypto 1996, LNCS 1109, pp. 104–113, 1996.
[172] N. Koblitz. CM curves with good cryptographic propertics. Crypto 1991, LNCS 576,
pp. 279–287, 1991.
[173] P.L. Montgomery. Speeding the pollard and elliptic curve methods of factorization.
Math. Comput., Vol. 48, pp. 243–264, 1987.
[174] N. Koblitz. Elliptic curve cryptosystems. Mathematics of Computation, Vol. 48, No.
177, pp. 203–209, 1987.
[175] V.S. Miller. Short programs for functions on curves. Unpublished manuscript, 1986.
[176] P.L. Montgomery. Modular multiplication without trial division. Math. Computa-
tion, Vol. 44, pp. 519–521, 1985.
[177] D. Chaum. Security without identification: Transaction systems to make Big Brother
obsolate. Comm. ACM, Vol. 28, pp. 1030–1044, 1985.
[178] V.S. Miller. Use of elliptic curves in cryptography. Crypto 1985, LNCS 218, pp.
417–426, 1985.
[179] K.R. Sloan. Comments on a computer algorithm for calculating the product A*B
modulo M. IEEE Transactions on Computers, Vol. C-34, No. 3, pp. 290–292, 1985.
[180] G.R. Blakley. A computer algorithm for calculating the product A*B modulo M.
IEEE Transactions on Computers, Vol. C-32, No. 5, pp. 497–500, 1983.
[181] R.P. Brent and H.T. Kung. A regular layout for parallel adders. IEEE Transactions
on Computers, Vol. C-31, No. 3, pp. 260–264, 1982.
[182] R.L. Rivest, A. Shamir, and L.M. Adleman. A method for obtaining digital signa-
tures and public-key cryptosystems. Commun. ACM, Vol. 21, No. 2, pp. 120–126,
1978.
[183] J. Pollard. Monte Carlo methods for index computation mod p. Math. Comp., Vol.
32, pp. 918-924, 1978.
174
Disseminations
Publications Directly Related to the Thesis
Refereed Journal
1. Santosh Ghosh, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. Petrel
: power and timing attack resistant elliptic curve scalar multiplier based on pro-
grammable GF(p) arithmetic unit. IEEE Transactions on Circuits and Systems − I
(TCAS–I), Vol. 58, No. 9, September 2011.
2. Santosh Ghosh, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. Fault
attack and countermeasures on pairing based cryptography. International Journal of
Network Security, Vol.12, No.1, pp. 26-33, January 2011.
Refereed Conference
1. Santosh Ghosh, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. High
speed flexible pairing cryptoprocessor on FPGA platform. Fourth International Con-
ference on Pairing-based Cryptography − Pairing 2010, LNCS 6487, pp. 450–466,
Japan, December 2010.
2. Santosh Ghosh, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. High
speed Fp multipliers and adders on FPGA platform. Design and Architectures for
Signal and Image Processing − DASIP 2010, Edinburgh, Scotland, October 2010.
3. Santosh Ghosh and Dipanwita Roy Chowdhury. Configurable multicore process-
ing unit for elliptic curve cryptography. 12th VLSI Design and Test Symposium −VDAT 2008, Bangalore, India, July 2008.
175
BIBLIOGRAPHY
Cryptology ePrint Archive
1. Santosh Ghosh, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. Secu-
rity of pairing cryptoprocessor against differential power attacks. Cryptology ePrint
Archive, Report 2011/181, http://eprint.iacr.org/.
Other Publications of the Author
Refereed Journal
1. Santosh Ghosh, Monjur Alam, Dipanwita Roy Chowdhury, and Indranil Sen Gutpa.
Parallel crypto-devices for GF(p) elliptic curve multiplication resistant against side
channel attacks. Computers and Electrical Engineering, Elsevier, Vol. 35, pp. 329–
338, 2009.
2. Monjur Alam, Santosh Ghosh, M jagon Mohan, Debdeep Mukhopadhyay, Dipan-
wita Roy Chowdhury, and Indranil Sen Gupta. Effect of glitches against masked
AES S-box implementation and countermeasure. Information Security (IFS), IET,
Vol. 3, No. 1, pp. 34–44, 2009.
Refereed Conference
1. Santosh Ghosh, Dipanwita Roy Chowdhury, and Abhijit Das. High Speed Crypto-
processor for ηT Pairing on 128-bit Secure Supersingular Elliptic Curves over Char-
acteristic Two Fields. CHES 2011 (to appear).
2. Santosh Ghosh. Design and analysis of pairing based cryptographic hardware for
prime fields. PhD forum ISVLSI 2011, IEEEXplore, pp. 1-2, Chennai, India, July
2011.
3. Santosh Ghosh. Design and analysis of pairing hardware on FPGA platform. PhD
forum DAC 2011, San Diego, US, June 2011.
176
BIBLIOGRAPHY
4. Santosh Ghosh and Dipanwita Roy Chowdhury. Elliptic curve based multi-signature
scheme for multi-server systems. IEEE Tencon 2008, Hyderabad, India, November
2008.
5. Santosh Ghosh, Monjur Alam, Dipanwita Roy Chowdhury, and Indranil Sen Gutpa.
A GF(p) Elliptic curve group operator resistant against side channel attacks. ACM
Great Lakes Symposium on VLSI − GLSVLSI 2008, Orlando, Florida, US, ACM,
pp. 53–58, May 2008.
6. Monjur Alam, Santosh Ghosh, Dipanwita Roy Chowdhury, and Indranil Sen Gutpa.
Single chip encryptor/decryptor core implementation of AES algorithm. 21st In-
ternational Conference on VLSI Design − VLSID 2008, Hyderabad, India, IEEE
Computer Society, pp. 693–698, January 2008.
7. Santosh Ghosh, Monjur Alam, Kundan Kumar, Debdeep Mukhopadhyay, and Di-
panwita Roy Chowdhury. Preventing the side-channel leakage of masked AES S-
Box. 15th International Conference on Advanced Computing & Communication −ADCOM 2007, IIT Guwahati, India, IEEE Computer Society, pp. 15–20, December
2007.
8. Avishek Saha and Santosh Ghosh. A speed-area optimization of full search block
matching hardware with applications in high-definition TVs (HDTV). High Perfor-
mance Computing − HiPC 2007, LNCS 4873, pp. 83–94, Goa, India, December
2007.
9. Santosh Ghosh, Monjur Alam, Dipanwita Roy Chowdhury, and Indranil Sen Gupta.
Effect of side channel attacks on RSA embedded devices. IEEE Tencon 2007, Taipei,
Taiwan, IEEE, pp, 1–4, November 2007.
10. Santosh Ghosh and Avishek Saha. Speed-area optimized FPGA implementation for
full search block matching. International Conference on Computer Design − ICCD
2007, IEEE Comp. Society, pp, 13–18, California, US, October 2007.
11. Santosh Ghosh, Monjur Alam, Indranil Sen Gupta, and Dipanwita Roy Chowdhury.
A robust GF(p) parallel arithmetic unit for public key cryptography. 10th EUROMI-
177
BIBLIOGRAPHY
CRO Conference on Digital System Design - Architectures, Methods and Tools −DSD 2007, Lubeck, Germany, IEEE Computer Society, pp. 109–117, August 2007.
12. Monjur Alam, Santosh Ghosh, Debdeep Mukhopadhyay, Dipanwita Roy Chowd-
hury, and Indranil Sen Gutpa. Latency optimized AES-Rijndael with flexible mode
of operation. 11th VLSI Design and Test Symposium − VDAT 2007, pp, 413–420,
Kolkata, India, August 2007.
13. Avishek Saha, Santosh Ghosh, Shamik Sural, and Jayanta Mukherjee. Toward
memory-efficient design of video encoders for multimedia applications. ISVLSI
2007, Porto Alegre, Brazil, IEEE Computer Society, pp, 453–454, May 2007.
14. Monjur Alam, Sonai Ray, Debdeep Mukhopadhyay, Santosh Ghosh, Dipanwita Roy
Chowdhury, and Indranil Sen Gutpa. An area optimized reconfigurable encryptor for
AES-Rijndael. Design Automation and Test in Europe − DATE 2007, Nice, France,
IEEE Computer Society, pp, 1116–1121, April 2007.
National Conferences and Workshops
1. Santosh Ghosh, Dipanwita Roy Chowdhury, and Indranil Sen Gutpa. The ECMQV
key agreement protocol on FPGA platform. National Workshop on Cryptology 2009,
Surat, India, August 2009.
2. Santosh Ghosh and Dipanwita Roy Chowdhury. A GF(2163) elliptic curve cryp-
tographic processor unit. National Conference on Information Security-Issues and
Challenges (NCISIC 2008), Orissa, India, January 2008.
3. Santosh Ghosh, Dipanwita Roy Chowdhury, and Indranil Sen Gupta. Side channel
attacks on RSA and ECC crypto devices. National Workshop on Cryptology 2007,
Coimbatore, India, September 2007.
178
Curriculum Vitae
SANTOSH GHOSH, son of Mr. Sadhan Ghosh and Mrs. Ambika Ghosh, was born on
September 10, 1978 at Kshiragram, Burdwan, WB, India. After passing the Higher
Secondary (10+2) Examination from Burdwan Municipal High School, Burdwan, he joined
Haldia Institute of Technology (HIT) under the Vidyasagar University to pursue the study
of Bachelor of Technology (B.Tech.) in Computer Science and Engineering (CSE) and
subsequently joined as a lecturer at CSE department of HIT in 2002. He pursed for further
study of Master of Science (M.S.) from the Department of Computer Science and Engi-
neering at the Indian Institute of Technology Kharagpur in 2006. After completion of the
Master’s course in the year 2008, he joined the same department in the same year, as a PhD
student. During this time, his research area has been broadly in the field of Cryptographic
Hardware and Side-channel Attacks. Research work presented in this thesis has been the
outcome of that effort. His other research interests include VLSI Design and Testing.
Contact e-mail :
santosh[dot]ghosh[at]gmail[dot]com
santosh[at]cse[dot]iitkgp[dot]ernet[dot]in
179