Upload
others
View
21
Download
0
Embed Size (px)
Citation preview
Modular Exponentiation on ReconfigurableHardware
byThomas Blum
A ThesisSubmitted to the Faculty
of theWORCESTER POLYTECHNIC INSTITUTEIn partial fulfillment of the requirements for the
Degree of Master of Sciencein
Electrical Engineering
April 8th, 1999
Approved:
Prof. Christof Paar Prof. Yusuf LeblebiciECE Department ECE DepartmentThesis Advisor Thesis Committee
Prof. Fred J. Looft Prof. John OrrECE Department ECE Department HeadThesis Committee
Abstract
It is widely recognized that security issues will play a crucial role in the majority of futurecomputer and communication systems. A central tool for achieving system security arecryptographic algorithms. For performance as well as for physical security reasons, it isoften advantageous to realize cryptographic algorithms in hardware. In order to overcomethe well-known drawback of reduced flexibility that is associated with traditional ASIC so-lutions, this contribution proposes arithmetic architectures which are optimized for modernfield programmable gate arrays (FPGAs). The proposed architectures perform modularexponentiation with very long integers. This operation is at the heart of many practi-cal public-key algorithms such as RSA and discrete logarithm schemes. We combine twoversions of Montgomery modular multiplication algorithm with new systolic array designswhich are well suited for FPGA realizations. The first one is based on a radix of two and iscapable of processing a variable number of bits per array cell leading to a low cost design.The second design uses a radix of sixteen, resulting in a speed–up of a factor three at thecost of more used resources. The designs are flexible, allowing any choice of operand andmodulus.
Unlike previous approaches, we systematically implement and compare several versionsof our new architecture for different bit lengths. We provide absolute area and timingmeasures for each architecture on Xilinx XC4000 series FPGAs. As a first practical resultwe show that it is possible to implement modular exponentiation at secure bit lengths ona single commercially available FPGA. Secondly we present faster processing times thanpreviously reported. The Diffie–Hellman key exchange scheme with a modulus of 1024bits and an exponent of 160 bits is computed in 1.9 ms. Our fastest design computes a1024 bit RSA decryption in 3.1 ms when the Chinese remainder theorem is applied. Thesetimes are more than ten times faster than any reported software implementation. They alsooutperform most of the hardware–implementations presented in technical literature.
ii
Contents
1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Thesis Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Previous Work 42.1 Montgomery Reduction and Redundant Representation . . . . . . . . . . . 42.2 Montgomery Reduction and Systolic Arrays . . . . . . . . . . . . . . . . . . 62.3 Other Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Preliminaries: Public–Key Algorithms 93.1 RSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Algorithms Based on the Discrete Logarithm Problem in Finite Fields . . . 103.3 Elliptic Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Preliminaries: Modular Exponentiation 144.1 Square & Multiply Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Montgomery Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Montgomery Multiplication for Radix Two . . . . . . . . . . . . . . . . . . 184.4 High–Radix Montgomery Algorithm . . . . . . . . . . . . . . . . . . . . . . 20
5 General Design Considerations 235.1 Xilinx XC4000 Series FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.1 Configurable Logic Blocks . . . . . . . . . . . . . . . . . . . . . . . . 235.1.2 Routing Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.1.3 Special Features of the XC4000 Family . . . . . . . . . . . . . . . . 265.1.4 Cost and Speed Evaluation . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Architectures Suitable for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . 275.2.1 Systolic Array vs. Redundant Representation . . . . . . . . . . . . . 275.2.2 State Machine and Storage Elements . . . . . . . . . . . . . . . . . . 31
iii
6 Design 1: A Resource Efficient Architecture 326.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.2 Processing Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.3 Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.4 Modular Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.4.1 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.4.2 Function Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7 Design 2: A Speed Efficient Architecture 467.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.2 Processing Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.3 Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.4 Modular Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.4.1 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547.4.2 Function Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8 Methodology 598.1 Xilinx Synopsys Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608.2 Simulation and Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . 608.3 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618.4 Place and Route . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9 Results 639.1 Design 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639.2 Design 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659.3 Application to RSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679.4 Application to Algorithms Based on the Discrete Logarithm Problem . . . . 689.5 Application to Elliptic Curves . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10 Comparison and Outlook 72
A Test Bench Sample 75
B Synosys Script 83
C Simulation Results 89C.1 Processing Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89C.2 Systolic Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94C.3 Modular Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
iv
List of Tables
5.1 Routing per CLB in XC4000XL and XC4000XV devices [38] . . . . . . . . . 265.2 Amount of CLBs used for RAM and DP RAM blocks on XC4000 devices . 27
9.1 Design 1: CLB usage, minimal clock cycle time, and time–area product ofmodular exponentiation architectures on Xilinx FPGAs . . . . . . . . . . . 63
9.2 Design 1: CLB usage and execution time for a full modular exponentiation 659.3 Design 2: CLB usage, minimal clock cycle time, and time–area product of
modular exponentiation architectures on Xilinx FPGAs . . . . . . . . . . . 659.4 Design2: CLB usage and execution time for a full modular exponentiation 669.5 Application to RSA: Encryption . . . . . . . . . . . . . . . . . . . . . . . . 679.6 Application to RSA: Decryption . . . . . . . . . . . . . . . . . . . . . . . . 679.7 CLB usage and execution time for algorithms based on the DL–problem . . 689.8 Operations for a point addition . . . . . . . . . . . . . . . . . . . . . . . . . 699.9 Operations for a Doubling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709.10 Application to Elliptic Curves: Execution time for point addition and dou-
bling (160 bits) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719.11 Application to Elliptic Curves: Execution time for a general point multipli-
cation (160 bits) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
v
List of Figures
5.1 XC4000 CLB Structure [38] . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2 Xilinx FPGA structure [38] . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1 Processing element (unit) of Design 1 that computes Si+1 = (Si+qi ·M)/2+ai ·B, qi, ai ∈ {0, 1} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Systolic Array for modular multiplication . . . . . . . . . . . . . . . . . . . 376.3 Design for a modular exponentiation . . . . . . . . . . . . . . . . . . . . . . 406.4 DP RAM Z Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.5 Exp RAM Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.6 Prec RAM Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.1 Processing Element (unit) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.2 Systolic array for modular multiplication . . . . . . . . . . . . . . . . . . . . 527.3 Design for a modular exponentiation . . . . . . . . . . . . . . . . . . . . . . 547.4 DP RAM Z Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.1 Design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
C.1 Processing Element: Loading of the pre-computation factor and calculationof its multiples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
C.2 Processing Element: Computation of two modular multiplications (first cycles) 92C.3 Processing Element: Computation of two modular multiplications (last cycles). 93C.4 Systolic Array: Beginning of the pre-computation . . . . . . . . . . . . . . . 95C.5 Systolic Array: End of the pre-computation in units 0,1,2 . . . . . . . . . . 96C.6 Systolic Array: End of the pre-computation in units 41,42,43 . . . . . . . . 97C.7 Modular Exponentiation: Beginning of pre-computation . . . . . . . . . . . 100C.8 Modular Exponentiation: End of pre-computation and beginning of Z1, P1
calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101C.9 Modular Exponentiation: Loading the exponent and computation of a mod-
ular exponentiation with a 19–bit exponent . . . . . . . . . . . . . . . . . . 102
vi
Chapter 1
Introduction
1.1 Motivation
It is widely recognized that security issues will play a crucial role in many future
computer and communication systems. A central tool for achieving system security
is cryptography. For performance as well as for physical security reasons it is often
required to realize cryptographic algorithms in hardware. Traditional ASIC solutions,
however, have the well-known drawback of reduced flexibility compared to software
solutions. Since modern security protocols are increasingly defined to be algorithm
independent, a high degree of flexibility with respect to the cryptographic algorithms
is desirable. A promising solution which combines high flexibility with the speed and
physical security of traditional hardware is the implementation of cryptographic algo-
rithms on reconfigurable devices such as FPGAs and EPLDs. In the case of public-key
schemes, algorithm independence can mean not only a change of the actual crypto-
graphic algorithm but also a change of parameters such as bit length, modulus, or
exponents. One application, dealt with in this report, includes arithmetic architec-
tures for modular exponentiation with very long integers which is at the heart of most
modern public-key schemes. Most notably, both RSA and discrete logarithm-based
1
CHAPTER 1. INTRODUCTION 2
(e.g., Diffie-Hellman key exchange or the Digital Signature Algorithm, DSA) schemes
require modular long number exponentiation.
The challenge at hand is to design arithmetic architectures for operands with up to
1024 bits on current FPGAs. The very long word lengths prohibit the application of
many proposed architectures as they would result in unrealistically large resource re-
quirements. In this thesis we derive two modular exponentiation architectures which
combine Montgomery’s modular reduction scheme and novel systolic array architec-
tures. The systolic array architecture requires considerably fewer logic resources than
many other systolic array architectures for modular arithmetic. This is crucial, as
one of our goals was to derive solutions that can fit into a single FPGA, a design
goal that has many cost and design advantages over multi–FPGA solutions. Another
important objective was to systematically implement various architecture options for
different bit lengths and compare performance and resource usage.
1.2 Thesis Goals
Based on the general considerations in the previous section we defined the following
goals for the thesis research:
1. Implementation of a 1024–bit modular exponentiation architecture in a single
commercially available FPGA device. 1024 bits is the recommended bit size
for RSA and discrete logarithm systems and thus highly relevant for practical
applications. The computation time of our architecture should be close to a
previously reported architecture that used 16 FPGAs [33].
2. It should be investigated which of the two proposed general architecture op-
tions, based on systolic arrays and a redundant representation, is best suited
for modular exponentiation architectures on FPGAs.
CHAPTER 1. INTRODUCTION 3
3. To find an optimal resource usage and computation time trade–off for the FPGA
architectures that will be designed.
4. A resource efficient FPGA design should be developed which allows the imple-
mentation of a 1024 bit architecture at moderate costs.
5. It should be investigated weather the high–radix Montgomery modular mul-
tiplication algorithm proposed in [9] can be used for modular exponentiation
architectures on FPGAs.
6. Develop and implement a design that is considerably faster than any previously
reported FPGA architecture and reaches speeds similar to the fastest design
reported in technical literature [27].
1.3 Thesis Outline
This thesis is structured as follows. In Chapter 2, we summarize some of the previous
work on modular exponentiation. Chapter 3 describes three families of algorithms in
public–key cryptography, and the modular arithmetic needed for their implementa-
tion. Chapter 4 describes algorithms for modular exponentiation and multiplication
and some simplifications and speed-ups for their hardware implementation. In Sec-
tion 5 we summarize some of the relevant features of the Xilinx XC4000 FPGA series.
Based on these features we derive some characteristics for our architectures. Chap-
ter 6 outlines our architecture for modular exponentiation, optimized for low resource
usage. Chapter 7 describes an architecture optimized for speed. Chapter 8 describes
our methodology and tools that were used for this research. Chapter 9 posts the tim-
ing and area results obtained. A comparison to other architectures and an outlook
conclude this thesis.
Chapter 2
Previous Work
In the following, we will summarize relevant previous work in the field of modular
multiplication. Most presented approaches are based on an algorithm proposed by
Peter Montgomery in 1985 [19], either in conjunction with a redundant number rep-
resentation or in a systolic array architecture. Solutions using other algorithms have
also been presented.
2.1 Montgomery Reduction and Redundant Rep-
resentation
Applying Montgomery’s algorithm, the cost of a modular exponentiation is reduced
to a series of additions of very long integers. To avoid the carry propagation in
multiplication/addition architectures several solutions have been proposed in the lit-
erature. They either use Montgomery’s algorithm, in combination with a redundant
radix number system [26, 33, 7, 9, 36] or a Residue Number System [2].
In [7] Montgomery’s modular multiplication algorithm is adapted for an efficient
hardware implementation. A gain in speed results from a faster clock, due to sim-
pler combinatorial logic. Compared to previous techniques based on Brickell’s Algo-
4
CHAPTER 2. PREVIOUS WORK 5
rithm [4], a speed-up factor of two is reported.
The Research Laboratory of Digital Equipment Corp. in Paris implemented mod-
ular exponentiation architectures on FPGAs [33, 26]. They utilized an array of 16
XILINX 3090 FPGAs. Compared to XILINX 4000 series in terms of flip–flops, this
is equivalent to a chip with 5100 configurable logic blocks (CLBs). In terms of logic
resources this is equivalent to a chip of 4000 CLBs. In their work they used sev-
eral speed-up methods [26] including the Chinese remainder theorem, asynchronous
carry completion adder, and a windowing exponentiation method. The implemen-
tation computes a 970bit RSA decryption at a rate of 185kb/s (5.2ms per 970 bit
decryption) and a 512 bit RSA decryption in excess of 300 kb/s (1.7ms per 512 bit
decryption). A drawback of this solution is that the binary representation of the
modulus is hardwired into the logic representation so that the architecture has to be
reconfigured with every new modulus.
The problem of using high radices in Montgomery’s modular multiplication al-
gorithm is the more complex determination of the quotient. This behavior made
a pipelined execution of the algorithm impossible. Reference [9] rewrites the algo-
rithm and avoids thereby any operation involved in the quotient determination. The
necessary pre–computation has to be done only once for a given modulus.
Reference [36] proposes a novel VLSI architecture for Montgomery’s modular mul-
tiplication algorithm. The critical path that determines the clock speed is pipelined.
This is done by interleaving each iteration of the algorithm. Compared to previous
propositions, an improvement of the time–area product of a factor two is reported.
Reference [2] describes a new approach using a Residue Number System (RNS).
The algorithm is implemented with n moduli in the RNS on n reasonably simple
processors. The resulting processing time is O(n).
CHAPTER 2. PREVIOUS WORK 6
2.2 Montgomery Reduction and Systolic Arrays
There have been a number of proposals for systolic array architectures for modular
arithmetic. However, no implementations have been reported to our knowledge. In [8]
a VLSI solution is presented where a modular multiplication is calculated in (4m +
1) · 3m/2 clock cycles, where m is the number of bits of the modulus. That is
approximately four times more cycles than in a conventional solution. In terms of
resources, this design would be suitable for FPGA.
Similar two-dimensional systolic arrays are presented in [10, 35, 36]. For a radix
of two they all propose an m × m matrix of one bit processing elements. With this
configuration 2m modular multiplications are calculated at the same time and the
theoretical throughput is one modular multiplication per clock cycle. In terms of
resources, such a solution is not feasible in either VLSI or FPGA for the bit length
required in public-key algorithms. Even implementing only one row of processing
elements, (resulting in m times slower throughput) into presently available FPGAs is
difficult in terms of resources.
In reference [30] a linear systolic array was obtained by systematically mapping a
two-dimensional graph model onto a one-dimensional systolic array.
Reference [15] describes an architecture based on one row of processing elements
and a radix of two. Squarings and multiplications are computed in parallel. The
system requires n systolic processing elements for an n–bit modular exponentiation,
and the resulting execution time is 2n2 clock cycles.
2.3 Other Work
References [4, 25, 32, 31] describe different algorithms for modular multiplication
avoiding costly division. Reference [3] compares these algorithms. An overview of
previously presented architectures for VLSI implementations and their underlying
CHAPTER 2. PREVIOUS WORK 7
algorithms for modular integer arithmetic is also provided in this contribution. Ref-
erence [5] summarizes the chips available in 1990 for performing RSA encryption.
In Reference [34] a generalization of [4] is presented and some conclusions are
drawn about the choice of the radix.
Reference [29] proposes a radix–4 hardware algorithm. A redundant number rep-
resentation is used and the propagation of carries in additions is therefore avoided.
A processing speed–up of about six times compared to previous work is reported.
More recently an approach [39] has been presented that utilizes pre-computed
complements of the modulus and is based on the iterative Horner’s rule. Compared
to Montgomery’s algorithms these approaches use the most significant bits of an inter-
mediate result to decide which multiples of the modulus to subtract. The drawback of
these solutions is that they either need a large amount of storage space or many clock
cycles to complete a modular multiplication. The authors attempted to overcome
the later problem by a higher clock frequency which is possible due to a simplified
modulo reduction operation.
2.4 Implementations
To our knowledge, the fastest reported software implementation of modular exponen-
tiation [37] computes RSA decryption with a 1024–bit modulus in 43 ms.
In Reference [23] a table with several VLSI hardware implementations for RSA is
published. The fastest chip computes RSA decryption with a 512–bit modulus in 8
ms. These chips are somewhat dated, though. More recently an ASIC implementa-
tion has been reported [16] that computes RSA decryption with a 1024–bit modulus
in 150 ms. However, the author claims in [27] that 1024–bit exponentiation architec-
tures with 10 ms computation time are available. This time corresponds to an RSA
computation time of 2.5 ms if the Chinese remainder theorem is used for speeding–
CHAPTER 2. PREVIOUS WORK 8
up the computation. Only one FPGA implementation of RSA has been reported in
technical literature so far [33]. 970-bit RSA decryption is computed in 5.2 ms in this
approach. A detailed comparison of the modular exponentiation architectures that
we develop in this thesis with the previously reported implementations will be given
in Chapter 9.
Chapter 3
Preliminaries: Public–Key
Algorithms
In this chapter we review the three most popular families of public key algorithms.
Information on secure key length is given as well as speed-up methods proposed in
the literature. We will show that all algorithm families are based on modular long
number arithmetic.
3.1 RSA
RSA was proposed by Rivest, Shamir and Adleman [21] in 1978. The private key of a
user consists of two large primes p and q and an exponent D. The public key consists
of the modulus M = p × q, M =∑m−1i=0 mi2
i, mi ∈ {0, 1} and an exponent E such
that E = D−1 mod (p − 1)(q − 1), E =∑n−1i=0 ei2
i, ei ∈ {0, 1}. In the remainder of
this thesis we assume that E can be represented by n bits, and M can be represented
by m digits. To encrypt a message X the user computes:
Y = XE modM
9
CHAPTER 3. PRELIMINARIES: PUBLIC–KEY ALGORITHMS 10
Decryption is done by calculating:
X = Y D modM
The identical operations are used for the RSA digital signature scheme. In order to
thwart currently known attacks, the modulus M and thus X and Y should have a
length of 768 – 1024 bits. Both encryption and decryption require algorithms for
computing a modular exponentiation.
For speeding up encryption the use of a short exponent E has been proposed [13].
Recommended by the International Telecommunications Union ITU is the the Fermat
prime F4 = 216+1. Using F4, the encryption is executed in only 17 operations. Other
short exponents proposed include E = 3 and E = 17.
Obviously the same trick can not be used for decryption, as the decryption expo-
nent D must be kept secret. But using the knowledge of the factors of M = q × p,
the Chinese Remainder Theorem [20] can be applied by the decrypting party. Two
m/2 size modular exponentiations and an additional recombination instead of one
m size modular exponentiation are computed in this case. Each modular exponen-
tiation of length m/2 takes 1/4 of the time required for an m – bit exponentiation
(see Chapter 4). If both exponentiations are performed serially, an over–all speed–up
factor of two is achieved. If they are performed in parallel, a speed–up factor of four
is achieved.
3.2 Algorithms Based on the Discrete Logarithm
Problem in Finite Fields
The best known public–key schemes based on the discrete logarithm problem in finite
fields are the Diffie-Hellman key exchange scheme, the Digital Signature Algorithm
(DSA) and the ElGamal encryption scheme (see, e.g., [17]). As an example, we
CHAPTER 3. PRELIMINARIES: PUBLIC–KEY ALGORITHMS 11
present below the Diffie-Hellman key exchange scheme, proposed in 1976 by W. Diffie
and M.E. Hellman [6].
The goal of this protocol is to establish a secret session key between to parties
over an insecure channel. The two parties, Alice and Bob, want to establish a secret
key without Oscar, the adversary, being able to compute this key. During the setup
phase Alice and Bob obtain the public parameters p and α. Parameter p is a large
prime and α a primitive element in Z∗p or a subgroup of Z∗p .
The algorithm proceeds as follows:
1a) Alice generates a random key: 1b) Bob generates a random key:
aA ∈ {2, 3, . . .p− 1} (private) aB ∈ {2, 3, . . .p− 1} (private)
2a) Alice computes her public key: 2b) Bob computes his public key:
βA = αaA mod p (public) βB = αaB mod p (public)
3a) Alice sends βA to BobβA−→βB←− 3b) Bob send βB to Alice
4a) Alice computes: 4b) Bob computes:
Ks = βaAB = (αaB)aA = αaB·aA mod p Ks = (αaA)aB = αaA·aB mod p
After the final stage of the algorithm, Alice and Bob share a session keyKs. Oscar
cannot regenerate the session key from the public parameters α, βA, and βB because
the two random integers aA and aB, generated by Alice and Bob are private and were
never transmitted over the insecure channel.
The computational complexity of the algorithm lies in steps 2 and 4, the com-
putation of a modular exponentiation. The index–calculus method is the currently
best known attack against discrete logarithm-based schemes. In order to thwart this
attack, the modulus p and thus α should have a length of 768–1024 bits, and even
longer bit lengths are recommended for highly sensitive applications. If α generates
a subgroup of order n, the exponents aA, aB can be restricted to 0 < aA,aB < n. In
CHAPTER 3. PRELIMINARIES: PUBLIC–KEY ALGORITHMS 12
practice, a 160 bit exponent can be used with moduli up to 1024 bit.
3.3 Elliptic Curves
Elliptic Curve public–key cryptosystems were proposed independently in 1986/1987
by Victor Miller [18] and Neil Koblitz [14]. We restrict ourselves in the following to
curves over prime fields, as opposed to curves over extension fields such as GF (2m).
An elliptic curve is a set of all pairs (x,y), x, y ∈ Zp, that fulfill the equation:
y2 ≡ x3 + ax+ b mod p
To perform an addition of two points P1 = (x1,y1), P2 = (x2,y2), P3 = P1 + P2 =
(x3,y3) we need to compute the following equations:
x3 = λ2 − x1 − x2
y3 = λ · (x1 − x3)− y1
λ =
y2−y1x2−x1
mod p, if P1 = P2 (addition)3x21+a
2y1mod p, if P1 = P2 (doubling)
The complexity of this operation is two multiplications and one inversion (point–
addition) or three multiplications and one inversion (point–doubling), if we ignore
additions and subtractions. The inversion is very costly to implement. To optimize
the addition of two points by avoiding the inversion, the use of projective coordinates
has been proposed. A projective point (X,Y ,Z) in the projective plane can be iden-
tified with a point (x,y) in the affine plane. The homogeneous elliptic curve is a set
of all points (X,Y ,Z) that fulfill the equation:
ZY 2 ≡ X3 + aXZ2 + bZ3 mod p
The addition formulae are now [12]:
CHAPTER 3. PRELIMINARIES: PUBLIC–KEY ALGORITHMS 13
addition: X3 = V A
Y3 = U(V 2X1Z2 −A)− V 3Y1Z2
Z3 = V 3Z1Z2
where U = Y2Z1 − Y1Z2, V = X2Z1 −X1Z2, A = U2Z1Z2 − V 2T , T = X2Z1 +X1Z2
doubling: X3 = 2SH
Y3 = W (4F −H)− 8E2
Z3 = 8S3
where S = Y1Z1, W = 3X21 + aZ21 , E = Y1S, F = X1E, H = W 2 − 8F
To perform an elliptic curve projective space addition we have to compute 15 multi-
plications, and 12 multiplications are needed for a doubling operation.
A method similar to Algorithm 4.1 combines additons and doublings to a general
point multiplication, e · P , that is, addition of the point P e–times to itself. Point
multiplication is the core operation in elliptic curve public key crypto–systems. If we
use a modulus and operands of lengthm+1 bits, m doublings and an average of m/2
additions have to be executed.
The currently best known attack against elliptic curve public key crypto–systems
uses the Silver–Pohlig–Hellmann algorithm [24] together with Pollard’s rho method.
In order to thwart this attack, the modulus p and thus X, Y , and Z should have
a length of at least 160 bits. We note that this operand bit length is considerably
shorter than in the case of RSA or DL schemes.
Chapter 4
Preliminaries: Modular
Exponentiation
In this chapter we review the square & multiply algorithm, which is the most popular
algorithm for modular exponentiation. Secondly we develop versions of Montgomery’s
modular multiplication algorithm, which are well suited for hardware implementa-
tions.
4.1 Square & Multiply Algorithm
The public–key schemes described in Chapter 3 are based on modular exponentiation
or repeated point addition. Both operations are in their most basic forms done by
the square and multiply algorithm [13].
Algorithm 4.1 compute Z = XE modM , where E =∑n−1i=0 ei2
i, ei ∈ {0, 1}
1. Z = X
2. FOR i = n− 2 down to 0 DO
3. Z = Z2 modM
4. IF ei = 1 THEN Z = Z ·X mod M
14
CHAPTER 4. PRELIMINARIES: MODULAR EXPONENTIATION 15
5. END FOR
Algorithm 4.1 takes 2(n−1) operations in the worst case and 1.5(n−1) on average. To
compute a squaring and a multiplication in parallel we can use the following version
of the square & multiply algorithm [36]:
Algorithm 4.2 computes P = XE mod M , where E =∑n−1i=0 ei2
i, ei ∈ {0, 1}
1. P0 = 1, Z0 = X
2. FOR i = 0 to n− 1 DO
3. Zi+1 = Z2i modM
4. IF ei = 1 THEN Pi+1 = Pi · Zi modM
ELSE Pi+1 = Pi
5. END FOR
Algorithm 4.2 takes 2n operations in the worst case and 1.5n on average. A speed–
up can be achieved by applying the l – ary method [13] which is a generalization
of Algorithm 4.1. The l – ary method processes l exponent bits at the time. The
drawback here is that (2l − 2) multiples of X have to be precomputed and stored. A
reduction to 2l−1 pre–computations is possible. The resulting complexity is roughly
n/l multiplications and n squaring operations.
4.2 Montgomery Reduction
As shown in the previous section, modular exponentiation is reduced to a series of
modular multiplications and squaring steps. The algorithm for modular multiplica-
tion described below has been proposed by P. L. Montgomery in 1985 [19]. It is a
method for multiplying two integers modulo M , while avoiding division by M . The
idea is to transform the integers in m-residues and compute the multiplication with
these m-residues. Finally we transform back to the normal representation. This ap-
proach is only beneficial if we compute a series of multiplications in the transform
domain (e.g., modular exponentiation).
CHAPTER 4. PRELIMINARIES: MODULAR EXPONENTIATION 16
To compute the Montgomery multiplication, we chose a radix R > M , with
gcd(M,R) = 1. Division by R has to be inexpensive, thus an optimal choice is
R = 2m if we assume that M =∑m−1i=0 mi2
i. The m-residue of x is xR modM .
We also compute M ′ = −M−1 mod R. Now we define a function MRED(T) that
computes TR−1 modM : This function computes the normal representation of T ,
given T is an m-residue.
Algorithm 4.3 MRED(T): computes a Montgomery reduction of T
T < RM , R = 2m, M =∑m−1i=0 mi2
i, gcd(M,R) = 1
1. U = TM ′ mod R
2. t = (T + UM)/R
3. IF t ≥M RETURN t−M
ELSE RETURN t
The result of MRED(T) is t = TR−1 mod M . For the proof of this equation,
see [19].
Now we consider a multiplication of two integers a and b in the transform domain,
where their respective representations are (aR modM) and (bR mod M). To acquire
the result (abR mod M) we feed their product into MRED(T):
MRED((aR modM) · (bR mod M)) = abR2R−1 = abR mod M
For a modular exponentiation we can repeat this step numerous times according to
Algorithm 4.1 or 4.2 to get the final result ZR mod M (Algorithm 4.1) or PnR mod M
(Algorithm 4.2). We finally feed one of these values into MRED(T) to get the result
Z modM or Pn modM .
The initial transform step still requires costly modular reductions. To avoid the
division involved, we can take the following approach. First we compute R2 mod M
using division. This step needs to be done only once for a given cryptosystem. To get
a and b in the transform domain we run MRED(a ·R2 modM) and MRED(b ·R2 mod
CHAPTER 4. PRELIMINARIES: MODULAR EXPONENTIATION 17
M) to get aR mod M and bR modM . Obviously, any variable can be transformed
in this manner.
We now consider a hardware implementation of Algorithm 4.3: To compute step
2 we need an m × m–bit multiplication and a 2m–bit addition. The intermediate
result can have as many as 2m bits. Instead of computing U at once, we can compute
one digit of an r–radix representation at a time. We have to chose a radix r, such
that gcd(M, r) = 1 [28]. Division by r has to be inexpensive, thus an optimal choice
is r = 2k. All variables are now represented in a basis–r representation. Another
improvement is to include the multiplication A×B in the algorithm.
Algorithm 4.4 [7] Montgomery Modular Multiplication for computing A·B mod M ,
where M =∑m−1i=0 (2
k)imi, mi ∈ {0, 1 . . . 2k − 1};
B =∑m−1i=0 (2
k)ibi, bi ∈ {0, 1 . . . 2k − 1};
A =∑m−1i=0 (2
k)iai, ai ∈ {0, 1 . . . 2k − 1};
A,B < M ; M < R = 2km; M ′ = −M−1 mod 2k; gcd(2k,M) = 1
1. S0 = 0
2. FOR i = 0 to m− 1 DO
3. qi = (((Si + aiB) mod 2k)M ′) mod 2k
4. Si+1 = (Si + qiM + aiB)/2k
5. END FOR
6. IF Sm ≥M RETURN Sm −M
ELSE RETURN Sm
The output of Algorithm 4.4 is Sm = ABR−1 mod M . Considering a radix r = 2k,
we need at most two k× k– bit multiplications and a k–bit addition to compute step
3. For step 4 two k×m– bit multiplications and two m+ k–bit additions are needed.
The maximal bit length of S is reduced to m+k+2 bits, compared to the 2m bits of
Algorithm 4.3. In Section 4.3 we review further improvements of Algorithm 4.4 for
the case of r = 2. Section 4.4 treats the algorithm for larger radix.
CHAPTER 4. PRELIMINARIES: MODULAR EXPONENTIATION 18
4.3 Montgomery Multiplication for Radix Two
Algorithm 4.5 is a simplification of Algorithm 4.4 for radix r = 2. For the radix
r = 2, the operations in step 3 of Algorithm 4.4 are done modulo 2. The modulus
M must be odd due to the condition gcd(M, 2k) = 1. It follows immediately that
M ≡ 1 mod 2. Hence M ′ ≡ −M−1 mod 2 also degenerates to M ′ = 1. Thus the
multiplication by M ′ mod 2 in step 3 can be omitted.
Algorithm 4.5 [7] Montgomery Modular Multiplication (Radix r = 2) for computing
A ·B modM , where M =∑m−1i=0 2
imi, mi ∈ {0, 1};
B =∑m−1i=0 2
ibi, bi ∈ {0, 1};
A =∑m−1i=0 2
iai, ai ∈ {0, 1};
A,B < M ; M < R = 2m; gcd(2,M) = 1
1. S0 = 0
2. FOR i = 0 to m− 1 DO
3. qi = (Si + aiB) mod 2
4. Si+1 = (Si + qiM + aiB)/2
5. END FOR
6. IF Sm ≥M RETURN Sm −M
ELSE RETURN Sm
The final comparison and subtraction in step 6 of Algorithm 4.5 would be costly
to implement, as an m bit comparision is very slow or expensive in terms of resource
usage . It would also make a pipelined execution of the algorithm impossible. It can
easily be verified that Si+1 < 2M always holds if A, B < M . Sm, however, can not
be reused as input A or B for the next modular multiplication. If we perform two
more executions of the for loop with am+1 = 0 and inputs A, B < 2M , the inequality
Sm+2 < 2M is satisfied. Now, Sm+2 can be used as input B for the next modular
multiplication. We just allow S to have two more bits for intermediate results.
To further reduce the complexity of Algorithm 4.5, B can be shifted up by one
position, i.e., multiplied by two [7]. This results in ai ·B mod 2 = 0 and the addition
CHAPTER 4. PRELIMINARIES: MODULAR EXPONENTIATION 19
in step 3 is avoided. In the update of Si+1 we replace (Si + qiM + aiB)/2 by (Si +
qiM)/2 + aiB. The cost of this simplification is one more execution of the loop with
am+2 = 0. The algorithm below comprises the just mentioned optimizations.
Algorithm 4.6 [7] MONT R2(A,B): Montgomery Modular Multiplication (Radix
r = 2) for computing A ·B modM , where M =∑m−1i=0 2
imi, mi ∈ {0, 1};
B =∑mi=0 2
ibi, bi ∈ {0, 1};
A =∑m+2i=0 2
iai, ai ∈ {0, 1}, am+1 = 0, am+2 = 0;
A,B < 2M , M < R = 2m+2; gcd(2,M) = 1
1. S0 = 0
2. FOR i = 0 to m+ 2 DO
3. qi = Si mod 2
4. Si+1 = (Si + qi ·M)/2 + ai ·B
5. END FOR
The algorithm above calculates Sm+3 = (2−(m+2)AB) modM . To get the cor-
rect result we need an extra Montgomery modular multiplication by 22(m+2) modM .
However, if further multiplications are required as in exponentiation algorithms, it is
better to pre–multiply all inputs by the factor 22(m+2) modM . Thus every interme-
diate result carries a factor 2m+2. We just need to Montgomery multiply the result
by “1” to eliminate that factor.
The final Montgomery multiplication with “1” insures that our final result is
smaller than M . Consider Algorithm 4.6 with B < 2M and A = (0, . . . , 0, 1). We
will get S1 = a0 · B = B < 2M . As all remaining ai = 0, we get at most Si+1 =
(Si +M)/2 → M . If only one qi = 0 (i = 1, 2 . . .m + 2), then Si+1 = Si/2 < M
(probability: 1− 2−(m+2)).
The computational complexity of Algorithm 4.6 lies in the two additions of m
bit operands for computing Si+1. Recall that m ≈ 160 − 1024 is of great interest in
public–key algorithms. As the propagation of m carries is too slow and an equivalent
CHAPTER 4. PRELIMINARIES: MODULAR EXPONENTIATION 20
carry look ahead logic requires to many resources, two different strategies have been
pursued in literature:
1. Redundant representation: The intermediate results are kept in redundant form.
Resolution into binary representation is only done at the very end and for feeding
the intermediate result back as ai in Algorithm 4.6.
2. Systolic Arrays: Typically m processing units calculate 1 bit per clock cycle.
The computed carries, qi and ai are “pumped” through the processing units.
As these signals have to be distributed only between adjacent processing units,
a faster clock speed and a resulting higher throughput should be possible. The
cost is a higher latency and possibly more resources.
4.4 High–Radix Montgomery Algorithm
The goal of this section is to improve Algorithm 4.4 to make it suitable for a hardware
implementation. At first we avoid the costly comparison and subtraction of step 6.
The output Sm has to be small enough to be fed back in the algorithm as A or B.
We change the conditions to 4M < 2km and A,B < 2M . This results in Sm < 2M
as needed for further processing. The penalty is two more executions of the loop (see
also Section 4.3 for k = 1).
In Section 4.3 the multiplication in the quotient qi determination of Algorithm 4.4
was avoided. This is also possible for higher radixes [9]. M has to be transformed
to M̃ = (M ′ mod 2k)M . This step must be performed only once for a given crypto–
system. The conditions for the algorithm are 4M̃ < 2km and A,B < 2M̃ now.
Thus for the same bit length of M the loop has to be executed one more time. An
other penalty is a larger range of Sm. Algorithm 4.7 comprises the above mentioned
improvements.
CHAPTER 4. PRELIMINARIES: MODULAR EXPONENTIATION 21
Algorithm 4.7 [9] Montgomery Modular Multiplication for computing A·B mod M ,
where M =∑m−3i=0 (2
k)imi, mi ∈ {0, 1 . . . 2k − 1};
M̃ = (M ′ mod 2k)M , M̃ =∑m−2i=0 (2
k)im̃i, m̃i ∈ {0, 1 . . . 2k − 1};
B =∑m−1i=0 (2
k)ibi, bi ∈ {0, 1 . . . 2k − 1};
A =∑m−1i=0 (2
k)iai, ai ∈ {0, 1 . . . 2k − 1};
A,B < 2M̃ ; 4M̃ < 2km; M ′ = −M−1 mod 2k
1. S0 = 0
2. FOR i = 0 to m− 1 DO
3. qi = (Si + aiB) mod 2k
4. Si+1 = (Si + qiM̃ + aiB)/2k
5. END FOR
The quotient qi determination complexity can further be reduced by replacing
B by B · 2k. Since aiB mod 2k ≡ 0, step 3 is reduced to qi = Si mod 2k. The
addition in step 3 is avoided at the cost of an additional iteration of the loop, to
compensate for the extra factor 2k in B. A Montgomery algorithm optimized for
hardware implementation is shown below:
Algorithm 4.8 [9] MONT RH(A,B):Montgomery Modular Multiplication for com-
puting A ·B mod M , where M =∑m−3i=0 (2
k)imi, mi ∈ {0, 1 . . . 2k − 1};
M̃ = (M ′ mod 2k)M , M̃ =∑m−2i=0 (2
k)im̃i, m̃i ∈ {0, 1 . . . 2k − 1};
B =∑m−1i=0 (2
k)ibi, bi ∈ {0, 1 . . . 2k − 1};
A =∑mi=0(2
k)iai, ai ∈ {0, 1 . . . 2k − 1}, am = 0;
A,B < 2M̃ ; 4M̃ < 2km; M ′ = −M−1 mod 2k
1. S0 = 0
2. FOR i = 0 to m DO
3. qi = (Si) mod 2k
4. Si+1 = (Si + qiM̃)/2k + aiB
5. END FOR
The result of Algorithm 4.8 is Sm+1 = ABR−1 modM where R = 2km modM .
We perform the same pre–computations as mentioned in Section 4.3: Pre–multiply
CHAPTER 4. PRELIMINARIES: MODULAR EXPONENTIATION 22
all inputs by the factor 22km modM . Thus every intermediate result carries a factor
2km. We just need to Montgomery multiply the final result by 1 to eliminate that
factor.
The final Montgomery multiplication with 1 makes sure our final result is smaller
than M̃ . Consider Algorithm 4.8 with B < 2M̃ and A = (0, . . . , 0, 1). We will
obtain S1 = a0 · B = B < 2M̃ . As all remaining ai = 0, we get at most Si+1 =
(Si+(2k−1)M̃)/2k → M̃ . If only one qi = 2k−1, (i = 1, 2 . . .m), qi ∈ {0, 1 . . . 2k−1},
then Si+1 < M̃ . M̃ , however, can be 2k− 1 times larger than M and the same is true
for the result of a modular exponentiation Sm+1. The last quotient qm and the factor
M ′ mod 2k might determine the number of timesM has to be subtracted from Sm+1.
This behavior, however, has not been studied in this thesis. We assume that the final
comparison and eventual modular reduction step is performed outside of our design.
Algorithm 4.8 is used for the architecture described in Chapter 7. The designs we
implemented have moduli of 160, 256, 512, 768, and 1024 bits. The radix chosen was
r = 24 = 16. For the rest of this thesis we use the following convention for specifying
the complexity: M =∑m−1i=0 (2
k)imi ⇒ M̃ =∑mi=0(2
k)im̃i, B =∑m+1i=0 (2
k)ibi, A =∑m+2i=0 (2
k)iai, am+2 = 0. The loop in Algorithm 4.8 is executed m + 3 times and
the response is Sm+3 = ABR−1 modM , where R = 2k(m+2) modM . Thus the pre–
computation factor is 22k(m+2) modM .
The computational complexity of Algorithm 4.8 lies in the two additions of m+ k
bit operands for computing Si+1. Another costly operation is the computation of the
multiples of M and B in step 4.
Chapter 5
General Design Considerations
In this chapter we present some of the relevant features of the Xilinx XC4000 Series
FPGAs and introduce a metric for FPGA cost and performance evaluation. Based on
these features we derive some characteristics for our architectures. The pros and cons
of two different approaches to implement Montgomery’s algorithm are exhibited.
5.1 Xilinx XC4000 Series FPGAs
5.1.1 Configurable Logic Blocks
An FPGA device consists of three types of reconfigurable elements, the Configurable
Logic Blocks (CLBs), I/O blocks (IOBs) and routing resources [38].
Figure 5.1.1 shows the structure of a CLB. An XC4000 CLB is made up of three
look–up tables (LUT) F , G, and H, two flip-flops and programmable multiplexers.
Any boolean function of 5 inputs, any two functions of 4 inputs and some functions of
up to 9 inputs can be computed in one CLB. The multiplexers can route the outputs
of the look–up tables directly to the outputs or to the flip-flops. In the first case the
flip-flops can be utilized to store direct inputs. This is important for a design with a
large amount of registers. The same CLB can be used to store two bits and compute
two independent logic functions. Thus a considerable amount of resources can be
23
CHAPTER 5. GENERAL DESIGN CONSIDERATIONS 24
programmed duringconfiguration
G
D Q
EC
D Q
EC
F
H
Figure 5.1: XC4000 CLB Structure [38]
saved as will be shown in Sections 6.2 and 7.2.
5.1.2 Routing Topologies
Programmable routing resources connect the CLBs and IOBs into a network. The
structure of XC4000 FPGA devices is shown in Figure 5.2. A matrix of switch boxes
is placed over the CLB array. These switch boxes make it possible to connect any
two CLBs together.
Routing in the XILINX FPGA is accomplished through a hierarchal structure.
CHAPTER 5. GENERAL DESIGN CONSIDERATIONS 25
Programmable Switch Matrices CLB
Figure 5.2: Xilinx FPGA structure [38]
Each row or column of routing lines between CLBs has a number of different types
of lines. These include single, double, quad, long, and global lines. Single lines
route signals between adjacent CLBs. Double lines stretch over two CLBs. For the
architectures described in Chapters 6 and 7, devices of the XC4000XL and XC4000XV
families were used. Table 5.1 shows their routing resources per CLB:
The numbers show that a large amount of connections are available for closely
located CLBs. On the other hand routing gets increasingly difficult for CLBs further
apart. In Section 5.2.1 we will discuss some consequences this issue causes for the
choice of our architectures. For a detailed description of the routing structure inside
the XILINX FPGA, refer to [38].
CHAPTER 5. GENERAL DESIGN CONSIDERATIONS 26
Table 5.1: Routing per CLB in XC4000XL and XC4000XV devices [38]
Vertical Horizontal
Singles 8 8
Doubles 4 4
Quads 12 12
Longlines 10 6
Direct Connects 2 2
Globals 8 0
Carry Logic 1 0
5.1.3 Special Features of the XC4000 Family
The XC4000 devices contain dedicated, hard–wired carry logic to both accelerate and
condense arithmetic functions such as adders and counters [38]. An n bit ripple carry
adder is implemented in n/2 + 2 CLBs. The ripple carry outputs are routed between
CLBs on high speed dedicated paths. The maximum delay from the operand input
to the sum output of a N–bit adder is approximately:
tpd = 4.5 +N · 0.35 [ns]
For an n–bit counter, the minimum clock period is approximately:
tclk−clk = 9.5 +N · 0.35 [ns]
These values vary slightly for different devices and speed grades.
Another very useful feature of the XC4000 devices is the possibility to implement
RAM in CLBs. A single CLB can be programmed as a 16 × 2 bit or 32 × 1 bit
ROM/RAM or as a 16 × 1 bit Dual Port RAM. RAM with larger address width
requires considerably more resources. Table 5.2 shows the amount of CLBs used for
implementing 64× 2–bit, 128× 2–bit, and 256× 2–bit RAM and DP RAM blocks.
CHAPTER 5. GENERAL DESIGN CONSIDERATIONS 27
Table 5.2: Amount of CLBs used for RAM and DP RAM blocks on XC4000 devices
64× 2–bits 128× 2–bits 256 × 2–bits
CLBs CLBs CLBs
RAM 6 12 24
DP RAM 16 32 64
5.1.4 Cost and Speed Evaluation
In previous work [36, 35, 7] related to modular arithmetic architectures, the gate
count model has been used for cost evaluation and the gate delay model for speed
evaluation. This is not appropriate for FPGAs. As the functional unit of an FPGA
is the CLB, we evaluate the cost (C) in number of CLBs. The operation time (T)
consists of logic delay in the CLBs and routing delay and is obtained from Xilinx’s
Timing Analyzer software. As a third parameter we use the time–area product (TA).
It is defined by time multiplied by cost.
5.2 Architectures Suitable for FPGAs
5.2.1 Systolic Array vs. Redundant Representation
As described in Section 4.3, there have been two principle approaches proposed to
compute Montgomery modular multiplication.
1. Avoid the carry propagation delays by keeping intermediate results in redundant
representation. Resolution into binary representation is only done at the very
end and for feeding the intermediate result back as ai in Algorithm 4.8.
2. Systolic Arrays: Processing units compute successive values for a single digit
position. The computed carries, qi and ai, are “pumped” through the processing
units.
CHAPTER 5. GENERAL DESIGN CONSIDERATIONS 28
A solution following approach 1 has already been implemented in FPGAs [33]. A
matrix of 16 FPGAs has been used. The second approach using systolic arrays has
drawn considerable attention in the research community. However, no architectures
that specifically target FPGAs have been reported, nor are there reports of ASIC
implementations of such systolic architectures.
The question at hand is which solution is better suited for an FPGA implementa-
tion. To answer this we first take a closer look at the equation we have to implement
(Section 4.4):
qi = Si mod 2k
Si+1 = (Si + qi · M̃)/2k + ai ·B (5.1)
Equation (5.1) is executed iteratively to compute a modular multiplication. A
series of modular multiplications combine to a modular exponentiation according to
the square & multiply Algorithms 4.1 or 4.2. Additionally a pre–computation and
post–computation are necessary (Section 4.3 and 4.4).
An architecture comprising these computations has two major parts:
1. The arithmetic part computes Equation (5.1). The operands B, M̃ and Si must
be stored and multiples of M̃ and B have to be calculated. Furthermore two
additions have to be computed. To avoid a long carry chain, we can divide the
adder into units that typically compute one digit of Si+1 in Equation (5.1). If
the units compute one iteration of (5.1) at the same time, the result Si+1 must
be kept in redundant representation. A separate adder is needed to resolve the
redundancy of Si+1 for further processing. In a systolic array approach the units
compute an iteration of 5.1 successively. The overflow of a unit is fed as carry
to the next unit. In both approaches we have to globally feed qi and ai to all
units. The quotient qi is computed once per iteration in the least significant
unit. The result of the modular multiplication Sm+3 must be stored and reused
CHAPTER 5. GENERAL DESIGN CONSIDERATIONS 29
as operand A for the successive squaring. Therefore all digits of Sm+3 need to
be distributed to all units of the design.
2. A control and storage part contains the finite state machine (FSM) that con-
trols the execution of the square & multiply algorithm and the pre– and post–
computations. We also need storage elements for the pre–computation factor,
the exponent and operand A. A is processed serially and has to be distributed
digit by digit globally over the arithmetic part.
Clearly, the arithmetic part will utilize most of the resources. Considerations concern-
ing its design will decide between an systolic array and a redundant representation
architecture. Some considerations concerning the control part are discussed in Sub-
section 5.2.2.
Two recent MS theses [22] and [11] in the same field found that a major problem
when implementing designs in FPGAs, is the availability of enough routing resources.
As discussed in Section 5.1.2, the problem worsens if there are many connections
between CLBs which are far apart. When considering an implementation of an ar-
chitecture with redundant representation there are some routing issues to deal with:
To compute Equation (5.1) with a radix r = 2k, qi and ai, each k bits wide and
some control signals have to be distributed globally over the arithmetic part. As
all redundant digits of Si+1 are computed concurrently, the signals have to arrive at
their destination at the same time. The resulting high fan–out causes also considerable
propagation delays. Another issue is feeding B and M̃ to the locations where they are
stored and processed. To avoid the full size bus that feeds these signals in parallel, we
need a systolic approach at least for the loading of B and M̃ . Lastly, the redundant
result Sm+1 of a modular multiplication has to be resolved and distributed as A for
further processing. As the low order bits of A are needed first, the resolution of
the most significant bits is not critical in terms of propagation delay. The routing,
however, is an issue as the full length result has to be stored and distributed all over
CHAPTER 5. GENERAL DESIGN CONSIDERATIONS 30
the arithmetic part.
Routing in a systolic array architecture is much less critical:
In a systolic array architecture we have typically m units, each processing k bits.
The k signals qi and ai and the control signals are “pumped” through the units,
from register to register. k bits of Si+1 are computed per clock cycle. Only k bits
of the result Sm+1 of a modular multiplication are valid at the time. Thus with an
additional k bit register per unit we can “pump” the result back through the units.
We can further store the result in RAM, k bits at the time, for further processing.
So far no connections stretch over more than one unit. If reasonably small units are
designed routing is not problematic. B and M̃ can be loaded over k bit wide buses.
The load signal propagates through the units and activates the clock enable of one
unit per cycle. Thus only 2k signals have to be routed all over the design.
Summarizing the last two paragraphs, three major advantages of a systolic array
architecture were found.
routing resources Most connections are within one unit or are stretching to an
adjacent unit. The availability of enough routing resources is much less an issue
compared to an architecture where signals have to be distributed all over the
design.
propagation delay A higher clock frequency is possible due to less additional rout-
ing delays of long paths.
synchronous design A fully synchronous design is possible, as we do not need
an asynchronous carry completion adder as described in [33], to resolve the
redundant representation of the result S.
In contrast to these three advantage, we possibly need more resources to implement
a systolic array. A considerable amount of registers is needed for “pumping” the
operand A, the quotient Q, the control word and the result S consecutively through
CHAPTER 5. GENERAL DESIGN CONSIDERATIONS 31
the units. CLBs used for implementing registers, however, can be reused for logic
functions as pointed out in Section 5.1.1.
5.2.2 State Machine and Storage Elements
We chose a synchronous methodology for all designs presented in this thesis. In the
synchronous approach, the clock period is determined by the longest combinatorial
delay between two registers.
One of the major goals of this work was to design a high–speed modular expo-
nentiation architecture. This can be done by reducing the amount of iterations in
Algorithm 4.8 by choosing a high radix. Secondly the propagation delay of Step 4 in
the same algorithm needs to be optimized. This step is typically computed in one
clock cycle and determines the minimum clock period.
The FSM and central storage elements have to be designed in such a way that
their minimum clock period is smaller than the combinatorial delay of the arithmetic
part. To speed–up the FSM as much as possible “one–hot encoding” was used. Each
state is represented with an individual bit, resulting in decreased logic complexity
associated with each state [1]. On the other hand more registers are needed. We
accept this minor disadvantage as the FSM uses only a small percentage of the whole
design.
In the central storage elements we need to store data serially. Such data includes
the pre–computation factor, the exponent, and intermediate results of the square &
multiply algorithm. The usage of RAM saves a large amount of resources compared
to registers. A 1024 bit exponent stored in a 16 × 64-bit RAM uses 32 CLBs, while
an equivalent register needs 512 CLBs. RAM with address width larger than 4 bit
require more resources and feature larger access delays, due to additional address
decoding. The RAM have to be designed in such a way that their access time does
not determine the minimal clock period of the design.
Chapter 6
Design 1: A Resource Efficient
Architecture
In this chapter we describe our first architecture. The goal was to design an area
efficent architecture using Algorithm 4.6. As target devices we use the Xilinx XC4000
family as described in Chapter 5. The results of the actual implementation of the
architecture will be described in Chapter 9.
6.1 Design Overview
A general radix 2 systolic array as proposed in [10, 35, 8] utilizes m times m pro-
cessing elements, where m is the number of bits of the modulus and each element
processes a single bit. 2m modular multiplications can be processed simultaneously,
featuring a throughput of one modular multiplication per clock cycle and a latency
of 2m cycles. As this approach would result in unrealistically large CLB counts for
the bit length required in modern public–key schemes, we implemented only one row
of processing elements. With this approach two modular multiplications can be pro-
cessed simultaneously and the performance reduces to a throughput of two modular
multiplications per 2m cycles. The latency remains 2m cycles.
32
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 33
The second consideration was the choice of the radix r = 2k. Increasing k reduces
the amount of steps to be executed in Algorithm 4.8. Such an approach, however,
requires more resources: The main expense lies in the computation of the 2k multiples
of M and B in Algorithm 4.8. They can either be pre-computed and stored in RAM
or calculated by a multiplexer network as proposed in Reference [9]. Clearly, the CLB
count becomes smallest for r = 2, as no multiples of M or B have to be calculated
or pre–computed.
Using a radix r = 2, the following equation has to be implemented (Algorithm 4.6):
Si+1 = (Si + qi ·M)/2 + ai ·B, qi, ai ∈ {0, 1}
To further reduce the required number of CLBs we took the following measures:
1. A considerable amount of CLBs are used in the overhead of a processing element
(unit). At least ai, qi and two control bits have to be stored and decoded in each
unit. If only one bit is processed per unit, the overhead is required m times.
In order to save resources we implemented units that process u = 4,8,16 bits.
With this approach we need only m/u instead of m units, and a considerable
amount of overhead can be saved. For processing more than one bit per unit
large adders are needed. Thus we expect the processing time and resource
requirements to increase exponentially. The possibility to use the fast carry
ripple adder as described in Section 5.1.3, however, causes processing time and
CLB count to grow only proportionally.
2. Computing u bits per unit, the operands Si, B and M have all u bits in each
unit. The result Si+1 has a maximum of u+ 2 bits, u result bits and two carry
bits that are fed to the next unit. Instead of using two adders and computing
the two additions serially in each execution of the loop, we pre-compute B+M
and store the result in a register. The resulting carry is fed to the next unit.
Thus we need only one adder, that adds Si to 0, B, M or B +M . This adder
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 34
is also used for computing B +M . With this approach the result Si+1 is only
u+ 1 bits wide and just one carry bit is fed to the next unit.
Similar to the approach in [15] we compute squarings and multiplications in par-
allel. As explained in Section 6.3, this measure fully utilizes every cycle.
Design 1 can be divided hierarchically into three levels.
Processing Element Computes u bits of a modular multiplication.
Modular Multiplication An array of processing elements computes a modular
multiplication.
Modular Exponentiation Combine modular multiplications to a modular expo-
nentiation according to Algorithm 4.2.
In the following we describe the system with a bottom–up approach.
6.2 Processing Elements
Figure 6.1 shows the implementation of a processing element.
In the processing elements we need the following registers:
• M-Reg (u bits): storage of the modulus
• B-Reg (u bits): storage of the B multiplier
• B+M-Reg (u bits): storage of the intermediate result B +M
• S-Reg (u+ 1 bits): storage of the intermediate result (inclusive carry)
• S-Reg-2 (u− 1 bits): storage of the intermediate result
• Control-Reg (3 bits): control of the multiplexers and clock enables
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 35
S_Reg_2
B_Reg
u-bit Adder Result_R
eg
Decode
Control_R
eg
Result_Out
Result_In
Control_In
Control_Out
q_i, a_i-In
q_i, a_i-Out
Carry_In
B_In
--u
--u
--u
--u
--u --u
--u --u
--u
--u
--u
--u
--1
q_i, a_i-Reg
Mux_B
Mux_1 Mux_2
Mux_R
es
Control
--
----
----
----
----
uu u
u
2
2
3
3
1
2
1u-1
Carry_Out
--u+1
"0"
1
"0"B+M_Reg M_Reg
--u-1
M_In S_0_In
S_Reg
S_0_Out
Figure 6.1: Processing element (unit) of Design 1 that computes Si+1 = (Si + qi ·
M)/2 + ai ·B, qi, ai ∈ {0, 1}
• ai,qi (2 bits): multiplier A, quotient Q
• Result-Reg (u bits): storage of the result at the end of a multiplication
The registers need a total of (6u + 5)/2 CLBs, the adder u/2 + 2 CLBs, the
multiplexers 4 · u/2 CLBs, and the decoder 2 CLBs. The possibility of re–using
registers for combinatorial logic allows some savings of CLBs. MuxB and MuxRes
are implemented in the CLBs of B-Reg and Result-Reg, Mux1 and Mux2 partially in
M-Reg and B+M-Reg. The resulting costs are approximately 3u+ 4 CLBs per u–bit
processing unit. That is 3 to 4 CLBs per bit, depending on the unit size u.
Let’s compare this expense to the resources needed for a one bit unit implemen-
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 36
tation (u = 1). We would need a total of seven bit register space (M , B, ai, qi,
control(2) and result) plus eventually a B + M register and a 4-bit input – 3 bit
output (2 carries, result) adder. Together with one or two CLBs for decoding the
control word and multiplexing, we would have a total of 6 or 7 CLBs per unit. With
such a large amount of CLBs we can implement a much faster architecture, as we will
see in Chapter 7.
Before a unit can compute a modular multiplication, the system parameters have
to be loaded. M is stored into M-Reg of the unit. At the beginning of a modular
multiplication, the operand B is loaded from either B-in or S-Reg, according to the
select line of multiplexer B-Mux. The next step is to compute M +B once and store
the result in the B+M-Reg. This operation needs two clock cycles, as the result is
clocked into S-Reg first. The select lines of Mux1 and Mux2 are controlled by ai or
the control word respectively.
In the following 2(m+ 2) cycles a modular multiplication is computed according
to Algorithm 4.6. Multiplexer Mux1 selects one of its inputs 0, M , B, B +M to be
fed in the adder according to the value of the binary variables ai and qi. Mux2 feeds
the u− 1 most significant bits of the previous result S-Reg2 plus the least significant
result bit of the next unit (division by two/shift right) into the second input of the
adder. The result is stored in S-Reg for one cycle. The least significant bit goes into
the unit to the right (division by two / shift right) and the carry to the unit to the
left. In this cycle a second modular multiplication is calculated in the adder, with
updated values of S-Reg2, ai and qi. The second multiplication uses the same operand
B but a different operand A.
At the end of a modular multiplication, Sm+3 is valid for one cycle at the output
of the adder. This value is both stored into Result-Reg, as fed via S-Reg into B-Reg.
The result of the second multiplication is fed into Result-Reg one cycle later.
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 37
6.3 Modular Multiplication
Figure 6.2 shows how the processing elements are connected to an array for computing
an m–bit modular multiplication. To compute Si+1 = (Si + qi ·M)/2 + ai · B with
M =∑m−1i=0 2
i · mi, we need m/u + 1 units. Unit1 . . .Unit(m/u)−1 are designed as
described in Section 6.2. Unit0 has only u − 1 B inputs as B0 is added to a shifted
value Si + qiM . The result bit S-Reg0 is always zero according to the properties of
Montgomery’s algorithm. Unitm/u processes the most significant bit of B and the
temporary overflow of the intermediate result Si+1. There is no M input into this
unit.
S_0_OutCarry_In
Control_outq_i, a_i-OutCarry_Out
B_In
Result_In
Unit_0
Carry_In
Control_outq_i, a_i-OutCarry_Out
B_In
Result_In
Unit_1
Control_In
Result_Out
q_i-Ina_i-In
a_inq_i, a_i-Inq_i, a_i-In
Control_In
Result_Out
Control_In
Result_Out
Control_In
Result_Out
B_odd_InB_even_In
M_even_InM_odd_In
M_InM_In
B_even_BusB_odd_BusM_even_BusM_odd_Bus
B_In
Unit_(m/u) Units_(m/u)-1 ... 2
S_0_InS_0_Out S_0_Out S_0_In
Figure 6.2: Systolic Array for modular multiplication
The inputs and outputs of the units are connected to each other in the following
way. The control word, qi and ai are pumped from right to left through the units.
The result is pumped from left to right. The carry-out signals are fed to the carry-in
inputs to the right. Output S 0 Out is always connected to input S 0 In of the unit
to the right. This represents the division by 2 of the equation.
At first the modulus M is fed into the units. To allow enough time for the signals
to propagate to all the units, M is valid for two clock cycles. We use two M-Buses,
the M-even-Bus connected to all even numbered units unit0, unit2 . . . unitm/u, and
the M-odd-Bus connected to all odd numbered units unit1, unit3 . . . unit(m/u)−1. This
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 38
approach allows to feed u bits of M̃ to the units per clock cycle. Thus it takes m/u
cycles to load the full modulus M .
The operand B is loaded similarly. The signals are also valid for two clock cycles.
We also use two B-Buses, the B-even-Bus connected to all even numbered units
unit0, unit2 . . . unitm/u, and the B-odd-Bus connected to all odd numbered units unit1,
unit3 . . . unit(m/u)−1.
After the operand B is loaded, the computation of Algorithm 4.6 can begin.
Starting at the rightmost unit0, the control word, ai, and qi are fed into their registers.
The adder computes S-Reg-2 plus B, M , or B +M in one clock cycle according to
ai and qi. The least significant bit of the result is read back as qi+1 for the next
computation. The resulting carry bit, the control word, ai and qi are pumped into
the unit to the left, where the same computation takes place in the next clock cycle.
In such a systolic fashion the control word, ai, qi, and the carry bits are pumped from
right to left through the whole unit array. The division by two in Algorithm 4.6 leads
also to a shift–right operation. The least significant bit of a unit’s addition (S0) is
always fed back into the unit to the right. After a modular multiplication is completed,
the results are pumped from left to right through the units and consecutively stored
in RAM for further processing.
A single processing element computes u bits of Si+1 = (Si + qi ·M)/2 + ai ·B of
Algorithm 4.6. In clock cycle i, unit0 computes bits 0 . . . u − 1 of Si. In cycle i+ 1,
unit1 uses the resulting carry and computes bits u . . . 2u − 1 of Si. Unit0 uses the
right shifted (division by 2) bit u of Si (S0) to compute bits 0 . . . u−1 of Si+1 in clock
cycle i+ 2.
Clock cycle i + 1 is unproductive in unit0 while waiting for the result of unit1.
This inefficiency is avoided by computing squares and multiplications in parallel ac-
cording to Algorithm 4.2. Both pi+1 and zi+1 depend on zi. We therefore store the
intermediate result zi in the B–Registers and feed zi and pi into the ai input of the
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 39
units for squaring and multiplication.
6.4 Modular Exponentiation
6.4.1 Data Flow
Figure 6.3 shows how the array of units is utilized for modular exponentiation. At
the heart of our design is a finite state machine (FSM) with 17 states. An idle
state, four states for loading the system parameters, and four times three states
for computing the modular exponentiation. The actual modular exponentiation is
executed in four main states, pre-computation1, pre-computation2, computation, and
post-computation. Each of these main states is subdivided in three sub–states, load-B,
B+M, and calculate-multiplication. The control word fed into control-in is encoded
according to the states. The FSM is clocked at half the clock rate. The same is true
for loading and reading the RAM and DP RAM elements. This measure makes sure
the maximal propagation time is in the units. Thus the minimal clock cycle time and
the resulting speed of a modular exponentiation relates to the effective computation
time in the units and not to the computation of overhead.
Before a modular exponentiation is computed, the system parameters have to be
loaded. The modulus M is read 2u bits at the time from I/O into M-Reg. Reading
starts from low order bits to high order bits. M is fed from M-Reg u bits at the
time alternatively to M-even-Bus and M-odd-Bus. The signals are valid two cycles
at a time. The exponent E is read 16 bits at the time from I/O and stored into
Exp-RAM. The first 16 bit wide word from I/O specifies the length of the exponent
in bits. Up to 64 following words contain the actual exponent. The pre–computation
factor 22(m+2) mod M is read from I/O 2u bits at the time. It is stored into Prec-RAM.
In state Pre-compute1 we read the X value from I/O, u bits per clock cycle, and
store it into DP RAM Z. At the same time the pre–computation factor 22(m+2) mod
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 40
Exp RAM--
1
--
1
Prec RAM
M_even_In
X_In
--u
--1
--u
Statemachine
TDM
DP RAM Z DP RAM P
1
1
Res_Out
u
uPrec_In
--2u
Exp_In--2u
u
u
Units_(m/u)...0
M_In
M_R
EG
B_even_In
a_i_InControl_In
Result_Out
3
16
B_odd_In
M_odd_In
Figure 6.3: Design for a modular exponentiation
M is read from Prec RAM and fed u bits per clock cycle alternatively via the
B-even-Bus and B-odd-Bus to the B–registers of the units. In the next two clock
cycles, B +M is calculated in the units.
Now we begin calculating the pre–computation. The initial values of Algorithm 4.2
are P0 = 1 and Z0 = X. Both values have to be multiplied by 22(m+2) mod M . This
can be done in parallel as both multiplications use a common operand 22(m+2) modM ,
that is already stored in B. The time division multiplexing unit (TDM) reads X from
DP RAM Z and multiplexes X and 1, 0 . . . 0 on the ai–bus into the units.
After 2(m+3) clock cycles the low order bits of the result of MONT R2(22(m+2) mod
M , X)= X · 2m+2 modM appear at Result-Out and are stored in DP RAM Z. The
low order bits of the result of MONT R2(22(m+2) mod M , 1)= 2m+2 modM appear
at Result-Out one cycle later and are stored in DP RAM P. This process repeats for
2m cycles, until all digits of the two results are saved in DP RAM Z and DP RAM P.
The result X · 2m+2 modM is also stored in the B-registers of the units.
In state pre-compute2 the actual computation of Algorithm 4.2 begins. For both
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 41
calculations of Z1 and P1 we use Z0 as an operand. This value is stored in the B-
registers. The second operand Z0 or P0 respectively, is read from DP RAM Z and
DP RAM P and “pumped” via TDM as ai into the units. After another 2(m + 3)
clock cycles the low order bits of the result of Z1 and P1 appear at Result-Out. Z1 is
stored in DP RAM Z. P1 is needed only if the first bit of the exponent e0 is equal to
“1”. Depending on e0, P1 is either stored in DP RAM P or discarded.
In state compute the loop of Algorithm 4.2 is executed n − 1 times. Zi in
DP RAM Z is updated after every cycle and “pumped” back as ai into the units.
Pi in DP RAM P is updated only if the relevant bit of the exponent ei is equal to
“1”. In this way always the last stored Pi is “pumped” back into the units.
After the processing of en−1, the FSM enters state post-compute. Pn = XE mod
M · 2m+2 mod M is stored in DP RAM P now. To eliminate the factor 2m+2 (Sec-
tion 4.3) from the result Pn, we compute a final Montgomery multiplicationMONT(Pn,
“1”). First the vector 0, 0, . . . 0, 1 is fed alternatively via the B-even-Bus and B-odd-Bus
into the B–registers of the units. Pn is “pumped” from DP RAM P as ai into the
units. After state post-compute is executed, u bits of the result Pn = XE modM are
valid at the I/O port. Every two clock cycles another u bits appear at I/O. State
pre-compute1 can be re–entered immediately now for the calculation of another X
value.
A full modular exponentiation is computed in 2(n+2)(m+4) clock cycles. That
is the delay it takes from inserting the first u bits of X into the device until the first
u result bits appear at the output. At that point, another X value can enter the
device. With a additional latency of m/u clock cycles the last u bits appear on the
output bus.
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 42
6.4.2 Function Blocks
In this subsection we explain the function blocks in Figure 6.3: DP RAM P, DP RAM Z,
EXP RAM, and Prec RAM.
Figure 6.4 shows the design of DP RAM Z. An m/u × u bit DP RAM is at the
heart of this unit. It has separate write (A) and read (DPRA) address inputs. The
DecoderStatus-
3
Status
S-R-FFR
CLKS Q_Out
Clock
CounterRead-
Clk_En
term_cntQ_Out
m/u x u DP RAM
Clock
Shift-Register
Clock
Data_In
2
u
1
WR_CLK
DPODI
WR_ENDPRAA
u
Clock
Counter
term_cntQ_Out
Write-
ld_enable
=02log (u)
Clock
u-bit-Register
D_loadQ_Out
Clk_En
Q_Out
D_load
u
u
1
1+log (m/u)2
log (m/u)
Data_Out
Terminal_Count
1
SEL
MUX
1+log (m)2
Figure 6.4: DP RAM Z Unit
write-counter counting up to m/u computes the write address (A). The write-counter
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 43
starts counting (clock-enable) in sub–states B-load when the first u bits of Zi appear
at data in. At the same time the enable signal of the DP RAM is active and data
is stored in DP RAM. Terminal-count resets count–enable and write–enable of DP
RAM when m/u is reached. The read-counter is enabled in the sub–states compute.
When read-counter reaches its upper limit m+2, terminal-count triggers the FSM to
transit into sub-state B-load. The log2(m/u) most significant bits of the read-counter
value (q out) address DPRA of the DP RAM. Every u cycles another value stored in
the DP RAM is read. This value is loaded into the shift register when the log2(u)
least significant bits of q out reach zero. The next u cycles u bits appear bit by bit at
the serial output of the shift register. The last value of zi is stored in a u–bit register.
This measure allows us to select an m/u×u–bit DP RAM instead of an 2m/u×u–bit
DP RAM (m = 2x, x = 8, 9, 10).
DP RAM P works almost the same way. It has an additional input ei, that acti-
vates the write-enable signal of the DP RAM in the case of ei = 1.
Figure 6.5 shows the design of Exp RAM. For the simulation results please refer
to Figure C.9 in the appendix. In the first cycle of the load-exponent state, the first
word is read from I/O and stored into the 10–bit register. Its value specifies the
length of the exponent in bits. In the next cycles the exponent is read 16–bit at a
time and stored in RAM. The storage address is computed by a 6–bit write counter.
At the beginning of each compute state the 10–bit read counter is enabled. Its 6
most significant bits compute the memory address. Thus every 16th activation, a
new value is read from RAM. This value is stored in the 16–bit shift–register at the
same time (when the 4 least significant bits of read counter are equal to zero). When
read counter reaches the value specified in the 10–bit register, the terminate signal
triggers the FSM to enter state post-compute.
Figure 6.6 shows the design of Prec RAM. In state load–pre–factor the pre-
computation factor is read 2u bits at the time from I/O and stored in RAM. A
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 44
10
Q_Out
D_load
Clock
1
WR_CLK
DPODI
WR_EN
A
ld_enable
Clock
Q_In
Q_Out
16 16 Clk_En
4Clk_En
e_i
ab
terminate
64x16 RAM16bit-Shift-
Register
a=b
=0
Register
ReadCounter
Q_Out
ClockClk_En
StatusDecoder
Q_Out
CounterWrite
Clk_EnClock
MuxStatus
Clock
Data_In
6
3
6
10
10
10-bit
10
Figure 6.5: Exp RAM Unit
counter that counts up to m/2u addresses the RAM. When all m/2u values are
read, the terminal-count signal triggers the FSM to leave state load–pre–factor. In
state pre–compute1 the pre–computation factor is read from RAM and fed to the
B–registers of the units. The counter is incremented each clock cycle and 2u bits are
loaded in the 2u–bit register. From there u bits are fed on B-even-bus each positive
edge of the clock. On the negative clock edge, u bits are fed on the B-odd-bus via the
u–bit register.
CHAPTER 6. DESIGN 1: A RESOURCE EFFICIENT ARCHITECTURE 45
2log (m/2u)2log (m/2u)
x 2u RAM
Q_Out
Clock
Data_In
WR_CLK
DPODI
WR_EN
A
ClockS
R Q_out
Clock
term_cnt
Clk_En
Q_out
2u u
Clock
u
B_Odd_Bus
u
ClockQ_Out
Clk_EnD_In
D_InClk_En
B_Even_Bus
2u2u
Status
m2u
R-S-FF
Counter
Register2u-bit
Registeru-bit
-bit
Terminal_Count
Figure 6.6: Prec RAM Unit
Chapter 7
Design 2: A Speed Efficient
Architecture
In this chapter we describe our second architecture. The goal was to design a speed
efficient architecture using Algorithm 4.8 with a larger radix. As target devices we
use the Xilinx XC4000 family as described in Chapter 5. The results of the actual
implementation of the architecture will be described in Chapter 9.
7.1 Design Overview
Design 1, described in Chapter 6, was optimized in terms of resource usage. In order
to speed–up the design two approaches can be taken. We can either try to reduce
the cycle time, or the number of cycles per modular multiplication. Reduction of the
cycle time can be achieved by computing one instead of u bits per unit. Compared
to a design with 4 bit units, the resulting speed–up of approximately 20% (without
three ripple carry delays) comes at the expense of additional 30% resources. Thus
the time–area product becomes worse. Section 4.4 describes how the number of steps
per modular multiplication can be reduced. Using a radix r = 2k, k > 1, reduces
the number of steps in Algorithm 4.6 by a factor k. The following computation of
46
CHAPTER 7. DESIGN 2: A SPEED EFFICIENT ARCHITECTURE 47
Algorithm 4.8 has to be executed m+ 3 times (i = 0 to m+ 2):
qi = Si mod 2k
Si+1 = (Si + qi · M̃)/2k + ai ·B
where B =∑m+1i=0 (2
k)i · bi;
A =∑m+2i=0 (2
k)i · ai, am+2 = 0;
M̃ =∑mi=0(2
k)i · m̃i;
M =∑m−1i=0 (2
k)i ·mi;
Please note that qi and ai are digits with k bits. One of the major problems when im-
plementing this equation is computing multiples of B and M̃ . Reference [9] proposes
a multiplexer network. This approach is not suitable for a systolic array implemen-
tation into FPGA because of the following reasons:
1. For a radix of 22 the multiplexer could be implemented in one CLB per bit
length, but already a radix of 24 uses more than four CLBs per bit. This would
result in unrealistically large CLB counts for secure bit length.
2. In a systolic array we typically compute k bits per processing element. With
a multiplexer solution the internal bit length becomes 2k resulting in twice as
much costs for adders and registers.
To avoid the doubling of the internal bit length of a unit the following approach
which is optimized for the CLB architectures at hand can be taken.
1. Pre-compute the multiples of B and M̃ at the beginning of the execution of
Montgomery’s algorithm and store the results for further use.
2. Let the carries of this pre–computations propagate to the units to the left.
If a unit processes k bits, the stored multiples will also have k bits and the internal
bit length will not exceed k + 2 bits (addition of 3 operands). The cost is additional
CHAPTER 7. DESIGN 2: A SPEED EFFICIENT ARCHITECTURE 48
2k clock cycles for calculating the 2k multiples of B. For small k values, this expense
is negligible compared to the total amount of 2(m+3) cycles for the whole algorithm.
As the value of M̃ is changed infrequently, we can compute its multiples externally
and store these values into the units. This approach saves some additional CLBs.
As storage elements we can either use registers or RAM elements. For k larger
than 2, registers are not suitable as they utilize one CLB per 2 stored bits. RAM
elements are very efficient up to an address width of 4 bits. Their implementation
requires only one CLB per two bits data width (2 CLBs for a 16 × 4 bit RAM).
The resource requirements grow rapidly, though, for larger address width. A 64 × 6
bit implementation (k = 6) utilizes 18 CLBs, a 256 × 8 bit implementation (k = 8)
utilizes 96 CLBs (see Table 5.2). Both would result in unrealistically large CLB counts
for secure bit length. Additionally the 26 or even 28 clock cycles for computing the
multiples of B are not negligible any more. To achieve an optimal time–area product
we implemented therefore an architecture with a radix r = 24. We compute 4 bits
per processing element. The multiples of M̃ are computed externally once and stored
in the units. Similar to Design 1, we use square and multiply Algorithm 4.2 and
compute squares and multiplications in parallel.
Design 2 can be divided hierarchically into three levels.
Processing Element Computes 4 bits of a modular multiplication.
Modular Multiplication An array of processing elements computes a modular
multiplication.
Modular Exponentiation Combines modular multiplications to a modular expo-
nentiation according to Algorithm 4.2.
In the following we describe the system with a bottom–up approach.
CHAPTER 7. DESIGN 2: A SPEED EFFICIENT ARCHITECTURE 49
7.2 Processing Elements
Figure 7.1 shows the implementation of a processing element.
Si+1 = (Si + qi · M̃)/2k + ai ·B
5 bit Reg
B_Adder
3
B_Reg
Decode
Control_R
eg
Control_In
Control_Out
B_In
--u
Mux_B
Control
----
-- a_i-Out
a_i-Reg
q_i-Reg
Carry_In
q_i-Out
q_i-In
a_i-In
D_out
D_in D_in
D_out
Carry
Decode
B_carry_in
Carry_Out
RAM16x4 bit
RAMB-16x4 bit
B_carry_out
Mux_R
es
Result_In
Result_Out
Result_R
eg
Ad Ad
--4
M_In~
B+M-Adder
M-
4 4
4 4
4
4
4
4
44
5
carry_0
carry_12
4
42
6
1
41
4
4
~
~
~
S_In
S+B+M-Adder
S_Reg
S_Out
3
-
Figure 7.1: Processing Element (unit)
CHAPTER 7. DESIGN 2: A SPEED EFFICIENT ARCHITECTURE 50
The following elements are needed:
• B-Reg (4 bits): storage of the B multiplier
• B-Adder-Reg (5 bits): storage of multiples of B
• S-Reg (4 bits): storage of the intermediate result Si (Algorithm 4.8)
• Control-Reg (3 bits): control of the multiplexers and clock enables
• ai-Reg (4 bits): multiplier A
• qi-Reg (4 bits): quotient Q
• Result-Reg (4 bits): storage of the result at the end of a multiplication
• B-Adder (4 bits): Adds B to the previously computed multiple of B
• B+M̃ -Adder (4 bits): Adds a multiple of M̃ to a multiple of B
• S+B+M̃ -Adder (5 bits): Adds the intermediate result Si to B + M̃
• B-RAM (16x4 bits): Stores 16 multiples of B
• M̃-RAM (16x4 bits): Stores 16 multiples of M̃
For a timing model of the processing element shown in Figure 7.1 please refer to
Figure C.2 in the appendix.
The registers need a total of 14 CLBs, the adders 13 CLBs and the RAM blocks 4
CLBs. The possibility of re–using registers for combinatorial logic allows some savings
of CLBs. Thus a processing element utilizes a total of 24 CLBs, which is equal to 6
CLBs per processed bit.
Before a unit can compute a modular multiplication, the system parameters have
to be loaded. The relevant bits of the multiples of M̃ are loaded into M̃-RAM starting
from M̃ , 2 · M̃ to 15 · M̃ . Each multiple of M̃ is valid on the input M̃-in for 2 clock
CHAPTER 7. DESIGN 2: A SPEED EFFICIENT ARCHITECTURE 51
cycles. The write enable signal for M̃ -RAM is decoded from Control-In. The address
input qi is incremented every two cycles.
At the beginning of a modular multiplication, the operand B is loaded from either
B-in or S-Reg, according to the select line of multiplexer B-Mux. This value is stored
in B-RAM while address input ai is equal to one. The write enable signal for B-RAM
is decoded from control-in. After each clock cycle, B is added to the accumulated B
multiple and stored in B-RAM, while ai is incremented by one. The resulting carry
propagates to the adjacent unit. The pre–computation and storage of the multiples of
B takes 16 clock cycles. For a simulation of this behavior please refer to Figure C.1.
When computing the modular multiplication, the relevant multiples of B and
M̃ are addressed by ai and qi. Both values are added up and added to S-in (Si in
Algorithm 4.8). The result of these additions will not exceed 47 (3·15+(carry-in = 2)).
Thus carry-out ∈ {0, 1, 2}. This behavior saves us another full adder. The incoming
carry-in is decoded in the carry-decode unit and the resulting signals carry 0 (carry-in
= 2) and carry 1 (carry-in = 1 or 2) can be fed in the carry-in inputs of the two
adders. A simulation of a modular multiplication can be found in Figure C.2 in the
appendix.
At the end of a modular multiplication the result Sm+3 appears in S-Reg. This
value is stored via Mux-Res into Result-Reg and via B-Mux into B-Reg for further
processing (see Figure C.3 in the appendix).
7.3 Modular Multiplication
Figure 7.2 shows how the processing elements are connected to an array for computing
a full size modular multiplication. To compute Si+1 = (Si + qi · M̃)/16 + ai ·B with
operand B =∑m+1i=0 16
i · bi, we need m + 3 units. Units 1 . . .m + 1 are designed as
described in Section 7.2. Unit0 does not have a B input as B is not shifted by 4
bits (division by 16) in above mentioned equation. The four result bits S-Reg3...0 are
CHAPTER 7. DESIGN 2: A SPEED EFFICIENT ARCHITECTURE 52
S_Out
B_InB_InControl_In
a_i-Inq_i-In
B_Carry_InCarry_In
Result_Out
Control_out
Result_In
Carry_Out
a_i-Outq_i-OutB_Carry_Out
Unit_2
Result_Out
Carry_In
Control_In
B_Carry_Inq_i-Ina_i-In
Control_outa_i-Outq_i-Out
Carry_OutB_Carry_Out
Result_In
Unit_1
a_i-Inq_i-In
Result_Out
Carry_In
a_i-Out
Carry_Out
Control_out
q_i-Out
Unit_0
Result_Out
Control_InControl_In
a_i-In
q_i-In
Control_In
q_i-In
a_i-In
B_even_in
B_odd_inB_odd_BUS
B_even_BUS
M_odd_BUS
M_even_BUS
Unit_m+2 Units_m+1 ... 3
M_InB_In M_In M_In
M_odd_In
M_even_In
S_InS_Out S_Out S_In S_Out S_In
Figure 7.2: Systolic array for modular multiplication
always equal to zero according to the properties of Montgomery’s algorithm. Unitm+2
on the other hand does not have an M input. It processes the most significant bit of
B and the temporarily occurring overflow of Si+1.
The inputs and outputs of the units are connected to each other in the following
way. The control word, qi and ai are pumped from right to left through the units.
The result is pumped from left to right. The carry-out signals are fed to the carry-in
inputs to the right. Output S out is always connected to input S in of the unit to the
right. This represents the division by 16 of the computation.
At first the multiples of the modulus M̃ are fed into the units. To allow enough
time for the signals to propagate to all the units, the modulus M̃ is valid for two
clock cycles. We use two M̃ -buses, the M̃ -even-bus connected to all even numbered
units unit0, unit2 . . . unitm+2, and the M̃-odd-bus connected to all odd numbered units
unit1, unit3 . . . unitm+1. This approach allows to feed 4 bits of M̃ to the units per
clock cycle. It takes m+2 cycles to feed one multiple of M̃ into all the units, a total
of 15 · (m+ 2) cycles to load all multiples of M̃ . As this step is executed only when
the modulus is changed, the long load time can be accepted.
The operand B is loaded similarly. The signals are also valid for two clock cy-
cles. We also use two B-buses, the B-even-bus connected to all even numbered units
CHAPTER 7. DESIGN 2: A SPEED EFFICIENT ARCHITECTURE 53
unit2, unit4 . . . unitm+2, and the B-odd-bus connected to all odd numbered units unit1,
unit3 . . . unitm+1. In cycle 1 the control word with state load-B is fed into control-in
of unit0 and is valid for two cycles. In cycle 2, the control word propagates to unit1.
In cycle 2 and 3, bits b0 . . . b3 are fed on the B-odd-bus and stored in unit1. In cycle
3 and 4, bits b4 . . . b7 are fed on the B-even-bus and stored in unit2. Also in cycle
3 the control word with state multiple-B is fed into control-in of unit0 and is valid
for 16 cycles. Unit1 computes the multiples of B in cycles 4 to 20, unit2 in cycles 5
to 21. In cycle 19, Unit0 starts to compute a squaring according to Algorithm 4.8.
Therefore bits a0 . . . a3 of the operand A are fed in ai-in, and the control word with
state calculate-multiplication in control-in. In cycle 20, unit1 uses the resulting carries
and computes bits 0 . . . 3 of S1. These bits are fed back either in S-in, as well as in
qi-in of unit0. Unit0 is ready to compute the second loop of the squaring S2 in cycle
21.
Clock cycle 20 is unproductive in unit0 while waiting for the result S-out of unit1.
This inefficiency is avoided by computing squares and multiplications in parallel ac-
cording to Algorithm 4.2. Both pi+1 and zi+1 depend on zi. We therefore store the
intermediate result zi in the B–Registers and feed zi and pi into the ai input of the
units for squaring and multiplication.
In cycle 20+2(m+2) bits 0 . . . 3 of Sm+3 are available at result-out of unit1. Thus
the first modular multiplication takes 2m+22 cycles from feeding b0 . . . b3 on the bus
until the result is available. Further squaring operations take 2m+ 20 cycles. While
the last loop of the modular multiplication is computed, the result of the squaring is
stored in B-Reg in only one cycle. Thus the two cycles for loading B from the bus
are not needed.
For the simulation of the systolic array please refer to Figures C.4, C.5 and C.6
in the appendix.
CHAPTER 7. DESIGN 2: A SPEED EFFICIENT ARCHITECTURE 54
7.4 Modular Exponentiation
7.4.1 Data Flow
Figure 7.3 shows how the array of units is utilized for modular exponentiation. For
simulation results please refer to Figures C.7, C.8 and C.9 in the appendix.
Exp RAM
Prec RAM
DP RAM Z
q_Mux
--
1
B_odd_In
X_In
Statemachine
TDM
DP RAM P
3
Exp_In
16
a_Counter
q_counter z_Mux
44
44
4
--8
44
--4
--4
--4
Units_m+2..0M
_RE
G
M_In
M_even_InM_odd_In
--8
q_i_In
a_i_InControl_In
Result_OutRes_Out
Prec_In
B_even_In
Figure 7.3: Design for a modular exponentiation
At the heart of our design is a finite state machine (FSM) with 17 states. An
idle state, four states for loading the system parameters, and four times three states
for computing the modular exponentiation. The actual modular exponentiation is
executed in four main states, pre-computation1, pre-computation2, computation, and
post-computation. Each of these main states is subdivided in three sub–states, load-B,
multiple-B, and calculate-multiplication. The control word fed into control-in is en-
coded according to the states. The FSM is clocked at half the clock rate. The same
CHAPTER 7. DESIGN 2: A SPEED EFFICIENT ARCHITECTURE 55
is true for loading and reading the RAM and DP RAM elements. This measure
makes sure the maximal propagation time is in the units. Thus the minimal clock
cycle time and the resulting speed of a modular exponentiation relates to the effective
computation time and not to the computation of overhead.
Before a modular exponentiation is computed, the system parameters have to be
loaded. Multiples of the modulus M̃ have to be computed externally and are read
eight bit at a time from I/O in M̃-Reg. Reading starts from low order bits to high
order bits, and from M̃ to 15 · M̃ . From M̃ -Reg the multiples of M̃ are fed 4 bits at
the time alternatively to M̃-even-bus and M̃ -odd-bus. The address for the M̃-RAM
of the units is generated by the q-counter and fed via q-Mux into qi-In. After loading
a full multiple of M̃ , q-counter is incremented. The exponent E is read 16 bits at
the time from I/O and stored into exp-RAM. The first 16 bit wide word from I/O
specifies the length of the exponent in bits. Up to 64 following words contain the
actual exponent. The pre–computation factor 28(m+2) mod M is read from I/O eight
bits at the time. It is stored into prec-RAM.
In state pre-compute1 we read the X value from I/O, 4 bits per clock cycle, and
store it intoDP RAM Z. At the same time the pre-computation factor 28(m+2) mod M
is read from prec-RAM and fed 4 bits per clock cycle alternatively via the B-even-bus
and B-odd-bus to the B–registers of the units. In the next 16 clock cycles, the multi-
ples of B = 28(m+2) mod M are calculated.
Now we begin calculating the pre–computation. The initial values of Algorithm 4.2
are P0 = 1 and Z0 = X. Both values have to be multiplied by the pre–computation
factor. This can be done in parallel as both multiplications use a common operand
28(m+2) modM , that is already stored in B. The time division multiplexing unit
(TDM) reads X from DP RAM Z and multiplexes X and 1, 0 . . . 0 on the ai–bus
into the units. After 2(m + 2) clock cycles the low order bits of the result of
MONT HR(28(m+2) modM , X)= X · 24(m+2) mod M appear at result-out and are
CHAPTER 7. DESIGN 2: A SPEED EFFICIENT ARCHITECTURE 56
stored in DP RAM Z. The low order bits of the result of MONT HR(28(m+2) modM ,
1)= 24(m+2) modM appear at result-out one cycle later and are stored in DP RAM P.
This process repeats for 2(m+ 2) cycles, until all digits of the two results are saved
in DP RAM Z and P. The result X · 24(m+2) modM is also being stored in the B-
registers of the units. For the simulation of this behavior please refer to Figure C.7
in the appendix.
In state pre-compute2 the actual computation of Algorithm 4.2 begins. For both
calculations of Z1 and P1 we use Z0 as an operand. This value is stored in the B-
registers. The second operand Z0 or P0 respectively, is read from DP RAM Z and
P and “pumped” via TDM as ai into the units (Figure C.8 in the appendix). After
another 2(m + 2) clock cycles the low order bits of the result of Z1 and P1 appear
at result-out. Z1 is stored in DP RAM Z. P1 is needed only if the first bit of the
exponent e0 is equal to “1”. Depending on e0, P1 is either stored in DP RAM P or
discarded.
In state compute the loop of Algorithm 4.2 is executed n − 1 times. Zi in
DP RAM Z is updated after every cycle and “pumped” back as ai into the units.
Pi is updated only if the relevant bit of the exponent ei is equal to “1”. In this way
always the last stored Pi is “pumped” back into the units.
After the processing of en−1, the FSM enters state post-compute. Pn = XE ·
24(m+2) modM is stored in DP RAM P now. To eliminate the factor 24(m+2) from
the result Pn, we compute a final Montgomery multiplication MONT HR(Pn, “1”).
First the vector 0, 0, . . . 0, 1 is fed alternatively via the B-even-bus and B-odd-bus into
the B–registers of the units. Pn is “pumped” from DP RAM P as ai into the units.
After state post-compute is executed the result XE modM is ready at the I/O port. 4
bits appear every two clock cycles. State pre-compute1 can be re–entered immediately
now for the calculation of another X value.
A full modular exponentiation is computed in (n+2)(2m+20) clock cycles. That
CHAPTER 7. DESIGN 2: A SPEED EFFICIENT ARCHITECTURE 57
is the delay it takes from inserting the first 4 bits of X into the device, until the first
4 result bits appear at the output. At that point, another X value can enter the
device. With an additional latency of m+2 clock cycles the last 4 bits appear on the
output bus.
7.4.2 Function Blocks
In this subsection we explain the function blocks in Figure 6.3: DP RAM P and
DP RAM Z. The modules EXP RAM and Prec RAM are explained in Section 6.4.2.
Figure 7.4 shows the design of DP RAM Z. Anm×4 bit DP RAM is at the heart
of this unit. It has separate write (A) and read (DPRA) address inputs. Two counters
Write-
Clock
Counter
Clk_En
term_cntQ_Out
ClockClk_En
Q_OutD_In
4-bit REG
ClockClk_En
Q_Out
4-bit REG
D_In
Decoder
Clock
CounterRead-
Clk_EnData_In
WR_CLK
CLKR
S
Clock
S-R-FF
Q_Out
term_cntQ_Out
DPODI
WR_ENDPRAA
4
4
Status
2
2log (m)
m x 4 DP RAM
2log (m)
MUX
2
2
Data_Out
Terminal_Count
sel
Status-
Figure 7.4: DP RAM Z Unit
that count up to m+ 2 compute these addresses. The write-counter starts counting
CHAPTER 7. DESIGN 2: A SPEED EFFICIENT ARCHITECTURE 58
(clock- enable) in sub–states B-load when the first digit of Zi appears at data in. At
the same time the enable signal of the DP RAM is active and data is stored in DP
RAM. When m + 2 is reached, the terminal-count signal of the write-counter resets
the two enable signals. The read-counter is enabled in sub–states compute. The data
of DP RAM is addressed by q out of the read-counter and appears immediately at
DPO. When read-counter reaches m+ 2, terminal-count triggers the FSM to transit
into sub-state B-load. The last two values of zi are stored in a 4–bit register each.
This measure allows us to choose a 100% utilized m× 4–bit DP RAM instead of an
only 50% utilized 2m× 4–bit DP RAM.
DP RAM P works almost the same way. It has an additional input ei, that acti-
vates the write-enable signal of the DP RAM in the case of ei = “1′′.
Chapter 8
Methodology
In our implementation we adopted the following design approach that resulted in fast
verification of gate level netlists as well as back annotated designs:
1. Design entry
2. Logic verification
3. Synthesis
4. Place and Route
5. Timing Verification
The entire design, with the exception of vendor specific soft macros, was entered in
VHDL format. Once the design was developed in VHDL, boolean logic and ma-
jor timing errors were verified by simulating the gate level description with Synop-
sys VHDL analyzer (vhdlan) and VHDL debugger (vhdldbx) version 1998.08. The
next step involved the synthesis of the VHDL code with Synopsys Design Compiler
(fpga analyzer) version 1998.08. The output of this step was an optimized netlist
describing the gate level design in XILINX format. The most time consuming step
was the compilation of the synthesized design with the place and route tools available
59
CHAPTER 8. METHODOLOGY 60
from Xilinx. This process was accomplished with the XILINX Design Manager tools
version M1.5.19. The final step of the design flow was to verify the design once again
but this time with the physical net, CLB, and pad delays introduced when the design
was placed into a specific device. This was accomplished with the same test benches
and simulation models that were used during the logic verification stage. Synopsys
(vhdldbx) was used once again to verify back-annotated designs. The timing results
from Chapter 9 were all computed by the Xilinx timing analyzer and verified by the
Synopsis vhdl debugger. They were not verified with an actual chip.
8.1 Xilinx Synopsys Interface
Figure 8.1 presents a flow chart diagram of the design flow with Xilinx-Synopsys-
Interface (XSI) tools. The XSI tools provide for a transition between results obtained
from Synopsys synthesis and the Xilinx place and route tools. The XSI module
includes all libraries necessary for Synopsys fpga analyzer to interpret gates into
logical blocks so that synthesis can be performed at this level. The design ware
libraries provided by Xilinx are automatically instantiated when possible. Synthesis
results include report files on area and timing utilization, design netlist and constraints
that are used in the place and route process, and Synopsys design files that describe
the entire system.
8.2 Simulation and Verification
Verification of the design is done at two points. First, it is applied to the initial VHDL
design. This verifies only the logic without delays. The input to this verification
process is a test bench written in VHDL and the actual VHDL design. The results
are compared to test vectores generated by Maples.
The post place and route verification uses the same test bench. The VHDL input
CHAPTER 8. METHODOLOGY 61
Synopsys Simulator ver. 1998.08
SynopsysLibs
C Model
user constr
back_anno.vhd back_anno.sdf
ROUTE, TRACE, BACK_ANNOTATE)
(BUILD, MAP, TRACE, PLACE,
XILIX
XILINX libsScript
VHDL Design
Testbench
bitstream
To PROM
design constr. design netlist
Synopsys FPGA Analyzer ver. 1998.08
Figure 8.1: Design flow
model to this stage is different. Here the VHDL model is obtained from the XILINX
place and route tools. This VHDLmodel includes a separate file defining all net, CLB,
and port delays associated with the placed design. Once again, verification process
involves testing all vectors against test vectors generated by Maples. A sample test
bench can be found in Appendix A.
8.3 Synthesis
To synthesize our designs, script files were developed that could be launched from
within the fpga analyzer. These scripts would elaborate, compile, optimize the
design, and prepare report summaries. The output of the synthesis tools is a design
netlist and constraints file. A sample script file is provided in Appendix B.
CHAPTER 8. METHODOLOGY 62
8.4 Place and Route
The input to the place and route tools is a design netlist and constraints file generated
by Synopsys, as well as a possible user constraints file. The user constraints have
higher priority over the Synopsys constraints and may include additional constraints
relaxing the clock period or implementing pin assignment. The output of this process
is a bit-stream file that can be used to directly program the device and the back-
annotated design that can be simulated for timing verification (Figure 8.1).
Chapter 9
Results
9.1 Design 1
We implemented Design 1 for various bit lengths and unit widths. Table 9.1 shows
our results in terms of used CLBs (C), clock cycle time (T) and the time–area product
(TA).
Table 9.1: Design 1: CLB usage, minimal clock cycle time, and time–area product of
modular exponentiation architectures on Xilinx FPGAs
160 bit 256 bit 512 bit
C T TA C T TA C T TA
u [CLBs] [ns] [CLB·µs] [CLBs] [ns] [CLB·µs] [CLBs] [ns] [CLB·µs]
4 951 17.3 16.4 1307 17.5 22.8 2555 17.7 45.2
8 820 19.6 16.0 1122 19.8 22.2 2094 19.1 39.9
16 790 21.1 16.6 1110 21.7 24.0 2001 21.8 43.6
768 bit 1024 bit
C T TA C T TA
u [CLBs] [ns] [CLB·µs] [CLBs] [ns] [CLB·µs]
4 3745 19.1 71.5 4865 19.2 93.4
8 3132 19.4 60.7 4224 23.4 98.8
16 2946 21.6 63.6 3786 23.7 89.7
63
CHAPTER 9. RESULTS 64
The majority of CLBs is used in the units. In Section 6.2 we derived an approx-
imation of 3u + 4 CLBs per unit which proves to be correct for large designs. The
overhead consists mainly of RAM, dual port RAM, shift registers, counters and the
state machine. Counters and their decoding for addressing RAM and dual port RAM
are more costly for larger designs. On the other hand, we used the same state machine
for all designs in Table 9.1.
The clock cycle time T in Table 9.1 is the propagation delay from B-Reg through
mux1 and the carries of the adder to the registered carry, plus the setup time of the
flip-flop. We compare this delay to the optimal cycle time calculated by the Xilinx
timing analyzer; for a 4–bit unit the delay with optimal routing is 10.5 ns (256 and
512 bit designs) and 12.7 ns (768 and 1024 bit designs); for an 8–bit unit 11.2 ns and
13.7 ns and for a 16–bit unit 12.8 ns and 15.5 ns. The larger designs were implemented
in larger FPGA devices featuring different delay specifications. Otherwise we expect
the same cycle times for designs with the same unit size. The additional routing delay
is between 50% and 80% above the optimal propagation delay. For designs up to 768
and 1024 (u = 4) bits it remains approximately constant; it deteriorates for 1024 bit
designs with unit sizes u = 8 and u = 16. The same can be said about the place and
route time: we experienced run–times of a couple of hours on a AMD–K6–2/300 MHz
PC for designs up to 768 and 1024 (u = 4) bits, up to a week for the 1024 (u = 8
and u = 16) bit designs. Different design methods, such as hard–macros for a single
unit, would probably improve routing delay and place and route time.
The time–area product shows that designs with 8–bit units are generally most
efficient.
Table 9.2 shows our results for a full length modular exponentiation. The purpose
of this table is to compare our design to previous propositions. None of the popular
public key schemes as described in Chapter 3 requires computing a modular exponen-
tiation with equal size exponent and modulus. A full modular exponentiation with
CHAPTER 9. RESULTS 65
Table 9.2: Design 1: CLB usage and execution time for a full modular exponentiation
512 bit 768 bit 1024 bit
u C T C T C T
CLBs [ms] CLBs [ms] CLBs [ms]
4 2555 9.38 3745 22.71 4865 40.50
8 2094 10.13 3123 23.06 4224 49.36
16 2001 11.56 2946 25.68 3786 49.99
an n bit exponent and an m bit modulus is computed in 2(n+2)(m+4) clock cycles.
9.2 Design 2
Table 9.3 shows our results of Design 2 in terms of used CLBs (C), clock cycle time
(T) and the time–area product (TA).
Table 9.3: Design 2: CLB usage, minimal clock cycle time, and time–area product of
modular exponentiation architectures on Xilinx FPGAs
160 bit 256 bit 512 bit
C T TA C T TA C T TA
CLBs] [ns] [CLB·µs] [CLBs] [ns] [CLB·µs] [CLBs] [ns] [CLB·µs]
1219 20.8 25.4 1818 21.3 38.7 3413 20.7 70.6
768 bit 1024 bit
C T TA C T TA
CLBs] [ns] [CLB·µs] [CLBs] [ns] [CLB·µs]
5071 20.1 101.9 6633 21.9 145.2
The time–area products of Table 9.3 are between 50% and 70% larger compared
to those of Table 9.1. Design 2, however, is far more efficent than Design 1, as we
gain a speed–up of approximately a factor 4 by computing a modular exponentiation
with 4 times fewer cycles.
CHAPTER 9. RESULTS 66
The majority of CLBs is expended in the units, that is 6 CLBs per bit of the
modulus. The overhead consists mainly of RAM, DP RAM, counters, registers, and
the state machine. Between 300 CLBs for the 160–bit design and 500 CLBs for the
1024–bit design are used for overhead.
The clock cycle time T in Table 9.3 is the access delay qi → D out of the M̃ -RAM
or ai→ D out of the B-RAM plus the delay through the two adders to the registered
carry in S Reg, plus the setup time of the flip-flop (see Figure 7.1). We compare
this delay to the optimal cycle time calculated by the Xilinx timing analyzer; for
the smaller designs (160–512 bits) the delay with optimal routing is 14.7 ns, for the
larger designs 15.7 ns. The larger designs were implemented in larger FPGA devices
featuring different delay specifications. Otherwise we expected the same cycle times
for all designs as the difference between designs lies in the amount of units. The
additional routing delay is about 30% above the optimal propagation delay. This is
considerably better than the additional routing delay of Design 1. The structure with
a RAM block and two adders in series seems to be less of a routing problem than the
register–multiplexer–adder structure in Design 1.
Table 9.4: Design2: CLB usage and execution time for a full modular exponentiation
512 bit 768 bit 1024 bit
design C T C T C T
CLBs [ms] CLBs [ms] CLBs [ms]
2 3413 2.93 5071 6.25 6633 11.95
1 2555 9.38 3745 22.71 4865 40.50
Table 9.4 shows our results for a full length modular exponentiation. A full mod-
ular exponentiation with an n bit exponent and an m digit modulus is computed in
2 ·(n+2)(m+10) clock cycles. For an easy comparison we included the fastest Design
1 in Table 9.4. A speed–up of approximately a factor 3.5 is gained using Design 2.
CHAPTER 9. RESULTS 67
9.3 Application to RSA
Table 9.5 shows our results from the tables above, applied to RSA. The encryption
time is calculated for the F4 exponent, requiring 2 ·19(m+4) clock cycles for Design 1
and 2 ·19(m+10) clock cycles if Design 2 is used. Please note thatM =∑m−1i=0 (2
k)imi
for k = 1 in Design 1 and k = 4 in Design 2.
Table 9.5: Application to RSA: Encryption
512 bit 1024 bit
design u C T C T
CLBs [ms] CLBs [ms]
1 4 2555 0.35 4865 0.75
8 2094 0.37 4224 0.91
16 2001 0.43 3786 0.93
2 3413 0.11 6633 0.22
For decryption we apply the Chinese remainder theorem. We either decryptm bits
with an m/2 bit architecture serially, or with two m/2 bit architectures in parallel.
The first approach uses only half as many resources, the later is twice as fast.
Table 9.6: Application to RSA: Decryption
512 bit 512 bit 1024 bit 1024 bit
2 · 256 serial 2 · 256 parallel 2 · 512 serial 2 · 512 parallel
design u C T C T C T C T
CLBs [ms] CLBs [ms] CLBs [ms] CLBs [ms]
1 4 1307 4.69 2614 2.37 2555 18.78 5110 10.18
8 1122 5.31 2244 2.56 2094 20.26 4188 12.41
16 1110 5.82 2220 2.92 2001 23.12 4002 12.52
2 1818 1.62 3636 0.79 3413 5.87 6826 3.10
CHAPTER 9. RESULTS 68
9.4 Application to Algorithms Based on the Dis-
crete Logarithm Problem
Table 9.7 applies our results to encryption and decryption for algorithms based on
the descrete logarithm problem. As the exponent is only 160 bits long, computation
is about six times faster than for a full 1024 bit modular exponentiation.
Table 9.7: CLB usage and execution time for algorithms based on the DL–problem
512 bit 768 bit 1024 bit
design u C T C T C T
CLBs [ms] CLBs [ms] CLBs [ms]
1 4 2555 2.97 3745 4.77 4865 6.39
8 2094 3.19 3123 4.85 4224 7.79
16 2001 3.64 2946 5.4 3786 7.89
2 3413 0.93 5071 1.31 6633 1.89
9.5 Application to Elliptic Curves
Section 3.3 states that an elliptic curve projective space addition can be performed in
15 operations, and 12 operations are needed for a doubling. As operations we count
multiplications only. Additions, subtractions and shift operations are neglected.
With both our designs we can compute two multiplications in parallel, under the
condition that the same operand B is used in both multiplications. For example
B(C +D) can be computed in parallel. The operation for adding and doubling are
listed in Tables 9.8 and 9.9. As many steps as possible are computed in parallel,
including the pre– and post–computations. In an addition 16 operations are executed
in series, 8 can be done in parrallel.
For a point doubling we need 14 operations in series and 8 can be done in parallel.
In a general point multiplication a series of addings and doublings are executed. Thus
CHAPTER 9. RESULTS 69
Table 9.8: Operations for a point addition
Common Multiplication 1 Multiplication 2 Type of
Operand Operand Result Operand Result Operation
22k(m+2) X1 X ′1 X2 X ′2 pre-computation
22k(m+2) Y1 Y ′1 Y2 Y ′2 pre-computation
22k(m+2) Z1 Z ′1 Z2 Z ′2 pre-computation
Z ′1 Y ′2 Z ′1Y′2 X ′2 Z1X
′2
Z ′2 Y ′1 Z ′2Y′1 X ′1 Z2X
′1
Z ′2 Z ′1 Z ′2Z′1
V ′ V ′ (V 2)′
(V 2)′ T ′ (V 2)′T ′ V ′ (V 3)′
(V 2)′ Z2X′1 (V 2)′Z2X ′1
(V 3)′ Z ′2Z′1 Z ′3 Z ′2Y
′1 (V 3)′Z ′2Y
′1
U ′ U ′ (U2)′
(U2)′ Z ′2Z′1 A′
(V 2)′Z2X ′1 − A U ′ Y ′3
V ′ A′ X ′3
1 X ′3 X3 Y ′3 Y3 post-computation
1 Z ′3 Z3 post-computation
the pre-computations have to be done only once before the first operation, and the
post-computations after the last operation. The total amount of clock cycles for a
normal addition is therefore 11 · 2(m + 4) if Design 1 is used, and 11 · 2(m + 10)
cycles with Design 2. A doubling without pre– and post–computations is executed in
8 · 2(m+ 4) cycles (Design 1), or 8 · 2(m+ 10) cycles (Design 2). We assume that all
values are stored externally. The subtractions, additions and shift operations could
be done internally with almost no additonal resource requirements. As we process
only u bit (Design 1) or 4 bit (Design 2) at a time, addition or subtraction could be
done serially. In the doubling operation we have multiplications by eight. Thus the
CHAPTER 9. RESULTS 70
Table 9.9: Operations for a Doubling
Common Multiplication 1 Multiplication 2 Type of
Operand Operand Result Operand Result Operation
22k(m+2) X1 X ′1 X2 X ′2 pre-computation
22k(m+2) Y1 Y ′1 Y2 Y ′2 pre-computation
22k(m+2) Z1 Z ′1 Z2 Z ′2 pre-computation
22k(m+2) a a′ Pre-computation
Z ′1 Y ′1 S ′ Z ′1 (Z21)′
S ′ S ′ (S2)′ Y ′1 E′
X ′1 X ′1 (X21 )′ E′ F ′
(Z21)′ a′ W ′
W ′ W ′ (W 2)′
E′ E′ (E2)′
4F ′ −H W ′ Y ′3
S ′ (S2)′ (S3)′ 2H ′ X ′3
1 X ′3 X3 Y ′3 Y3 post-computation
1 Z ′3 Z3 post-computation
operands are getting larger than specified in Chapter 4 (A,B > 2M). Therefor an
at least 4 bit larger architecture has to be chosen. The resulting additional resource
requirements and clock cycles however can be neglected.
Table 9.10 shows the total execution time for the basic operations in elliptic curve
cryptosystems. Table 9.11 shows the average time needed for a general point mul-
tiplication in an elliptic curve cryptosystems. The modulus and the operands are
160 bits long. Thus 160 doublings and an average of 80 additions are computed. It
should be noted that advanced point multiplication algorithms (see, e.g., [37]) can
reduce the number of point additions considerably at the cost of additional memory
for pre–computed points.
CHAPTER 9. RESULTS 71
Table 9.10: Application to Elliptic Curves: Execution time for point addition and
doubling (160 bits)
Addition Doubling
design u T T
[µs] [µs]
1 4 62 39
8 71 45
16 76 48
2 23 15
Table 9.11: Application to Elliptic Curves: Execution time for a general point mul-
tiplication (160 bits)
design u T [ms]
1 4 11.3
8 12.8
16 13.8
2 4.2
Chapter 10
Comparison and Outlook
We compare our fastest RSA 512/1024 bit designs of Table 9.6 to the fastest soft-
and hardware solutions we found in the literature [33, 26, 37]. Our 0.8 ms decryption
time is about 11 times faster than the 512 bit software implementation (9.1 ms) on
a 150MHz Alpha [26]. The fastest 1024 bit software implementation [37] of 43.3 ms
running on a PPro–200 based PC is about 14 times slower than our best result (3.1
ms).
Most reported hardware implementations of modular arithmetic are somewhat
dated, making a fair comparison difficult. It is nevertheless interesting to look at
previously reported performances. The fastest reported FPGA design [33] (1.7 ms
for a 512 bit modulus and 5.2 ms for a 970 bit modulus) is a factor 2.1/1.8 slower
than ours (2.8 ms for a 970 bit modulus). It is possible, however, that their solution
upgraded to currently available FPGA technology, would reach similar speeds. A
drawback of the solution in [33] is, however, that the binary representation of the
modulus is hardwired into the logic representation so that the architecture has to
be reconfigured with every new modulus. The user of such an implementation needs
to own the full development tools for synthesis, placing and routing of FPGAs, if
RSA with different moduli should be executed. Our design stores the modulus, the
exponent and the pre–computation factor in registers and RAM. A second advantage
72
CHAPTER 10. COMPARISON AND OUTLOOK 73
of our design is that it is implemented into one device instead of a matrix of 16
devices. Using currently available FPGA technology the design [33] would probably
also fit in a single device.
The fastest ASIC solution was presented in [16]. A 1024–bit modular exponentia-
tion with a 1024–bit exponent is performed in an average of 10 ms (50% bits equal to
“1”) without applying the Chinese remainder theorem. Our Design 2 processes the
same computation in 11.9 ms, where average and worst case computation times are
the same.
To improve our design in terms of speed the following conclusions can be drawn.
1. Choice of radix r = 2k: We believe that the radix r = 24 = 16 chosen for
Design 2 is the optimal choice for optimizing the time–area product in a Xilinx
XC4000 FPGA implementation. Smaller radixes are simpler to implement at
the drawback of more clock cycles. Larger radixes result in very large resource
requirements for the computation of the 2k − 1 multiples of the operand B.
2. To fully utilize each clock cycle we compute squaring operations and multipli-
cations in parallel. Fully utilize is not quite correct, though, as we have to
compute only an average of n/2 multiplications. We could speed-up our design
by a factor 1.33 on average if we use Algorithm 4.1 and compute two modular
exponentiations in parallel. The worst case timing improvement is zero how-
ever. The drawback of this approach is that two operands B and their multiples
have to be stored. The units would need two 32× 4 bit RAM blocks instead of
two 16 × 4 bit RAM blocks and two additional registers. Thus the additional
resource requirement is one CLB per computed bit of the modulus.
3. Exponent: In Section 4.1 the l–ary method was discussed. This method is
particularly useful if we choose an approach as discussed in the last paragraph.
For l = 2 we have to store only two additional values X2 and X3. The worst
CHAPTER 10. COMPARISON AND OUTLOOK 74
case execution time is improved by a factor 2, the average time by a factor 1.5.
4. Implementation of our designs in devices from other FPGA vendors. We might
be able to run the designs at faster clock frequencies using different devices.
An architecture as proposed in (2) and (3) would barely fit into the largest avail-
able device of the Xilinx XC4000 family. The combined resulting speed-up is a factor
two compared to the tables in Chapter 9.
Appendix A
Test Bench Sample
-- VHDL Architecture TB_example_hr.vhd
-- Tests the functionality of the monXb_r4.vhd designs (X=160,256,512,768,1024).
-- Test_vectors: M = ’1...1’, 2M = ’2...2’, 3M = ’3...3’ ... 15M = ’F...F’
-- E = 1101101010101010101 (MSB-LSB)
-- Pre-computation factor = ’1’,’2’,’2’,’3’,’3’... (digit_0, digit_1 ...)
-- X = ’2’,’3’,’4’ ... (digit_0, digit_1 ...)
--
--Created:
-- by - Thomas Blum
-- at - 02/03/99
Library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.std_logic_unsigned.all;
use IEEE.std_logic_textio.all;
use work.array_types.all;
entity E is
GENERIC(
width : positive := w_idth; -- unit width
slices : positive := s_lices; -- number of units
length : positive := l_ength -- length of modulus
);
end E;
ARCHITECTURE A OF E IS
-- Architecture declarations
75
APPENDIX A. TEST BENCH SAMPLE 76
CONSTANT clk_prd_f : time := 20 ns;
CONSTANT clk_prd_s : time := 40 ns;
CONSTANT del_prd : time := 5 ns;
SIGNAL modulus_in : std_logic_vector(7 DOWNTO 0);
SIGNAL ttn : std_logic_vector(7 DOWNTO 0);
SIGNAL data_in : std_logic_vector(3 DOWNTO 0);
SIGNAL CLOCK_F : std_logic;
SIGNAL CLOCK_S : std_logic;
SIGNAL reset : std_logic;
SIGNAL exponent_in : std_logic_vector(15 DOWNTO 0);
SIGNAL encrypt_in : std_logic;
SIGNAL load_mod_in : std_logic;
SIGNAL load_exp_in : std_logic;
SIGNAL load_ttn_in : std_logic;
SIGNAL result_out : std_logic_vector(3 DOWNTO 0);
SIGNAL zero_out : std_logic_vector(3 DOWNTO 0);
--internal clock signal
SIGNAL iclk_f : std_logic;
SIGNAL iclk_s : std_logic;
--clock procedure
PROCEDURE wait_clock(CONSTANT clk_ticks:integer) IS
VARIABLE i : integer := 0;
BEGIN
FOR i IN 1 TO clk_ticks*2 LOOP
WAIT UNTIL iclk_s’EVENT;
END LOOP;
END wait_clock;
PROCEDURE wait_clock_f(CONSTANT clk_ticks:integer) IS
VARIABLE i : integer := 0;
BEGIN
FOR i IN 1 TO clk_ticks LOOP
WAIT UNTIL iclk_s’EVENT;
END LOOP;
END wait_clock_f;
--test component declaration
component mon160b_r4
PORT(
modulus_in : IN std_logic_vector(7 DOWNTO 0);
ttn : IN std_logic_vector(7 DOWNTO 0);
data_in : IN std_logic_vector(3 DOWNTO 0);
APPENDIX A. TEST BENCH SAMPLE 77
CLOCK_F : IN std_logic;
CLOCK_S : IN std_logic;
reset : IN std_logic;
exponent_in : IN std_logic_vector(15 DOWNTO 0);
encrypt_in : IN std_logic;
load_mod_in : IN std_logic;
load_exp_in : IN std_logic;
load_ttn_in : IN std_logic;
result_out : OUT std_logic_vector(3 DOWNTO 0);
zero_out : OUT std_logic_vector(3 DOWNTO 0)
);
END COMPONENT;
BEGIN
--test component instantiation
UUT: mon160b_r4
PORT MAP(
modulus_in => modulus_in,
ttn => ttn,
data_in => data_in,
CLOCK_F => CLOCK_F,
CLOCK_S => CLOCK_S,
reset => reset,
exponent_in => exponent_in,
encrypt_in => encrypt_in,
load_mod_in => load_mod_in,
load_exp_in => load_exp_in,
load_ttn_in => load_ttn_in,
result_out => result_out,
zero_out => zero_out);
--testbench procedure
flow_process: PROCESS
-- Process declarations
variable i : integer := 0;
BEGIN
--***************************************************
-- initialize signals and reset: ’0’ -> ’1’ -> ’0’
--***************************************************
APPENDIX A. TEST BENCH SAMPLE 78
reset <= ’0’;
data_in <= "0000";
ttn <= "00000000";
exponent_in <= "0000000000000000";
load_mod_in <= ’0’;
load_exp_in <= ’0’;
load_ttn_in <= ’0’;
encrypt_in <= ’0’;
modulus_in <= "00000000";
load_mod_in <= ’0’;
wait_clock_f(1);
wait for del_prd;
reset <= ’1’;
wait_clock_f(1);
wait for del_prd;
reset <= ’0’;
wait_clock_f(1);
wait for del_prd;
--***************************************************
-- load modulus: This is painful but has to be done
-- only once for a given system
--***************************************************
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "00010001"; -- 1M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "00100010"; -- 2M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "00110011"; -- 3M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
APPENDIX A. TEST BENCH SAMPLE 79
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "01000100"; -- 4M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "01010101"; -- 5M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "01100110"; -- 6M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "01110111"; -- 7M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "10001000"; -- 8M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "10011001"; -- 9M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
APPENDIX A. TEST BENCH SAMPLE 80
wait_clock(1);
wait for del_prd;
modulus_in <= "10101010"; -- 10M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "10111011"; -- 11M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "11001100"; -- 12M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "11011101"; -- 13M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "11101110"; -- 14M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
load_mod_in <= ’1’;
wait_clock(1);
wait for del_prd;
modulus_in <= "11111111"; -- 15M (units 0 ... m)
load_mod_in <= ’0’;
wait_clock(length/(2*width));
wait for del_prd;
wait_clock(1);
wait for del_prd;
modulus_in <= "00000000"; --end load modulus
wait_clock(1);
APPENDIX A. TEST BENCH SAMPLE 81
wait for del_prd;
--***************************************************
-- load exponent
--***************************************************
load_exp_in <= ’1’; --prepare load exponent: load_input->hi
wait_clock(1);
wait for del_prd;
exponent_in <= "0000000000010011"; --counter_value / nuber of bits in exponent
wait_clock(1);
wait for del_prd;
exponent_in <= "1101010101010101"; --16 bit exponent
wait_clock(1);
wait for del_prd;
exponent_in <= "0000000000000110"; --16 bit exponent
load_exp_in <= ’0’; --load_input->low
wait_clock(1);
wait for del_prd;
exponent_in <= "0000000000000000"; --16 bit exponent
wait_clock(1);
wait for del_prd;
--***************************************************
-- load ttn
--***************************************************
load_ttn_in <= ’1’;
wait_clock(1);
wait for del_prd;
ttn <= "00100001"; -- ls-digits of precomputation factor: ’1’, ’1’
wait_clock(1);
wait for del_prd;
for i in 1 to (length/(2*width)) loop
ttn <= ttn + "00010001"; -- add ’1’, ’1’ to next digit
wait_clock(1);
wait for del_prd;
end loop;
load_ttn_in <= ’0’;
wait_clock(1);
wait for del_prd;
ttn <= "00000000";
wait_clock(1);
wait for del_prd;
APPENDIX A. TEST BENCH SAMPLE 82
--***************************************************
-- start encryption
--***************************************************
encrypt_in <= ’1’;
wait_clock(2);
wait for del_prd;
data_in <= "0010"; -- ls-digit of X = ’2’
wait_clock(1);
wait for del_prd;
for i in 0 to (length/(width)) loop
data_in <= data_in + "0001"; - add ’1’ to next digit of X
wait_clock(1);
wait for del_prd;
end loop;
data_in <= "0000";
wait_clock(100000);
END PROCESS flow_process;
clock_gen_f : PROCESS
BEGIN
iclk_f <= ’1’;
WAIT FOR clk_prd_f/2;
iclk_f <= ’0’;
WAIT FOR clk_prd_f/2;
END PROCESS clock_gen_f;
CLOCK_f <= iclk_f;
clock_gen_s : PROCESS
BEGIN
iclk_s <= ’1’;
WAIT FOR clk_prd_s/2;
iclk_s <= ’0’;
WAIT FOR clk_prd_s/2;
END PROCESS clock_gen_s;
CLOCK_S <= iclk_s;
END A;
--architecture configuration
configuration CFG_TB_mont_BEHAVIORAL of E is
for A
end for;
end CFG_TB_mont_BEHAVIORAL;
Appendix B
Synosys Script
/* Sample Script for Synopsys to Xilinx Using */
/* FPGA Compiler targeting an XC4000XL device */
/* Set the name of the design"s top-level */
TOP = mon160b_r4
F1 = package_r4
F2 = module_pack
F3 = unit4b_r4ram
F4 = f_unit4b_r4ram
F5 = e_unit4b_r4ram
F6 = exp_ram_64x16s
F7 = mu_block_ram_40x4dp
F8 = sq_block_ram_40x4dp
F9 = state_mach
F10 = ttn_ram_20x8s
/* Set the name of the design"s LOGIBLOX */
F11 = reg_2b
F12 = reg_3b
F13 = reg_4b
F14 = reg_5b
F15 = reg_6b
F16 = reg_8b
F17 = reg_10b
F18 = reg_16b
F19 = shift16bit
F20 = clk_div_a
F21 = count_4b_15m
F22 = count_5b_20m
F23 = count_6b_u
F24 = count_6b_42m
F25 = count_10b_u
83
APPENDIX B. SYNOSYS SCRIPT 84
F26 = ram4x4
F27 = ram5x4
F28 = ram32x8s
F29 = ram64x16s
F30 = ram64x4dp
designer = "Thomas Blum"
company = "WPI Crypto Group"
part = "4085XLBG432-09"
/* Read the LOGIBLOX. */
read -format edif "WORK/" + F11 + ".edn"
read -format edif "WORK/" + F12 + ".edn"
read -format edif "WORK/" + F13 + ".edn"
read -format edif "WORK/" + F14 + ".edn"
read -format edif "WORK/" + F15 + ".edn"
read -format edif "WORK/" + F16 + ".edn"
read -format edif "WORK/" + F17 + ".edn"
read -format edif "WORK/" + F18 + ".edn"
read -format edif "WORK/" + F19 + ".edn"
read -format edif "WORK/" + F20 + ".edn"
read -format edif "WORK/" + F21 + ".edn"
read -format edif "WORK/" + F22 + ".edn"
read -format edif "WORK/" + F23 + ".edn"
read -format edif "WORK/" + F24 + ".edn"
read -format edif "WORK/" + F25 + ".edn"
read -format edif "WORK/" + F26 + ".edn"
read -format edif "WORK/" + F27 + ".edn"
read -format edif "WORK/" + F28 + ".edn"
read -format edif "WORK/" + F29 + ".edn"
read -format edif "WORK/" + F30 + ".edn"
/* Analyze and Elaborate the design file. */
analyze -format vhdl "sim_rtl/" + F1 + ".vhd"
analyze -format vhdl "sim_rtl/" + F2 + ".vhd"
analyze -format vhdl "sim_rtl/" + F3 + ".vhd"
analyze -format vhdl "sim_rtl/" + F4 + ".vhd"
analyze -format vhdl "sim_rtl/" + F5 + ".vhd"
analyze -format vhdl "sim_rtl/" + F6 + ".vhd"
analyze -format vhdl "sim_rtl/" + F7 + ".vhd"
analyze -format vhdl "sim_rtl/" + F8 + ".vhd"
analyze -format vhdl "sim_rtl/" + F9 + ".vhd"
analyze -format vhdl "sim_rtl/" + F10 + ".vhd"
analyze -format vhdl "sim_rtl/" + TOP + ".vhd"
elaborate TOP
APPENDIX B. SYNOSYS SCRIPT 85
/*Set the current design to unit4b_r4ram level. */
current_design F3
/* Don’t touch the logiblox*/
set_dont_touch ("b_ram_inst")
set_dont_touch ("m_ram_inst")
set_dont_touch ("a_i_reg")
set_dont_touch ("q_i_reg")
set_dont_touch ("res_reg_inst")
set_dont_touch ("control_reg")
set_dont_touch ("b_in_reg")
set_dont_touch ("b_mult_reg")
set_dont_touch ("result_reg")
/*Set the current design to the f_unit4b_r4ram level. */
current_design F4
/* Don’t touch the logiblox*/
set_dont_touch ("m_ram_inst")
set_dont_touch ("a_i_reg")
set_dont_touch ("q_i_reg")
set_dont_touch ("res_reg_inst")
set_dont_touch ("control_reg")
set_dont_touch ("result_reg")
/*Set the current design to the e_unit4b_r4ram level. */
current_design F5
/* Don’t touch the logiblox*/
set_dont_touch ("b_ram_inst")
set_dont_touch ("res_reg_inst")
set_dont_touch ("res_reg_2_inst")
set_dont_touch ("control_reg")
set_dont_touch ("b_in_reg")
set_dont_touch ("b_mult_reg")
set_dont_touch ("result_reg")
/*Set the current design to the exp_ram_64x16s level. */
current_design F6
/* Don’t touch the logiblox*/
APPENDIX B. SYNOSYS SCRIPT 86
set_dont_touch ("ram_inst")
set_dont_touch ("cnt_ram_wr")
set_dont_touch ("cnt_ram_rd")
set_dont_touch ("count_val_reg")
set_dont_touch ("shift_reg")
/*Set the current design to the mu_block_ram_40x4dp level. */
current_design F7
/* Don’t touch the logiblox*/
set_dont_touch ("dp_ram_y")
set_dont_touch ("count_wrt")
set_dont_touch ("count_rd")
set_dont_touch ("data_in_reg_f")
set_dont_touch ("data_in_reg_s")
set_dont_touch ("data_r")
/*Set the current design to the sq_block_ram_40x4dp level. */
current_design F8
/* Don’t touch the logiblox*/
set_dont_touch ("dp_ram_y")
set_dont_touch ("count_wrt")
set_dont_touch ("count_rd")
set_dont_touch ("data_in_reg")
set_dont_touch ("data_r")
/*Set the current design to the state_mach level. */
current_design F9
/* Don’t touch the logiblox*/
set_dont_touch ("cnt_a")
/*Set the current design to the ttn_ram_20x8s level. */
current_design F10
/* Don’t touch the logiblox*/
set_dont_touch ("ram_inst")
set_dont_touch ("cnt_ram")
set_dont_touch ("output_reg_even")
set_dont_touch ("output_reg_odd")
APPENDIX B. SYNOSYS SCRIPT 87
/*Set the current design to the top level. */
current_design TOP
/* Don’t touch the logiblox*/
set_dont_touch ("mod_even")
set_dont_touch ("mod_odd")
set_dont_touch ("ttn_reg")
set_dont_touch ("reg_exp")
set_dont_touch ("clk_divider")
remove_constraint -all
/* Uniquify the design and reset the schematic */
uniquify
create_schematic -size infinite -gen_database
/* include timming and area constraints */
remove_constraint -all
remove_clock -all
create_clock -period 20 -waveform {0 10} CLOCK_F
create_clock -period 40 -waveform {20 40} CLOCK_S
group_path -critical_range 10000 -default
set_input_delay 0 -clock CLOCK_F { all_inputs()}
set_output_delay 0 -clock CLOCK_F { all_outputs()}
set_input_delay 0 -clock CLOCK_S { all_inputs()}
set_output_delay 0 -clock CLOCK_S { all_outputs()}
set_operating_conditions WCCOM
/* Indicate which ports are pads. */
set_port_is_pad "*"
set_pad_type -no_clock all_inputs()
set_pad_type -clock CLOCK_F
set_pad_type -clock CLOCK_S
set_pad_type -slewrate LOW all_outputs()
insert_pads
/* link */
link
/* Synthesize the design.*/
APPENDIX B. SYNOSYS SCRIPT 88
compile -boundary_optimization -map_effort high
/* Write the design report files. */
report_fpga > "reports/" + TOP + ".fpga"
report_timing > "reports/" + TOP + ".timing"
report_constraint -verbose > "reports/" + TOP + ".cnst"
/* Write out an intermediate DB file to save state */
write -format db -hierarchy -output "db/" + TOP + "_compiled.db"
/* Replace CLBs and IOBs primitives (XC4000E/EX/XL only) */
replace_fpga
/* Set the part type for the output netlist. */
set_attribute TOP "part" -type string part
/* Write out the intermediate DB file to save state*/
write -format db -hierarchy -output "db/" + TOP + ".db"
/* Write out the timing constraints */
ungroup -all -flatten
write_script > "dc/" + TOP + ".dc"
/* Save design in XNF format as <design>.sxnf */
write -format xnf -hierarchy -output "sxnf/" + TOP + ".sxnf"
/* XILINX primitive to convert Synopsys design constraints to Xilinx format*/
sh /usr/local/xilinx/bin/sol/dc2ncf "dc/" + TOP + ".dc"
Appendix C
Simulation Results
In this chapter the simulation results are shown of Design 2, with a modulus of 160
bits. For a more detailed description of the data flow, please refer to Sections 7.2, 7.3,
and 7.4.
C.1 Processing Elements
The following sequence of three figures shows the pre place-and-route simulation
results for processing element 3. Figure C.1 shows the loading of the pre–computation
factor into B and the calculation of its multiples. Figure C.2 shows the first cycles
of the two modular multiplications. In Figure C.3 finally, the last cycles of these
operations are shown, the storing of the first multiplication result in B-reg, and the
calculation of the multiples of B.
The signals shown in the simulation sequence are as follows:
Figure: C.1
CLOCK F ⇒ Fast clock signal, system clock.
CLOCK S ⇒ Slow clock signal(CLOCK F/2).
control ⇒ Current state encoded.
b in ⇒ Operand B.
control in ⇒ control input for unit3.
89
APPENDIX C. SIMULATION RESULTS 90
b ⇒ Input B registered.
a i ⇒ Operand A.
write ram b ⇒ Write enable signal for B-RAM.
b m reg ⇒ Data input of B-RAM (b p multregistered).
b p mult ⇒ Temporarily result of B-multiplication.
c b in ⇒ Carry input B-multiplication.
c b out ⇒ Carry output B-multiplication.
Figure: C.2
carry in ⇒ Carry input from unit to the right.
q in ⇒ Quotient Q.
res in ⇒ Result S of an iteration from unit to the left.
modulus ⇒ Multiple of M according to qi (1M = 1).
a t b ⇒ Multiple of B according to ai (1B = 2).
mod p b ⇒ modulus plus a t b plus carry 0.
mod p b p s ⇒ mod p b plus res in plus carry 1.
carry out ⇒ Bits 4 and 5 of the result mod p b p s registered.
res out ⇒ Bits 3 to 0 of the result mod p b p s registered.
res reg ⇒ Bits 5 to 0 of the result mod p b p s registered.
Figure: C.3
result in ⇒ Result of a modular multiplication from unit to the left.
result out ⇒ Result of a modular multiplication mod p b p s or result in
to unit to the right .
load b ⇒ Write enable signal for B-reg.
b ⇒ Result of squaring stored in B-reg.
APPENDIX
C.SIMULATIONRESULTS
91
CLOCK_F
CLOCK_S
curr_state
control
b_in
control_in
a_i
b
write_ram_b
b_m_reg
b_p_bmult
c_b_in
c_b_out
idle
0
1
0
0
0
0
2
2
2
1
4
4
2
3
6
6
3
8
8
4
4
10
10
5
12
12
6
5
14
14
7
17
17
6
8
3
3
9
5
5
A
7
7
7
pre_c1_ld_b
B
4
9
9
C
8
11
11
D
13
13
E
9
15
4
0
F
pre_c1_mult
3
A
3
2
2
15
17
pre_c1_calc
0
B
0
1
0
0
0
us17.450 17.500 17.550 17.600 17.650 17.700 17.750 17 .800 17 .850 17 .900 17.950
Figure C.1: Processing Element: Loading of the pre-computation factor and calculation of its multiples
APPENDIX C. SIMULATION RESULTS 92
CLOCK_F
CLOCK_S
curr_state
control
control_in
a_i
carry_in
q_in
res_in
modulus
a_t_b
mod_p_b
mod_p_b_p_s
carry_out
res_out
res_reg
F
0
01
01
00
1
0
0
2
04
1
4
04
04
0
4
0
1
02
0
2
02
02
2
2
6
3
0E
2
6
08
0E
1
0
3
0
04
1
0
01
04
9
2
E
0
1
4
14
9
8
12
14
3
0
4
1
4
0
07
3
0
03
07
A
1
7
0
A
5
1F
A
A
14
1F
6
F
1
8
0
0E
6
0
06
0E
0
0
E
7
6
14
0
C
0C
14
D
1
4
0
E
0
1C
D
0
0D
1C
F
2
C
1
1
7
20
F
E
1E
20
B
1
0
2
D
0
19
B
0
0B
3
8
0
2
0
2
1
03
03
1
9
19
pre_c1_calc
0
0
0
1
8
9
8
0
0
1
0
3
0
us18.000 18.050 18.100 18.150 18 .200 18.250
Figure C.2: Processing Element: Computation of two modular multiplications (firstcycles)
APPENDIX C. SIMULATION RESULTS 93
CLOCK_F
CLOCK_S
curr_state
control
control_in
a_i
carry_in
q_in
res_in
modulus
a_t_b
mod_p_b
mod_p_b_p_s
carry_out
res_out
res_reg
result_in
result_out
load_b
b
c_b_in
b_m_reg
write_ram_b
c_b_out
16
2
6
1
B
0
0D
0
2
02
0D
9
0
D
0
7
18
9
10
18
3
1
8
9
0D
3
03
0D
0
0
D
3
03
0
0
00
03
0
3
6
B
0F
7
08
3
00
0
0
9
0
01
0A
0F
9
03
F
2
F
1
03
04
07
9
1
4
1
2
05
05
00
0A
B
1
6
0
3
07
07
0E
B
6
8
0
4
09
09
00
11
B
A
6
0
5
0B
0B
05
B
C
D
0
6
0D
0D
00
08
9
E
D
0
7
0F
0F
0C
9
1
B
0
8
02
02
00
0F
9
3
B
0
9
04
04
13
9
5
8
A
06
06
00
06
4
7
8
0
B
08
6
08
0A
4
9
2
0
C
0A
0A
00
0D
D
B
2
0
D
0C
0C
11
D
8
D
5
0
E
0E
0E
8
6
0
00
F
5
0
F
F
10
10
00
F
1
9
0
7
1
8
08
08
08
9
1
0
1
8
0
9
0F
0
F
0F
pre_c2_mult
0
3
1
3
04
3
8
0
1
0
1
C
0D
0D
1
F
0F
pre_c2_calc
0
0
0
0
00
Figure C.3: Processing Element: Computation of two modular multiplications (lastcycles).
APPENDIX C. SIMULATION RESULTS 94
C.2 Systolic Array
The following sequence of three figures shows the pre place-and-route simulation
results for the systolic array. Figure C.4 shows the loading of the pre–computation
factor into B, the calculation of its multiples, and the first cycles of the two modular
multiplications. Figure C.5 shows the last cycles of these operations in units 0,1
and 2, the storing of the first multiplication result in B-reg, and calculation of the
multiples of B. In Figure C.6 finally, the end of the modular multiplications in the
units 41,42 and 43 is shown.
The signals shown in the simulation sequence are as follows:
Figure: C.4
CLOCK F ⇒ Fast clock signal, system clock.
CLOCK S ⇒ Slow clock signal(CLOCK F/2).
b even ⇒ bus B to even numbered units.
b odd ⇒ bus B to odd numbered units.
control in ⇒ Inputs control of units 0 to 2 and 42,43.
a in ⇒ Inputs operand A of units 0 to 2 and 42,43.
res out ⇒ Result of unit1 (reused as qi in next iteration).
q in ⇒ Inputs quotient Q of units 0 to 2 and 42,43.
Figures: C.5 and C.6
result out ⇒ Modular multiplication results of units 0 to 3 and 41,42,43
pumped trough the systolic array.
APPENDIX
C.SIMULATIONRESULTS
95
CLOCK_F
CLOCK_S
curr_state
b_even
b_odd
control_in
control_in
control_in
control_in
control_in
a_in
a_in
a_in
a_in
a_in
res_out
q_in
q_in
q_in
q_in
q_in
idle
0
0
0
0
0
0
0
1
1
1
0
2
0
2
2
1
2
3
3
2
3
0
4
4
3
3
5
5
4
4
0
6
6
5
4
7
7
6
5
0
8
8
7
5
9
9
8
6
0
A
A
9
6
B
B
A
7
0
C
C
B
7
D
D
C
8
0
4
E
0
0
E
D
8
F
4
F
0
E
9
0
2
1
4
1
2
F
0
9
0
1
0
1
1
2
A
2
2
3
2
3
0
1
1
A
1
0
1
0
2
3
B
9
9
4
9
4
1
0
3
B
3
0
3
0
9
4
C
A
A
5
A
5
3
0
6
C
6
0
6
0
A
5
D
0
0
6
0
6
6
0
D
D
D
0
D
0
0
6
E
F
F
7
F
7
D
0
E
B
B
0
B
0
F
7
F
2
2
8
2
8
B
0
8
F
8
0
8
0
2
8
0
D
D
9
D
9
8
0
1
0
1
0
1
0
D
9
2
9
9
A
9
A
1
0
3
1
3
0
3
0
9
A
3
7
7
B
7
B
3
0
6
2
6
0
6
0
7
B
4
9
9
C
9
C
6
0
D
3
D
0
D
0
9
C
5
5
5
D
5
D
D
0
B
4
B
0
0
B
0
5
D
6
4
4
E
0
4
E
B
0
0
8
8
0
8
0
4
E
1
0
A
A
A
8
0
2
1
1
1
1
A
3
2
0
0
F
0
F
1
4
3
3
3
0
3
0
0
F
5
4
4
4
1
4
1
3
0
6
5
6
6
0
6
0
4
1
7
6
5
5
2
5
2
6
0
8
7
D
D
0
D
0
5
2
9
8
0
0
3
0
3
D
0
A
9
B
B
0
B
0
0
3
B
A
2
2
4
2
4
B
0
C
B
8
8
0
8
0
2
4
D
C
F
F
5
F
5
8
0
0
E
D
1
1
0
4
1
0
F
5
F
0
E
6
6
6
4
6
6
1
0
1
2
F
3
3
0
3
0
6
6
0
1
1
2
F
F
7
7
F
3
0
2
3
0
1
6
0
6
pre_c1_mult
7
5
3
3
3
3
3
pre_c1_calc
2
1
0
0
0
0
0
8
0
7
0
3
F
6
F
1
2
F
us17.400 17 .500 17.600 17 .700 17.800 17 .900 18 .000 18.100 18 .200 18.300 18 .400 18.500 18 .600 18.700 18 .800 18.900
Figure C.4: Systolic Array: Beginning of the pre-computation
APPENDIX
C.SIMULATIONRESULTS
96
CLOCK_F
CLOCK_S
curr_state
b_even
b_odd
control_in
control_in
control_in
control_in
control_in
a_in
a_in
a_in
a_in
a_in
q_in
q_in
q_in
q_in
q_in
result_out
result_out
result_out
result_out
result_out
result_out
result_out
8
0
8
C
1
9
8
B
B
8
B
A
C
1
0
C
C
B
3
7
B
A
0
1
0
C
2
F
3
7
0
0
E
0
0
1
8
2
2
F
1
1
9
0
0
4
9
8
2
2
6
2
0
9
1
2
F
6
9
3
3
2
6
2
1
2
5
3
F
6
4
4
2
6
3
D
1
E
A
5
3
5
5
C
2
4
D
1
2
2
E
A
6
6
2
C
5
3
D
8
2
2
7
7
C
E
6
3
D
C
8
8
8
C
E
7
C
3
0
0
C
9
9
E
0
8
C
3
E
B
0
0
A
A
E
0
9
E
C
9
C
E
B
B
B
0
D
A
E
C
6
2
9
C
C
C
0
D
B
9
E
4
B
6
2
D
D
D
3
C
9
E
A
3
4
B
6
E
6
6
E
D
3
D
B
9
F
7
A
3
F
6
F
6
E
3
A
B
9
9
F
7
E
1
6
1
E
3
A
F
5
B
1
6
9
0
4
0
4
A
B
2
1
E
5
A
7
1
6
9
4
4
9
A
2
5
0
4
D
2
A
7
2
2
2
F
4
9
5
6
4
D
2
6
8
8
6
2
F
9
2
8
9
6
4
C
1
C
1
F
9
0
8
6
1
C
8
9
2
A
A
2
F
0
F
9
C
1
E
D
1
C
1
D
1
D
9
0
5
A
2
F
A
B
E
D
C
0
0
C
0
F
5
1
D
D
2
E
A
B
D
3
D
3
F
5
D
0
C
E
A
2
E
E
F
F
E
5
D
E
C
D
3
D
9
A
A
C
A
C
D
E
C
7
F
E
5
6
D
9
0
D
D
0
E
C
7
A
C
B
C
5
6
0
E
0
E
C
7
D
D
0
E
7
B
C
D
2
2
D
7
D
3
0
E
C
3
E
7
9
9
D
3
9
2
D
1
D
C
3
3
A
A
3
D
3
9
0
9
E
F
1
D
B
B
B
B
9
3
A
3
3
0
B
2
E
F
A
3
3
A
9
3
C
0
B
B
F
1
B
2
0
5
0
5
0
3
8
C
3
A
0
8
F
1
2
7
7
2
3
C
8
0
5
2
8
0
8
8
2
A
0
7
2
C
B
8
F
D
9
D
F
8
2
A
6
0
C
B
F
9
0
F
9
8
A
6
3
0
D
F
2
3
1
C
0
1
0
calc_mult
3
3
3
6
1
6
calc_calc
2
1
0
0
0
3
6
F
0
9
0
0
4
1
F
1
6
A
6
D
3
8
3
E
us22 .550 22.600 22.650 22 .700 22 .750 22.800 22.850 22 .900 22 .950 23.000 23.050 23.100 23 .150 23.200 23.250 23 .300 23 .350 23 .400 23.450 23.500 23 .550 23 .600
Figure C.5: Systolic Array: End of the pre-computation in units 0,1,2
APPENDIX
C.SIMULATIONRESULTS
97
CLOCK_F
CLOCK_S
curr_state
b_even
b_odd
control_in
control_in
control_in
control_in
control_in
a_in
a_in
a_in
a_in
a_in
q_in
q_in
q_in
q_in
q_in
result_out
result_out
result_out
result_out
result_out
result_out
result_out
9
8
2
A
6
0
C
B
F
9
F
9
8
A
6
3
0
D
F
2
3
C
0
1
0
D
1
0
A
3
E
8
F
9
6
3
0
4
F
4
F
6
3
6
D
1
0
E
3
1
0
5
5
E
6
4
F
3
3
D
2
1
9
D
9
D
E
3
1
5
6
3
2
E
D
D
E
3
1
7
6
9
D
4
3
A
C
A
C
3
7
A
1
D
E
5
4
7
8
8
7
1
7
6
A
A
C
6
5
A
6
D
8
7
7
7
6
D
4
4
D
A
6
D
F
8
7
F
3
F
3
D
F
3
4
D
6
9
8
9
3
3
9
D
3
0
F
F
3
A
9
F
0
F
0
F
3
3
3
9
0
B
A
3
9
9
3
3
0
3
F
0
C
B
B
C
B
C
0
3
8
9
3
D
C
8
A
A
8
3
8
B
C
4
6
E
D
6
2
6
6
2
8
4
A
8
F
6
E
A
0
6
0
A
8
4
7
6
2
1
E
F
6
6
7
6
0
A
4
0
4
1
E
3
3
7
A
6
6
4
9
0
4
A
E
A
E
6
A
3
7
3
2
4
9
6
7
7
6
3
5
A
E
6
A
8
6
2
C
3
C
3
A
2
3
5
7
6
C
1
8
6
1
B
B
1
3
5
2
6
C
3
A
2
C
1
7
7
7
7
5
2
6
D
B
1
1
D
A
2
A
D
D
A
2
6
D
5
7
7
0
C
1
D
6
6
5
A
D
A
6
D
D
3
0
C
D
1
1
D
D
5
6
6
A
F
E
D
3
8
F
8
F
A
6
F
5
1
D
A
C
F
E
3
2
3
2
A
F
C
6
8
F
D
0
A
C
0
4
0
4
6
F
D
C
3
2
0
E
D
0
3
2
2
3
F
C
D
0
4
2
D
0
E
9
9
C
D
6
2
3
9
2
D
8
7
7
8
D
6
9
5
A
3
9
1
4
1
4
6
5
9
7
8
B
B
A
3
7
E
E
6
5
9
3
1
4
3
A
B
B
6
8
7
8
5
9
3
8
E
0
5
3
A
6
3
3
6
9
3
8
3
7
8
7
2
0
5
1
A
1
A
3
8
3
D
3
6
7
2
3
9
9
3
8
D
1
A
3
D
F
5
5
D
9
3
3
F
9
D
F
2
D
D
2
D
5
1
0
F
9
6
2
6
2
D
2
4
F
1
0
D
0
0
D
6
2
5
4
F
5
2
3
3
9
7
7
7
7
7
7
7
calc_calc
2
1
0
0
0
0
0
A
5
D
D
5
1
2
0
D
9
0
0
0
0
0
0
0
us23.600 23.650 23.700 2 3.7 50 2 3.80 0 23.850 23.900 23.950 24.000 24.050 24 .100 24.150 24 .20 0 24.250 24.300 24.350 24.400 24.450 2 4.50 0 24 .55 0 24.600 24.650
Figure C.6: Systolic Array: End of the pre-computation in units 41,42,43
APPENDIX C. SIMULATION RESULTS 98
C.3 Modular Exponentiation
The following sequence of three figures shows the pre place-and-route simulation re-
sults of the modular exponentiation. Figures C.7 and C.8 show the pre-computation.
In Figures C.9 an overview of a 19-bit modular exponentiation is given.
The signals shown in the simulation sequence are as follows:
Figure: C.7
CLOCK F ⇒ Fast clock signal, system clock.
CLOCK S ⇒ Slow clock signal(CLOCK F/2).
data in ⇒ X value to encrypt/decrypt from input.
b even ⇒ bus B to even numbered units.
b odd ⇒ bus B to odd numbered units.
control in ⇒ Control word to systolic array.
a i ⇒ Operand A(= X) to to systolic array.
q i ⇒ Quotient Q to systolic array.
y out ⇒ Data output of DP RAM Z.
Figure: C.8
result out ⇒ Result from systolic array.
y out ⇒ Data output of DP RAM Z.
y out ⇒ Data output of DP RAM P.
max count sq ⇒ Terminal count signal from DP RAM Z.
Figure: C.9
data in ⇒ Data input of Exp RAM:
1.value: number of bits ei of exponent (19).
2.value: first word of exponent (e15 . . . e0).
APPENDIX C. SIMULATION RESULTS 99
3.value: second word of exponent (e18 . . . e16).
exponent ⇒ Bit ei of exponent.
finish ⇒ Terminal signal in Exp RAM (see Figure 6.5).
sig pre c1 ⇒ Signal for state pre-computation1.
sig pre c2 ⇒ Signal for state pre-computation2.
sig calc ⇒ Signal for state computation.
sig post ⇒ Signal for state post-computation.
sig load ⇒ Signal for sub–state load.
APPENDIX C. SIMULATION RESULTS 100
CLOCK_F
CLOCK_S
curr_state
data_in
b_even
b_odd
control_in
a_i
q_i
y_out
0
0
0
2
1
1
2
2
3
2
3
3
4
4
3
5
4
6
5
4
7
5
8
6
5
9
6
A
7
6
B
7
C
8
7
D
8
4
E
0
9
8
F
9
0
2
1
A
9
0
1
A
2
2
3
B
A
1
0
B
3
9
4
C
B
3
0
C
4
A
5
D
C
6
0
D
5
0
6
E
D
D
0
E
6
F
7
F
E
B
0
F
7
2
8
0
F
8
0
0
8
D
9
1
0
1
0
2
9
9
A
2
1
3
0
3
A
7
B
3
pre_c1_mult
2
3
pre_c1_calc
4
4
3
0
6
B
0
us17.450 17.50 0 17.5 50 17.600 17.650 17.700 17.750 17 .8 00 17.850 17 .9 00 17.950 18.000 18.05 0 18.100 18.150 18.200 18.250 18.300 18.350
Figure C.7: Modular Exponentiation: Beginning of pre-computation
APPENDIX C. SIMULATION RESULTS 101
CLOCK_F
CLOCK_S
curr_state
data_in
b_even
b_odd
control_in
a_i
q_i
result_out
y_out
y_out
max_count_sq
C 8
0
0
E
1
9
0
0
1
5
2
A
2
0
2
1
3
3
0
3
E
4
8
0
4
7
5
0
0
5
F
6
2
0
6
7
0
0
3
7
8
9
0
8
8
9
C
0
0
9
A
A
0
2
A
9 3
pre_c1_calc
B
0
0
B 0
B
8
0
6
0
pre_c
0
2
1
1
2
3
0
us19.450 19.500 19.550 19.600 19.650 19.700 19.750 19.800 19 .850 19 .900 19.950 20 .000
CLOCK_F
CLOCK_S
curr_state
data_in
b_even
b_odd
control_in
a_i
q_i
result_out
y_out
y_out
max_count_sq
0
0
7
1
9
2
8
3
5
4
3
5
F
6
9
7
1
8
B
9
6
A
B
B
D
C
9
D
B
E
9
F
8
2
0
7
1
4
0
9
2
9
7
8
1
D
F
5
5
5
8
3
4
F
E
F
9
3
F
9
6
1
1
1
0
1
B
pre_c2_mult
6
0
9
3
1
6
pre_c2_calc
0
2
1
A
B
0
us20 .100 20 .150 20.200 20 .250 20.300 20 .350 20.400 20.450 20.500 20.550 20.600 20.650
Figure C.8: Modular Exponentiation: End of pre-computation and beginning of Z1,P1 calculation
APPENDIX C. SIMULATION RESULTS 102
CLOCK_F
CLOCK_S
curr_state
data_in
exponent
finish
idle
0000 0013
st_ld_cnt_exp
D555
st_load_exp
0006
idle
0000
us16 .000 16 .020 16 .040 16.060 16 .080 16 .100 16 .120 16 .140 16 .160 16.180
curr_state
data_in
exponent
finish
sig_pre_c1
sig_pre_c2
sig_calc
sig_post
sig_load
calc
0000
us15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
Figure C.9: Modular Exponentiation: Loading the exponent and computation of amodular exponentiation with a 19–bit exponent
Bibliography
[1] P. Alfke and B. New. Implementing State Machines in LCA Devices. Xilinx,
Inc., San Jose, CA. Xilinx Application Note XAPP 027.001.
[2] J. Bajard, L. Didier, and P. Kornerup. An RNS Montgomery modular multipli-
cation algorithm. IEEE Transactions on Computers, 47(7):766–76, July 1998.
[3] T. Beth and D. Gollmann. Algorithm engineering for public key algorithms.
IEEE Journal on Selected Areas in Communications, 7(4):458–65, May 1989.
[4] E.F. Brickell. A fast modular multiplication algorithm with application in public
key cryptography. In Advances in Cryptology — CRYPTO ’82, pages 51–60.
Chaum et al., Eds. New York, Plenum Press, 1983.
[5] E.F. Brickell. A survey of hardware implementations of RSA. In Advances in
Cryptology — CRYPTO ’89, pages 368–70. Springer-Verlag, 1990.
[6] W. Diffie and M.E. Hellman. New directions in cryptography. IEEE Transactions
on Information Theory, IT-22:644–654, 1976.
[7] S. E. Eldridge and C. D. Walter. Hardware implementation of Montgomery’s
modular multiplication algorithm. IEEE Transactions on Computers, 42(6):693–
699, July 1993.
[8] W. Gai and H. Chen. A systolic linear array for modular multiplication. In 2nd
International Conference on ASIC, pages 171–4, 1996.
103
BIBLIOGRAPHY 104
[9] H.Orup. Simplifying quotient determination in high-radix modular multiplica-
tion. In Proceedings 12th Symposium on Computer Arithmetic, pages 193–9,
1995.
[10] K. Iwamura, T. Matsumoto, and H. Imai. Montgomery modular-multiplication
method and systolic arrays suitable for modular exponentiation. Electronics and
Communications in Japan, Part 3, 77(3):40–51, March 1994.
[11] J.-P. Kaps. High speed FPGA architectures for the Data Encryption Standard.
Master’s thesis, ECE Dept., Worcester Polytechnic Institute, Worcester, USA,
May 1998.
[12] K.Koyama and Y.Tsuruoka. Speeding up elliptic cryptosystems by using a signed
binary window method. In Advances in Cryptology — EUROCRYPT ’92, pages
345–357, May 1992.
[13] D.E. Knuth. The Art of Computer Programming. Volume 2: Seminumerical
Algorithms. Addison-Wesley, Reading, Massachusetts, 2nd edition, 1981.
[14] N. Koblitz. Elliptic curve cryptosystems. Mathematics of Computation, 48:203–
209, 1987.
[15] P. Kornerup. A systolic, linear-array multiplier for a class of right-shift algo-
rithms. IEEE Transactions on Computers, 43(8):892–8, August 1994.
[16] N. Lange. Single-chip implementation of a cryptosystem for financial applica-
tions. Lecture Notes in Computer Science, 1318:135–44, 1997.
[17] A. J. Menezes, P.C.van Oorschot, and S.A. Vanstone. Handbook of Applied
Cryptography. CRC Press, 1997.
BIBLIOGRAPHY 105
[18] V. Miller. Uses of elliptic curves in cryptography. In Lecture Notes in Computer
Science 218: Advances in Cryptology — CRYPTO ’85, pages 417–426. Springer-
Verlag, Berlin, 1986.
[19] P.L. Montgomery. Modular multiplication without trial division. Mathematics
of Computation, 44(170):519–21, April 1985.
[20] J.J. Quisquater and C. Couvreur. Fast decipherment algorithm for RSA public–
key cryptosystem. Electronics Letters, 18:905–7, October 1982.
[21] R.L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signa-
tures and public key cryptosystems. Communications of the ACM, 21(2):120–6,
Feb. 1978.
[22] M. Rosner. Elliptic curve cryptosystems on reconfigurable hardware. Master’s
thesis, ECE Dept., Worcester Polytechnic Institute, Worcester, USA, May 1998.
[23] B. Schneier. Applied Cryptography. Wiley & Sons, 2nd edition, 1995.
[24] S.C.Pohlig and M.E. Hellman. An improved algorithm for computing logarithms
over gf(p) and its cryptographyc significance. IEEE Transactions on Information
Theory, 24(1):106–110, January 1978.
[25] H. Sedlak. An rsa cryptography processor. In Proceedings of EUROCRYPT ’87,
pages 127–42. Springer-Verlag, 1987.
[26] M. Shand and J. Vuillemin. Fast implementations of RSA cryptography. In Pro-
ceedings 11th IEEE Symposium on Computer Arithmetic, pages 252–259, 1993.
[27] Sican GmbH. RSA public key algorithm.
http://www.sican.de/internet/homepage internet d.html.
BIBLIOGRAPHY 106
[28] S.R.Dusse and B.S.Kaliski. A cryptographic library for the motorola dsp 56000.
In Advances in Cryptology — EUROCRYPT ’90, pages 230–244. Springer-
Verlag, 1991. LNCS 740.
[29] N. Takagi. A radix-4 modular multiplication hardware algorithm efficient for it-
erative modular multiplications. In Proceedings 10th IEEE Symposium on Com-
puter Arithmetic, pages 35–42, 1991.
[30] A. Tiountchik. Systolic modular exponentiation via Montgomery algorithm.
Electronic Letters, 34(9):874–5, April 1998.
[31] A. Vandemeulebroecke. A single chip 1024 bits RSA processor. In Advances in
Cryptology — EUROCRYPT ’89, pages 219–36. Springer-Verlag, 1990.
[32] M. Vielhaber. Entwurf und Layout eines RSA-Prozessors auf einer Chipkarte.
Master’s thesis, University of Karlsruhe, Karlsruhe, Germany, 1988.
[33] J.E. Vuillemin, P. Bertin, D. Roncin, M. Shand, H.H. Touati, and P. Boucard.
Programmable active memories: Reconfigurable systems come of age. IEEE
Transactions on VLSI Systems, 4(1):56–69, Mar 1996.
[34] C.D. Walter. Fast modular multiplication using 2-power radix. International
Journal of Computer Mathematics, 39(1–2):21–8, 1991.
[35] C.D. Walter. Systolic modular multiplication. IEEE Transactions on Computers,
42(3):376–8, March 1993.
[36] P.A. Wang. New VLSI architectures of RSA public key cryptosystems. In Pro-
ceedings of 1997 IEEE International Symposium on Circuits and Systems, vol-
ume 3, pages 2040–3, 1997.
BIBLIOGRAPHY 107
[37] E. De Win, S. Mister, B. Preneel, and M. Wiener. On the performance of signa-
ture schemes based on elliptic curves. In Algorithmic Number Theory Symposium
III, pages 252–266. Springer-Verlag, 1998.
[38] Xilinx, Inc., San Jose, CA. The Programmable Logic Data Book, 1996.
[39] J. Yong-Yin and W.P. Burleson. VLSI array algorithms and architectures for
RSA modular multiplication. IEEE Transactions on VLSI Systems, 5(2):211–
17, June 1997.