View
212
Download
0
Embed Size (px)
Citation preview
EFFICIENT FPGA IMPLEMENTATION OF FFT BASED
MULTIPLIERS
Lo Sing Cheng, Ali Miri, Tet Hin YeapSchool of Information Technology and Engineering
University of OttawaOttawa, Ontario, Canada, K1N 6N5
e-mail: lcheng,samiri,[email protected]
AbstractFinite field multiplication is one of the most useful arithmetic op-erations and has applications in many areas such as signal process-ing, coding theory and cryptography. However, it is also one of themost time consuming operations in both software and hardware, whichmakes it pertinent to develop a fast and efficient implementation. Inthis paper, we propose a novel FFT based finite field multiplier toaddress this problem.
The Fast Fourier Transform (FFT) is the collection of computa-tionally efficient algorithms that perform the Discrete Fourier Trans-form (DFT). For our purposes, we will use its efficient computationfor polynomial multiplication. The FFT performs polynomial multi-plication in O(nlog(n)) time compared to the classical method timeof O(n2). The idea of using the FFT for finite field multiplicationhas been researched extensively, but to our knowledge, this is the firstimplementation in hardware.
Keywords— Fast Fourier Transform, Finite Field, Multi-plier, Number Theoretic Transform
1 Theoretical Aspect
The Fast Fourier Transform (FFT) is the collection ofcomputationally efficient algorithms that perform the Dis-crete Fourier Transform (DFT). The DFT is the compu-tation of the point-value representation of a sequence of Nsamples. It has many applications in digital signal process-ing, such as linear filtering, correlation analysis, and spec-trum analysis. For our purposes, we will use its efficientcomputation for polynomial multiplication.
The FFT performs polynomial multiplication inO(nlog(n)) time a significant improvement over the clas-sical method time of O(n2). The underlying idea behindusing the FFT for polynomial multiplication is to first eval-uate the two polynomials at a special set of values, performthe point wise multiplication then interpolating the resultto the whole region Figure 1.
Standard representation oftwo polynomials
Point-value representation
Standard representation ofproduct
Point-value representation ofproduct
Evaluation (DFT)O(n log (n) )
Pointwise multiplicaiton O(n)
Interpolation (IDFT)O(n log (n))
Classical Multiplication O(n 2)
Figure 1: FFT Polynomial Multiplication Structure
1.1 DFT
The Fourier Transform is a time domain to frequencydomain mathematical transform. To perform the transformthe signal has to be sampled before using the DFT. For anN sample transform the DFT is defined by the formula:
X(k) =N−1∑n=0
x(n)ωnkN , 0 ≤ k ≤ N − 1
ωN = e−j2π/N
Where ωN is the N -th root of unity, for the above case itit over the complex plane. Where x(n) is the sample attime index n and X(k) is a vector of N values at frequencyindex k corresponding to the magnitude of the sine wavesresulting from the decomposition of the time indexed signal.For detailed information on the DFT refer to [5].
Similarly the Inverse Discrete Fourier Transform (IDFT)becomes
x(n) =1N
N−1∑n=0
X(k)ω−nkN , 0 ≤ n ≤ N − 1
1.1.1 Numerical Error
Numerical errors appear during the computation. If thenumerical errors are sufficiently small they can be negligi-ble. The bound of the numerical errors α on the xi afterthe FFT process can be proven to be
α ≤ 6n2B2log(n)ε
where ε ∼= 1e−16 and B is the base of the polynomial.For our implementation, we overcome all numerical error
by using the Number Theoretic Transform approach (NTT)in place of the FFT. This consists of working in a finite fieldwith the appropriate ωN value.
1.2 FFT Algorithms
The FFT exists in two functionally equivalent formsknown as decimation in time (DIT) and decimation in fre-quency (DIF). The various algorithms that result from theFFT are collectively known as Radix-R Fast Fourier Trans-forms. The most popular Radix-R choices are those ofR = 2 and R = 4. For our implementation we use Radix-4.
0-7803-8886-0/05/$20.00 ©2005 IEEECCECE/CCGEI, Saskatoon, May 2005
1300 Authors Absent - Paper Not Presented
1.2.1 Radix-4 Algorithm
The Radix-4 algorithm equates the sum of N/4 sequencesto the DFT summation, resulting in the formulas:
X(k) =
N4 −1∑n=0
x(n)ωknN + ω
Nk4
N
N4 −1∑n=0
x(n +N
4)ωkn
N +
ωNk2
N
N4 −1∑n=0
x(n +N
2)ωkn
N + ω3Nk
4N
N4 −1∑n=0
x(n +3N
4)ωkn
N
The twiddle factor is the multiplicative factor definedby a power of ωN . The twiddle factors for FFT of Radix-4are,
ωkN4
N = (−j)k, ωkN2
N = (−1)k, ω3kN
4N = (j)k,
Therefore the formulas can be divided into four separateequations where k = 0, 1, ..., N
4 as follows:
X(4k) =∑N
4 −1
n=0[x(n) + x(n + N
4 ) + x(n + N2 ) + x(n + 3N
4 )]ω0N ωkn
N4
X(4k + 1) =∑N
4 −1
n=0[x(n)− jx(n + N
4 )− x(n + N2 ) + jx(n + 3N
4 )]ωnN ωkn
N4
X(4k + 2) =∑N
4 −1
n=0[x(n) − x(n + N
4 ) + x(n + N2 ) − x(n + 3N
4 )]ω2nN ωkn
N4
X(4k +3) =∑N
4 −1
n=0[x(n)+ jx(n+ N
4 )−x(n+ N2 )− jx(n+ 3N
4 )]ω3nN ωkn
N4
Note that the input to each N4 -point DFT is a linear
combination of four signal samples scaled by a twiddle fac-tor. This procedure is repeated v times, where v = log4N .The total computational cost of Radix-4 is 3
8Nlog2N com-plex multipliers and Nlog2N complex adds. The output ofRadix-4 is in bit reversal of the input.
1.3 FFT Multiplication
We now show how to use the DFT to multiply two poly-nomials of degree at most n−1 in O(nlog(n)) time. Supposethat we are given two polynomials p(x) and q(x), where
p(x) = a0 + a1x + ... + an−1xn−1 and
q(x) = b0 + b1x + ... + bn−1xm−1,
and max{m,n} = n. If we multiply the polynomials to-gether, clearly the polynomials pq will have degree n+m−2.Suppose that the polynomial pq(x) has the coefficient rep-resentation
pq(x) = c0 + c1x + ... + cn+m−2xn+m−2
Refer to Figure 1 for the structure of FFT multiplications.
Algorithm 1 FFT multiplication1: Consider the two polynomials p (of degree n−1) and q ( of degree
m−1). Let n′ be the smallest power of 2 satisfying n′ ≥ n+m−1.2: Call the FFT algorithm on the polynomial p to calculate the
values of p(ωkn′ ) for all the n′-th roots of unity (ie for all 0 ≤ k ≤
n′ − 1).3: Call the FFT algorithm on the polynomial q to calculate the
values of q(ωkn′ for all the n′-th roots of unity (ie for all 0 ≤ k ≤
n′ − 1).4: Compute yk = pq(ωk
n′ ) = p(ωkn′ )q(ω
kn′ ) for every k = 0, 1, ..., n′−
1. this is the DFT for pq.
5: Compute the inverse DFT by a single application of the FFT
algorithm (with the root-of-unity ω−1n′ as the principal root-of-
unity) and then multiplying the resulting vector by 1n′ . Output
the resulting list as coefficients c0, c1, ..., cn′−1.
2 Previous Work
Since the publication of a fast algorithm for the DFT byCooley and Tukey in 1965, FFT algorithms have played akey role in the widespread use of digital signal process-ing in a variety of applications. FFT is also the bestknown method of multiplying discovered by Schonhage andStrassen. We will now provide a brief literature review ofFFT multipliers.
In 1976, Robert T. Moenck [4] discusses four differ-ent techniques for improving fast polynomial multiplica-tion. The different techniques are Karatsuba algorithm, hy-brid FFT algorithm, Mixed-Basis FFT algorithm, Hybrid-Mixed-Basis FFT.
In 1993, Yiquan Wu and Zhaoda Zhu [6] proposed fourdifferent versions of the DFT (DFT-j, j=I, II, III, IV). Theunderlining idea is to eliminate all complex multiplicationsbecause they require double the arithmetic operations andstorage compared to the real multiplier. The new real-multiplier FFT-j algorithms are proposed for all four ver-sions of the N = 2m DFT. All algorithms were implementedby software, FORTRAN-77. The paper did not discuss anyidea of using FFT as a multiplier but instead discussed thedetails of FFT.
There have not been many actual implementations ofFFT based multipliers. The FFT itself has been studiedextensively. Despite all the proposed enhancements to theFFT we have decided to implement the Radix-4 FFT withthe NTT approach. With our modular structure of ourimplementation of the FFT multiplier it would be trivialto replace the FFT and IFFT modules with with futureenhanced methods.
3 Implementation
When designing our FFT multiplier we made severalmodifications due to error and FPGA constraints. Ourmultiplier is a pipelined hybrid Radix-4 parallel NTT mul-tiplier over GF(2113). Working in a finite field allows us tohard code parameters. Instead of using ωN = ej(2π/N) forany arbitrary finite field, we use the NTT approach wherean ωN is N -th primitive root of unity in the finite fieldZm. We will now outline the advantages of using the NTT
1301
approach.Error Elimination - The NTT uses a finite field eliminat-ing all floating points.Complex Arithmetic Elimination - The NTT uses finitefields thereby eliminating the use of floating point repre-sentations.Register Reduction - The NTT eliminates all complexnumbers. Hardware implementation represents complexnumbers with two separate registers. Using NTT, we halvethe number of registers.Complexity Reduction - The NTT eliminates all complexarithmetic operations.An additional modification to our multiplier is a rearrange-ment of the structure of Radix-4. We make all commutatorstages constant, [refer to Figure 2]. Figure 3 depicts theoverall architecture of our implementation.
Figure 2: Restructured Radix-4 for a 16 point FFT
Setup
Setup
FFT
FFT
PoinwiseMultiplication Finalize
240 113
113
113
240
240
240
240
Figure 3: Architecture of Finite Field FFT Multiplier
3.1 Preliminary Calculations
In exchange for NTT’s finite field advantages, a lot of pre-calculation must be preformed. We must find a sufficientlylarge m such that the modular ring contains a principal N -th root of unity. To determine m we must use some basicgroup theory. In any ring containing a principal N -th rootof unity r, the values r, r2, ..., rN form a cyclic group of or-der N .
In order to simplify the task of finding a cyclic subgroupof a proper order, we restrict our attention to prime moduli.Furthermore, it is the case that the multiplicative group of
Commutator Twiddle
En
16 c
oeffi
eien
ts
Commutator
done
Rearrange
done done done
16 c
oeffi
eien
ts
RST
Clk
Figure 4: Detailed FFT Module
nonzero elements in any finite field is cyclic. Thus, if mis prime, the multiplicative group of nonzero elements is acyclic group of order m − 1 containing m − 1.
In addition to finding the value m, we must also take intoaccount the size of N , so as to reduce the padding of the co-efficients of the polynomial. After many trial and error at-tempts, using [2], we finalized on m = 16417 = 34×1026+1and N = 16. Where the 16-th primitive root of unity ofF16417 is equal to ω16 = 7339. For the twiddle factor, wemust calculate all constants ωi
N for i = 0, 1, ..., N − 1 andtheir inverses. We used the java.math package for all cal-culations.
3.2 FFT Setup Module
The setup module of the multiplier prepares the input forthe FFT modules. The inputs of the setup module are inpolynomial representation. The setup module performs allthe padding at incorporates all the preliminary calculations.
3.3 FFT Module
The first step in implementing the FFT was to standard-ize the Radix-4 structure. In all previous implementationsof the FFT, the commutator module arranged the data,using a lookup table in terms of the current stage. Weeliminated the lookup table by changing the structure toFigure 2. One can see that the input and output arrange-ments of the butterflies remain constant.
The FFT module performs the actual FFT operationwith the fixed twiddle factors. The twiddle factors are de-fined from the ωN value. Figure 4 is a detailed view of theFFT module including all its components and their inter-connections.
The commutator inputs the appropriate coefficients intothe Radix-4 butterfly. Each commutator includes 4 Radix-4 butterflies. Figure 5 the butterfly operation.
The twiddle multiplies the output of the commutator
1302
+
+
+
+
+
+
+
+3846
--
-
-
.
.
.
.
.
.
.
.
x(n)
x(n+N/4)
x(n+N/2)
x(n+3N/4)
a(n)
a(n+N/4)
a(n+N/2)
a(n+3N/4)
Figure 5: Radix-4 Butterfly
with the appropriate value of ωxnN ωkn
N4
where x = 0, 1, 2, 3.We know from basic arithmetic that for x = 0, n = 0 ork = 0 the twiddle factor is equal to 1 and multiplication isnot needed.
The rearrange module rearranges the output in the cor-rect order as per the bit reversal criteria.
3.4 Point-wise Multiplier Module
The point-wise multiplier module does exactly what itsname implies. It multiplies the output polynomials of theFFT on a coefficient basis. Suppose the two output poly-nomials of the FFT are A(x) and B(x) and the outputpolynomials of the point-wise multiplier is C(x). ThenC0 = A0 × B0, C1 = A1 × B1 and so on. This includesa total of 16 multiplications one for each coefficient.
3.5 Inverse FFT Module
The Inverse FFT Module has a similar structure to theFFT Module, Figure 4. The only major difference it thatnow the twiddle module incorporates the 1
N factor and thetwiddle factor is now the inverse twiddle constants.
3.6 Finalization Module
The finalization module performs the reverse operationof the setup module. The finalization module rearrangesthe output, removes the padded coefficients, reconstructsthe product and performs the GF(2113) reduction.
4 Results
There are no FFT multipliers that we came across inour research therefore we will not include a comparison ofthe performance of the FFT multiplier. The next tablecompares the FFT module itself, which was implementedthoroughly for signal processing applications. For compari-son purposes we choose an implementation that has similarcharacteristics. The Se-Hann Lee and Shao-Hua Shih im-plementation [3] was an FPGA Radix 4 design. Despitethe fact that our input coefficient is 16 bits compared totheir 8 bits, we can see improvement in all aspects. Webelieve this is due to the use of the NNT approach, whicheliminates all complex numbers and complex arithmetic.
The results of our FFT multiplier which include the
FFT Module Lee and Shih [3]Input 16-pt Radix-4 16-pt Radix-4
� Coeff. 16 bits 8 bitsLanguage Verilog Verilog
Type Pipeline NTT Pipeline FFTSoftware Xilinx ISE Xilinix ISEMax Freq 527.426 MHz 5.796 MHzClk Cycles 2 18
Time 3.79ns 3.02 µ sSlices 11218 19872
setup modules and finalization module are displayed in thefollowing table.
FFT MultiplierMax Frequency 527.426 MHzClock Cycles 5
Time 9.48 nsSlices 45865
LUT/FF 81995/8
5 Conclusions
In this paper, we have presented a new finite field FFTmultiplier implementation, which uses the NTT approach.The NTT approach had many advantages including re-duced complexity, reduced registers and elimination of er-ror, making this, to our knowledge, the fastest and first ofits kind.
References
[1] Fedorenko, S. and Trifonov, P., On Computing theFast Fourier Transform over Finite Fields, In the Pro-ceeding of the Eighth International Workshop on Al-gebraic and Combinatorial Coding Theory, TsarskoeSelo, Russia, pp. 108-111, 2002.
[2] Howell, R. R., How To Find FFT Con-stants, available at http://www.cis.ksu.edu/∼howell/calculator/how.html, 2001.
[3] Lee, S. and Shih, S., FPGA Based Solu-tions for The Fourier Transform, available athttp://web.uvic.ca/∼sshih/index.html, University ofVictoria, July, 2004.
[4] Moenck, R. T., Practical Fast Polynomial Multiplica-tion, In the Proceedings of the third ACM symposiumon Symbolic and algebraic computation (SYMSAC),pp. 136-148, 1976.
[5] Oppenheim, Alan V. and Willsky, Alan S., Signals AndSystems, Second Edition. Prentice Hall Sinal Process-ing Series, 1997.
[6] Wu, Y. and Zhu, Z., The New Real-Multiplier FFT-JAlgorithms, In the Proceeding of the IEEE 1993 Na-tional Aerospace and Electronics Conference (NAE-CON), vol. 1, pp. 90-93, 1993.
1303