[IEEE Canadian Conference on Electrical and Computer Engineering, 2005. - Saskatoon, SK, Canada (May 1-4, 2005)] Canadian Conference on Electrical and Computer Engineering, 2005. -

EFFICIENT FPGA IMPLEMENTATION OF FFT BASED

MULTIPLIERS

Lo Sing Cheng, Ali Miri, Tet Hin YeapSchool of Information Technology and Engineering

University of OttawaOttawa, Ontario, Canada, K1N 6N5

e-mail: lcheng,samiri,[email protected]

AbstractFinite field multiplication is one of the most useful arithmetic op-erations and has applications in many areas such as signal process-ing, coding theory and cryptography. However, it is also one of themost time consuming operations in both software and hardware, whichmakes it pertinent to develop a fast and efficient implementation. Inthis paper, we propose a novel FFT based finite field multiplier toaddress this problem.

The Fast Fourier Transform (FFT) is the collection of computa-tionally efficient algorithms that perform the Discrete Fourier Trans-form (DFT). For our purposes, we will use its efficient computationfor polynomial multiplication. The FFT performs polynomial multi-plication in O(nlog(n)) time compared to the classical method timeof O(n2). The idea of using the FFT for finite field multiplicationhas been researched extensively, but to our knowledge, this is the firstimplementation in hardware.

Keywords— Fast Fourier Transform, Finite Field, Multi-plier, Number Theoretic Transform

1 Theoretical Aspect

The Fast Fourier Transform (FFT) is the collection ofcomputationally efficient algorithms that perform the Dis-crete Fourier Transform (DFT). The DFT is the compu-tation of the point-value representation of a sequence of Nsamples. It has many applications in digital signal process-ing, such as linear filtering, correlation analysis, and spec-trum analysis. For our purposes, we will use its efficientcomputation for polynomial multiplication.

The FFT performs polynomial multiplication inO(nlog(n)) time a significant improvement over the clas-sical method time of O(n2). The underlying idea behindusing the FFT for polynomial multiplication is to first eval-uate the two polynomials at a special set of values, performthe point wise multiplication then interpolating the resultto the whole region Figure 1.

Standard representation oftwo polynomials

Point-value representation

Standard representation ofproduct

Point-value representation ofproduct

Evaluation (DFT)O(n log (n) )

Pointwise multiplicaiton O(n)

Interpolation (IDFT)O(n log (n))

Classical Multiplication O(n 2)

Figure 1: FFT Polynomial Multiplication Structure

1.1 DFT

The Fourier Transform is a time domain to frequencydomain mathematical transform. To perform the transformthe signal has to be sampled before using the DFT. For anN sample transform the DFT is defined by the formula:

X(k) =N−1∑n=0

x(n)ωnkN , 0 ≤ k ≤ N − 1

ωN = e−j2π/N

Where ωN is the N -th root of unity, for the above case itit over the complex plane. Where x(n) is the sample attime index n and X(k) is a vector of N values at frequencyindex k corresponding to the magnitude of the sine wavesresulting from the decomposition of the time indexed signal.For detailed information on the DFT refer to [5].

Similarly the Inverse Discrete Fourier Transform (IDFT)becomes

x(n) =1N

N−1∑n=0

X(k)ω−nkN , 0 ≤ n ≤ N − 1

1.1.1 Numerical Error

Numerical errors appear during the computation. If thenumerical errors are sufficiently small they can be negligi-ble. The bound of the numerical errors α on the xi afterthe FFT process can be proven to be

α ≤ 6n2B2log(n)ε

where ε ∼= 1e−16 and B is the base of the polynomial.For our implementation, we overcome all numerical error

by using the Number Theoretic Transform approach (NTT)in place of the FFT. This consists of working in a finite fieldwith the appropriate ωN value.

1.2 FFT Algorithms

The FFT exists in two functionally equivalent formsknown as decimation in time (DIT) and decimation in fre-quency (DIF). The various algorithms that result from theFFT are collectively known as Radix-R Fast Fourier Trans-forms. The most popular Radix-R choices are those ofR = 2 and R = 4. For our implementation we use Radix-4.

0-7803-8886-0/05/$20.00 ©2005 IEEECCECE/CCGEI, Saskatoon, May 2005

1300 Authors Absent - Paper Not Presented

1.2.1 Radix-4 Algorithm

The Radix-4 algorithm equates the sum of N/4 sequencesto the DFT summation, resulting in the formulas:

X(k) =

N4 −1∑n=0

x(n)ωknN + ω

Nk4

N

N4 −1∑n=0

x(n +N

4)ωkn

N +

ωNk2

N

N4 −1∑n=0

x(n +N

2)ωkn

N + ω3Nk

4N

N4 −1∑n=0

x(n +3N

4)ωkn

N

The twiddle factor is the multiplicative factor definedby a power of ωN . The twiddle factors for FFT of Radix-4are,

ωkN4

N = (−j)k, ωkN2

N = (−1)k, ω3kN

4N = (j)k,

Therefore the formulas can be divided into four separateequations where k = 0, 1, ..., N

4 as follows:

X(4k) =∑N

4 −1

n=0[x(n) + x(n + N

4 ) + x(n + N2 ) + x(n + 3N

4 )]ω0N ωkn

N4

X(4k + 1) =∑N

4 −1

n=0[x(n)− jx(n + N

4 )− x(n + N2 ) + jx(n + 3N

4 )]ωnN ωkn

N4

X(4k + 2) =∑N

4 −1

n=0[x(n) − x(n + N

4 ) + x(n + N2 ) − x(n + 3N

4 )]ω2nN ωkn

N4

X(4k +3) =∑N

4 −1

n=0[x(n)+ jx(n+ N

4 )−x(n+ N2 )− jx(n+ 3N

4 )]ω3nN ωkn

N4

Note that the input to each N4 -point DFT is a linear

combination of four signal samples scaled by a twiddle fac-tor. This procedure is repeated v times, where v = log4N .The total computational cost of Radix-4 is 3

8Nlog2N com-plex multipliers and Nlog2N complex adds. The output ofRadix-4 is in bit reversal of the input.

1.3 FFT Multiplication

We now show how to use the DFT to multiply two poly-nomials of degree at most n−1 in O(nlog(n)) time. Supposethat we are given two polynomials p(x) and q(x), where

p(x) = a0 + a1x + ... + an−1xn−1 and

q(x) = b0 + b1x + ... + bn−1xm−1,

and max{m,n} = n. If we multiply the polynomials to-gether, clearly the polynomials pq will have degree n+m−2.Suppose that the polynomial pq(x) has the coefficient rep-resentation

pq(x) = c0 + c1x + ... + cn+m−2xn+m−2

Refer to Figure 1 for the structure of FFT multiplications.

Algorithm 1 FFT multiplication1: Consider the two polynomials p (of degree n−1) and q ( of degree

m−1). Let n′ be the smallest power of 2 satisfying n′ ≥ n+m−1.2: Call the FFT algorithm on the polynomial p to calculate the

values of p(ωkn′ ) for all the n′-th roots of unity (ie for all 0 ≤ k ≤

n′ − 1).3: Call the FFT algorithm on the polynomial q to calculate the

values of q(ωkn′ for all the n′-th roots of unity (ie for all 0 ≤ k ≤

n′ − 1).4: Compute yk = pq(ωk

n′ ) = p(ωkn′ )q(ω

kn′ ) for every k = 0, 1, ..., n′−

1. this is the DFT for pq.

5: Compute the inverse DFT by a single application of the FFT

algorithm (with the root-of-unity ω−1n′ as the principal root-of-

unity) and then multiplying the resulting vector by 1n′ . Output

the resulting list as coefficients c0, c1, ..., cn′−1.

2 Previous Work

Since the publication of a fast algorithm for the DFT byCooley and Tukey in 1965, FFT algorithms have played akey role in the widespread use of digital signal process-ing in a variety of applications. FFT is also the bestknown method of multiplying discovered by Schonhage andStrassen. We will now provide a brief literature review ofFFT multipliers.

In 1976, Robert T. Moenck [4] discusses four differ-ent techniques for improving fast polynomial multiplica-tion. The different techniques are Karatsuba algorithm, hy-brid FFT algorithm, Mixed-Basis FFT algorithm, Hybrid-Mixed-Basis FFT.

In 1993, Yiquan Wu and Zhaoda Zhu [6] proposed fourdifferent versions of the DFT (DFT-j, j=I, II, III, IV). Theunderlining idea is to eliminate all complex multiplicationsbecause they require double the arithmetic operations andstorage compared to the real multiplier. The new real-multiplier FFT-j algorithms are proposed for all four ver-sions of the N = 2m DFT. All algorithms were implementedby software, FORTRAN-77. The paper did not discuss anyidea of using FFT as a multiplier but instead discussed thedetails of FFT.

There have not been many actual implementations ofFFT based multipliers. The FFT itself has been studiedextensively. Despite all the proposed enhancements to theFFT we have decided to implement the Radix-4 FFT withthe NTT approach. With our modular structure of ourimplementation of the FFT multiplier it would be trivialto replace the FFT and IFFT modules with with futureenhanced methods.

3 Implementation

When designing our FFT multiplier we made severalmodifications due to error and FPGA constraints. Ourmultiplier is a pipelined hybrid Radix-4 parallel NTT mul-tiplier over GF(2113). Working in a finite field allows us tohard code parameters. Instead of using ωN = ej(2π/N) forany arbitrary finite field, we use the NTT approach wherean ωN is N -th primitive root of unity in the finite fieldZm. We will now outline the advantages of using the NTT

1301

approach.Error Elimination - The NTT uses a finite field eliminat-ing all floating points.Complex Arithmetic Elimination - The NTT uses finitefields thereby eliminating the use of floating point repre-sentations.Register Reduction - The NTT eliminates all complexnumbers. Hardware implementation represents complexnumbers with two separate registers. Using NTT, we halvethe number of registers.Complexity Reduction - The NTT eliminates all complexarithmetic operations.An additional modification to our multiplier is a rearrange-ment of the structure of Radix-4. We make all commutatorstages constant, [refer to Figure 2]. Figure 3 depicts theoverall architecture of our implementation.

Figure 2: Restructured Radix-4 for a 16 point FFT

Setup

Setup

FFT

FFT

PoinwiseMultiplication Finalize

240 113

113

113

240

240

240

240

Figure 3: Architecture of Finite Field FFT Multiplier

3.1 Preliminary Calculations

In exchange for NTT’s finite field advantages, a lot of pre-calculation must be preformed. We must find a sufficientlylarge m such that the modular ring contains a principal N -th root of unity. To determine m we must use some basicgroup theory. In any ring containing a principal N -th rootof unity r, the values r, r2, ..., rN form a cyclic group of or-der N .

In order to simplify the task of finding a cyclic subgroupof a proper order, we restrict our attention to prime moduli.Furthermore, it is the case that the multiplicative group of

Commutator Twiddle

En

16 c

oeffi

eien

ts

Commutator

done

Rearrange

done done done

16 c

oeffi

eien

ts

RST

Clk

Figure 4: Detailed FFT Module

nonzero elements in any finite field is cyclic. Thus, if mis prime, the multiplicative group of nonzero elements is acyclic group of order m − 1 containing m − 1.

In addition to finding the value m, we must also take intoaccount the size of N , so as to reduce the padding of the co-efficients of the polynomial. After many trial and error at-tempts, using [2], we finalized on m = 16417 = 34×1026+1and N = 16. Where the 16-th primitive root of unity ofF16417 is equal to ω16 = 7339. For the twiddle factor, wemust calculate all constants ωi

N for i = 0, 1, ..., N − 1 andtheir inverses. We used the java.math package for all cal-culations.

3.2 FFT Setup Module

The setup module of the multiplier prepares the input forthe FFT modules. The inputs of the setup module are inpolynomial representation. The setup module performs allthe padding at incorporates all the preliminary calculations.

3.3 FFT Module

The first step in implementing the FFT was to standard-ize the Radix-4 structure. In all previous implementationsof the FFT, the commutator module arranged the data,using a lookup table in terms of the current stage. Weeliminated the lookup table by changing the structure toFigure 2. One can see that the input and output arrange-ments of the butterflies remain constant.

The FFT module performs the actual FFT operationwith the fixed twiddle factors. The twiddle factors are de-fined from the ωN value. Figure 4 is a detailed view of theFFT module including all its components and their inter-connections.

The commutator inputs the appropriate coefficients intothe Radix-4 butterfly. Each commutator includes 4 Radix-4 butterflies. Figure 5 the butterfly operation.

The twiddle multiplies the output of the commutator

1302

+

+

+

+

+

+

+

+3846

--

-

-

.

.

.

.

.

.

.

.

x(n)

x(n+N/4)

x(n+N/2)

x(n+3N/4)

a(n)

a(n+N/4)

a(n+N/2)

a(n+3N/4)

Figure 5: Radix-4 Butterfly

with the appropriate value of ωxnN ωkn

N4

where x = 0, 1, 2, 3.We know from basic arithmetic that for x = 0, n = 0 ork = 0 the twiddle factor is equal to 1 and multiplication isnot needed.

The rearrange module rearranges the output in the cor-rect order as per the bit reversal criteria.

3.4 Point-wise Multiplier Module

The point-wise multiplier module does exactly what itsname implies. It multiplies the output polynomials of theFFT on a coefficient basis. Suppose the two output poly-nomials of the FFT are A(x) and B(x) and the outputpolynomials of the point-wise multiplier is C(x). ThenC0 = A0 × B0, C1 = A1 × B1 and so on. This includesa total of 16 multiplications one for each coefficient.

3.5 Inverse FFT Module

The Inverse FFT Module has a similar structure to theFFT Module, Figure 4. The only major difference it thatnow the twiddle module incorporates the 1

N factor and thetwiddle factor is now the inverse twiddle constants.

3.6 Finalization Module

The finalization module performs the reverse operationof the setup module. The finalization module rearrangesthe output, removes the padded coefficients, reconstructsthe product and performs the GF(2113) reduction.

4 Results

There are no FFT multipliers that we came across inour research therefore we will not include a comparison ofthe performance of the FFT multiplier. The next tablecompares the FFT module itself, which was implementedthoroughly for signal processing applications. For compari-son purposes we choose an implementation that has similarcharacteristics. The Se-Hann Lee and Shao-Hua Shih im-plementation [3] was an FPGA Radix 4 design. Despitethe fact that our input coefficient is 16 bits compared totheir 8 bits, we can see improvement in all aspects. Webelieve this is due to the use of the NNT approach, whicheliminates all complex numbers and complex arithmetic.

The results of our FFT multiplier which include the

FFT Module Lee and Shih [3]Input 16-pt Radix-4 16-pt Radix-4

� Coeff. 16 bits 8 bitsLanguage Verilog Verilog

Type Pipeline NTT Pipeline FFTSoftware Xilinx ISE Xilinix ISEMax Freq 527.426 MHz 5.796 MHzClk Cycles 2 18

Time 3.79ns 3.02 µ sSlices 11218 19872

setup modules and finalization module are displayed in thefollowing table.

FFT MultiplierMax Frequency 527.426 MHzClock Cycles 5

Time 9.48 nsSlices 45865

LUT/FF 81995/8

5 Conclusions

In this paper, we have presented a new finite field FFTmultiplier implementation, which uses the NTT approach.The NTT approach had many advantages including re-duced complexity, reduced registers and elimination of er-ror, making this, to our knowledge, the fastest and first ofits kind.

References

[1] Fedorenko, S. and Trifonov, P., On Computing theFast Fourier Transform over Finite Fields, In the Pro-ceeding of the Eighth International Workshop on Al-gebraic and Combinatorial Coding Theory, TsarskoeSelo, Russia, pp. 108-111, 2002.

[2] Howell, R. R., How To Find FFT Con-stants, available at http://www.cis.ksu.edu/∼howell/calculator/how.html, 2001.

[3] Lee, S. and Shih, S., FPGA Based Solu-tions for The Fourier Transform, available athttp://web.uvic.ca/∼sshih/index.html, University ofVictoria, July, 2004.

[4] Moenck, R. T., Practical Fast Polynomial Multiplica-tion, In the Proceedings of the third ACM symposiumon Symbolic and algebraic computation (SYMSAC),pp. 136-148, 1976.

[5] Oppenheim, Alan V. and Willsky, Alan S., Signals AndSystems, Second Edition. Prentice Hall Sinal Process-ing Series, 1997.

[6] Wu, Y. and Zhu, Z., The New Real-Multiplier FFT-JAlgorithms, In the Proceeding of the IEEE 1993 Na-tional Aerospace and Electronics Conference (NAE-CON), vol. 1, pp. 90-93, 1993.

1303

Documents

[IEEE Canadian Conference on Electrical and Computer Engineering, 2005. - Saskatoon, SK, Canada (May 1-4, 2005)] Canadian Conference on Electrical and Computer Engineering, 2005. -