Optimization of Rotations in FFTs.pdf

8/10/2019 Optimization of Rotations in FFTs.pdf

1/69

Linkping Studies in Science and TechnologyDissertations, No 1423

Optimization of Rotations in FFTs

Fahad Qureshi

Department of Electrical Engineering

Linkping 2012


2/69

Linkping Studies in Science and TechnologyDissertations, No 1423

Fahad [email protected] of Electronics SystemsDepartment of Electrical EngineeringLinkping UniversitySE581 83 Linkping, Sweden

Copyright c 2012 Fahad Qureshi, unless otherwise noted.All rights reserved.Paper A, B, C, E, G reprinted with permission from IEEE.Paper D, partial work, reprinted with permission from IEEE.Paper F reprinted with permission from IEICE.

Qureshi, FahadOptimization of Rotations in FFTsISBN 978-91-7519-973-3ISSN 0345-7524

Typeset with LATE

X 2Printed by LiU-Tryck, Linkping, Sweden 2012


3/69

To my loving parents and family


4/69


5/69


6/69


7/69

Populrvetenskaplig sammanfattningDen diskreta Fouriertransformen (DFT) r en digital signalbehandlingsalgoritmsom anvnds i mnga sammanhang. Exempel r vid mtning av vilka frekvenser

en signal innehller samt vissa typer av kommunikation, t ex bredband i tele-fonntet (ADSL), trdlsa ntverk (WiFi) och fjrde generationens mobiltele-fonsystem (LTE).

Att rkna ut den diskreta Fouriertransformen r mycket berkningskrvandeoch det har sedan en lng tid utvecklats s kallade snabba Fouriertransformer,FFT (fast Fourier transform). ven om den knde matematikern Carl FriedrichGauss beskrev en liknande teknik i brjan av 1800-talet, ck de inte stor sprid-ning frrn i mitten av 1950-talet d de freslogs som ett stt att berkna enDFT effektivt p de nya datorer som brjade dyka upp hr och var.

FFT-algoritmer baseras p det faktum att man kan dela upp en lng sekvens

av data i era kortare sekvenser och sedan rkna ut DFTer p delarna var frsig samt kombinera ihop resultaten till det slutgiltiga svaret. Fr att gra dettakrvs att delresultaten multipliceras med ett komplext tal vars belopp r ett,vilket kallas fr att delresultaten roteras.

Som ett exempel s kan en DFT av lngd 16 delas upp i tta stycken DFTerav lngd tv. Utdatat frn dessa roteras samt skickas in i tv stycken DFTerav lngd tta. DFTerna av lngd tta kan delas i mindre delar och s vidare.Dessa uppdelningar kan gras p olika stt, men hittills har enbart ett ftal avdessa utforskats, trots att det gtt ver 50 r sedan FFT-algoritmen populris-erades. Dessa olika uppdelningar leder till FFT-algoritmer som r i grundenlika, men med en viktig skillnad: rotationerna r olika svra att realisera i hrd-vara. I denna avhandling studeras detta faktum och den pverkan det har pimplementeringar.

Bidragen i detta arbete kan delas in i tv delar. Dels har vi rknat ut exakthur mnga olika stt man kan gra uppdelningen p och underskt dessa frnolika aspekter, dels har vi freslagit olika effektiva stt att realisera en delmngdav rotationerna i hrdvara.

Genom att anvnda grafer fr att representera uppdelningarna har vi kom-mit fram till antalet olika mjliga uppdelningar. Dessa r mnga er n dettiotal som explicits freslagits i litteraturen tidigare och kar fort med antaletdata som anvnds till DFTn. Vi har studerat dessa baserat p hur mnga rota-tioner som behvs. Ibland kan rotationen fenklas, i extremfallet ingen rotation.Vi har ven studerat hur mycket de koefficienter man multiplicerar med ndrarsig, ngot som pverkar hur mycket energi som gr t fr multiplikationerna,hur mycket minne som behvs fr att spara alla koefficienter, samt hur mycketdigitalt brus multiplikationerna genererar. Resultaten r i mnga fall anmrkn-ingsvrda d det visar sig att den tidigare mest anvnda FFT-algoritmen pmnga stt r den smsta, samt att man genom att anvnda andra vanliga al-goritmer p fel hrdvara kan minska vissa kostnadsmtt markant. Dremot

s motverkar era av kostnaderna till viss del varandra, s det nns ingen algo-ritm som r bst p allt. Ii praktiken behver man vlja vilken kostnad som r

vii


8/69

viii Populrvetenskaplig sammanfattning

viktigast. Detta arbete kar dock antaler mjliga lnlingar signikant.Vid multiplikation med en konstant s kan man frenkla hrdvaran genom

att dela upp den i multiplikationer med tv, som kan implementeras utan n-gra kretsar, samt additioner och subtraktioner, likt multiplikation med tio kan

gras genom att lgga till nollor fr decimala tal. Till exempel kan man tnkasig en multiplikation med 15. Istllet fr en vanlig multiplikation kan vi frstmultiplicera med 16, som r tv gnger tv gnger tv gnger tv samt subtra-hera med det utsprungliga talet. P s stt kan multiplikationen utfras meden subtraktion, ngot som r mycket mindre kostsamt att bygga i hrdvaran en multiplikator. Nr man skall multiplicera med ett ftal konstanter kanman anvnda liknande tekniker, men lgga till hrdvara som kan koppla omledningarna s att just de utvalda konstanterna kan skapas. Vissa rotationer ien FFT-algoritm kan bara anta ett ftal vrden och vi har utnyttjat detta tillatt fresl ett antal effektiva kretsar fr multiplikation med just dessa tal. Den

frsta baseras p att anvnda trigonometriska likheter fr att uttrycka ett antalolika rotationskonstanter med hjlp av andra tal. Den andra metoden baserasp att beloppet p rotationen egentligen inte behver vara ett, s lnge det rsamma fr alla rotationer i en uppdelning. P s stt har vi lyckats minska an-talet additioner och subtraktioner ytterligare och ngra av resultaten br varaoptimala, vi br drmed ha s stor noggrannhet som gr att uppn givet ettvisst antal additioner och subtraktioner.


9/69

PrefaceThis dissertation thesis presents my research from December 2007 to December2011 at the Division of Electronics Systems, Department of Electrical Engineer-ing, Linkping University, Sweden. The following papers have been included inthis thesis.

Paper A

The analysis of different approaches to store the twiddle factor coefficients of radix-2i FFT algorithms for an SDF pipelined FFT architecture is discussed.The comparative analysis of different approaches is based on complexity whenimplemented on both FPGAs and ASICs.

F. Qureshi and O. Gustafsson, Analysis of twiddle factor memory com-plexity of radix-2i pipelined FFTs, in Proc. Asilomar Conf. Signals Syst.Comp. , Pacic Grove, CA, Nov. 14, 2009, pp. 217220.

Paper B

Radix-22 equivalent FFT algorithms based on binary tree decomposition areproposed. In this work, we analyze the difference among these algorithms interms of switching activity of twiddle factor multiplication, which is related topower consumption of the circuit. Experimental results show the that equivalentalgorithms can lead to saving of 40% in switching activity.

F. Qureshi and O. Gustafsson, Twiddle factor memory switching activ-ity analysis of radix-22 and equivalent FFT algorithms, in Proc. IEEE Int. Symp. Circuits Syst. , Paris, France, May. 30Jun. 2, 2010, pp. 41454148.

Paper C

Based on the binary tree decomposition, the radix-2 FFT algorithms for a 4k-point FFT are proposed. Those have been compared in terms of complexityand switching activity of twiddle factor multiplication. The results show thetrade off between the complexity and switching activity.

F. Qureshi , S. A. Alam, and O. Gustafsson, 4k-point FFT algorithmsbased on optimized twiddle factor multiplication for FPGAs, in Proc.IEEE Asia Pacic Postgraduate Research Microelectron. Electron. , Shang-hai, China, Sept. 2224, 2010, pp. 225228.

Paper D

A systematic method for generation of possible number of FFT algorithmsbased on binary tree decomposition is proposed. This method generates radix-2

ix


10/69

x Preface

FFTs algorithms for computation of 2n -point FFTs. The difference among theseFFT algorithms are in terms of twiddle factor multiplication properties, such asswitching activity, size of coefficient memories and round-off noise, which are re-lated to the power consumption, area and accuracy of the circuit. Experimental

results show the importance of proper selection of algorithm, since algorithmscan achieve very high savings in all properties of twiddle factor multiplication.

F. Qureshi , M. Garrido, and O. Gustafsson, Generation of all radix-2 fast Fourier transform algorithms using binary trees and its analysis,manuscript.

Preliminary version of the above work can be found in:

F. Qureshi and O. Gustafsson, Generation of all radix-2 fast Fouriertransform algorithms using binary trees, in Proc. European Conf. Circuit Theory Design. , Linkping, Sweden, Aug. 2931, 2011, pp. 677680.

Paper E

A process denoted as addition aware quantization is proposed. This process candetermine multiplication coefficients that have as low complexity as the roundedvalue but smaller approximation error by searching among coefficients with alonger word length or lower complexity but same approximation error.

O. Gustafsson andF. Qureshi , Addition aware quantization for low com-plexity and high precision constant multiplication, IEEE Signal Process.Lett. , vol. 17, no. 2, pp. 173176, Feb. 2010.

Paper F

A novel low-complexity structure for computation of twiddle factor in FFTbased on shift-and-add-multiplication is presented. This structure computes thetwiddle factor multiplication with a resolution of 32 points. In addition, revisitsand improves the complexity of a twiddle factor with a resolution of 16 pointsand for completeness, calculates the coefficients and the complexity of a twiddlefactor with a resolution of 8 points. Furthermore, nite word length analysis isperformed for all coefficients. The round-off errors and the coefficients optimizedby the addition aware quantization for varying requirements have been derived.

F. Qureshi and O. Gustafsson, Low-complexity constant multiplica-tion based on trigonometric identities with applications to FFTs, IEICE Trans. Fundamentals , vol. E94-A, no. 11, pp. 23612368, Nov. 2011.

Preliminary versions of the above work can be found in:

F. Qureshi and O. Gustafsson, Low-complexity recongurable complex

constant multiplication for FFTs, in Proc. IEEE Int. Symp. Circuits Syst. , Taipei, Taiwan, May 2427, 2009, pp. 11371140.


11/69

Preface xi

F. Qureshi and O. Gustafsson, Error analysis of low-complexity recon-gurable complex constant multiplication for FFTs, in Proc. Swedish System-on-Chip Conf. , Rusthllargrden, Arlid, Sweden, May 45, 2009.

Paper G A low-complexity realization of constant angle rotators based on shifts, addersand subtracters have been realized. The results show that redundant CORDICand scaled constant multiplication are providing the best results depending onwhich angle is considered. It is also shown that the precision can vary severalbits using the same number of adders and subtractors, and hence the correctchoice of rotator architecture is crucial for a low-complexity realization.

F. Qureshi , M. Garrido, and O. Gustafsson, Alternatives for low com-plexity complex rotators, in Proc. IEEE Int. Conf. Electron. Circuits

Syst. , Athens, Greece, Dec. 1215. 2010, pp. 1720.Paper H

A novel approach for obtaining the low-adder rotators is presented. The ap-proach is based on scaling the complex coefficients that dene the rotations. Asa result, this can be used to derive rotators that are efficient both in area andperformance.

M. Garrido, F. Qureshi , and O. Gustafsson, Low-adder rotators basedon coefficient scaling, manuscript.

Paper I

A unied architecture is proposed to compute 2,3,4,5 and 7 point DFT, which isbased on the Winograd Fourier transform algorithm (WFTA). The complexityof the unied architecture is equal to the complexity of 7-point DFT plus sixmultiplexers.

F. Qureshi , M. Garrido, and O. Gustafsson, Unied architecture basedon Winograd Fourier transform algorithm for 2, 3, 4, 5, and 7 small pointDFTs, manuscript.

Contributions are also made in the following publications but the contentsare not directly relevant or less relevant to the topic of thesis.

M. Abbas, F. Qureshi , Z. Sheikh, O. Gustafsson, H. Johansson, andK. Johansson, Comparison of multiplierless implementation of nonlinear-phase versus linear-phase FIR lters, in Proc. Asilomar Conf. Signals Syst. Comp. , Pacic Grove, CA, Oct. 2629, 2008, pp. 598601.

S. Athar, O. Gustafsson, F. Qureshi , and I. Kale, On the efficient com-putation of single-bit input word length pipelined FFTs, IEICE Electron-

ics Express, vol. 8, no. 17, pp. 14371443, 2011.


12/69


13/69

AcknowledgmentsAll praise and gratitude to almighty ALLAH the most gracious and the mostmerciful who gave courage and strength and made it possible for me to write

down the thesis as a result of four years of continuous effort and research.Apart from the long term commitment and consistent endeavors, there arepeople whom I owe a lot. Their kind guidance, subjective information andmoral encouragement helped me a lot in the accomplishment of this task. Theycertainly deserve my warmest gratitude for their contributions and supportthroughout my thesis.

My supervisor Dr. Oscar Gustafsson for his continuous guidance, patienceand seless helping attitude throughout these four years. His sage advice,insightful criticism, and patient encouragement aided the writing of this

thesis in innumerable ways. He made me realize what I am capable of, gaveme his valuable time and understood my ideas even I could not conveythem.

My co-supervisor Dr. Mario Garrido who played a very positive role inmy learning and understanding during my PhD research work.

Prof. Hkan Johansson who rst accepted me as a PhD student and pro-vided me a great opportunity and higher seat of learning in a very com-petitive and motivational research environment at Electronics Systems

Division. Higher Education Commission (HEC) of Pakistan is gratefully acknowl-

edged for its nancial support. I am thankful to the Swedish Institute(SI) for coordinating the scholarship program. Linkping University isalso deeply appreciated for the overall support.

Dr. Amir Eghbali and Dr. Anton Blad for their help with Latex problems.

Dr. Muhammad Abbas and Zaka Ullah for their valuable suggestions anddiscussions. I always received their help and cooperation at times whenneeded.

Peter Johansson for helping me with all administrative issues.

All previous and present members of the Electronics Systems for theirmoral support and creating such a friendly environment in the group.

My friends Dr. Rashid Ali and Dr. Rashad Ramzan for their extra ordi-nary help and guidance at the start of my PhD study.

My friends Imran Qureshi, Syed Asad Alam and Muhammad Irfan Kazimfor their kind help in proof reading my thesis.

xiii


14/69

xiv Acknowledgments

My friends and colleagues in Pakistan for their help and care they haveprovided during my career, especially Imtiaz Rabbani, Sabz Ali, Muham-mad Harris Bhutto, Abdul Jabbar Jamali, and Asif Hussain Memon.

All my friends who provided me support and encouragement during mystay in Sweden, especially Dr. Rizwan Asghar, Dr. Naveed Ahsan, Dr.Ihsan ullah Kashif, Ali Saeed, Saima Athar, Nadeem Afzal, MuhammadTauqir Pasha, Syed Ahmed Aamir, Fahad Qazi, Zafar Iqbal, MuhammadSaifullah Khan, Dr. Jawad ul Hasan, Muhammad Junaid, MuhammadFarooq, Jawad Ahmed, Muhammad Fahim Ul Haque and many more.

All my family members including my uncles, aunties and cousins for theirprays and support during my career.

My younger brothers Fawad Qureshi and Asadullah Qureshi and sister

Zuha Qureshi for their support in taking care of the responsibilities at mynative home during my absence.

My in-laws for their prays and encouragement throughout these four years.

Special thanks to my mother and my father for their non-stop prays, en-couragement, which is denitely part of the reason that I have come thisfar in life.

Finally, my life-partner Sumera Fahad for her devotion, patience, and un-conditional support. Certainly, this work would have never been possiblewithout her. My son Arhm, for taking away all the day-time worries withhis innocent acts.


15/69

Contents

1 Introduction 1

1.1 Basic Concept of FFT Algorithms . . . . . . . . . . . . . . . . . 11.2 Applications of DFT . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Rotation and Its Importance . . . . . . . . . . . . . . . . . . . . 3

2 FFT Algorithms 52.1 Cooley-Tukey FFT Algorithm . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Classic FFT Algorithms . . . . . . . . . . . . . . . . . . . 62.1.2 Improved FFT Algorithms . . . . . . . . . . . . . . . . . 82.1.3 Split-Radix FFT Algorithm . . . . . . . . . . . . . . . . . 102.1.4 Prime Factor Algorithm . . . . . . . . . . . . . . . . . . . 13

2.2 Winograd Fourier Transform Algorithm . . . . . . . . . . . . . . 142.3 Other Fourier Transforms Algorithms . . . . . . . . . . . . . . . 15

2.3.1 Bluestein Chirp Fourier Transforms . . . . . . . . . . . . 152.3.2 Rader Prime Algorithm . . . . . . . . . . . . . . . . . . . 152.3.3 Goertzel Algorithm . . . . . . . . . . . . . . . . . . . . . . 16

3 FFT Architectures 193.1 Direct Implementation . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Memory-Based Architectures . . . . . . . . . . . . . . . . . . . . 203.3 Pipelined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.1 Single-Path Delay Feedback Pipelined FFT Architecture . 223.3.2 Multi-Path Delay Feedback Pipelined FFT Architecture . 233.3.3 Multi-Path Delay Commutator Pipelined FFT Architecture 23

3.4 Bit-Reversal / I/O Order . . . . . . . . . . . . . . . . . . . . . . 24

4 Rotation and its Hardware 274.1 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 General Complex Multiplier . . . . . . . . . . . . . . . . . . . . . 28

4.2.1 Approach I . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.2 Approach II . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.3 Approach III . . . . . . . . . . . . . . . . . . . . . . . . . 29

xv


16/69

xvi Contents

4.3 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Constant Multiplication . . . . . . . . . . . . . . . . . . . . . . . 32

5 Finite Word Length Effects 35

5.1 Coefficient Quantization . . . . . . . . . . . . . . . . . . . . . . . 355.2 Round-Off Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.3 Overow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.4 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.4.1 Static Data Scaling . . . . . . . . . . . . . . . . . . . . . . 385.4.2 Dynamic Data Scaling . . . . . . . . . . . . . . . . . . . . 38

6 Conclusions and Future Work 416.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 References 43

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

A Analysis of Twiddle Factor Memory Complexity of Radix- 2iPipelined FFTs 511 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542 Architectures for Twiddle Factor Memories . . . . . . . . . . . . 55

2.1 Single Look-Up Table . . . . . . . . . . . . . . . . . . . . 552.2 Twiddle Factor Memory with Address Mapping . . . . . . 562.3 Twiddle Factor Memory with Address Mapping and Sym-

metry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.4 Address Mapping . . . . . . . . . . . . . . . . . . . . . . . 56

3 Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . . . 574 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

B Twiddle Factor Memory Switching Activity Analysis of Radix-22 and Equivalent FFT Algorithms 651 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 Radix-22 FFT and its Equivalent Algorithms . . . . . . . . . . . 69

2.1 Case I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702.2 Case II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712.3 Case III . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722.4 Cases IV and V . . . . . . . . . . . . . . . . . . . . . . . . 72

3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76


17/69

Contents xvii

C 4k-Point FFT Algorithms Based on Optimized Twiddle FactorMultiplication for FPGAs 791 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822 Criteria for Algorithm Selection . . . . . . . . . . . . . . . . . . 83

2.1 Complexity of W N -Multiplier . . . . . . . . . . . . . . . . 842.2 Switching Activity . . . . . . . . . . . . . . . . . . . . . . 843 Proposed Architectures Based on Radix-2i . . . . . . . . . . . . . 854 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

D Generation of All Radix- 2 Fast Fourier Transform AlgorithmsUsing Binary Trees and Its Analysis 931 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993 Binary Tree Representation of Cooley-Tukey Algorithm . . . . . 994 FFT Algorithms Generated by Binary Tree and Architecture . . 100

4.1 Number of Algorithms . . . . . . . . . . . . . . . . . . . . 1004.2 Twiddle Factor Index Generation . . . . . . . . . . . . . . 101

5 Implementation Parameters of Twiddle Factor Multiplication . . 1035.1 Switching Activity . . . . . . . . . . . . . . . . . . . . . . 1035.2 Memory Complexity . . . . . . . . . . . . . . . . . . . . . 1045.3 Number of Non-Trivial Multiplications . . . . . . . . . . . 1055.4 Round-Off Noise . . . . . . . . . . . . . . . . . . . . . . . 105

6 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . 1066.1 Switching Activity . . . . . . . . . . . . . . . . . . . . . . 1066.2 Memory Complexity . . . . . . . . . . . . . . . . . . . . . 1076.3 Number of Non-Trivial Multiplication . . . . . . . . . . . 1106.4 Round-Off Noise . . . . . . . . . . . . . . . . . . . . . . . 1106.5 Comparing Classic Algorithms and Good Candidates . . . 110

7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

E Addition Aware Quantization for Low Complexity and HighPrecision Constant Multiplication 1171 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202 Addition Aware Quantization . . . . . . . . . . . . . . . . . . . . 1223 Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

3.1 Rational Numbers . . . . . . . . . . . . . . . . . . . . . . 1243.2 Trigonometric Constants . . . . . . . . . . . . . . . . . . . 1243.3 CORDIC Scale Factor Compensation . . . . . . . . . . . 1253.4 Joint Optimization of Several Factors . . . . . . . . . . . 125

4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . 127References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128


18/69

xviii Contents

F Low-Complexity Constant Multiplication Based on Trigonomet-ric Identities with Applications to FFTs 1311 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1342 Complex Constant Multipliers . . . . . . . . . . . . . . . . . . . . 136

2.1 W 8-Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . 1362.2 W 16 -Multiplier . . . . . . . . . . . . . . . . . . . . . . . . 1362.3 W 32 -Multiplier . . . . . . . . . . . . . . . . . . . . . . . . 137

3 Finite Wordlength Error Analysis . . . . . . . . . . . . . . . . . . 1373.1 Coefficient Quantization Error . . . . . . . . . . . . . . . 1383.2 Round-Off Noise . . . . . . . . . . . . . . . . . . . . . . . 140

4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1444.1 Coefficient Quantization and Optimized Coefficient . . . . 1444.2 Comparison with Previous Method . . . . . . . . . . . . . 145

5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1496 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . 151References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

G Alternatives for Low-Complexity Complex Rotators 1551 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1582 Rotator Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . 159

2.1 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 1592.2 Constant Multiplication . . . . . . . . . . . . . . . . . . . 1602.3 Scaled Constant Multiplication . . . . . . . . . . . . . . . 1602.4 General Scaled Constant Multiplication . . . . . . . . . . 1612.5 Addition Aware Quantization . . . . . . . . . . . . . . . . 161

3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1624 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

H Low-Adder Rotators Based on Coefficient Scaling 1671 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1702 Rotations in Fixed-Point Arithmetic . . . . . . . . . . . . . . . . 172

3 Obtaining Low-Adder Kernels . . . . . . . . . . . . . . . . . . . . 1733.1 Obtaining Candidate Kernels . . . . . . . . . . . . . . . . 1733.2 Calculating the Number of Adders for Each Rotation . . . 1743.3 Calculating the Number of Adders for the Kernel . . . . . 1753.4 Choosing the Most Suitable Kernel . . . . . . . . . . . . . 1763.5 Finding a Low-Complexity Realization . . . . . . . . . . . 177

4 Proposed Low-Adder Kernels . . . . . . . . . . . . . . . . . . . . 1775 Hardware Implementation of Low-Adder Kernels . . . . . . . . . 1826 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1857 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188


19/69

Contents xix

I Unied Architecture Based on Winograd Fourier TransformAlgorithm for 2, 3, 4, 5, and 7 Point DFTs 1931 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1962 Architectures of 2, 3, 4, 5, and 7-Point DFTs . . . . . . . . . . . 197

3 Proposed Unied Architecture . . . . . . . . . . . . . . . . . . . . 1974 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201


20/69


21/69

Chapter 1

Introduction

The discrete Fourier transform (DFT) is an important signal processing blockin various applications, such as communication systems, patient monitoringand speech, signal and image processing. The efficient implementation of DFTis fundamental in many cost and hardware constraint applications. Variousmethods for efficiently computing the DFT have been the subject of a largenumber of published literature. These methods are commonly referred to asfast Fourier transform (FFT) algorithms.

In this chapter, the basic concepts of FFT algorithms is presented. This

chapter further discusses the different applications of DFT and the importanceof rotations in transforms.

1.1 Basic Concept of FFT AlgorithmsFFT algorithms are used to compute DFTs efficiently. These algorithms arenot based on any approximation, so they have the same results as the directcomputation of the DFT. The DFT is dened as:

X (k) =

N 1

n =0x(n)W nkN , k = 0, 1, . . . , N 1, (1.1)

where N represent the size of the DFT, n is the time index, k is the frequencyindex and W nkN is the twiddle factor. The twiddle factor W nkN is dened as:

W nkN = e j 2nk/N = cos (2nk/N ) j sin(2nk/N ) . (1.2)

The inverse discrete Fourier transform (IDFT) for recovering the originalsequence is given by:

x(n) = 1N

N 1

k=0X (k)W

knN , n = 0, 1, . . . , N 1, (1.3)

1


22/69

2 Chapter 1. Introduction

To efficiently compute the DFT, a number of fast Fourier transform algo-rithms have been proposed to reduce the computational complexity, includingthe Cooley-Tukey algorithm [1], prime factor algorithm [2, 3] and Winogradalgorithm [4, 5]. The approach, which unied the derivation of these algorithms

is divide and conquer [6]. In this approach, the N -point DFT is mapped intoseveral sub-DFTs in such a way that it satises following condition:

L

i=1

Cost (sub-DFTN ( i ) ) + Cost (interconnections ) Cost (DFTN ),

(1.4)

where L is the number of sub-DFTs, Cost (DFT) is the number of operations,Cost (interconnection) is the cost of twiddle factor multiplications and indexmapping and DFTN is the N -point DFT. The power of divide and conquer ap-proach is, to apply decomposition recursively, until the sub-DFTs are sufficientlysmall. This results in reduction of the order of complexity of FFT algorithm.This approach can also be applied directly to convolution algorithms to breakit down into multiple small convolutions [2].

1.2 Applications of DFTThe discrete Fourier transform nds limitless applications in many areas of signal processing. In the era of fast computing it has become increasingly im-portant to enhance the existing FFT algorithms to meet the ever increasingapplications in the eld of digital signal processing.

DFTs have also been extensively used in multi-carrier transmission systemslike orthogonal frequency domain multiplexing (OFDM) as FFT processors.In multi-carrier modulation, such as OFDM and discrete multitone (DMT),data symbols are transmitted in parallel on multiple sub-carriers. Multi-carriermodulation-based transceivers involve real-time DFT computations [7]. FFT-based channel estimation method is derived from the maximum likelihood cri-terion, which is originally proposed for OFDM systems with pilot preambles.In order to save bandwidth and improve system performance, decision-feedback(DF) data symbols are usually exploited to track channel variations in subse-quent OFDM data symbols, and this method is called DF DFT-based channelestimation [8].

DFT is extensively used in sonar and radar systems. These systems usemillions of multiplications per second. Seismic processing require extensiveprocessing of seismic data [9]. Computerized tomography is widely used tosynthetically form images of internal organs of the human body wherein mas-sive amounts of signal processing is required. Remote sensing is another eldemploying huge amount of processing. Satellite photographs can be processed

digitally to merge several images or enhance features or combine informationreceived on different wavelengths.


23/69


24/69


25/69


26/69

6 Chapter 2. FFT Algorithms

k = P k1 + k2k1 = 0, 1, 2, .....Q 1,k2 = 0, 1, 2, .....P 1.

(2.5)

Using (2.4) and (2.5), the DFT (2) can be written as:

X (k) =P 1

n 1 =0

Q 1

n 2 =0x(n)W (n 1 + P n 2 )( Qk 1 + k2 )N . (2.6)

Further development considering that W n 1 Qk 1N = W n 1 k1P , W

n 2 P k 2N = W

n 2 k2Q

and W P n 2 Qk 1N = 1 leads to:

X (Qk1 + k2) =P 1

n 1 =0

Q 1

n 2 =0

x (n1 + P n 2) W n 1 k1P W n 1 k2N W

n 2 k2Q ,

=P 1

n 1 =0

Q 1

n 2 =0x(n1 + P n 2)W n 2 k2Q

Q point DF T W n 1 k2N

T widdlefactorW n 1 k1P

P point DF T ,

(2.7)

where the two summations are indexed by n1 and n2 , which are referred to asinner and outer DFTs. As a result, an N -point DFT is broken down into P -point and Q-point DFTs. The output of the inner DFT is multiplied by W n 1 k2N ,which is called twiddle factor multiplication. The scheme of decomposition isshown in Fig. 2.1, where the left and right sides represent the P -point innerDFT and Q-point outer DFT, respectively. Between those DFTs, a twiddlefactor multiplication indicates a rotation by W N = e j

2 N .

The binary tree representation is an effective way of understanding the differ-ence in various FFT algorithms [28]. In binary tree representation, a node witha value n represents a 2n -point DFT. Each node has at most two leaves whichrepresent the inner and outer DFTs. An example of a binary tree is shown inFig. 2.1 corresponding to the scheme of DFT. The left leaf corresponds to num-ber of 2q DFTs of 2 p-point whereas the right leaf represents set of 2 p DFTs of 2q-point. This decomposition applies recursively until the leaf node matches theradix. When the binary tree is nalized, all leaf nodes correspond to the radixand internal nodes correspond to twiddle factor multiplication stages, whichconnect the two decomposed DFTs.

2.1.1 Classic FFT Algorithms

In this section, classic FFT algorithms are discussed. Firstly, the Cooley-Tukeyalgorithm is based on decomposing the N point DFT in s = logr N stages,


27/69

2.1. Cooley-Tukey FFT Algorithm 7

W N

Q-point DFT

X (k)

p

2 p-DFT

q

2q -DFT

2 p times

P -point DFT

x(n)

2q times

log2(N ) = n = p + q

Figure 2.1: Decomposition Scheme and binary tree of Cooley-Tukey FFT algo-rithm.

1

x(0) X(0)

X(1)x(1)

Figure 2.2: Flow graph of radix-2 buttery.

where r is the radix of the FFT and N is a multiple of r . The decompositionsplits an N point DFT into two arbitrary factors and is continued recursively

until all the remaining transforms sizes are equal to r . Each radix-r FFT iscalled buttery. It has r inputs and r outputs and calculates an r-point FFT.The most popular radices are 2 and 4, which are suited for FFT sizes that aremultiple of 2 and 4, respectively. Figure 2.2 represents a radix-2 buttery, whichcalculates:

X (0) = x(0) + x(1),X (1) = x(0) x(1). (2.8)

Figure 2.3 shows a radix-4 buttery. Note that it requires a complex rotationby e j/ 2 . This complex rotation is considered a trivial rotation as it can be


28/69


j

x(1)

x(2)

x(3)

X(1)

X(2)

X(3)

x(0) X(0)

j1 j

11

j1

Figure 2.3: Flow graph of radix-4 buttery.

realized by interchanging the real and imaginary parts and/or changing thesign of the input data. Rotations by 1, 1, j, j are also trivial. Furthermore,

higher radices generally reduces the number of stages at the cost of increasingthe complexity of each stage because radices higher than 4 require butterieswith non trivial rotations. Non trivial rotation can be implemented using ageneral complex multiplier, the CORDIC algorithm or constant multiplications.As a result, the selection of the radix has a signicant impact on the complexityof the selected FFT algorithm.

Figures 2.4(a), (b) and (c) show the binary trees of radix-2 DIF/DIT andradix-4 DIF FFT algorithms respectively. The node number represents the twid-dle factor multiplication between the two decomposed stages and the index of twiddle factor can be generated from the counter in the range [0, N 1] [19]. If

compared, the radix-4 FFT algorithm has half the stages as compared to radix-2 FFT algorithm. Likewise, the number of stages can be reduced by using ahigher-radix FFT algorithm. Finally, the most common decompositions, namelydecimation in time (DIT) and decimation in frequency (DIF) are discussed. Foran N -point DFT, a decomposition using decimation in time separates the inputsequence into its even- and odd-indexed samples. This breaks down the N -pointDFT into two N/ 2-point DFTs plus some additions and rotations. Then, thedecomposition proceeds iteratively on the N/ 2-point DFTs until the whole al-gorithm has been simplied. Likewise, the decimation in frequency decomposesthe DFT iteratively into DFTs of half size. Unlike the DIT, DIF starts from theoutput frequencies by separating them into even- and odd-indexed ones. Fig-ures 2.4(a) and (b) represent the binary trees of a radix-2 DIF and DIT FFT,respectively. By comparing both gures, it can be observed that the DIF andDIT decompositions of the FFT only differ in the location of the twiddle factormultiplications.

2.1.2 Improved FFT AlgorithmsImproved FFT algorithms are basically the modied versions of the classic FFT

algorithms. As discussed in classical algorithms, a selection of radix has a largeimpact on the complexity of FFT algorithm. Higher radices help in reducing


29/69


(c)(b)(a)

1

n

n 2

n 1

2

1

1

1 1

n

n 2

n 1

2

11

n

n 2

n 4

4

2

2

2 2

1

Figure 2.4: Binary trees of classical FFT algorithms: (a) Radix-2 DIF, (b)Radix-2 DIT, and (c) Radix-4 DIF.

2

1 1

1 12

4

n 4

1

2

1 11

2

n

n 2

Figure 2.5: Binary tree of radix-22 FFT algorithm.

the number of stages. However, complexity of the butteries increases [10]. Theradix-22 FFT algorithm is one of most popular FFT algorithms. In radix-22 thenumber of non trivial multiplications is exactly the same as the radix-4. Thebinary tree diagram of radix-22 is depicted in Fig. 2.5, which shows that thealternate twiddle factor multiplication stages have the W 4 twiddle factor multi-plication. The W 4 twiddle factor multiplication involves trivial multiplications.

Furthermore, radix-23 and radix-24 can be used to implement the radix-8and radix-16 butteries by radix-2 [23]. The radix-23 and radix-24 FFT algo-rithms have same twiddle factor stages as the radix-8 and radix-16 butteryFFT algorithm. The binary tree of the radix-23 with radix-8 buttery FFT al-gorithms and radix-24 with radix-16 FFT algorithms are shown in Fig. 2.6. Theradix-24 can further be modied to implement the radix-16 by rst splitting theradix-16 into two radix-4 and further decompose into radix-2. In this way, canremove the W 8 twiddle factor multiplication stage as shown in Fig. 2.7 (a) [23].The twiddle factor multiplication stages of radix-2,radix-22 , radix-23 , radix-24 ,

and modied radix-24

FFT algorithms are tabulated in Table 2.1. This tableincludes all the twiddle factor multiplication stages use to implement the FFT


30/69


Table 2.1: Multiplication at different stages for improved FFT algorithms.

Stage numberRadix 1 2 3 4 5 ..... s

2 W N W N/ 2 W N/ 4 W N/ 8 W N/ 16 ..... W 422 W 4 W N W 4 W N/ 4 W 4 ..... W 423 W 4 W 8 W N W 4 W 8 ..... W 824 W 4 W 8 W 16 W N W 4 ..... W 16

algorithm. In addition, there are many other FFT algorithms, such as, basedof balance binary tree [28], radix-2k [16] and radix-r k [17]. However, there arestill hundreds on the FFT algorithms that can be generated. In paper D, au-thors proposed a method to generate the number of possible FFT algorithms.The difference among these FFT algorithms analyzed in terms of twiddle factormultiplication properties, such as switching activity, size of coefficient memo-ries and round-off noise, which are related to the power consumption, area andaccuracy of the circuit.

2.1.3 Split-Radix FFT Algorithm

Split-radix FFT algorithms for 2n size DFT were introduced, which is famousfor having the less number of multiplications and additions [2]. This algorithmprovides optimum number of operations for power of two sized DFT. The mainidea outlining split-radix FFT algorithm is that different decompositions canbe used for different parts of the algorithm. Thus, independent parts of thealgorithm should be computed independently and should use the best possiblecomputational scheme, regardless of what scheme is used for other parts of thealgorithm. Thus, reduction in computational complexity is achieved by usingthis approach [3].

As an example, let us consider a split radix FFT algorithm for 2n sizeDFT. The radix-2 decomposition on the even indexed outputs of the DFT whileusing a radix-4 decomposition on the odd indexed parts. This approach thuscombines the advantages of both radix-2 and radix-4 algorithm by minimizingthe number of non trivial twiddle factors at any one stage while minimizingthe number of such stages. When split radix algorithm is applied to radix-2and radix-4 simultaneously on different parts of an N -point FFT, it resultsin an L-shaped buttery, which is shown in Fig. 2.8 However when iterativedecomposition is applied, unlike the uni-radix FFT algorithms, the split radix

does not progress stage by stage thus resulting in a less regular FFT algorithmthan the uni-radix [16] or the mixed-radix FFT algorithms [29].


31/69


(a) (b)

2

(c) (d)

n 4

n 3

n 6

6

3

3

3 3

n

n 3

6

2

3

2

23

23

1

1

11

11

1

1

111

1

n

n 4

4

1

1

3

11

8

n

3 n 6

1

12

11

4

1

1

3

2

11

4

1

1

11

2

3n 8

4

3

4

8

n

4

4

4

n 8

Figure 2.6: Binary tree of FFT algorithms: (a) Radix-8, (b) Radix-23 ,(c) Radix-16, and (d) Radix-24 .


32/69


(b)(a)

1

n n/ 2n 4 n/ 2

n n

2

11

2

114

2

11

2

1

4

2

11

2

11

4

n 8

11 11 11 11

Figure 2.7: Binary tree of FFT algorithms: (a) Modied radix-24 and (b) Bal-ance binary tree decomposition.

1

1 j

1

N/ 2-point DFT

x(n)

W 3nN

W nN

x(n + N/ 4)

x(n + N/ 2)

x(n + 3N/ 4) N/ 4-point DFT

N/ 4-point DFT

Figure 2.8: Split radix-2/4 buttery.


33/69


2.1.4 Prime Factor AlgorithmThe elimination of the twiddle factor multiplication is obtained by using theprime factor algorithm for computation of the FFT, also called Good Thomasalgorithm [18]. The signal ow graph of prime factor FFT algorithm decompo-sition is shown in Fig. 2.9, where there is no multiplication between two decom-posed sub-DFTs. This algorithm uses a similar divide and conquer approach,where the sizes of the decomposed DFTs are relative prime. Two numbers arerelatively prime or coprime, if their common divisor is one. In such cases aspecial index mapping based on the Chinese remainder theorem is formed toconnect these decomposed DFTs. In this case, the input x(n) and data sequenceX (k) can be substituted by the following:

n = An 1 + Bn 2 N n1 = 0, 1, 2, .....Q 1,

n2 = 0, 1, 2, .....P 1.(2.9)

k = Ck1 + Dk 2 N k1 = 0, 1, 2, .....Q 1,k2 = 0, 1, 2, .....P 1,

(2.10)

where f N denotes the f modulo N . The coefficients A, B , C and D have tobe chosen so that the following conditions are satised:

A N = Q, BD N = P AD N = BC N = 0 , (2.11)and further development considering these conditions lead to:

W ( An 1 + Bn 2 N Ck 1 + Dk 2 N )N = W k1 n 1P W

k2 n 2Q , (2.12)

thus twiddle factor multiplication is not required. The method to nd therequired values is not simple, it requires Chinese remainder theorem . One of the possible set of coefficients that satisfy these specic conditions is following:

A = Q, B = P, C = Q Q 1 P , C = P P 1 Q , (2.13)

where Q 1 P denotes the multiplicative inverse of Q moduloP , i.e., if Q 1 P = then Q P = 1. Note that many solutions of these conditions may exist for

specic value of P and Q.By applying this new index mapping the DFT (2) becomes

X (k) =P 1

n 1 =0

Q 1

n 2 =0x Qn 1 + P n 2 N W n 2 k2Q

Q point DF T W n 1 k1P

P point DF T

, (2.14)

where there is no twiddle factor multiplication to interconnect two decomposedDFTs as shown in Fig. 2.9. Apparently, there is no considerable disadvantage


34/69


from (2.14). However, it requires complex index mapping of the input data [10].This algorithm can also be applied recursively until the decomposed DFT canbe factorized into coprime DFTs. Furthermore, it is even possible to computeDFT by using both the Cooley-Tukey FFT and prime factor FFT algorithm.

X (k)

P -point DFT

Q-point DFT

x(n)

Figure 2.9: Decomposition based on prime factor FFT algorithm.

2.2 Winograd Fourier Transform AlgorithmThe Winograd Fourier transform algorithm (WFTA), is an FFT algorithmwhich achieves a reduction on the number of multiplications from order O(N 2)in the DFT to order N [4, 5, 30]. However, the hardware implementation getscomplicated mainly due to complex index mapping. Furthermore, the numberof additions is increased signicantly as compared to other FFT algorithms.Thus, WFTA may only be efficient in implementing small size DFTs.

This FFT algorithm is also based on the decomposition of N -point DFTinto sub-DFTs. Each sub-DFT is computed using Winograd algorithm, which

can be dened as:X (1)

.

.

.

.X (N 1)

= IMO

x(1)....

x(N 1)

, (2.15)

where I is a matrix, which corresponds to the addition between inputs, M is a

diagonal multiplication matrix, and O is matrix, which corresponds the additionafter multiplication. These operations, also called the preweave addition, the


35/69

2.3. Other Fourier Transforms Algorithms 15

multiplications, and the postweave additions, which are shown in Fig. 2. Theblocks are carried out on the basis of Radars permutation which converts aDFT into a cyclic convolution and then applies Winograds cyclic convolutionalgorithm for computation of DFT.

Unlike the Cooley-Tukey FFT algorithm or the prime factor algorithm, theWFTA computation method does not process one sub-DFT at a time. TheWFTA rst computes the preweave additions of all sub-DFTs, then computesall multiplications of sub-DFTs and nally computes the postweave addition of all sub-DFTs.

Input Preaddition Multiplication Postaddition Output

Figure 2.10: Block diagram of Winograd algorithm.

2.3 Other Fourier Transforms AlgorithmsMore algorithms have been proposed, those are based on the circular convolutionand recursive computation of DFT.

2.3.1 Bluestein Chirp Fourier Transforms

The Bluestein Chirp Fourier transform is based on the chirp z-transform algo-rithm [31]. In other words, it is a special case of chirp z-transform algorithm.The algorithm is more exible and can be applied to the size of DFT. By ap-plying the Blustein chirp algorithm, the DFT (2) becomes:

X (k) = W k2 / 2

N

N 1

n =0

x(n)W ( n k ) 2

2N W

n 22

N , (2.16)

where W k2 / 2

N and W n 2 / 2N are the chirp signals. In order to better understand

the algorithm its the block diagram is presented in Fig. 2.11. It consists of mul-tiplication stages before and after the convolution block by chirp signals. TheBluestein Chirp Fourier transform algorithm requires 2N multiplications anda cyclic convolution of length N , so it is not efficient in reducing the numberof operations. However, it can be suited to implement in a variety of applica-tions [32]. Furthermore, it can be possible to use fast convolution algorithmsfor the cyclic convolution.

2.3.2 Rader Prime Algorithm

Rader algorithm is also based on the cyclic convolution and only used to com-pute the prime factor size DFTs [33]. It requires some indexing operation and


36/69


Chirp signal Chirp signal

Convolution X (n)x(n)

Figure 2.11: Bluestein Chirp algorithm.

ConvolutionPermuted input Output

X(0)

x(n) X (k)

Figure 2.12: Rader prime algorithm.

the computation of a cyclic convolution. However, cyclic convolution requiresn 1 length, which is not a prime. Thus, rst covert the length from the primelength to non-prime for cyclic convolution length DFT by:

X (0) =N 1

n =0x(n). (2.17)

Because the primitive element generator p generates all values of the outputindex k except k(0). Now, replace n with pn mod N and k by pk mod N , andalso substitutes (2.17), which gives the following equation:

X ( pk M ) = x(0) +N 1

n =0

x( pn N )W pn + k

N 1N , k = 1, .....N 1. (2.18)

Now it is solved by applying cyclic convolution. The block diagram of thealgorithm is depicted in Fig. 2.12, where the cyclic convolution block computesthe output of DFT from 1 to N 1, X (0) is calculate by simple summation,nally all outputs are combined of DFT.

2.3.3 Goertzel Algorithm

The Goertzel algorithm is a digital signal processing (DSP) technique for iden-tifying frequency components of a signal. While the general FFT algorithmcomputes evenly across the bandwidth of the incoming signal, the Goertzelalgorithm looks at specic, predetermined frequencies.

The Goertzel algorithm exploits the periodicity of the phase factor W kN and

allows to express the computation of the DFT as a linear ltering operation.Since W kN N = 1, this can multiply by the 2:


37/69

2.3. Other Fourier Transforms Algorithms 17

X (k) =N 1

n =0x(n)W nkN W kN N , (2.19)

=N 1

n =0

x(n)W k (N n )N . (2.20)

If Y k (n) is dened as,

Y k (n) =n

p=0

xe ( p)W k(n p)N . (2.21)

Then it is clear that yk (n) is the convolution of the nite duration inputsequence of x(n) of length N that has an impulse response

hk (n) = W knN u(n). (2.22)

The system with impulse response hk (n) has the transfer function

H k (z) = 1

1 W kN z 1. (2.23)

Thus the entire DFT can be computed by passing the block of input datainto a parallel bank of N single pole lters where each lter has a pole at the cor-responding frequency of the DFT. Using the difference equation correspondingto the above equation to compute yk (n) recursively so

Y k (n) = W kN Y k (n 1) + x(n),Y k ( 1) = 0 . (2.24)

The complex multiplication and addition in the above equation can beavoided by combining the pairs of single pole lters possessing complex con- jugate poles. This leads to two pole lter given by

H k (z) = 1 W kN z 1

1 2 cos(2k/N )z 1 + z 2 (2.25)

The ow graph of the system corresponding to a second-order Goertzel algo-rithm is shown in Fig. 2.13. The recursive part of this gure iterates N cyclesand then pluses one complex multiply to obtain X (k).


38/69


Z 1

2cos( 2kN ) W kN

-1

X (n)x(n)

Z 1

Figure 2.13: Flow graph of second-order Goertzel algorithm.


39/69

Chapter 3

FFT Architectures

The key to high performance in FFT hardware is to have the computationalelements organized in such a way that they match the structure of the compu-tational algorithm. In chapter 2, FFT algorithms were discussed particularlyabout the optimization of the number of operations. In real implementations,often the number of operations is not as important as the amount of resourcesrequired and the utilization of those.

An FFT architecture consists of processing elements including butteries

and rotators with memory and data management circuits. Each buttery isusually reused to compute several butteries of the FFT algorithm. Similarly,rotators can be reused to perform several rotations. This reuse of the processingelement reduces the area of circuit.

FFT algorithms can be implemented using different FFT architectures. Amongthem, the selection of the algorithm depends on the required specications.Three basic specications can be found in the literature: signal processing spec-ications such accuracy, application specications such as latency and through-put, and hardware device specications, includes area limitations and sometimespower consumption.

The most common FFT hardware architectures are explained in the nextsection, including direct implementation of the FFT algorithm, memory-based,and pipelined architectures.

3.1 Direct ImplementationThe direct implementation is an isomorphic mapping of the signal ow graph of an FFT algorithms on to hardware. The direct implementation requires a num-ber of processing elements equal to the number of operations. Likely, this may

not be an efficient architecture for most applications, being very hardware inten-sive. Nevertheless, it can be suitable for small size FFTs and for extremely high

19


40/69

20 Chapter 3. FFT Architectures

FFT CoreMemory0

1

InputOutput

Figure 3.1: Memory based FFT.

throughputs [34]. Furthermore, the utilization of the butteries and rotators is100%. The hardware cost in terms of butteries and rotators is proportional toO(N logN ).

3.2 Memory-Based ArchitecturesMemory-based architectures are also called in-place architectures or iterativearchitectures. They consist of one or more processing elements that calculate allthe butteries and twiddle factor multiplication of the algorithm. The memory-based FFT architecture uses one or more processing elements so one or morememories are required to store the value. The FFT computation is done byfetching the input data from memory, processing it in the processing elementand storing the result again in the memory. This process is iterated until allFFT computations are completed. This is an efficient architecture to meetstringent low area requirements for large size FFTs. [29, 3538]

A simple memory-based architecture is shown in Fig. 3.1. It consists of amemory and an FFT core, which computes the buttery and rotation. As seenin Fig. 3.1, after every iteration data are stored in memory so it is necessary tocompute whole FFT before it receives new samples. Thus, the memory-basedarchitecture is unable to compute the FFT when data arrives continuously atthe input. Indeed, this can be solved by adding extra memory for storing theincoming data while the FFT is being calculated.

In memory-based architectures the throughput depends on the latency andthe size of the FFT. The higher the computation time of the FFT core, results inhigher the latency and the lower throughput. Likewise, large size FFTs requirelarge number of iterations, which lowers the throughput. As a result, it may beunsuitable for FFT computation in real time applications.

To improve the throughput and to decrease the latency in memory-basedarchitectures, high-radix processing elements are used. However, when memory-based and higher-radix processing element are used together, it causes memoryconict problems due to improper memory accesses. Thus multi bank memoryis required in order to solve these conicts [39]. Parallel processing is anotherapproach to increase the throughput and decrease the latency. The column FFTarchitecture is an example of highest parallelism [37, 38, 4042], where an entire

stage of the FFT signal ow graph is computed in parallel, as shown in Fig. 3.2.These architectures have high throughput, but hardware cost is increased as


41/69

3.3. Pipelined 21

P.E

S w

i t c h

i n g n e t w o r k

Input Output

P.E

P.E

P.E

Figure 3.2: Memory based FFT.

compared to single processing element architectures.

3.3 PipelinedPipelined architectures [3, 14, 16, 23, 28, 4353] are also called streaming ar-chitectures. These architectures have a regular structure, moderated area, rel-atively simple control and high throughput. In addition, a major advantageof pipelined architectures is that they can process a continuous ow of datawithout any additional circuitry as compared to memory-based architectures.Furthermore, it is possible to increase the clock frequency by adding registersin order to divide the critical path, which increases the throughput. As a result,pipelined architectures have high throughput rates and are suitable for real-timeapplications.

In pipelined FFT architectures, a single stage of the FFT algorithm is com-puted by a single stage of the architecture. Figure 3.3 shows a 16-point radix-2FFT algorithm and its pipelined architecture. In order to start the computa-tions, x(n) and x(n + N/ 2) should be available. For the continuous ow of data,the rst half samples have to be stored in memory until the second half samplearrives. Based on memory storage, there are two main categories: delay commu-tator and delay feedback. In the next section, the most common pipelined FFTarchitectures based on these categories are discussed, including single-path de-

lay feedback (SDF), multi-path delay commutator (MDC) and multi-path delayfeedback (MDF).


42/69


Input

Stage 2 Stage 3 Stage 4Stage 1

x(0)

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)x(7)

x(8)

x(9)

x(10)

x(11)

x(12)

x(13)

x(14)

x(15)

X(0)

X(8)

X(4)

X(12)

X(2)

X(10)

X(6)X(14)

X(1)

X(9)

X(5)

X(13)

X(3)

X(11)

X(7)

X(15)

Input R2 R2 R2 R2

Figure 3.3: Transform from signal ow graph to pipeline FFT architecture .

3.3.1 Single-Path Delay Feedback Pipelined FFT Architecture

The single-path delay feedback architecture has a feedback loop at every stage [14,

23, 28]. The buttery processing element stores the input samples in feedbackmemory, until they are required for buttery computation. These butterieshave 50% utilization ratio. This architecture has one continuous data stream of one sample per clock cycle, which means that at every clock cycle one sampleis fed and one sample is received as output. However, higher throughput canbe achieved by increasing the clock frequency.

Figure 3.4 shows the radix-r buttery SDF pipelined FFT architecture. Inthis architecture, the number of stages is s = logr (N ) and each stage contains aradix-r buttery, a rotator and memory or buffers. The radix-r buttery consistof an adder, a substractor and multiplexers. However, in case of radix higherthan 4, rotators are also required for computation of non-trivial rotations. Fur-thermore, multiplexers are used to control the inputs of the memory whetherfed directly without any computation or after the buttery computation is per-formed. These multiplexers are usually controlled by a clock signal. In case of radix-2, multiplexers of rst stage switch their position after N/ 2 clock cyclesand every N/ 4 clock cycles in the consequent stages. Likewise, the memoryrequirement is only N 1 because it requires N cycles to provide rst output.As far as the rotators are concerned, they can be implemented using differentarchitectures, which will be discussed in Chapter 4. In general, a single-path

delay feedback architecture for FFT is considered as an optimal choice in termsof the hardware cost and performance for many applications.


43/69

3.3. Pipelined 23

FF(s1)FF1

Stage 1 Stage 2 Stage s1 Stage s

R a d

i x r

R a d

i x r

R a d

i x r

R a d

i x r

Input Output

FF2 FF(s)

Figure 3.4: Radix-r single-path delay feedback architecture.

Output R a d x i r

Buffers

R a d x i r

Buffers

R a d x i r

Buffers

R a d x i r

Buffers

Stage 1 Stage 2 Stage s1 Stage s

Input

Figure 3.5: Radix-r multi-path delay feedback pipelined FFT architecture.

3.3.2 Multi-Path Delay Feedback Pipelined FFT ArchitectureMulti-path delay feedback pipelined FFT architectures are also known as paral-

lel feedback architectures. These architectures consist of parallel SDF pipelinedFFT architectures for processing several samples [44, 5456]. Like, the SDFpipelined FFT, the MDC architecture has 50% buttery utilization ratio. Thus,the parallelism does not improve the utilization ratio. MDF architectures canprocess samples in parallel, so it has high throughput at the expense of hardwarecost.

MDF pipelined FFT architectures can use different radices, such as radix-2 [54], radix-22 [56], and radix-24 [44]. Figure 3.5 shows the radix-r multi-pathdelay feedback architecture, where each path has single input and single output.Similar to other architectures, a radix-r buttery is used to compute r-pointDFTs. This gure shows the rotators in each path for twiddle factor multiplica-tion. However, this can be minimized by using the scheduling technique and thespecied constant multipliers, which are based on sharing the same rotator formultiple streams. Furthermore, a memory is used in the feedback loop, whichfeeds the samples to buttery stages according to data ow requirements of thecircuit.

3.3.3 Multi-Path Delay Commutator Pipelined FFT Architecture

The Multi-path delay commutator pipelined FFT architecture is the most straightforward implementation of FFT algorithms, also referred to as the feedforward


44/69


Stage s

B u

f f e r a n

d

c o m m u t a t o r

R a d

i x r

B u

f f e r a n

d

c o m m u t a t o r

R a d

i x r

B u

f f e r a n

d

c o m m u

t a t o r

R a d

i x r

R a d

i x r

Input Output

Stage 1 Stage 2 Stage s1

Figure 3.6: Radix-r multi-delay commutator pipelined FFT architecture.

FFT architecture [16, 5759]. This architecture does not have any feedbackloop so data is processed by the butteries and the rotators then fed to the nextstage. As a result, they have 100% utilization ratio of butteries [16]. Gen-erally, it can compute several samples in parallel, so they can provide higherthroughput than SDF pipelined architecture at the cost of hardware.

The MDC pipelined architecture can be used for several butteries, such as

radix-2 [47], radix-4 [45] and radix-8 [45]. These architectures can be improvedby using radix-22 , radix-23 , radix-24 and radix-2k [16]. Figure 3.6 shows theradix-r MDC pipelined FFT architecture. This consists of radix-r butteries,rotators, buffers and commutators. After the radix-r buttery computes ther -point DFTs, the result is then fed to the buffer and commutator after twiddlefactor multiplication. The twiddle factor multiplication is performed by therotator and its implementation techniques will be discussed in next Chapter.The purpose of the buffer and commutator block is to store the samples comingfrom one stage and reorder them in the right order for the next stage. Hence, thebuffer and commutator block is used after twiddle factor multiplication. Each

buffer and commutator block has a different size of buffer determined by thestage number..

3.4 Bit-Reversal / I/O OrderBit reversal is an algorithm that generates a set of indexed data according toa reversing of the bits [60]. This algorithm is used many times to sort out theoutput frequencies of the FFT, which are usually provided in bit-reversed order.This algorithm permutes a set of indexed data according to a reversing of thebits of the index.

The bit reversal of N = 2n indexed data is an algorithm that reordersthe data according to a reversing of the bits of the index. This means that anysample with index I = bn 1 ,...,b1 , b0 moves to the place BR (I ) = b0 , b1 ,...,bn 1 .In other words bit reversal is an inversion operation, i.e., BR(x) = BR 1(x)therefore, if data are in natural order, the bit reversal algorithm obtains themin bit-reversed order and vice versa.

There are different approaches to implement bit reversal algorithms. The

most common approach is to store the set of data in memory and read the datafrom the memory afterwords [61, 62]. Other approach is based on the circuits


45/69

3.4. Bit-Reversal / I/O Order 25

consists of buffers and multiplexers in series. This architecture also called theoptimum bit reversal circuit because it has a minimum number of registers andminimum latency [63].


46/69


47/69

Chapter 4

Rotation and its Hardware

In this chapter, the different possibilities to implement rotations are discussed.The rotations can be implemented using different algorithms, like general com-plex multiplier, CORDIC, and complex constant multiplier. The CORDIC andcomplex constant multiplier are multiplier-less implementations, where shift andadds are used to implement the rotation.

4.1 Rotation

A rotation is the multiplication by a complex number whose magnitude is equalto one, i.e. only the phase of the data is affected. The rotation of a complexnumber x + iy by an angle is calculated by the following equations:

X = x cos y sin,Y = y cos + x sin, (4.1)

where X and Y are the real and imaginary parts of the result. Thus, the rotationcan be written as:

X + jY = ( x cos y sin ) + j (y cos + x sin ) . (4.2)

The rotation in (4.1) is shown in Fig. 2. This rotation can be computedin several different ways, including general complex multiplication [20], and

the CORDIC algorithm, [21, 22] and algorithms based on constant multiplica-tion [2326].

27


48/69

28 Chapter 4. Rotation and its Hardware

y

sin()

cos()

sin()

X

Y

cos()

x

Figure 4.1: Operational diagram of rotation.

MEM

x + iy X + iY

Figure 4.2: General complex multiplier.

4.2 General Complex Multiplier

The most common way of implementing the rotation is by using four multipliersand two adders. When complex multipliers are used, the sine and cosine compo-nents of the rotation angle are usually stored in a memory. The architecture of the multiplier is shown in Fig. 4.2, where the multiplier symbol represents themultiplication and addition of (4.1) and the memory is used to store the sine andcosine component of the rotation angles. In pipelined FFT architecture, eachstage performs N number of rotations so need to store N number of twiddle

factor coefficients in the memory. In order to reduce the size of the coefficientmemory, there are a number of approaches, which have been applied [28, 64].The coefficients in memory are the input to the complex multiplier in order tocalculate the rotation of the input sample. Furthermore, the number of 0 1bit transitions between successive coefficient that are read from the coefficientmemory is dened as a switching activity (SA), which is related to the powerconsumption of the circuit [6567].

The complex multiplier consists of four real multiplications and two addi-tions. Regardless of the application of rotation, the complex multiplier can be

realized by different approaches. These approaches generally reduce the numberof real multiplications from four to three at additional cost.


49/69

4.2. General Complex Multiplier 29

Y y

sin()cos()

x X

Figure 4.3: Algorithm with three multiplier and ve adders.

4.2.1 Approach I

The general rotation (4.1) can also be written in the following approach:

X = x (cos + sin ) (x + y) sin,Y = x (cos + sin ) + (y x) cos. (4.3)

This consists of only three multiplications and ve additions. In this ap-proach one multiplication is replaced by three additions. It is appropriate solu-tion when no multiplier is available. Based on (4.3), the structure is depictedin Fig. 4.3.

4.2.2 Approach II

Another approach to rewrite the rotation (4.1):

X = ( x + y) cos x (sin + cos ),Y = ( x + y) cos + y (sin cos ). (4.4)

It requires three real multiplications and three additions. This approachreplaces one multiplication by addition. However, it needs to store three coeffi-cients. Since (sin +cos ) and (sin cos ) can also be precomputed so onereplaces the coefficient sin and other stores at new location in memory. Thestructure based on (4.4) is shown in Fig. 4.4.

4.2.3 Approach III

This approach can be applied only in the case of rotation, which are based onlifting [68]. The rotation (4.1) can be written in matrix form as follows:

X Y = cos sin sin cos xy (4.5)


50/69


sin() + cos()

x

y

cos() X

Y

sin() cos()

Figure 4.4: Algorithm with three multiplier and three adders.

X x

y

sin()tan( 2 ) tan(2 )

Y

Figure 4.5: Algorithm with three multiplier and three adders.

When the lifting approach is applied to the (4.5). The following equation isobtained:

X Y =

1 00 1

1 tan 20 1

1 0sin 1

1 tan 20 1

xy (4.6)

By further computing the matrices, the following equations are obtained:

X = x x sin tan 2 + 2y tan 2 y tan

2 2 sin,

Y = y x sin y sin tan 2 . (4.7)

Based on these equations, the structure is depicted in Fig. 4.5. This approach

also requires three real multiplications and three additions. However, there isno need for an additional coefficient as it just replaces the cos with tan 2 inmemory.

4.3 CORDICCORDIC (COordinate Rotation DIgital Computer) is one popular algorithmfor the implementation of multiplier-less rotations [21, 22, 69]. It realizes rota-tion by means of a series of shifts and additions, which reduces the amount of

hardware. It is also suitable where multipliers are not available. However, itmay affect the accuracy since it is based on approximation.


51/69

4.3. CORDIC 31

xi xi+1

yi+1yi

Figure 4.6: CORDIC micro-rotation.

The CORDIC algorithm decomposes the angle that has to be rotated, ,into a sum of M predened angles, i , according to:

=M 1

i=0

i i + , (4.8)

where is the error of the approximation, i indicates the direction of the socalled micro-rotation and:

i = tan 1(2 i ). (4.9)

These angles that dene the micro-rotations have the property that theycan be rotated by shifts and additions, which reduces signicantly the hardwareresource. These micro-rotations are carried out as follows:

x i+1 = xi yi i 2 i ,yi+1 = yi + x i i 2 i .

(4.10)

The hardware circuit for calculating the case of i = 1 is depicted in Fig. 4.6.In Fig. 4.6, the angle i that the input datum is rotated is chosen by settingthe number of bits that are shifted before the additions and subtractions arecarried out.

Usually {1, 1}. This forces to calculate all the micro-rotations eitherclockwise or counter clockwise and assures a constant gain of the CORDIC,

which can be compensated by multiplying the outputs by:

K =M

i=0

cos( i ) =M

i=0

cos(tan 1(2 i )). (4.11)

This option is preferable when the circuit is used for rotating several differentangles, and a constant gain for all of them is required, as happens in the rotatorsfor the FFT. However, in a constant rotator only a single angle must berotated. In this case it is better to consider i {1, 0, 1}. This approach is

called redundant CORDIC [70] and allows to remove certain micro-rotations,reducing the number of adders.


52/69


3 2 45xx

Figure 4.7: Simplied constant multiplication.

Shifts andAdders

AddersShifts and

X x cos

y sin

x

y Y

Figure 4.8: Rotation based on constant multiplication.

4.4 Constant MultiplicationFor constant multiplication, it is possible to replace general multiplier by shifts,adders, and substractors. Adders and substractors have same complexity, soboth are referred as adders. When an input signal is multiplied by more thanone constants, a simple method exit to realize each multiplier individually. Thiscan be done optimally for up to 19-bits coefficients [71]. However, it is alsopossible to utilize the redundancies between the constants in order to reducethe complexity of the hardware. Furthermore, the architecture of a constantmultiplier is shown in Fig. 4.7, where the input data x is multiplied with acoefficient 45, being the output 45x.

In terms of complexity, shift operations are free as they only reduce thenumber of adders in the implementation of the constant multiplications. Forcomplex multiplications in particular each input is multiplied by two constantcoefficients. A dedicated algorithm for realizing multiple constant multiplica-tions (MCM) with two constants have been proposed in [72]. MCM is constantmultiplication algorithm, where the input is multiplied with several constantcoefficients. In fact, any rotation angle can be realized with constant mul-tiplications, so it is possible to implement the rotation (4.1) using a constantmultiplication algorithm. It is based on the implementation on constant valueof the sine and cosine functions of rotation. A general block diagram of therotator is shown in Fig. 4.8. This consists of two constant multiplication blocks,where constant value of sin and cosine is implemented by shifts and adders withtwo additional adders, which are used to complete the rotation. In addition,the complexity of the rotation based on constant multiplication depends on theprecision requirement of the rotator

The author proposes an architecture to implement the twiddle factor multi-

plication based on constant multiplication. The different approaches have beenused to reduce the complexity and improve the accuracy. In paper F, the ar-


53/69

4.4. Constant Multiplication 33

chitecture is proposed, which makes use of the trigonometric identities of theangles. This allows nding expressions that can be reused for different angles.As a result, a simplied architecture is obtained for implementation of the twid-dle factor multiplication. Another architecture proposed in paper H, uses the

coefficients of twiddle factor , which have low adder cost and more accuracy.This can be obtained using the scaling of the coefficient. The number of addersare further optimized using multiple constant multiplication (MCM) approach.


54/69


55/69

Chapter 5

Finite Word Length Effects

FFT algorithms are implemented in hardware with nite word length data andarithmetic. As a result, the coefficients and internal signals are quantized (niteword length) in FFT architectures, which determines the input dynamic range,precision, and hardware resources. This generally leads to phenomena knownas nite word length effects.

This chapter presents these effects in FFTs, such as coefficient quantizationerror, data quantization error and over ow. Here data quantization error is

explained in terms of round-off error.

5.1 Coefficient QuantizationIt is usually required to quantize all the coefficients to a x number of bits,because it is not possible to keep the unlimited word length of the coefficients.If the coefficients of rotation is perfectly represented by the required numberof bits, there would be no error in implementation, however it increases thehardware cost. This is usually a trade off between hardware cost and accuracy.

The coefficient quantization causes an error in the computed result of arithmeticoperations. Although the nature of coefficient quantization is inherently nonstatical, useful results may be obtained by means of a statistical analysis.

The quantized coefficient is dened as the sum of coefficient and the errorintroduced by the coefficient quantization. Thus, the quantized coefficient isdened as:

cq = c + c , (5.1)where c is coefficient and c is the coefficient quantization error. Now, if it usesrounding with b fractional bits, we know that | c | 2 (b+1) = 0 .5 ulp1 .

1

Unit of least position, i.e., the weight of the least signicant bit of the representation.When using b fractional bits, ulp = 2 b .

35


56/69

36 Chapter 5. Finite Word Length Effects

In FFT, rotations are the one of the source of coefficient quantization error.The rotation consists of two coefficients, cos and sin , which have always avalue in a range [ 1, 1]. If cos and sin are represented by a nite number of bits, a rotation with the coefficient quantization error can be described as,

(cos )c = cos + cos ,(sin )c = sin + sin ,

(5.2)

where cos and sin are the coefficient quantization errors of cosine and sinecoefficients, respectively. As a result, the overall expression of the rotation withcoefficient quantization is represented by:

X c = x (cos + cos ) y (sin + sin ) ,Y c = y (cos + cos ) + x (sin + sin ) .

(5.3)

Taking all above into account, the rotation error due to quantization of thecoefficients is calculated by subtracting the (5.3) form (4.1), we get:

cos = x cos y sin ,sin = y cos + x sin . (5.4)

The absolute value of the error becomes:

E =

(x2 + y2) ( 2cos + 2sin ). (5.5)

The error depends on the magnitude of the input and the coefficient quan-tization error. However, the coefficient quantization error can be considered toreduce the error [27].

Each output of the FFT computation passes number of cascaded rotation.Thus, the error is accumulated at the output of FFT. The effect of the coefficientquantization can be computed by using Forbenius norm, which is based oncomparing the quantized transfer function to the ideal transfer function [73].This can be dened as:

F N =N 1

i=0

N 1

j =0Z i,j Z i,j

2(dB ), (5.6)

where Z is the ideal matrix of rotations and Z represents quantized coefficients.As a result, the impact of coefficient quantization in FFT could be analyzed.

5.2 Round-Off ErrorIn xed point representation, it is inefficient to increase the word length after

multiplication. However, the products must be quantized using rounding ortruncation. The quantized products can be expressed as the sum of unquantized


57/69

5.3. Overow 37

Q

e

X

X

x

x

c

c

Figure 5.1: Linear noise model of the quantization.

product and a random noise source as shown in Fig. 5.1 [7476]. Consideringthe multiplication of the quantized numbers x and c, the product X itself willbe quantized to a b-bit that is quantized. The quantization product X Q = [cx]Q

is modelled by an additive error (e). Thus the quantized product is expressedas:X Q = [cx]Q + e. (5.7)

Round-off error sources in the design are often modelled as random noisesource with statistical properties, variance 2 and mean value m, determinedby the quantization model and word length [74]. Assume that the variance andmean value for quantization at node i are 2i and mi , respectively. Then, thetotal mean and variance of the round-off noise for K sources is given as [74]

m tot =

K

i=1m i

n =0h i (n) , (5.8)

2tot =K

i=1

2i

n =0h2i (n) ,

=K

i=1

2i2

H i ejT 2 dT , (5.9)

where hi (n) is the impulse response from the noise source to the output and

H i ejT

is the corresponding transfer function. As a result, the variance of the round-off error is equal to sum of the all noise sources.In the FFT, a multiplication of the data by a rotation introduces the round-

off noise, which means that a noise source is added after each multiplication.The total round-off noise is calculated by adding the effect of all round-off noisesat the output [76].

5.3 Overow

The possibility of overow occurs when the available data range exceeds theavailable nite length [74]. This happens in addition/subtraction operation of


58/69

38 Chapter 5. Finite Word Length Effects

x(1)1

X(0)

X(1)

2

2

x(0)

Figure 5.2: Buttery with scaling.

the FFT. Generally, it is assumed that input data are numbers in the range[ 1, 1], with a certain number of bits. According to this typical interpretation,the output value of addition/subtraction operations not in the range [ 1, 1],since the result causes overows.

5.4 ScalingScaling is used to prevent the overows in xed point realization while at thesame time keeping the signal level as high as possible to reduce the round-off noise [10]. It adjusts the range of data by inserting scaling multipliers. However,the scaling should not disturb the transfer function.

5.4.1 Static Data ScalingIn order to avoid overows in FFT architectures, it is common to introduce ascaling factor after each buttery stage. To completely guarantee that overowswill not happen, the outputs can be divided by two after a radix-2 buttery asshown in Fig. 5.2. The factor two comes from the assumption that the inputshave a given value, which can at most be do

Documents

Optimization of Rotations in FFTs.pdf